-
Notifications
You must be signed in to change notification settings - Fork 530
[Platform] Add initial experimental support for Altlas 300I series #1333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Vincent Yuan <[email protected]> Signed-off-by: Yikun Jiang <[email protected]> Co-authored-by: Vincent Yuan <[email protected]>
Signed-off-by: angazenn <[email protected]> Signed-off-by: Yikun Jiang <[email protected]> Co-authored-by: angazenn <[email protected]>
Signed-off-by: Yikun Jiang <[email protected]> Co-authored-by: wangxiyuan <[email protected]>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1333 +/- ##
==========================================
- Coverage 27.73% 27.39% -0.34%
==========================================
Files 56 56
Lines 6004 6191 +187
==========================================
+ Hits 1665 1696 +31
- Misses 4339 4495 +156
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
cc @Angazenn @farawayboat @leo-pony Please help review and test, thanks. |
| // Calculate mask for org_vocab range | ||
| // org_vocab_mask = (input_ >= org_vocab_start_index) & (input_ < org_vocab_end_index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@farawayboat Why remove these note?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was deleted during my local code formatting, and it should be added back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, let's address these in a new PR.
| @@ -0,0 +1,114 @@ | |||
| name: 'image / openEuler' | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
image / 310p-openEuler to make the CI title clear
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will address this in new PR
| @@ -0,0 +1,110 @@ | |||
| name: 'image / Ubuntu' | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
|
I didn't do any change for these 3 PRs, consider we need to publish the 310p image first, I will merge this. cc @jianzs @ganyi1996ppo Feel free to review if you have any comments. This test we only do the e2e manually test wihtout new unit addded because this should be run test in a real 310 image. we will add this in next week. |
Download image 310pRun with 310p:Offline testfrom vllm import LLM, SamplingParams
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(max_tokens=100, temperature=0.0)
# Create an LLM.
llm = LLM(
model="Qwen/Qwen2.5-0.5B-Instruct",
max_model_len=4096,
max_num_seqs=4,
trust_remote_code=True,
tensor_parallel_size=1,
dtype="float16", # IMPORTANT cause some ATB ops cannot support bf16 on 310P
disable_custom_all_reduce=True, # IMPORTANT cause 310p needed
compilation_config={"custom_ops":["+rms_norm", "+rotary_embedding"]}, # IMPORTANT cause 310p needed custom ops
)
# Generate texts from the prompts.
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") |
…eries (vllm-project#1333)" This reverts commit 097e714. Signed-off-by: Yikun Jiang <[email protected]>
|
The vllm-ascend works on 300I duo is extremely slow! |
|
We have disabled the JIT compiler for the 310P and changed the data format to NZ for the weight in the vocabulary embedding and QKV projection layers. This will improve performance. diff --git a/vllm_ascend/worker/model_runner_v1.py b/vllm_ascend/worker/model_runner_v1.py
--- a/vllm_ascend/worker/model_runner_v1.py (revision e1123172d12afa15f306ba6e1e4c9d0c6d1d799e)
+++ b/vllm_ascend/worker/model_runner_v1.py (date 1751249710816)
@@ -90,6 +90,10 @@
import vllm_ascend.envs as envs_ascend
+import torch_npu
+if is_310p():
+ torch_npu.npu.set_compile_mode(jit_compile=False)
+
@dataclass
class GraphCaptureContext:
@@ -1830,6 +1834,18 @@
with DeviceMemoryProfiler() as m: # noqa: SIM117
self.model = get_model(vllm_config=self.vllm_config)
+ from vllm_ascend.utils import is_310p
+ if is_310p():
+ import torch_npu
+ from vllm.model_executor.layers.linear import (MergedColumnParallelLinear,
+ QKVParallelLinear,
+ RowParallelLinear)
+ from vllm.model_executor.layers.vocab_parallel_embedding import VocabParallelEmbedding
+ for module in self.model.modules():
+ if isinstance(module, (VocabParallelEmbedding, MergedColumnParallelLinear,
+ QKVParallelLinear, RowParallelLinear)):
+ module.weight.data = torch_npu.npu_format_cast(module.weight.data, 29)
+
if hasattr(self, "drafter"):
logger.info("Loading drafter model...")
if self.use_aux_hidden_state_outputs: |
Qwen3-0.6B, Performance: --compilation-config for:'{"custom_ops":["none", "+rms_norm", "+rotary_embedding"]}' 代码优化: ============ Serving Benchmark Result ============ |
I test it on Qwen3 8B, the throughtput raise from 2.1token/s to 9token/s! |
… 300I series (#1591) ### What this PR does / why we need it? Since running on Altlas 300I Duo was initial supported after #1333 , this PR will disable the JIT compiler for the 310P and changed the data format to NZ for the weight in the vocabulary embedding and QKV projection layers, which help improving performance. See #1563 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test manually: #1591 (comment) Signed-off-by: Vincent Yuan <[email protected]>
…llm-project#1333) ### What this PR does / why we need it? Add initial experimental support for Ascend 310P, this patch squash below PR into one to help validation: - vllm-project#914 - vllm-project#1318 - vllm-project#1327 ### Does this PR introduce _any_ user-facing change? User can run vLLM on Altlas 300I DUO series ### How was this patch tested? CI passed with: - E2E image build for 310P - CI test on A2 with e2e test and longterm test - Unit test missing because need a real 310P image to have the test, will add in a separate PR later. - Manually e2e test: - Qwen2.5-7b-instruct, Qwen2.5-0.5b, Qwen3-0.6B, Qwen3-4B, Qwen3-8B: vllm-project#914 (comment) - Pangu MGoE 72B The patch has been tested locally on Ascend 310P hardware to ensure that the changes do not break existing functionality and that the new features work as intended. #### ENV information CANN, NNAL version: 8.1.RC1 > [!IMPORTANT] > PTA 2.5.1 version >= torch_npu-2.5.1.post1.dev20250528 to support NZ format and calling NNAL operators on 310P #### Code example ##### Build vllm-ascend from source code ```shell # download source code as vllm-ascend cd vllm-ascend export SOC_VERSION=Ascend310P3 pip install -v -e . cd .. ``` ##### Run offline inference ```python from vllm import LLM, SamplingParams prompts = ["水的沸点是100摄氏度吗?请回答是或者否。", "若腋下体温为38摄氏度,请问这人是否发烧?请回答是或者否。", "水的沸点是100摄氏度吗?请回答是或者否。", "若腋下体温为38摄氏度,请问这人是否发烧?请回答是或者否。"] # Create a sampling params object. sampling_params = SamplingParams(temperature=0.0, top_p=0.95, max_tokens=10) # Create an LLM. llm = LLM( model="Qwen/Qwen2.5-7B-Instruct", max_model_len=4096, max_num_seqs=4, dtype="float16", # IMPORTANT cause some ATB ops cannot support bf16 on 310P disable_custom_all_reduce=True, trust_remote_code=True, tensor_parallel_size=2, compilation_config={"custom_ops":['none', "+rms_norm", "+rotary_embedding"]}, ) # Generate texts from the prompts. outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` --------- Signed-off-by: Vincent Yuan <[email protected]> Signed-off-by: Yikun Jiang <[email protected]> Signed-off-by: angazenn <[email protected]> Co-authored-by: Vincent Yuan <[email protected]> Co-authored-by: angazenn <[email protected]> Co-authored-by: wangxiyuan <[email protected]> Co-authored-by: leo-pony <[email protected]> Co-authored-by: shen-shanshan <[email protected]>
… 300I series (vllm-project#1591) ### What this PR does / why we need it? Since running on Altlas 300I Duo was initial supported after vllm-project#1333 , this PR will disable the JIT compiler for the 310P and changed the data format to NZ for the weight in the vocabulary embedding and QKV projection layers, which help improving performance. See vllm-project#1563 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test manually: vllm-project#1591 (comment) Signed-off-by: Vincent Yuan <[email protected]>
…llm-project#1333) ### What this PR does / why we need it? Add initial experimental support for Ascend 310P, this patch squash below PR into one to help validation: - vllm-project#914 - vllm-project#1318 - vllm-project#1327 ### Does this PR introduce _any_ user-facing change? User can run vLLM on Altlas 300I DUO series ### How was this patch tested? CI passed with: - E2E image build for 310P - CI test on A2 with e2e test and longterm test - Unit test missing because need a real 310P image to have the test, will add in a separate PR later. - Manually e2e test: - Qwen2.5-7b-instruct, Qwen2.5-0.5b, Qwen3-0.6B, Qwen3-4B, Qwen3-8B: vllm-project#914 (comment) - Pangu MGoE 72B The patch has been tested locally on Ascend 310P hardware to ensure that the changes do not break existing functionality and that the new features work as intended. #### ENV information CANN, NNAL version: 8.1.RC1 > [!IMPORTANT] > PTA 2.5.1 version >= torch_npu-2.5.1.post1.dev20250528 to support NZ format and calling NNAL operators on 310P #### Code example ##### Build vllm-ascend from source code ```shell # download source code as vllm-ascend cd vllm-ascend export SOC_VERSION=Ascend310P3 pip install -v -e . cd .. ``` ##### Run offline inference ```python from vllm import LLM, SamplingParams prompts = ["水的沸点是100摄氏度吗?请回答是或者否。", "若腋下体温为38摄氏度,请问这人是否发烧?请回答是或者否。", "水的沸点是100摄氏度吗?请回答是或者否。", "若腋下体温为38摄氏度,请问这人是否发烧?请回答是或者否。"] # Create a sampling params object. sampling_params = SamplingParams(temperature=0.0, top_p=0.95, max_tokens=10) # Create an LLM. llm = LLM( model="Qwen/Qwen2.5-7B-Instruct", max_model_len=4096, max_num_seqs=4, dtype="float16", # IMPORTANT cause some ATB ops cannot support bf16 on 310P disable_custom_all_reduce=True, trust_remote_code=True, tensor_parallel_size=2, compilation_config={"custom_ops":['none', "+rms_norm", "+rotary_embedding"]}, ) # Generate texts from the prompts. outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` --------- Signed-off-by: Vincent Yuan <[email protected]> Signed-off-by: Yikun Jiang <[email protected]> Signed-off-by: angazenn <[email protected]> Co-authored-by: Vincent Yuan <[email protected]> Co-authored-by: angazenn <[email protected]> Co-authored-by: wangxiyuan <[email protected]> Co-authored-by: leo-pony <[email protected]> Co-authored-by: shen-shanshan <[email protected]>
… 300I series (vllm-project#1591) ### What this PR does / why we need it? Since running on Altlas 300I Duo was initial supported after vllm-project#1333 , this PR will disable the JIT compiler for the 310P and changed the data format to NZ for the weight in the vocabulary embedding and QKV projection layers, which help improving performance. See vllm-project#1563 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test manually: vllm-project#1591 (comment) Signed-off-by: Vincent Yuan <[email protected]>

What this PR does / why we need it?
Add initial experimental support for Ascend 310P, this patch squash below PR into one to help validation:
Does this PR introduce any user-facing change?
User can run vLLM on Altlas 300I DUO series
How was this patch tested?
CI passed with:
The patch has been tested locally on Ascend 310P hardware to ensure that the changes do not break existing functionality and that the new features work as intended.
ENV information
CANN, NNAL version: 8.1.RC1
Important
PTA 2.5.1 version >= torch_npu-2.5.1.post1.dev20250528 to support NZ format and calling NNAL operators on 310P
Code example
Build vllm-ascend from source code
Run offline inference
Co-authored-by: Vincent Yuan [email protected]
Co-authored-by: angazenn [email protected]
Co-authored-by: wangxiyuan [email protected]
Co-authored-by: leo-pony [email protected]
Co-authored-by: shen-shanshan [email protected]