Skip to content

Conversation

@Yikun
Copy link
Collaborator

@Yikun Yikun commented Jun 20, 2025

What this PR does / why we need it?

Add initial experimental support for Ascend 310P, this patch squash below PR into one to help validation:

Does this PR introduce any user-facing change?

User can run vLLM on Altlas 300I DUO series

How was this patch tested?

CI passed with:

  • E2E image build for 310P
  • CI test on A2 with e2e test and longterm test
  • Unit test missing because need a real 310P image to have the test, will add in a separate PR later.
  • Manually e2e test:

The patch has been tested locally on Ascend 310P hardware to ensure that the changes do not break existing functionality and that the new features work as intended.

ENV information

CANN, NNAL version: 8.1.RC1

Important

PTA 2.5.1 version >= torch_npu-2.5.1.post1.dev20250528 to support NZ format and calling NNAL operators on 310P

Code example

Build vllm-ascend from source code
# download source code as vllm-ascend
cd vllm-ascend
export SOC_VERSION=Ascend310P3
pip install -v -e .
cd ..
Run offline inference
from vllm import LLM, SamplingParams
prompts = ["水的沸点是100摄氏度吗?请回答是或者否。", "若腋下体温为38摄氏度,请问这人是否发烧?请回答是或者否。",
           "水的沸点是100摄氏度吗?请回答是或者否。", "若腋下体温为38摄氏度,请问这人是否发烧?请回答是或者否。"]

# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.0, top_p=0.95, max_tokens=10)
# Create an LLM.
llm = LLM(
    model="Qwen/Qwen2.5-7B-Instruct",
    max_model_len=4096,
    max_num_seqs=4,
    dtype="float16", # IMPORTANT cause some ATB ops cannot support bf16 on 310P
    disable_custom_all_reduce=True,
    trust_remote_code=True,
    tensor_parallel_size=2,
    compilation_config={"custom_ops":['none', "+rms_norm", "+rotary_embedding"]},
)

# Generate texts from the prompts.
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Co-authored-by: Vincent Yuan [email protected]
Co-authored-by: angazenn [email protected]
Co-authored-by: wangxiyuan [email protected]
Co-authored-by: leo-pony [email protected]
Co-authored-by: shen-shanshan [email protected]

farawayboat and others added 3 commits June 21, 2025 07:01
Signed-off-by: Vincent Yuan <[email protected]>
Signed-off-by: Yikun Jiang <[email protected]>
Co-authored-by: Vincent Yuan <[email protected]>
Signed-off-by: angazenn <[email protected]>
Signed-off-by: Yikun Jiang <[email protected]>
Co-authored-by: angazenn <[email protected]>
Signed-off-by: Yikun Jiang <[email protected]>

Co-authored-by: wangxiyuan <[email protected]>
@codecov
Copy link

codecov bot commented Jun 20, 2025

Codecov Report

❌ Patch coverage is 18.78173% with 160 lines in your changes missing coverage. Please review.
✅ Project coverage is 27.39%. Comparing base (2009fdb) to head (c4ecef0).
⚠️ Report is 515 commits behind head on main.

Files with missing lines Patch % Lines
vllm_ascend/utils.py 36.06% 39 Missing ⚠️
...d/patch/platform/patch_common/patch_distributed.py 9.52% 38 Missing ⚠️
vllm_ascend/ops/fused_moe.py 3.57% 27 Missing ⚠️
vllm_ascend/attention/attention_v1.py 3.70% 26 Missing ⚠️
vllm_ascend/attention/attention.py 5.55% 17 Missing ⚠️
vllm_ascend/ops/layernorm.py 14.28% 6 Missing ⚠️
vllm_ascend/ops/activation.py 25.00% 3 Missing ⚠️
vllm_ascend/ops/common_fused_moe.py 40.00% 3 Missing ⚠️
vllm_ascend/ops/rotary_embedding.py 50.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1333      +/-   ##
==========================================
- Coverage   27.73%   27.39%   -0.34%     
==========================================
  Files          56       56              
  Lines        6004     6191     +187     
==========================================
+ Hits         1665     1696      +31     
- Misses       4339     4495     +156     
Flag Coverage Δ
unittests 27.39% <18.78%> (-0.34%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Yikun Yikun added long-term-test enable long term test for PR ready-for-test start test by label for PR labels Jun 20, 2025
@Yikun
Copy link
Collaborator Author

Yikun commented Jun 20, 2025

cc @Angazenn @farawayboat @leo-pony Please help review and test, thanks.

Comment on lines -148 to -149
// Calculate mask for org_vocab range
// org_vocab_mask = (input_ >= org_vocab_start_index) & (input_ < org_vocab_end_index)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@farawayboat Why remove these note?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was deleted during my local code formatting, and it should be added back.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let's address these in a new PR.

@Yikun Yikun mentioned this pull request Jun 20, 2025
29 tasks
@@ -0,0 +1,114 @@
name: 'image / openEuler'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image / 310p-openEuler to make the CI title clear

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will address this in new PR

@@ -0,0 +1,110 @@
name: 'image / Ubuntu'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@Yikun
Copy link
Collaborator Author

Yikun commented Jun 21, 2025

I didn't do any change for these 3 PRs, consider we need to publish the 310p image first, I will merge this.

cc @jianzs @ganyi1996ppo Feel free to review if you have any comments.

This test we only do the e2e manually test wihtout new unit addded because this should be run test in a real 310 image. we will add this in next week.

@Yikun
Copy link
Collaborator Author

Yikun commented Jun 21, 2025

image long term CI also passed, I will merge this now.

@Yikun Yikun marked this pull request as ready for review June 21, 2025 00:57
@Yikun Yikun merged commit 097e714 into vllm-project:main Jun 21, 2025
36 checks passed
@Yikun
Copy link
Collaborator Author

Yikun commented Jun 21, 2025

Download image 310p

# First pull mirror to speedup, change main to specific version you want
docker pull quay.nju.edu.cn/ascend/vllm-ascend:main-310p
docker pull quay.io/ascend/vllm-ascend:main-310p

Run with 310p:

export IMAGE=quay.io/ascend/vllm-ascend:main-310p
docker run --rm \
        --name vllm-ascend \
        --device /dev/davinci0 \
        --device /dev/davinci1 \
        --device /dev/davinci2 \
        --device /dev/davinci3 \
        --device /dev/davinci4 \
        --device /dev/davinci5 \
        --device /dev/davinci6 \
        --device /dev/davinci7 \
        --device /dev/davinci_manager \
        --device /dev/devmm_svm \
        --device /dev/hisi_hdc \
        -v /usr/local/dcmi:/usr/local/dcmi \
        -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
        -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
        -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
        -v /etc/ascend_install.info:/etc/ascend_install.info \
        -v /root/.cache:/root/.cache \
        -p 8000:8000 \
        -it $IMAGE bash

# Make sure the commits are included:
cd /vllm-workspace/vllm-ascend/;git log -1;cd -

Offline test

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

# Create a sampling params object.
sampling_params = SamplingParams(max_tokens=100, temperature=0.0)
# Create an LLM.
llm = LLM(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    max_model_len=4096,
    max_num_seqs=4,
    trust_remote_code=True,
    tensor_parallel_size=1,
    dtype="float16", # IMPORTANT cause some ATB ops cannot support bf16 on 310P
    disable_custom_all_reduce=True, # IMPORTANT cause 310p needed
    compilation_config={"custom_ops":["+rms_norm", "+rotary_embedding"]}, # IMPORTANT cause 310p needed custom ops
)

# Generate texts from the prompts.
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Yikun added a commit to Yikun/vllm-ascend that referenced this pull request Jun 21, 2025
@Yikun Yikun mentioned this pull request Jun 28, 2025
40 tasks
@AlphaINF
Copy link

The vllm-ascend works on 300I duo is extremely slow!
We test it on qwen3-8b, only 2.1token/s (in the decode)
However, we works it on MindIE, and it can reach 10token/s

@farawayboat
Copy link
Contributor

We have disabled the JIT compiler for the 310P and changed the data format to NZ for the weight in the vocabulary embedding and QKV projection layers. This will improve performance.

diff --git a/vllm_ascend/worker/model_runner_v1.py b/vllm_ascend/worker/model_runner_v1.py
--- a/vllm_ascend/worker/model_runner_v1.py	(revision e1123172d12afa15f306ba6e1e4c9d0c6d1d799e)
+++ b/vllm_ascend/worker/model_runner_v1.py	(date 1751249710816)
@@ -90,6 +90,10 @@
 
 import vllm_ascend.envs as envs_ascend
 
+import torch_npu
+if is_310p():
+    torch_npu.npu.set_compile_mode(jit_compile=False)
+
 
 @dataclass
 class GraphCaptureContext:
@@ -1830,6 +1834,18 @@
 
         with DeviceMemoryProfiler() as m:  # noqa: SIM117
             self.model = get_model(vllm_config=self.vllm_config)
+            from vllm_ascend.utils import is_310p
+            if is_310p():
+                import torch_npu
+                from vllm.model_executor.layers.linear import (MergedColumnParallelLinear,
+                                                               QKVParallelLinear,
+                                                               RowParallelLinear)
+                from vllm.model_executor.layers.vocab_parallel_embedding import VocabParallelEmbedding
+                for module in self.model.modules():
+                    if isinstance(module, (VocabParallelEmbedding, MergedColumnParallelLinear,
+                                           QKVParallelLinear, RowParallelLinear)):
+                        module.weight.data = torch_npu.npu_format_cast(module.weight.data, 29)
+
             if hasattr(self, "drafter"):
                 logger.info("Loading drafter model...")
                 if self.use_aux_hidden_state_outputs:

@leo-pony
Copy link
Collaborator

leo-pony commented Jul 1, 2025

We have disabled the JIT compiler for the 310P and changed the data format to NZ for the weight in the vocabulary embedding and QKV projection layers. This will improve performance.

Qwen3-0.6B, Performance:
test five times, abort 200%-300% approve
--compilation-config for: "all ", "+rms_norm", "+rotary_embedding"
1st :Mean TPOT (ms): 885.89
2nd:Mean TPOT (ms): 567.10
3rd:Mean TPOT (ms): 136.19
4th:Mean TPOT (ms): 621.62
5th:Mean TPOT (ms): 636.84

--compilation-config for:'{"custom_ops":["none", "+rms_norm", "+rotary_embedding"]}'
1st :Mean TPOT (ms): 182.84
2nd:Mean TPOT (ms): 130.00
3rd:Mean TPOT (ms): 145.05
4th:Mean TPOT (ms): 182.84
5th:Mean TPOT (ms): 189.21

代码优化:
1st :Mean TPOT (ms): 66.35
2nd:Mean TPOT (ms): 64.19
3rd:Mean TPOT (ms): 63.71
4th:Mean TPOT (ms): 63.30
5th:Mean TPOT (ms): 63.06

============ Serving Benchmark Result ============
Successful requests: 10
Benchmark duration (s): 16.22
Total input tokens: 2000
Total generated tokens: 1280
Request throughput (req/s): 0.62
Output token throughput (tok/s): 78.92
Total Token throughput (tok/s): 202.24
---------------Time to First Token----------------
Mean TTFT (ms): 106.26
Median TTFT (ms): 106.71
P99 TTFT (ms): 136.90
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 63.06
Median TPOT (ms): 63.27
P99 TPOT (ms): 63.49
---------------Inter-token Latency----------------
Mean ITL (ms): 63.06
Median ITL (ms): 63.10
P99 ITL (ms): 69.43

@AlphaINF
Copy link

AlphaINF commented Jul 4, 2025

We have disabled the JIT compiler for the 310P and changed the data format to NZ for the weight in the vocabulary embedding and QKV projection layers. This will improve performance.

diff --git a/vllm_ascend/worker/model_runner_v1.py b/vllm_ascend/worker/model_runner_v1.py
--- a/vllm_ascend/worker/model_runner_v1.py	(revision e1123172d12afa15f306ba6e1e4c9d0c6d1d799e)
+++ b/vllm_ascend/worker/model_runner_v1.py	(date 1751249710816)
@@ -90,6 +90,10 @@
 
 import vllm_ascend.envs as envs_ascend
 
+import torch_npu
+if is_310p():
+    torch_npu.npu.set_compile_mode(jit_compile=False)
+
 
 @dataclass
 class GraphCaptureContext:
@@ -1830,6 +1834,18 @@
 
         with DeviceMemoryProfiler() as m:  # noqa: SIM117
             self.model = get_model(vllm_config=self.vllm_config)
+            from vllm_ascend.utils import is_310p
+            if is_310p():
+                import torch_npu
+                from vllm.model_executor.layers.linear import (MergedColumnParallelLinear,
+                                                               QKVParallelLinear,
+                                                               RowParallelLinear)
+                from vllm.model_executor.layers.vocab_parallel_embedding import VocabParallelEmbedding
+                for module in self.model.modules():
+                    if isinstance(module, (VocabParallelEmbedding, MergedColumnParallelLinear,
+                                           QKVParallelLinear, RowParallelLinear)):
+                        module.weight.data = torch_npu.npu_format_cast(module.weight.data, 29)
+
             if hasattr(self, "drafter"):
                 logger.info("Loading drafter model...")
                 if self.use_aux_hidden_state_outputs:

I test it on Qwen3 8B, the throughtput raise from 2.1token/s to 9token/s!
Besides, I test it on Qwen2.5-VL-3B(which mindie not support), and record as 12token/s

Yikun pushed a commit that referenced this pull request Jul 5, 2025
… 300I series (#1591)

### What this PR does / why we need it?

Since running on Altlas 300I Duo was initial supported after #1333 ,
this PR will disable the JIT compiler for the 310P and changed the data
format to NZ for the weight in the vocabulary embedding and QKV
projection layers, which help improving performance.

See #1563 

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

Test manually:
#1591 (comment)

Signed-off-by: Vincent Yuan <[email protected]>
@Yikun Yikun mentioned this pull request Jul 13, 2025
15 tasks
chopper0126 pushed a commit to chopper0126/vllm-ascend that referenced this pull request Oct 16, 2025
…llm-project#1333)

### What this PR does / why we need it?
Add initial experimental support for Ascend 310P, this patch squash
below PR into one to help validation:

- vllm-project#914
- vllm-project#1318
- vllm-project#1327


### Does this PR introduce _any_ user-facing change?
User can run vLLM on Altlas 300I DUO series

### How was this patch tested?
CI passed with:
- E2E image build for 310P
- CI test on A2 with e2e test and longterm test
- Unit test missing because need a real 310P image to have the test,
will add in a separate PR later.
- Manually e2e test:
- Qwen2.5-7b-instruct, Qwen2.5-0.5b, Qwen3-0.6B, Qwen3-4B, Qwen3-8B:
vllm-project#914 (comment)
  - Pangu MGoE 72B


The patch has been tested locally on Ascend 310P hardware to ensure that
the changes do not break existing functionality and that the new
features work as intended.

#### ENV information

CANN, NNAL version: 8.1.RC1
> [!IMPORTANT]  
> PTA 2.5.1 version >= torch_npu-2.5.1.post1.dev20250528 to support NZ
format and calling NNAL operators on 310P

#### Code example

##### Build vllm-ascend from source code

```shell
# download source code as vllm-ascend
cd vllm-ascend
export SOC_VERSION=Ascend310P3
pip install -v -e .
cd ..
```

##### Run offline inference

```python
from vllm import LLM, SamplingParams
prompts = ["水的沸点是100摄氏度吗?请回答是或者否。", "若腋下体温为38摄氏度,请问这人是否发烧?请回答是或者否。",
           "水的沸点是100摄氏度吗?请回答是或者否。", "若腋下体温为38摄氏度,请问这人是否发烧?请回答是或者否。"]

# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.0, top_p=0.95, max_tokens=10)
# Create an LLM.
llm = LLM(
    model="Qwen/Qwen2.5-7B-Instruct",
    max_model_len=4096,
    max_num_seqs=4,
    dtype="float16", # IMPORTANT cause some ATB ops cannot support bf16 on 310P
    disable_custom_all_reduce=True,
    trust_remote_code=True,
    tensor_parallel_size=2,
    compilation_config={"custom_ops":['none', "+rms_norm", "+rotary_embedding"]},
)

# Generate texts from the prompts.
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

```

---------

Signed-off-by: Vincent Yuan <[email protected]>
Signed-off-by: Yikun Jiang <[email protected]>
Signed-off-by: angazenn <[email protected]>
Co-authored-by: Vincent Yuan <[email protected]>
Co-authored-by: angazenn <[email protected]>
Co-authored-by: wangxiyuan <[email protected]>
Co-authored-by: leo-pony <[email protected]>
Co-authored-by: shen-shanshan <[email protected]>
chopper0126 pushed a commit to chopper0126/vllm-ascend that referenced this pull request Oct 16, 2025
… 300I series (vllm-project#1591)

### What this PR does / why we need it?

Since running on Altlas 300I Duo was initial supported after vllm-project#1333 ,
this PR will disable the JIT compiler for the 310P and changed the data
format to NZ for the weight in the vocabulary embedding and QKV
projection layers, which help improving performance.

See vllm-project#1563 

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

Test manually:
vllm-project#1591 (comment)

Signed-off-by: Vincent Yuan <[email protected]>
Angazenn pushed a commit to Angazenn/vllm-ascend that referenced this pull request Oct 21, 2025
…llm-project#1333)

### What this PR does / why we need it?
Add initial experimental support for Ascend 310P, this patch squash
below PR into one to help validation:

- vllm-project#914
- vllm-project#1318
- vllm-project#1327


### Does this PR introduce _any_ user-facing change?
User can run vLLM on Altlas 300I DUO series

### How was this patch tested?
CI passed with:
- E2E image build for 310P
- CI test on A2 with e2e test and longterm test
- Unit test missing because need a real 310P image to have the test,
will add in a separate PR later.
- Manually e2e test:
- Qwen2.5-7b-instruct, Qwen2.5-0.5b, Qwen3-0.6B, Qwen3-4B, Qwen3-8B:
vllm-project#914 (comment)
  - Pangu MGoE 72B


The patch has been tested locally on Ascend 310P hardware to ensure that
the changes do not break existing functionality and that the new
features work as intended.

#### ENV information

CANN, NNAL version: 8.1.RC1
> [!IMPORTANT]  
> PTA 2.5.1 version >= torch_npu-2.5.1.post1.dev20250528 to support NZ
format and calling NNAL operators on 310P

#### Code example

##### Build vllm-ascend from source code

```shell
# download source code as vllm-ascend
cd vllm-ascend
export SOC_VERSION=Ascend310P3
pip install -v -e .
cd ..
```

##### Run offline inference

```python
from vllm import LLM, SamplingParams
prompts = ["水的沸点是100摄氏度吗?请回答是或者否。", "若腋下体温为38摄氏度,请问这人是否发烧?请回答是或者否。",
           "水的沸点是100摄氏度吗?请回答是或者否。", "若腋下体温为38摄氏度,请问这人是否发烧?请回答是或者否。"]

# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.0, top_p=0.95, max_tokens=10)
# Create an LLM.
llm = LLM(
    model="Qwen/Qwen2.5-7B-Instruct",
    max_model_len=4096,
    max_num_seqs=4,
    dtype="float16", # IMPORTANT cause some ATB ops cannot support bf16 on 310P
    disable_custom_all_reduce=True,
    trust_remote_code=True,
    tensor_parallel_size=2,
    compilation_config={"custom_ops":['none', "+rms_norm", "+rotary_embedding"]},
)

# Generate texts from the prompts.
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

```

---------

Signed-off-by: Vincent Yuan <[email protected]>
Signed-off-by: Yikun Jiang <[email protected]>
Signed-off-by: angazenn <[email protected]>
Co-authored-by: Vincent Yuan <[email protected]>
Co-authored-by: angazenn <[email protected]>
Co-authored-by: wangxiyuan <[email protected]>
Co-authored-by: leo-pony <[email protected]>
Co-authored-by: shen-shanshan <[email protected]>
Angazenn pushed a commit to Angazenn/vllm-ascend that referenced this pull request Oct 21, 2025
… 300I series (vllm-project#1591)

### What this PR does / why we need it?

Since running on Altlas 300I Duo was initial supported after vllm-project#1333 ,
this PR will disable the JIT compiler for the 310P and changed the data
format to NZ for the weight in the vocabulary embedding and QKV
projection layers, which help improving performance.

See vllm-project#1563 

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

Test manually:
vllm-project#1591 (comment)

Signed-off-by: Vincent Yuan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build long-term-test enable long term test for PR module:core module:ops ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants