[Bugfix][Worker] Clear NPU memory between test profiling #989

shen-shanshan · 2025-05-28T09:08:50Z

What this PR does / why we need it?

Clear NPU memory between test profiling.

If we don't clear the memory this way, the CI may get OOM error between UTs.

This modification is adapted from vllm/worker/worker.py, find more details at https://github.com/vllm-project/vllm/blob/main/vllm/worker/worker.py#L184-L186.

Does this PR introduce any user-facing change?

How was this patch tested?

MengqingCao · 2025-05-28T09:29:55Z

Thanks for this fixing! I think we can derectly move gc collection and peak mem status reset into NPUPlatform.empty_cache()

shen-shanshan · 2025-06-03T06:33:55Z

Thanks for this fixing! I think we can derectly move gc collection and peak mem status reset into NPUPlatform.empty_cache()

I have assembled the 3 methods into clear_npu_memory() in platform.

shen-shanshan · 2025-06-03T06:38:57Z

@wangxiyuan The CI is passed, and this PR is may needed for #969.

MengqingCao · 2025-06-03T06:41:31Z

Thanks for this fixing! I think we can derectly move gc collection and peak mem status reset into NPUPlatform.empty_cache()

I have assembled the 3 methods into clear_npu_memory() in platform.

Why not just do it in https://github.com/vllm-project/vllm-ascend/pull/989/files#diff-efe99563375ba645b1c1befb237632c2c91f85b21d2ab2d1095eef82aecd3999L106

shen-shanshan · 2025-06-03T06:51:40Z

Thanks for this fixing! I think we can derectly move gc collection and peak mem status reset into NPUPlatform.empty_cache()

I have assembled the 3 methods into clear_npu_memory() in platform.

Why not just do it in https://github.com/vllm-project/vllm-ascend/pull/989/files#diff-efe99563375ba645b1c1befb237632c2c91f85b21d2ab2d1095eef82aecd3999L106

Because maybe there are some other scenarios just need empty_cache() and don't need do the other 2 methods.

wangxiyuan · 2025-06-03T07:03:26Z

please merge this pr and #969 into one

github-actions · 2025-06-04T08:26:32Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Potabk · 2025-06-05T06:45:21Z

hope can merge, sleep mode v1 need this

shen-shanshan · 2025-06-05T07:03:36Z

hope can merge, sleep mode v1 need this

I will rebase soon.

github-actions · 2025-06-05T08:36:43Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions · 2025-06-06T12:23:26Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions · 2025-06-09T06:09:49Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions · 2025-06-11T01:20:51Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions · 2025-06-11T12:57:15Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

momo609 · 2025-06-12T04:51:42Z

vllm_ascend/worker/worker.py

        # Profile the memory usage of the model and get the maximum number of
        # cache blocks that can be allocated with the remaining free memory.
-        NPUPlatform.empty_cache()
+        clear_npu_memory()


What is the difference between the two memory clear methods?

@momo609 clear_npu_memory() = gc.collect() + empty_cache() + reset_peak_memory_stats() (which will make the profile in dummy run more accurate).

github-actions · 2025-06-16T10:34:41Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions · 2025-06-17T09:51:28Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

shen-shanshan · 2025-06-18T01:25:03Z

CC: @wangxiyuan

github-actions · 2025-06-21T01:02:27Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: shen-shanshan <[email protected]>

codecov · 2025-06-23T12:31:57Z

Codecov Report

❌ Patch coverage is 40.00000% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 27.34%. Comparing base (c30ddb8) to head (7c433f1).
⚠️ Report is 563 commits behind head on main.

Files with missing lines	Patch %	Lines
vllm_ascend/utils.py	40.00%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #989      +/-   ##
==========================================
- Coverage   27.39%   27.34%   -0.05%     
==========================================
  Files          56       56              
  Lines        6191     6183       -8     
==========================================
- Hits         1696     1691       -5     
+ Misses       4495     4492       -3

Flag	Coverage Δ
unittests	`27.34% <40.00%> (-0.05%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions · 2025-06-27T01:17:20Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

shen-shanshan mentioned this pull request May 28, 2025

[Bugfix][CI] Update guided decoding backend list #969

Closed

github-actions bot added the module:core label May 29, 2025

shen-shanshan force-pushed the bugfix-2 branch from 7a8b57a to e3be772 Compare June 3, 2025 01:34

shen-shanshan force-pushed the bugfix-2 branch from e3be772 to d92cee4 Compare June 4, 2025 02:19

github-actions bot added the module:tests label Jun 4, 2025

wangxiyuan mentioned this pull request Jun 4, 2025

[release] 0.9.0rc1 release checklist #904

Closed

76 tasks

github-actions bot added the merge-conflicts label Jun 4, 2025

shen-shanshan force-pushed the bugfix-2 branch from 6296cce to 4ec38d9 Compare June 5, 2025 07:10

github-actions bot removed the merge-conflicts label Jun 5, 2025

github-actions bot added the merge-conflicts label Jun 5, 2025

shen-shanshan force-pushed the bugfix-2 branch from 4ec38d9 to 906ace1 Compare June 6, 2025 01:39

github-actions bot removed the merge-conflicts label Jun 6, 2025

shen-shanshan force-pushed the bugfix-2 branch 2 times, most recently from f56d491 to c509396 Compare June 6, 2025 09:10

github-actions bot added the merge-conflicts label Jun 6, 2025

shen-shanshan force-pushed the bugfix-2 branch from b777a29 to 6784aad Compare June 9, 2025 01:40

github-actions bot removed the merge-conflicts label Jun 9, 2025

github-actions bot added the merge-conflicts label Jun 9, 2025

github-actions bot added the merge-conflicts label Jun 11, 2025

shen-shanshan force-pushed the bugfix-2 branch from 22f398b to 1e26b53 Compare June 11, 2025 02:25

github-actions bot added merge-conflicts and removed merge-conflicts labels Jun 11, 2025

momo609 reviewed Jun 12, 2025

View reviewed changes

shen-shanshan force-pushed the bugfix-2 branch from 1e26b53 to 1f0392a Compare June 16, 2025 09:12

github-actions bot added merge-conflicts and removed merge-conflicts labels Jun 16, 2025

shen-shanshan force-pushed the bugfix-2 branch from 8e35549 to a49b8bd Compare June 17, 2025 07:50

github-actions bot added merge-conflicts and removed merge-conflicts labels Jun 17, 2025

shen-shanshan force-pushed the bugfix-2 branch from a49b8bd to 99be692 Compare June 17, 2025 09:54

github-actions bot removed the merge-conflicts label Jun 17, 2025

shen-shanshan force-pushed the bugfix-2 branch from 99be692 to 87494ec Compare June 20, 2025 07:09

github-actions bot removed the module:tests label Jun 20, 2025

github-actions bot added the merge-conflicts label Jun 21, 2025

shen-shanshan added 2 commits June 23, 2025 12:04

add clear_npu_memory() method

be58ae1

Signed-off-by: shen-shanshan <[email protected]>

rebase

7c433f1

Signed-off-by: shen-shanshan <[email protected]>

shen-shanshan force-pushed the bugfix-2 branch from 87494ec to 7c433f1 Compare June 23, 2025 12:05

shen-shanshan removed the merge-conflicts label Jun 23, 2025

github-actions bot added the merge-conflicts label Jun 27, 2025

shen-shanshan closed this Jun 27, 2025

[Bugfix][Worker] Clear NPU memory between test profiling #989

[Bugfix][Worker] Clear NPU memory between test profiling #989

Uh oh!

Conversation

shen-shanshan commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

MengqingCao commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shen-shanshan commented Jun 3, 2025

Uh oh!

shen-shanshan commented Jun 3, 2025

Uh oh!

MengqingCao commented Jun 3, 2025

Uh oh!

shen-shanshan commented Jun 3, 2025

Uh oh!

wangxiyuan commented Jun 3, 2025

Uh oh!

github-actions bot commented Jun 4, 2025

Uh oh!

Potabk commented Jun 5, 2025

Uh oh!

shen-shanshan commented Jun 5, 2025

Uh oh!

github-actions bot commented Jun 5, 2025

Uh oh!

github-actions bot commented Jun 6, 2025

Uh oh!

github-actions bot commented Jun 9, 2025

Uh oh!

github-actions bot commented Jun 11, 2025

Uh oh!

github-actions bot commented Jun 11, 2025

Uh oh!

momo609 Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

shen-shanshan Jun 16, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jun 16, 2025

Uh oh!

github-actions bot commented Jun 17, 2025

Uh oh!

shen-shanshan commented Jun 18, 2025

Uh oh!

github-actions bot commented Jun 21, 2025

Uh oh!

codecov bot commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Jun 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

shen-shanshan commented May 28, 2025 •

edited

Loading

MengqingCao commented May 28, 2025 •

edited

Loading

codecov bot commented Jun 23, 2025 •

edited

Loading