[CI Bugfix] Fix CI OOM for `test_shared_storage_connector_hashes` #21973

mgoin · 2025-07-30T21:44:00Z

Purpose

This test has been consistently failing since its addition in #21611
https://buildkite.com/vllm/ci/builds/25475/steps/canvas?sid=01985c5a-a224-435b-ab54-41cb1837705a#01985c5a-a380-4372-bf58-7de2780ad329/7-13621

History https://buildkite.com/organizations/vllm/analytics/suites/ci-1/tests/7b01abb7-8064-8f64-9362-7eceb5b9280e?period=7days

Test Plan

Green CI

Test Result

…connector_hashes Signed-off-by: mgoin <[email protected]>

github-actions · 2025-07-30T21:44:08Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request aims to fix a recurring out-of-memory (OOM) error in the test_shared_storage_connector_hashes CI test. The changes introduce gpu_memory_utilization=0.4 and enforce_eager=True to the EngineArgs for this specific test. This is a sound approach to reduce memory consumption in a resource-constrained CI environment, as it limits the KV cache size and disables CUDA graphs which can be memory-intensive. The changes are localized to the test and appear correct.

Signed-off-by: mgoin <[email protected]>

DarkLight1337

Thanks for fixing

…lm-project#21973) Signed-off-by: mgoin <[email protected]>

tlrmchlsmth · 2025-08-04T02:02:50Z

tests/v1/kv_connector/unit/test_shared_storage_connector.py

 from vllm.multimodal.utils import encode_image_base64

-MODEL_NAME = "Qwen/Qwen2.5-VL-3B-Instruct"
+MODEL_NAME = "RedHatAI/Qwen2.5-VL-3B-Instruct-quantized.w4a16"


Looks like this model also breaks the CI unfortunately:

[2025-08-03T23:44:26Z] (EngineCore_0 pid=14405) ERROR 08-03 16:44:26 [core.py:683] ValueError: Failed to find a kernel that can implement the WNA16 linear layer. Reasons: [2025-08-03T23:44:26Z] (EngineCore_0 pid=14405) ERROR 08-03 16:44:26 [core.py:683] MacheteLinearKernel requires capability 90, current compute capability is 89 [2025-08-03T23:44:26Z] (EngineCore_0 pid=14405) ERROR 08-03 16:44:26 [core.py:683] AllSparkLinearKernel cannot implement due to: For Ampere GPU, AllSpark does not support group_size = 128. Only group_size = -1 are supported. [2025-08-03T23:44:26Z] (EngineCore_0 pid=14405) ERROR 08-03 16:44:26 [core.py:683] MarlinLinearKernel cannot implement due to: Weight output_size_per_partition = 6840 is not divisible by min_thread_n = 64. Consider reducing tensor_parallel_size or running with --quantization gptq. [2025-08-03T23:44:26Z] (EngineCore_0 pid=14405) ERROR 08-03 16:44:26 [core.py:683] Dynamic4bitLinearKernel cannot implement due to: Only CPU is supported [2025-08-03T23:44:26Z] (EngineCore_0 pid=14405) ERROR 08-03 16:44:26 [core.py:683] BitBLASLinearKernel cannot implement due to: bitblas is not installed. Please install bitblas by running `pip install bitblas>=0.1.0` [2025-08-03T23:44:26Z] (EngineCore_0 pid=14405) ERROR 08-03 16:44:26 [core.py:683] ConchLinearKernel cannot implement due to: conch-triton-kernels is not installed, please install it via `pip install conch-triton-kernels` and try again! [2025-08-03T23:44:26Z] (EngineCore_0 pid=14405) ERROR 08-03 16:44:26 [core.py:683] ExllamaLinearKernel cannot implement due to: Exllama only supports float16 activations

Did we change the device running this test?

…lm-project#21973) Signed-off-by: mgoin <[email protected]>

…lm-project#21973) Signed-off-by: mgoin <[email protected]> Signed-off-by: x22x22 <[email protected]>

…lm-project#21973) Signed-off-by: mgoin <[email protected]>

…lm-project#21973) Signed-off-by: mgoin <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]>

…lm-project#21973) Signed-off-by: mgoin <[email protected]> Signed-off-by: Noam Gat <[email protected]>

…lm-project#21973) Signed-off-by: mgoin <[email protected]> Signed-off-by: Paul Pak <[email protected]>

…lm-project#21973) Signed-off-by: mgoin <[email protected]> Signed-off-by: Diego-Castan <[email protected]>

…lm-project#21973) Signed-off-by: mgoin <[email protected]>

Fix CI OOM for test_shared_storage_connector.py::test_shared_storage_…

a6c50c0

…connector_hashes Signed-off-by: mgoin <[email protected]>

mgoin added the ci-failure Issue about an unexpected test failure in CI label Jul 30, 2025

github-project-automation bot added this to CI Failures Jul 30, 2025

mgoin changed the title ~~Fix CI OOM for test_shared_storage_connector_hashes~~ [CI Bugfix] Fix CI OOM for test_shared_storage_connector_hashes Jul 30, 2025

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 30, 2025

gemini-code-assist bot reviewed Jul 30, 2025

View reviewed changes

mergify bot added the v1 label Jul 30, 2025

Use 4bit model

b10b1c9

Signed-off-by: mgoin <[email protected]>

DarkLight1337 approved these changes Jul 31, 2025

View reviewed changes

DarkLight1337 merged commit 055bd39 into vllm-project:main Jul 31, 2025
44 checks passed

github-project-automation bot moved this to Done in CI Failures Jul 31, 2025

liuyumoye pushed a commit to liuyumoye/vllm that referenced this pull request Jul 31, 2025

[CI Bugfix] Fix CI OOM for test_shared_storage_connector_hashes (vl…

62bc9c3

…lm-project#21973) Signed-off-by: mgoin <[email protected]>

tlrmchlsmth reviewed Aug 4, 2025

View reviewed changes

tlrmchlsmth mentioned this pull request Aug 4, 2025

[CI Bugfix] Fix wNa16 kernel not found for test_shared_storage_connector_hashes #22163

Merged

vadiklyutiy pushed a commit to CentML/vllm that referenced this pull request Aug 5, 2025

[CI Bugfix] Fix CI OOM for test_shared_storage_connector_hashes (vl…

6055877

…lm-project#21973) Signed-off-by: mgoin <[email protected]>

x22x22 pushed a commit to x22x22/vllm that referenced this pull request Aug 5, 2025

[CI Bugfix] Fix CI OOM for test_shared_storage_connector_hashes (vl…

52f1a7e

…lm-project#21973) Signed-off-by: mgoin <[email protected]> Signed-off-by: x22x22 <[email protected]>

npanpaliya pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Aug 6, 2025

[CI Bugfix] Fix CI OOM for test_shared_storage_connector_hashes (vl…

2db0061

…lm-project#21973) Signed-off-by: mgoin <[email protected]>

jinzhen-lin pushed a commit to jinzhen-lin/vllm that referenced this pull request Aug 9, 2025

[CI Bugfix] Fix CI OOM for test_shared_storage_connector_hashes (vl…

c98ab66

…lm-project#21973) Signed-off-by: mgoin <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]>

noamgat pushed a commit to noamgat/vllm that referenced this pull request Aug 9, 2025

[CI Bugfix] Fix CI OOM for test_shared_storage_connector_hashes (vl…

44a48b7

…lm-project#21973) Signed-off-by: mgoin <[email protected]> Signed-off-by: Noam Gat <[email protected]>

paulpak58 pushed a commit to paulpak58/vllm that referenced this pull request Aug 13, 2025

[CI Bugfix] Fix CI OOM for test_shared_storage_connector_hashes (vl…

6e57e43

…lm-project#21973) Signed-off-by: mgoin <[email protected]> Signed-off-by: Paul Pak <[email protected]>

diegocastanibm pushed a commit to diegocastanibm/vllm that referenced this pull request Aug 15, 2025

[CI Bugfix] Fix CI OOM for test_shared_storage_connector_hashes (vl…

4831368

…lm-project#21973) Signed-off-by: mgoin <[email protected]> Signed-off-by: Diego-Castan <[email protected]>

epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025

[CI Bugfix] Fix CI OOM for test_shared_storage_connector_hashes (vl…

664ca66

…lm-project#21973) Signed-off-by: mgoin <[email protected]>

zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025

[CI Bugfix] Fix CI OOM for test_shared_storage_connector_hashes (vl…

79f83a4

…lm-project#21973) Signed-off-by: mgoin <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[CI Bugfix] Fix CI OOM for `test_shared_storage_connector_hashes` #21973

[CI Bugfix] Fix CI OOM for `test_shared_storage_connector_hashes` #21973

Uh oh!

mgoin commented Jul 30, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jul 30, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

DarkLight1337 left a comment

Uh oh!

Uh oh!

tlrmchlsmth Aug 4, 2025 •

edited

Loading

Uh oh!

tlrmchlsmth Aug 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

[CI Bugfix] Fix CI OOM for test_shared_storage_connector_hashes #21973

[CI Bugfix] Fix CI OOM for test_shared_storage_connector_hashes #21973

Uh oh!

Conversation

mgoin commented Jul 30, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

github-actions bot commented Jul 30, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

DarkLight1337 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tlrmchlsmth Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tlrmchlsmth Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[CI Bugfix] Fix CI OOM for `test_shared_storage_connector_hashes` #21973

[CI Bugfix] Fix CI OOM for `test_shared_storage_connector_hashes` #21973

mgoin commented Jul 30, 2025 •

edited by github-actions bot

Loading

tlrmchlsmth Aug 4, 2025 •

edited

Loading