[TPU] Skip hanging tests #19115

lsy323 · 2025-06-03T23:23:29Z

Somehow the test have been hanging. Buildkite log

This makes each TPU CI run take 4 hr, disable it to unblock CI.


tests/v1/entrypoints/llm/test_struct_output_generate.py::test_structured_output_with_reasoning_matrices[Qwen/Qwen3-1.7B-xgrammar-auto-deepseek_r1-None] INFO 06-03 22:27:53 [config.py:822] This model supports multiple tasks: {'embed', 'classify', 'reward', 'generate', 'score'}. Defaulting to 'generate'.
--
  | INFO 06-03 22:27:53 [config.py:1967] Disabled the custom all-reduce kernel because it is not supported on current platform.
  | INFO 06-03 22:27:53 [config.py:2176] Chunked prefill is enabled with max_num_batched_tokens=8192.
  | INFO 06-03 22:27:53 [tpu.py:105] [TPU] Forcing DYNAMO_ONCE compilation level
  | huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
  | To disable this warning, you can either:
  | - Avoid using `tokenizers` before the fork if possible
  | - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true \| false)
  | INFO 06-03 22:27:55 [core.py:455] Waiting for init message from front-end.
  | INFO 06-03 22:27:55 [tpu.py:105] [TPU] Forcing DYNAMO_ONCE compilation level
  | INFO 06-03 22:27:55 [core.py:70] Initializing a V1 LLM engine (v0.9.1.dev143+gfa98d7777) with config: model='Qwen/Qwen3-1.7B', speculative_config=None, tokenizer='Qwen/Qwen3-1.7B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=None, decoding_config=DecodingConfig(backend='xgrammar', disable_fallback=False, disable_any_whitespace=True, disable_additional_properties=False, reasoning_backend='deepseek_r1'), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen/Qwen3-1.7B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, pooler_config=None, compilation_config={"level":2,"debug_dump_path":"","cache_dir":"","backend":"openxla","custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null}
  | INFO 06-03 22:27:55 [tpu_worker.py:294] tpu_commons not found, using vLLM's TPUWorker.
  | [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
  | INFO 06-03 22:27:55 [parallel_state.py:1065] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
  | WARNING 06-03 22:28:01 [tpu.py:178] Pin memory is not supported on TPU.
  | INFO 06-03 22:28:01 [tpu_model_runner.py:1574] Using exponential token paddings:
  | INFO 06-03 22:28:01 [tpu_model_runner.py:1576]     16
  | INFO 06-03 22:28:01 [tpu_model_runner.py:1576]     32
  | INFO 06-03 22:28:01 [tpu_model_runner.py:1576]     64
  | INFO 06-03 22:28:01 [tpu_model_runner.py:1576]     128
  | INFO 06-03 22:28:01 [tpu_model_runner.py:1576]     256
  | INFO 06-03 22:28:01 [tpu_model_runner.py:1576]     512
  | INFO 06-03 22:28:01 [tpu_model_runner.py:1576]     1024
  | INFO 06-03 22:28:01 [tpu_model_runner.py:1576]     2048
  | INFO 06-03 22:28:01 [tpu_model_runner.py:1576]     4096
  | INFO 06-03 22:28:01 [tpu_model_runner.py:1576]     8192
  | INFO 06-03 22:28:01 [tpu_model_runner.py:1540] Preparing request paddings:
  | INFO 06-03 22:28:01 [tpu_model_runner.py:1547]     8
  | INFO 06-03 22:28:01 [tpu_model_runner.py:1547]     16
  | INFO 06-03 22:28:01 [tpu_model_runner.py:969] Loading model from scratch...
  | INFO 06-03 22:28:01 [tpu.py:51] Cannot use None backend on TPU.
  | INFO 06-03 22:28:01 [tpu.py:54] Using Pallas V1 backend.
  | INFO 06-03 22:28:02 [weight_utils.py:292] Using model weights format ['*.safetensors']

Signed-off-by: Siyuan Liu <[email protected]>

github-actions · 2025-06-03T23:23:37Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Hello @lsy323, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello team, gemini-code-assist here to provide a summary of this pull request. This PR addresses an issue in the TPU CI pipeline where a specific test, test_structured_output_with_reasoning_matrices within test_struct_output_generate.py, has been hanging. This hang is causing the TPU CI runs to take an excessive amount of time (around 4 hours). To unblock the CI and allow other changes to be tested efficiently, this pull request temporarily skips the problematic test case in the TPU CI script.

Highlights

CI Fix: Skips a specific test (test_structured_output_with_reasoning_matrices) in the TPU CI buildkite script because it is currently hanging and significantly increasing CI run times.

Changelog

.buildkite/scripts/hardware_ci/run-tpu-v1-test.sh
- Modified the pytest command for test_struct_output_generate.py (line 153) to exclude the test_structured_output_with_reasoning_matrices test function using the -k 'not ...' flag.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Test hangs, CI waits,
Four hours pass, build fails slow,
Skip the test, move on.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This PR provides a clear and effective solution to unblock the TPU CI by skipping the identified hanging test (test_structured_output_with_reasoning_matrices). The change is well-targeted and the PR description clearly explains the rationale. My main suggestion focuses on adding a comment to the script for future maintainability. Thank you for addressing this CI issue promptly!

Summary of Findings

In-script Documentation for Test Skip: To improve long-term maintainability, it's beneficial to add a comment in the CI script explaining why a specific test is being skipped. This comment should ideally reference the PR or an issue tracking the problem, providing context for future developers and facilitating future re-evaluation of the skip.

Merge Readiness

This pull request effectively addresses the immediate issue of CI hangs by skipping the problematic test. The change is minimal and targeted.
To enhance maintainability, I've suggested adding a comment in the script to document the reason for this test skip. This is a medium-severity suggestion aimed at ensuring the context for this temporary measure isn't lost over time.

While I am not authorized to approve pull requests, I recommend addressing this suggestion to improve the script's clarity. The PR is otherwise a sensible step to unblock the CI pipeline. It's also important that the underlying issue causing the test to hang is tracked and investigated separately, with the goal of eventually re-enabling this test.

gemini-code-assist · 2025-06-03T23:24:41Z

.buildkite/scripts/hardware_ci/run-tpu-v1-test.sh

    "python3 -m pytest -s -v /workspace/vllm/tests/v1/tpu/test_pallas.py"
 run_and_track_test 11 "test_struct_output_generate.py" \
-    "python3 -m pytest -s -v /workspace/vllm/tests/v1/entrypoints/llm/test_struct_output_generate.py"
+    "python3 -m pytest -s -v /workspace/vllm/tests/v1/entrypoints/llm/test_struct_output_generate.py -k 'not test_structured_output_with_reasoning_matrices'"


This modification to skip the test test_structured_output_with_reasoning_matrices is a good immediate step to unblock the CI pipeline, given the hanging issue described.

For long-term maintainability and context for other developers (or your future self!), would you consider adding a shell comment immediately before this run_and_track_test invocation or on the line before this specific command? This comment could briefly explain why the test is skipped and reference this PR or a tracking issue for the hang.

For example:

# Temporarily skipping test_structured_output_with_reasoning_matrices due to CI hangs. # See PR #<this_pr_number> or issue #<issue_number_if_any> for details. # Original command: python3 -m pytest -s -v /workspace/vllm/tests/v1/entrypoints/llm/test_struct_output_generate.py run_and_track_test 11 "test_struct_output_generate.py" \ "python3 -m pytest -s -v /workspace/vllm/tests/v1/entrypoints/llm/test_struct_output_generate.py -k 'not test_structured_output_with_reasoning_matrices'"

Adding such a comment would make it easier to track and eventually revisit this skip. What are your thoughts on this?

mgoin

Thanks for finding the current issue

yaochengji

LGTM, thanks!

yaochengji · 2025-06-04T03:49:19Z

Oh, we got


5.200       ERROR: THESE PACKAGES DO NOT MATCH THE HASHES FROM THE REQUIREMENTS FILE. If you have updated the package versions, please update the hashes. Otherwise, examine the package contents carefully; someone may have tampered with them.
--
  | 5.200           torch==2.7.0 from https://files.pythonhosted.org/packages/cc/2c/91d1de65573fce563f5284e69d9c56b57289625cffbbb6d533d5d56c36a5/torch-2.7.0-cp310-cp310-manylinux_2_28_x86_64.whl:
  | 5.200               Expected sha256 0b9960183b6e5b71239a3e6c883d8852c304e691c0b2955f7045e8a6d05b9183
  | 5.200                    Got        c74a63dbe482e161469797fb5b97adf97b05e3265b52855964420a343acad996

when building the docker

Signed-off-by: Siyuan Liu <[email protected]>

lsy323 · 2025-06-04T04:11:05Z

close this one, putting this with #19108 together, both are fixing the CI issues are HEAD.

lsy323 · 2025-06-04T04:11:36Z

Oh, we got


5.200       ERROR: THESE PACKAGES DO NOT MATCH THE HASHES FROM THE REQUIREMENTS FILE. If you have updated the package versions, please update the hashes. Otherwise, examine the package contents carefully; someone may have tampered with them.
--
  | 5.200           torch==2.7.0 from https://files.pythonhosted.org/packages/cc/2c/91d1de65573fce563f5284e69d9c56b57289625cffbbb6d533d5d56c36a5/torch-2.7.0-cp310-cp310-manylinux_2_28_x86_64.whl:
  | 5.200               Expected sha256 0b9960183b6e5b71239a3e6c883d8852c304e691c0b2955f7045e8a6d05b9183
  | 5.200                    Got        c74a63dbe482e161469797fb5b97adf97b05e3265b52855964420a343acad996

when building the docker

Looks like flaky issue, didn't hit this in another PR #19108

skip hanging tests

47f6b53

Signed-off-by: Siyuan Liu <[email protected]>

gemini-code-assist bot reviewed Jun 3, 2025

View reviewed changes

mergify bot added the ci/build label Jun 3, 2025

gemini-code-assist bot suggested changes Jun 3, 2025

View reviewed changes

mgoin added tpu Related to Google TPUs ready ONLY add when PR is ready to merge/full CI is needed labels Jun 3, 2025

mgoin approved these changes Jun 3, 2025

View reviewed changes

mergify bot removed the tpu Related to Google TPUs label Jun 3, 2025

yaochengji approved these changes Jun 4, 2025

View reviewed changes

skip tests for ci disk size

d78e134

Signed-off-by: Siyuan Liu <[email protected]>

mergify bot added v1 tpu Related to Google TPUs labels Jun 4, 2025

lsy323 mentioned this pull request Jun 4, 2025

[TPU] Update dynamo dump file name in compilation test #19108

Merged

DarkLight1337 enabled auto-merge (squash) June 4, 2025 06:07

vllm-bot merged commit 8e972d9 into vllm-project:main Jun 4, 2025
39 of 40 checks passed

noooop mentioned this pull request Jun 4, 2025

Improve the output precision of embedding models #19092

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

[TPU] Skip hanging tests #19115

[TPU] Skip hanging tests #19115

Uh oh!

lsy323 commented Jun 3, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jun 3, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jun 3, 2025

Uh oh!

mgoin left a comment

Uh oh!

yaochengji left a comment

Uh oh!

yaochengji commented Jun 4, 2025

Uh oh!

lsy323 commented Jun 4, 2025

Uh oh!

lsy323 commented Jun 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Uh oh!

[TPU] Skip hanging tests #19115

[TPU] Skip hanging tests #19115

Uh oh!

Conversation

lsy323 commented Jun 3, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jun 3, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Changelog

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Summary of Findings

Merge Readiness

Uh oh!

gemini-code-assist bot Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

yaochengji left a comment

Choose a reason for hiding this comment

Uh oh!

yaochengji commented Jun 4, 2025

Uh oh!

lsy323 commented Jun 4, 2025

Uh oh!

lsy323 commented Jun 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lsy323 commented Jun 3, 2025 •

edited by github-actions bot

Loading