-
-
Couldn't load subscription status.
- Fork 10.8k
[TPU] Skip hanging tests #19115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TPU] Skip hanging tests #19115
Conversation
Signed-off-by: Siyuan Liu <[email protected]>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @lsy323, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
Hello team, gemini-code-assist here to provide a summary of this pull request. This PR addresses an issue in the TPU CI pipeline where a specific test, test_structured_output_with_reasoning_matrices within test_struct_output_generate.py, has been hanging. This hang is causing the TPU CI runs to take an excessive amount of time (around 4 hours). To unblock the CI and allow other changes to be tested efficiently, this pull request temporarily skips the problematic test case in the TPU CI script.
Highlights
- CI Fix: Skips a specific test (
test_structured_output_with_reasoning_matrices) in the TPU CI buildkite script because it is currently hanging and significantly increasing CI run times.
Changelog
- .buildkite/scripts/hardware_ci/run-tpu-v1-test.sh
- Modified the
pytestcommand fortest_struct_output_generate.py(line 153) to exclude thetest_structured_output_with_reasoning_matricestest function using the-k 'not ...'flag.
- Modified the
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Test hangs, CI waits,
Four hours pass, build fails slow,
Skip the test, move on.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This PR provides a clear and effective solution to unblock the TPU CI by skipping the identified hanging test (test_structured_output_with_reasoning_matrices). The change is well-targeted and the PR description clearly explains the rationale. My main suggestion focuses on adding a comment to the script for future maintainability. Thank you for addressing this CI issue promptly!
Summary of Findings
- In-script Documentation for Test Skip: To improve long-term maintainability, it's beneficial to add a comment in the CI script explaining why a specific test is being skipped. This comment should ideally reference the PR or an issue tracking the problem, providing context for future developers and facilitating future re-evaluation of the skip.
Merge Readiness
This pull request effectively addresses the immediate issue of CI hangs by skipping the problematic test. The change is minimal and targeted.
To enhance maintainability, I've suggested adding a comment in the script to document the reason for this test skip. This is a medium-severity suggestion aimed at ensuring the context for this temporary measure isn't lost over time.
While I am not authorized to approve pull requests, I recommend addressing this suggestion to improve the script's clarity. The PR is otherwise a sensible step to unblock the CI pipeline. It's also important that the underlying issue causing the test to hang is tracked and investigated separately, with the goal of eventually re-enabling this test.
| "python3 -m pytest -s -v /workspace/vllm/tests/v1/tpu/test_pallas.py" | ||
| run_and_track_test 11 "test_struct_output_generate.py" \ | ||
| "python3 -m pytest -s -v /workspace/vllm/tests/v1/entrypoints/llm/test_struct_output_generate.py" | ||
| "python3 -m pytest -s -v /workspace/vllm/tests/v1/entrypoints/llm/test_struct_output_generate.py -k 'not test_structured_output_with_reasoning_matrices'" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This modification to skip the test test_structured_output_with_reasoning_matrices is a good immediate step to unblock the CI pipeline, given the hanging issue described.
For long-term maintainability and context for other developers (or your future self!), would you consider adding a shell comment immediately before this run_and_track_test invocation or on the line before this specific command? This comment could briefly explain why the test is skipped and reference this PR or a tracking issue for the hang.
For example:
# Temporarily skipping test_structured_output_with_reasoning_matrices due to CI hangs.
# See PR #<this_pr_number> or issue #<issue_number_if_any> for details.
# Original command: python3 -m pytest -s -v /workspace/vllm/tests/v1/entrypoints/llm/test_struct_output_generate.py
run_and_track_test 11 "test_struct_output_generate.py" \
"python3 -m pytest -s -v /workspace/vllm/tests/v1/entrypoints/llm/test_struct_output_generate.py -k 'not test_structured_output_with_reasoning_matrices'"Adding such a comment would make it easier to track and eventually revisit this skip. What are your thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for finding the current issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
|
Oh, we got when building the docker |
Signed-off-by: Siyuan Liu <[email protected]>
|
close this one, putting this with #19108 together, both are fixing the CI issues are HEAD. |
Looks like flaky issue, didn't hit this in another PR #19108 |
Somehow the test have been hanging. Buildkite log
This makes each TPU CI run take 4 hr, disable it to unblock CI.