-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Revert "tests: add timeout_manager to tensorrt flow test cases (#5942)" #6309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThis change set removes all custom timeout management logic and utilities from the integration test suite. It eliminates the Changes
Sequence Diagram(s)sequenceDiagram
participant TestFunction
participant UtilityFunction
participant Subprocess
TestFunction->>UtilityFunction: Call utility (no timeout)
UtilityFunction->>Subprocess: Run subprocess (no timeout arg)
Subprocess-->>UtilityFunction: Return result
UtilityFunction-->>TestFunction: Return result
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~15–20 minutes Suggested reviewers
Poem
Note ⚡️ Unit Test Generation - BetaCodeRabbit's unit test generation is now available in Beta! Automatically generate comprehensive unit tests for your code changes, ensuring better test coverage and catching edge cases you might miss. Our AI analyzes your code structure and creates tests that follow best practices and your project's testing patterns. Learn more here, or just try it under ✨ Finishing Touches. ✨ Finishing Touches
🧪 Generate unit tests
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
/bot run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
tests/integration/defs/accuracy/accuracy_core.py (1)
700-707
:extra_mmlu_args
disappeared from the public API – breaks caller’s ability to forward MMLU flags
initialize_case()
still expectsextra_mmlu_args
, andmmlu()
later consumesself.extra_mmlu_args
.
After the signature rewrite, there is no way for callers to supply that argument, so any external test harness that previously relied on it silently loses functionality.@@ - extra_convert_args: Optional[list] = None, - extra_build_args: Optional[list] = None, - extra_summarize_args: Optional[list] = None, - extra_eval_long_context_args: Optional[list] = None, + extra_convert_args: Optional[list] = None, + extra_build_args: Optional[list] = None, + extra_summarize_args: Optional[list] = None, + extra_mmlu_args: Optional[list] = None, + extra_eval_long_context_args: Optional[list] = None, env: Optional[Dict[str, str]] = None): @@ - extra_eval_long_context_args=extra_eval_long_context_args, + extra_mmlu_args=extra_mmlu_args, + extra_eval_long_context_args=extra_eval_long_context_args,Add the parameter (with default
None
) and forward it toinitialize_case
to restore parity with the rest of the class.
🧹 Nitpick comments (1)
tests/integration/defs/accuracy/accuracy_core.py (1)
704-723
: Re-entrance/timeout logic removal acknowledged, but consider minimal safeguardThe method now fires the full pipeline serially with no outer try/except/finally or timeout envelope.
Given the heavy external processes (pip
,mpirun
,trtllm-build
…), a single failure will raise and leave the workspace artefacts intmp
, which may bloat disk on CI.A tiny guard helps:
self.install_requirements() - self.initialize_case( + try: + self.initialize_case( ... env=env) - self.convert() - self.build() - self.evaluate() + self.convert() + self.build() + self.evaluate() + finally: + logger.info("CliFlowAccuracyTestHarness finished – workspace cleaned")Not strictly blocking, but improves robustness now that timeouts are gone.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (12)
tests/integration/defs/accuracy/accuracy_core.py
(1 hunks)tests/integration/defs/accuracy/test_cli_flow.py
(3 hunks)tests/integration/defs/common.py
(5 hunks)tests/integration/defs/conftest.py
(0 hunks)tests/integration/defs/examples/test_commandr.py
(3 hunks)tests/integration/defs/examples/test_exaone.py
(4 hunks)tests/integration/defs/examples/test_gpt.py
(1 hunks)tests/integration/defs/examples/test_llama.py
(4 hunks)tests/integration/defs/trt_test_alternative.py
(4 hunks)tests/integration/defs/utils/__init__.py
(0 hunks)tests/integration/defs/utils/timeout_manager.py
(0 hunks)tests/integration/test_lists/qa/examples_test_list.txt
(1 hunks)
💤 Files with no reviewable changes (3)
- tests/integration/defs/utils/init.py
- tests/integration/defs/conftest.py
- tests/integration/defs/utils/timeout_manager.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (30)
tests/integration/test_lists/qa/examples_test_list.txt (2)
18-18
: LGTM: Timeout annotation removed consistently.The removal of the TIMEOUT (120) annotation for
test_llm_commandr_plus_4gpus_summary[enable_weight_only]
aligns with the broader removal of timeout management infrastructure.
22-31
: LGTM: Timeout adjustments align with timeout management removal.The changes to
test_llm_enc_dec_general
test variants are consistent:
- Line 22: Timeout reduced from 90 to 60 seconds
- Lines 23-31: Complete removal of TIMEOUT (90) annotations
These changes indicate the tests are expected to run more efficiently or within default timeout limits after the timeout management infrastructure removal.
tests/integration/defs/examples/test_commandr.py (4)
88-88
: LGTM: Function signature cleaned up correctly.The
timeout_manager
parameter has been properly removed from the function signature, aligning with the broader removal of timeout management infrastructure.
94-102
: LGTM: Timeout parameters removed from convert_weights call.The
convert_weights
function call has been properly updated to remove timeout-related parameters while maintaining all the necessary conversion arguments.
124-124
: LGTM: Direct check_call without timeout context.The
check_call
is now executed directly without timeout management context, consistent with the timeout infrastructure removal.
126-140
: LGTM: MPI calls simplified without timeout parameters.Both
venv_mpi_check_call
invocations have been properly updated to remove timeout-related parameters while preserving the essential MPI execution logic.tests/integration/defs/accuracy/test_cli_flow.py (2)
1158-1165
: LGTM: Timeout management removed from test_fp8_tp2pp2.The
timeout_manager
parameter has been correctly removed from both the method signature and theself.run()
call, maintaining all other test parameters and logic.
1175-1186
: LGTM: Timeout management removed from test_int8_plugin_tp8.The
timeout_manager
parameter has been properly eliminated from the method signature and the correspondingself.run()
invocation, preserving the parameterized test structure and all other arguments.tests/integration/defs/common.py (4)
46-63
: Function signature simplified correctly.The removal of
**kwargs
parameter and its forwarding simplifies the function interface. The implementation correctly removes the**kwargs
forwarding in thevenv.run_cmd
call.
66-83
: Function signature simplified correctly.The removal of
**kwargs
parameter is consistent with the timeout management removal. Thevenv.run_cmd
call correctly only passes the explicit parameters.
517-517
: Timeout parameter correctly removed.The
venv_check_call
invocation no longer includes a timeout parameter, which is consistent with the timeout management removal across the test suite.
619-619
: Timeout parameter correctly removed.The
venv_check_call
invocation no longer includes a timeout parameter, maintaining consistency with the timeout management removal.tests/integration/defs/examples/test_exaone.py (2)
36-78
: Timeout management correctly removed.The function signature and all operation calls have been updated to remove timeout management. The core test logic for weight conversion, engine building, and summary execution remains intact. The changes are systematic and maintain test functionality.
90-128
: Timeout management correctly removed from multi-GPU test.The function follows the same pattern as the single GPU test - timeout management removed while preserving the core test logic for multi-GPU execution with MPI. All operations maintain their functionality without timeout constraints.
tests/integration/defs/examples/test_gpt.py (1)
640-688
: Timeout management correctly removed from GPT-3 175B test.The function signature and all operations have been updated to remove timeout management. The checkpoint conversion, engine building with various plugin configurations, and MPI-based inference execution all maintain their original functionality without timeout constraints. The changes are consistent with the broader timeout management removal.
tests/integration/defs/trt_test_alternative.py (4)
211-211
: LGTM: Centralized timeout handlingThe change to use
get_pytest_timeout(timeout)
instead of direct timeout usage aligns with the broader refactoring to centralize timeout determination through pytest markers.
230-232
: LGTM: Simplified function signatureRemoving the explicit
timeout
parameter fromcheck_call
and delegating timeout handling to thecall
function is a clean simplification that aligns with the overall refactoring.
243-243
: LGTM: Consistent timeout handling patternThe use of
get_pytest_timeout(timeout)
and subsequent usage inprocess.communicate()
follows the same consistent pattern established in other functions.Also applies to: 249-249
201-346
: Overall refactoring looks good with proper centralization of timeout handlingThe systematic removal of explicit timeout management and centralization through
get_pytest_timeout()
is well-executed. The changes maintain consistent patterns across all affected functions while simplifying the interfaces.The refactoring successfully:
- Removes redundant timeout parameters from function signatures
- Centralizes timeout determination logic
- Maintains backward compatibility through proper fallbacks
- Includes appropriate error handling
tests/integration/defs/examples/test_llama.py (11)
3028-3030
: LGTM! Function signature correctly updated.The removal of the
timeout_manager
parameter is consistent with the broader timeout management removal described in the PR objectives.
3040-3045
: Clean timeout management removal.The core functionality for generating the evaluation dataset is preserved while cleanly removing timeout management. The subprocess call execution remains intact.
3048-3055
: Consistent timeout removal pattern.The weight conversion functionality is preserved while cleanly removing timeout context management. Changes are consistent with the overall timeout removal approach.
3058-3066
: Engine building functionality preserved.The timeout management removal maintains the engine building logic while eliminating timeout tracking overhead. Core test functionality remains intact.
3069-3083
: Passkey evaluation logic maintained.The core passkey evaluation functionality and MPI execution are preserved while cleanly removing timeout management. Test logic remains intact.
3384-3387
: Function signature consistently updated.The removal of the
timeout_manager
parameter follows the same clean pattern as other function signature changes in this revert.
3406-3416
: Weight conversion logic preserved.Timeout management is cleanly removed while maintaining the core weight conversion functionality and parameters.
3419-3434
: Engine building with conditional logic maintained.The build command execution and conditional gemm_allreduce plugin logic are preserved while cleanly removing timeout management.
3436-3442
: Dataset generation functionality intact.The synthetic dataset generation logic is preserved while cleanly removing timeout context management.
3445-3460
: Multi-process evaluation execution preserved.The MPI-based evaluation execution with multi-process coordination is maintained while removing timeout management overhead.
3463-3468
: MMLU test execution maintained.The MMLU accuracy test functionality is preserved while cleanly removing timeout management from the command execution.
PR_Github #12747 [ run ] triggered by Bot |
PR_Github #12747 [ run ] completed with state |
/bot run |
PR_Github #12772 [ run ] triggered by Bot |
PR_Github #12772 [ run ] completed with state |
…A#5942)" (NVIDIA#6309) Signed-off-by: Ransiki Zhang <[email protected]>
…A#5942)" (NVIDIA#6309) Signed-off-by: Lanyu Liao <[email protected]>
This reverts commit eb5cb5b.
Summary by CodeRabbit
Chores
Refactor
Description
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.