Skip to content

Conversation

Tabrizian
Copy link
Member

@Tabrizian Tabrizian commented Jul 23, 2025

This reverts commit eb5cb5b.

Summary by CodeRabbit

  • Chores

    • Removed all timeout management logic and related utilities from integration tests, including fixtures, context managers, and decorators.
    • Simplified test and helper function signatures by eliminating timeout parameters.
    • Updated test execution lists to reflect removal or adjustment of timeout constraints.
    • Cleaned up utility modules by deleting timeout-related files and exports.
  • Refactor

    • Streamlined test execution flow by removing conditional timeout handling across multiple test files.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Copy link
Contributor

coderabbitai bot commented Jul 23, 2025

Walkthrough

This change set removes all custom timeout management logic and utilities from the integration test suite. It eliminates the TimeoutManager class, its related decorators, pytest fixtures, and all associated context management in test functions. Function signatures and test lists were updated to reflect the removal of timeout parameters and logic.

Changes

File(s) / Path(s) Change Summary
tests/integration/defs/accuracy/accuracy_core.py Removed timeout management logic from CliFlowAccuracyTestHarness.run; dropped timeout_manager parameter.
tests/integration/defs/accuracy/test_cli_flow.py Removed timeout_manager from test method signatures and calls.
tests/integration/defs/common.py Removed **kwargs and timeout handling from utility function signatures and calls.
tests/integration/defs/conftest.py Deleted pytest fixtures for timeout extraction and management.
tests/integration/defs/examples/test_commandr.py Removed all usage of timeout_manager from test function, including context managers and arguments.
tests/integration/defs/examples/test_exaone.py Removed timeout_manager parameter and all timeout context usage from test functions.
tests/integration/defs/examples/test_gpt.py Removed timeout_manager parameter and all timeout context usage from test function.
tests/integration/defs/examples/test_llama.py Removed timeout_manager parameter and all timeout context usage from test functions.
tests/integration/defs/trt_test_alternative.py Simplified timeout logic; removed command-line timeout parsing; now only uses pytest marker-based timeouts.
tests/integration/defs/utils/__init__.py Deleted file; removed exports for timeout management utilities.
tests/integration/defs/utils/timeout_manager.py Deleted file; removed all timeout management classes, functions, and decorators.
tests/integration/test_lists/qa/examples_test_list.txt Updated test timeout annotations, enabling or reducing timeouts for specific tests.

Sequence Diagram(s)

sequenceDiagram
    participant TestFunction
    participant UtilityFunction
    participant Subprocess

    TestFunction->>UtilityFunction: Call utility (no timeout)
    UtilityFunction->>Subprocess: Run subprocess (no timeout arg)
    Subprocess-->>UtilityFunction: Return result
    UtilityFunction-->>TestFunction: Return result
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~15–20 minutes

Suggested reviewers

  • kaiyux
  • litaotju
  • Shixiaowei02

Poem

🐇
No more ticking clocks or timeout fright,
The tests now run both day and night.
TimeoutManager hops away,
Leaving code less gray.
Simpler flows, less stress—what a delight!

— A jubilant rabbit

Note

⚡️ Unit Test Generation - Beta

CodeRabbit's unit test generation is now available in Beta! Automatically generate comprehensive unit tests for your code changes, ensuring better test coverage and catching edge cases you might miss. Our AI analyzes your code structure and creates tests that follow best practices and your project's testing patterns. Learn more here, or just try it under ✨ Finishing Touches.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@Tabrizian Tabrizian requested a review from tburt-nv July 23, 2025 20:42
@Tabrizian Tabrizian enabled auto-merge (squash) July 23, 2025 20:45
@Tabrizian
Copy link
Member Author

/bot run

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🔭 Outside diff range comments (1)
tests/integration/defs/accuracy/accuracy_core.py (1)

700-707: extra_mmlu_args disappeared from the public API – breaks caller’s ability to forward MMLU flags

initialize_case() still expects extra_mmlu_args, and mmlu() later consumes self.extra_mmlu_args.
After the signature rewrite, there is no way for callers to supply that argument, so any external test harness that previously relied on it silently loses functionality.

@@
-            extra_convert_args: Optional[list] = None,
-            extra_build_args: Optional[list] = None,
-            extra_summarize_args: Optional[list] = None,
-            extra_eval_long_context_args: Optional[list] = None,
+            extra_convert_args: Optional[list] = None,
+            extra_build_args: Optional[list] = None,
+            extra_summarize_args: Optional[list] = None,
+            extra_mmlu_args: Optional[list] = None,
+            extra_eval_long_context_args: Optional[list] = None,
             env: Optional[Dict[str, str]] = None):
@@
-            extra_eval_long_context_args=extra_eval_long_context_args,
+            extra_mmlu_args=extra_mmlu_args,
+            extra_eval_long_context_args=extra_eval_long_context_args,

Add the parameter (with default None) and forward it to initialize_case to restore parity with the rest of the class.

🧹 Nitpick comments (1)
tests/integration/defs/accuracy/accuracy_core.py (1)

704-723: Re-entrance/timeout logic removal acknowledged, but consider minimal safeguard

The method now fires the full pipeline serially with no outer try/except/finally or timeout envelope.
Given the heavy external processes (pip, mpirun, trtllm-build…), a single failure will raise and leave the workspace artefacts in tmp, which may bloat disk on CI.

A tiny guard helps:

         self.install_requirements()
-        self.initialize_case(
+        try:
+            self.initialize_case(
               ...
             env=env)
-        self.convert()
-        self.build()
-        self.evaluate()
+            self.convert()
+            self.build()
+            self.evaluate()
+        finally:
+            logger.info("CliFlowAccuracyTestHarness finished – workspace cleaned")

Not strictly blocking, but improves robustness now that timeouts are gone.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cf4f4e8 and a4d403b.

📒 Files selected for processing (12)
  • tests/integration/defs/accuracy/accuracy_core.py (1 hunks)
  • tests/integration/defs/accuracy/test_cli_flow.py (3 hunks)
  • tests/integration/defs/common.py (5 hunks)
  • tests/integration/defs/conftest.py (0 hunks)
  • tests/integration/defs/examples/test_commandr.py (3 hunks)
  • tests/integration/defs/examples/test_exaone.py (4 hunks)
  • tests/integration/defs/examples/test_gpt.py (1 hunks)
  • tests/integration/defs/examples/test_llama.py (4 hunks)
  • tests/integration/defs/trt_test_alternative.py (4 hunks)
  • tests/integration/defs/utils/__init__.py (0 hunks)
  • tests/integration/defs/utils/timeout_manager.py (0 hunks)
  • tests/integration/test_lists/qa/examples_test_list.txt (1 hunks)
💤 Files with no reviewable changes (3)
  • tests/integration/defs/utils/init.py
  • tests/integration/defs/conftest.py
  • tests/integration/defs/utils/timeout_manager.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (30)
tests/integration/test_lists/qa/examples_test_list.txt (2)

18-18: LGTM: Timeout annotation removed consistently.

The removal of the TIMEOUT (120) annotation for test_llm_commandr_plus_4gpus_summary[enable_weight_only] aligns with the broader removal of timeout management infrastructure.


22-31: LGTM: Timeout adjustments align with timeout management removal.

The changes to test_llm_enc_dec_general test variants are consistent:

  • Line 22: Timeout reduced from 90 to 60 seconds
  • Lines 23-31: Complete removal of TIMEOUT (90) annotations

These changes indicate the tests are expected to run more efficiently or within default timeout limits after the timeout management infrastructure removal.

tests/integration/defs/examples/test_commandr.py (4)

88-88: LGTM: Function signature cleaned up correctly.

The timeout_manager parameter has been properly removed from the function signature, aligning with the broader removal of timeout management infrastructure.


94-102: LGTM: Timeout parameters removed from convert_weights call.

The convert_weights function call has been properly updated to remove timeout-related parameters while maintaining all the necessary conversion arguments.


124-124: LGTM: Direct check_call without timeout context.

The check_call is now executed directly without timeout management context, consistent with the timeout infrastructure removal.


126-140: LGTM: MPI calls simplified without timeout parameters.

Both venv_mpi_check_call invocations have been properly updated to remove timeout-related parameters while preserving the essential MPI execution logic.

tests/integration/defs/accuracy/test_cli_flow.py (2)

1158-1165: LGTM: Timeout management removed from test_fp8_tp2pp2.

The timeout_manager parameter has been correctly removed from both the method signature and the self.run() call, maintaining all other test parameters and logic.


1175-1186: LGTM: Timeout management removed from test_int8_plugin_tp8.

The timeout_manager parameter has been properly eliminated from the method signature and the corresponding self.run() invocation, preserving the parameterized test structure and all other arguments.

tests/integration/defs/common.py (4)

46-63: Function signature simplified correctly.

The removal of **kwargs parameter and its forwarding simplifies the function interface. The implementation correctly removes the **kwargs forwarding in the venv.run_cmd call.


66-83: Function signature simplified correctly.

The removal of **kwargs parameter is consistent with the timeout management removal. The venv.run_cmd call correctly only passes the explicit parameters.


517-517: Timeout parameter correctly removed.

The venv_check_call invocation no longer includes a timeout parameter, which is consistent with the timeout management removal across the test suite.


619-619: Timeout parameter correctly removed.

The venv_check_call invocation no longer includes a timeout parameter, maintaining consistency with the timeout management removal.

tests/integration/defs/examples/test_exaone.py (2)

36-78: Timeout management correctly removed.

The function signature and all operation calls have been updated to remove timeout management. The core test logic for weight conversion, engine building, and summary execution remains intact. The changes are systematic and maintain test functionality.


90-128: Timeout management correctly removed from multi-GPU test.

The function follows the same pattern as the single GPU test - timeout management removed while preserving the core test logic for multi-GPU execution with MPI. All operations maintain their functionality without timeout constraints.

tests/integration/defs/examples/test_gpt.py (1)

640-688: Timeout management correctly removed from GPT-3 175B test.

The function signature and all operations have been updated to remove timeout management. The checkpoint conversion, engine building with various plugin configurations, and MPI-based inference execution all maintain their original functionality without timeout constraints. The changes are consistent with the broader timeout management removal.

tests/integration/defs/trt_test_alternative.py (4)

211-211: LGTM: Centralized timeout handling

The change to use get_pytest_timeout(timeout) instead of direct timeout usage aligns with the broader refactoring to centralize timeout determination through pytest markers.


230-232: LGTM: Simplified function signature

Removing the explicit timeout parameter from check_call and delegating timeout handling to the call function is a clean simplification that aligns with the overall refactoring.


243-243: LGTM: Consistent timeout handling pattern

The use of get_pytest_timeout(timeout) and subsequent usage in process.communicate() follows the same consistent pattern established in other functions.

Also applies to: 249-249


201-346: Overall refactoring looks good with proper centralization of timeout handling

The systematic removal of explicit timeout management and centralization through get_pytest_timeout() is well-executed. The changes maintain consistent patterns across all affected functions while simplifying the interfaces.

The refactoring successfully:

  • Removes redundant timeout parameters from function signatures
  • Centralizes timeout determination logic
  • Maintains backward compatibility through proper fallbacks
  • Includes appropriate error handling
tests/integration/defs/examples/test_llama.py (11)

3028-3030: LGTM! Function signature correctly updated.

The removal of the timeout_manager parameter is consistent with the broader timeout management removal described in the PR objectives.


3040-3045: Clean timeout management removal.

The core functionality for generating the evaluation dataset is preserved while cleanly removing timeout management. The subprocess call execution remains intact.


3048-3055: Consistent timeout removal pattern.

The weight conversion functionality is preserved while cleanly removing timeout context management. Changes are consistent with the overall timeout removal approach.


3058-3066: Engine building functionality preserved.

The timeout management removal maintains the engine building logic while eliminating timeout tracking overhead. Core test functionality remains intact.


3069-3083: Passkey evaluation logic maintained.

The core passkey evaluation functionality and MPI execution are preserved while cleanly removing timeout management. Test logic remains intact.


3384-3387: Function signature consistently updated.

The removal of the timeout_manager parameter follows the same clean pattern as other function signature changes in this revert.


3406-3416: Weight conversion logic preserved.

Timeout management is cleanly removed while maintaining the core weight conversion functionality and parameters.


3419-3434: Engine building with conditional logic maintained.

The build command execution and conditional gemm_allreduce plugin logic are preserved while cleanly removing timeout management.


3436-3442: Dataset generation functionality intact.

The synthetic dataset generation logic is preserved while cleanly removing timeout context management.


3445-3460: Multi-process evaluation execution preserved.

The MPI-based evaluation execution with multi-process coordination is maintained while removing timeout management overhead.


3463-3468: MMLU test execution maintained.

The MMLU accuracy test functionality is preserved while cleanly removing timeout management from the command execution.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12747 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12747 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9491 completed with status: 'FAILURE'

@venkywonka
Copy link
Collaborator

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12772 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12772 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9512 completed with status: 'SUCCESS'

@Tabrizian Tabrizian merged commit 5fceaa6 into NVIDIA:main Jul 24, 2025
3 checks passed
NVShreyas pushed a commit to NVShreyas/TensorRT-LLM that referenced this pull request Jul 28, 2025
Ransiki pushed a commit to Ransiki/TensorRT-LLM that referenced this pull request Jul 29, 2025
lancelly pushed a commit to lancelly/TensorRT-LLM that referenced this pull request Aug 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants