Skip to content

Conversation

crazydemo
Copy link
Collaborator

@crazydemo crazydemo commented Aug 8, 2025

Summary by CodeRabbit

  • New Features

    • Multi-turn multimodal conversation mode with CLI options to enable and configure turn count and media-driven turns.
  • Tests

    • Added multimodal multi-turn integration tests and test-list entries for multiple models; extended model coverage and adjusted quickstart entry point. Duplicate test definition accidentally introduced. Updated accuracy reference expectations for select models. Added resource- and skip-based gating for several tests.
  • Chores

    • Migrated CUDA-graph handling in tests to a config-based approach.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Copy link
Contributor

coderabbitai bot commented Aug 8, 2025

📝 Walkthrough

Walkthrough

Adds multiturn multimodal conversation handling to examples/llm-api/quickstart_multimodal.py, introduces multimodal multiturn integration tests and test-list entries, migrates CUDA-graph booleans to a cuda_graph_config object across accuracy tests, tightens resource gating in disaggregated serving tests, and adds FP8 accuracy variants to reference YAMLs. (≤50 words)

Changes

Cohort / File(s) Change summary
Quickstart multimodal example
examples/llm-api/quickstart_multimodal.py
Added --multiturn and --conversation_turns CLI flags and implemented a per-turn multimodal loop that loads inputs via default_multimodal_input_loader (num_frames=8, image_data_format="pt", device="cpu"), optionally builds a LoRA request, calls llm.generate, captures per-turn metadata, and maintains running conversation_history. Single-turn path unchanged when multiturn disabled.
End-to-end integration tests
tests/integration/defs/test_e2e.py
Switched some example_root values to examples/llm-api; added test_ptp_quickstart_multimodal_multiturn exercising multimodal multiturn flows with per-model overrides and image inputs (models: gemma-3-27b-it, mistral-small-3.1-24b-instruct, Phi-4-multimodal-instruct). Note: the new test function was duplicated in the file.
Integration test lists
tests/integration/test_lists/qa/llm_function_full.txt
Added three new multiturn multimodal test entries referencing test_ptp_quickstart_multimodal_multiturn for gemma, mistral, and Phi-4 multimodal models.
Accuracy tests — CUDA-graph migration
tests/integration/defs/accuracy/test_llm_api_pytorch.py
Replaced boolean use_cuda_graph with cuda_graph_config (either CudaGraphConfig() or None) across tests and LLM initializations; added @pytest.mark.skip_less_device_memory(80000) to one test.
Disaggregated serving tests
tests/integration/defs/accuracy/test_disaggregated_serving.py
Replaced failing resource-check (pytest.fail) with pytest.skip; added @pytest.mark.skip_less_device(4) and @skip_pre_hopper gating on several classes/tests to gate runs by device/architecture.
Accuracy references
tests/integration/defs/accuracy/references/gsm8k.yaml, .../mmlu.yaml
Added FP8 / kv_cache FP8 entries (and an FP8 + spec_dec_algo: Eagle variant) under meta-llama/Llama-4-Maverick-17B-128E-Instruct with specified accuracy values.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant User
participant Quickstart as quickstart_multimodal.py
participant Loader as default_multimodal_input_loader
participant LLM as LLM
participant LoRA as LoRA (optional)

User->>Quickstart: run with --multiturn + prompts + media
loop for each turn
  Quickstart->>Loader: load inputs (conversation_history, media)
  alt LoRA enabled
    Quickstart->>LoRA: build LoRA request
  end
  Quickstart->>LLM: llm.generate(inputs, sampling_params, lora_request)
  LLM-->>Quickstart: assistant response
  Quickstart->>Quickstart: append response (and next user prompt if present) to conversation_history
end
Quickstart-->>User: print per-turn results and exit

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~35 minutes

Possibly related PRs

Suggested labels

Speculative Decoding, Disaggregated Serving, CI

Suggested reviewers

  • brb-nv
  • Shixiaowei02
  • litaotju
  • QiJune
  • amukkara
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (2)
tests/integration/test_lists/qa/llm_function_full.txt (1)

603-605: Optional: expand coverage and watch runtime

  • Consider adding one multiturn case for a Qwen2-VL or NVILA model later for broader backend coverage, if runtime budget allows.
  • If these prove slow/flake in CI, annotate with TIMEOUT like other heavy tests in this file or place them in a nightly-focused list.

Happy to draft the additional test list entries or help profile runtime to set TIMEOUTs.

tests/integration/defs/test_e2e.py (1)

2600-2601: Avoid dumping full generated output to CI logs

Printing the entire model output can flood CI logs and isn’t needed given the detailed assertion. Remove or gate behind a debug flag.

-    print("output:", output)
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 09038be and b2c460e.

📒 Files selected for processing (3)
  • examples/llm-api/quickstart_multimodal.py (3 hunks)
  • tests/integration/defs/test_e2e.py (1 hunks)
  • tests/integration/test_lists/qa/llm_function_full.txt (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • tests/integration/defs/test_e2e.py
  • examples/llm-api/quickstart_multimodal.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • tests/integration/defs/test_e2e.py
  • examples/llm-api/quickstart_multimodal.py
🧠 Learnings (3)
📓 Common learnings
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/integration/test_lists/qa/llm_function_full.txt
  • tests/integration/defs/test_e2e.py
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • tests/integration/test_lists/qa/llm_function_full.txt
  • tests/integration/defs/test_e2e.py
🪛 Ruff (0.12.2)
tests/integration/defs/test_e2e.py

2608-2608: Line too long (240 > 120)

(E501)

examples/llm-api/quickstart_multimodal.py

151-151: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)


156-156: Do not use bare except

(E722)

🔇 Additional comments (5)
tests/integration/test_lists/qa/llm_function_full.txt (2)

603-605: LGTM: Multiturn VLM cases added in the right section with consistent IDs

Names and placement align with existing quickstart multimodal entries (e.g., Lines 600-602). Model IDs/paths look consistent with prior single-turn/2GPU variants.


603-605: Multiturn tests and CLI logic verified
I’ve confirmed that:

  • test_ptp_quickstart_multimodal_multiturn is defined in tests/integration/defs/test_e2e.py and parametrizes exactly the three new models (gemma-3-27b-it, mistral-small-3.1-24b-instruct, Phi-4-multimodal-instruct).
  • The CLI in examples/llm-api/quickstart_multimodal.py supports --multiturn and --conversation_turns and includes a torch.cuda.empty_cache() step.

No changes needed.

tests/integration/defs/test_e2e.py (2)

2571-2582: Consider explicitly passing conversation_turns to match the number of prompts

AI summary notes quickstart_multimodal.py adds a --conversation_turns option. If the script doesn’t infer turns from provided prompts reliably, explicitly set it to len(prompts) to avoid ambiguity.

Would the CLI require or benefit from:

  • f"--conversation_turns={len(accuracy_inputs['image']['prompt'])}"

2608-2610: Fix line length violation flagged by Ruff (E501)

Static analysis reports a too-long line here. The refactor above splits the zip args onto separate lines, resolving E501.

Likely an incorrect or invalid review comment.

examples/llm-api/quickstart_multimodal.py (1)

111-120: CLI flag looks good – consider clarifying the help text.

--conversation_turns controls maximum turns (it is clipped by prompt length).
Maybe say “Maximum number of …” to avoid confusion, but otherwise OK.

@crazydemo crazydemo force-pushed the add_multi-turn_test branch from 6618032 to 0e86043 Compare August 8, 2025 09:46
@crazydemo crazydemo marked this pull request as ready for review August 8, 2025 09:46
@crazydemo crazydemo requested a review from a team as a code owner August 8, 2025 09:46
@crazydemo crazydemo requested review from chzblych and QiJune August 8, 2025 09:46
@crazydemo crazydemo force-pushed the add_multi-turn_test branch from 0e86043 to 42cee24 Compare August 8, 2025 09:47
@crazydemo
Copy link
Collaborator Author

/bot run

@crazydemo crazydemo requested a review from Wanli-Jiang August 8, 2025 09:47
@tensorrt-cicd
Copy link
Collaborator

PR_Github #14600 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14600 [ run ] completed with state SUCCESS
/LLM/release-1.0/L0_MergeRequest_PR pipeline #30 completed with status: 'FAILURE'

@crazydemo crazydemo force-pushed the add_multi-turn_test branch 3 times, most recently from 7cc9490 to 302df88 Compare August 8, 2025 15:22
@crazydemo
Copy link
Collaborator Author

/bot run

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (4)
examples/llm-api/quickstart_multimodal.py (3)

150-158: Narrow except and punctuate the docstring (Ruff D415, E722).

Apply this diff:

-def clear_cuda_cache():
-    """Clear CUDA cache to prevent memory issues"""
-    try:
-        import torch
-        if torch.cuda.is_available():
-            torch.cuda.empty_cache()
-    except:
-        pass
+def clear_cuda_cache():
+    """Clear CUDA cache to prevent memory issues."""
+    try:
+        import torch
+    except ImportError:
+        # Torch not installed – nothing to clear.
+        return
+    try:
+        if torch.cuda.is_available():
+            torch.cuda.empty_cache()
+    except Exception as err:
+        # Log once, but don't kill the caller.
+        print(f"[warn] Failed to clear CUDA cache: {err}")

196-209: Respect CLI parameters instead of hard-coded loader defaults.

Apply this diff:

-                inputs = default_multimodal_input_loader(
+                inputs = default_multimodal_input_loader(
                     tokenizer=llm.tokenizer,
                     model_dir=llm._hf_model_dir,
                     model_type=model_type,
                     modality=args.modality,
                     prompts=[cur_prompt],
                     media=args.media,
-                    image_data_format="pt",
-                    num_frames=8,
-                    device="cpu")
+                    image_data_format=args.image_format,
+                    num_frames=args.num_frames,
+                    device=args.device)

193-239: Conversation history is not persisted between turns – context never grows.

Each iteration uses a fresh args.prompt[i]; the assistant reply appended to cur_prompt is discarded. Persist a running history so the model receives prior turns.

Apply this diff:

-        generated_outputs = []  # Store generated outputs for return
+        generated_outputs = []  # Store generated outputs for return
+        conversation_history = ""

@@
-                # Use multimodal input loader to process input with conversation context
-                cur_prompt = args.prompt[i]
+                # Use multimodal input loader to process input with conversation context
+                cur_user = args.prompt[i]
+                cur_prompt = (f"{conversation_history} user: {cur_user}"
+                              if conversation_history else cur_user)
@@
-                generated_outputs.append({
+                generated_outputs.append({
                     "turn": i + 1,
-                    "user_input": cur_prompt,
+                    "user_input": cur_user,
                     "assistant_response": response,
                     "media": args.media
                 })
@@
-                # Add to cur_prompt
-                cur_prompt = cur_prompt + response
+                # Persist conversation context for next turn
+                conversation_history = f"{cur_prompt} assistant: {response}"
tests/integration/defs/test_e2e.py (1)

2601-2620: Harden matching, fix Ruff E501, and remove noisy prints in multi-turn check

  • Add an assertion that parsed outputs count equals number of prompts to catch parser/format drift early.
  • Make keyword matching robust and case-insensitive on both sides.
  • Remove debug prints that spam CI logs.
  • Split long lines (Line 2610, assert message) to satisfy E501.
-    output = llm_venv.run_cmd(cmd, caller=check_output)
-    print("output:", output)
-    # Set match ratio based on model
-    match_ratio = 4.0 / 5
-    if model_name == "Phi-4-multimodal-instruct":
-        match_ratio = 0.6
-
-    # Check output accuracy
-    for prompt_output, prompt_keywords in zip(
-            parse_output(output), expected_keywords[model_name]["image"]):
-        matches = [
-            keyword in prompt_output.lower() for keyword in prompt_keywords
-        ]
-        obs_match_ratio = 1. * sum(matches) / len(matches)
-        print("prompt_output:", prompt_output)
-        print("prompt_keywords:", prompt_keywords)
-        print("matches:", matches)
-        print("obs_match_ratio:", obs_match_ratio)
-        assert obs_match_ratio >= match_ratio, f"Incorrect output!\nGenerated \"{prompt_output}\"\nExpected keywords \"{prompt_keywords}\"\n Matched keywords: {matches}\n Observed match ratio {obs_match_ratio} below threshold {match_ratio}"
+    output = llm_venv.run_cmd(cmd, caller=check_output)
+    outputs = parse_output(output)
+    expected = expected_keywords[model_name]["image"]
+    assert len(outputs) == len(expected), (
+        f"Expected {len(expected)} outputs, got {len(outputs)}. "
+        f"Raw output (truncated): {output[:1000]}..."
+    )
+    # Set match ratio based on model
+    match_ratio = 4.0 / 5
+    if model_name == "Phi-4-multimodal-instruct":
+        match_ratio = 0.6
+
+    # Check output accuracy
+    for prompt_output, prompt_keywords in zip(outputs, expected):
+        matches = [keyword.lower() in prompt_output.lower()
+                   for keyword in prompt_keywords]
+        obs_match_ratio = 1.0 * sum(matches) / len(matches)
+        assert obs_match_ratio >= match_ratio, (
+            f"Incorrect output!\nGenerated \"{prompt_output}\"\n"
+            f"Expected keywords \"{prompt_keywords}\"\n"
+            f"Matched keywords: {matches}\n"
+            f"Observed match ratio {obs_match_ratio} below threshold {match_ratio}"
+        )
🧹 Nitpick comments (2)
examples/llm-api/quickstart_multimodal.py (2)

111-119: Validate --conversation_turns to be >= 1.

Prevent silent no-op when negative by enforcing a positive integer at parse time.

Apply this diff:

-    parser.add_argument(
-        "--conversation_turns",
-        type=int,
-        default=2,
-        help="Number of conversation turns for automated testing.")
+    parser.add_argument(
+        "--conversation_turns",
+        type=positive_int,
+        default=2,
+        help="Number of conversation turns for automated testing (>= 1).")

Add this helper (outside the shown range, e.g., above add_multimodal_args):

def positive_int(value: str) -> int:
    iv = int(value)
    if iv < 1:
        raise argparse.ArgumentTypeError("conversation_turns must be >= 1")
    return iv

212-217: Redundant model_class is None check inside if args.load_lora.

model_class is guaranteed to be set in the earlier if args.load_lora: block; this branch cannot see model_class as None.

Apply this diff:

-                if args.load_lora:
-                    if model_class is None:
-                        raise ValueError(
-                            "model_class must be provided when load_lora is True"
-                        )
-                    lora_request = model_class.lora_request(
+                if args.load_lora:
+                    lora_request = model_class.lora_request(
                         len(inputs), args.modality, llm._hf_model_dir)
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7cc9490 and 302df88.

📒 Files selected for processing (3)
  • examples/llm-api/quickstart_multimodal.py (3 hunks)
  • tests/integration/defs/test_e2e.py (1 hunks)
  • tests/integration/test_lists/qa/llm_function_full.txt (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/integration/test_lists/qa/llm_function_full.txt
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • tests/integration/defs/test_e2e.py
  • examples/llm-api/quickstart_multimodal.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • tests/integration/defs/test_e2e.py
  • examples/llm-api/quickstart_multimodal.py
🧠 Learnings (5)
📓 Common learnings
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/integration/defs/test_e2e.py
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • tests/integration/defs/test_e2e.py
📚 Learning: 2025-08-01T15:14:45.673Z
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

  • tests/integration/defs/test_e2e.py
📚 Learning: 2025-08-06T21:22:55.018Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T21:22:55.018Z
Learning: Applies to **/*.py : When using try-except blocks in Python, limit the except to the smallest set of errors possible.

Applied to files:

  • examples/llm-api/quickstart_multimodal.py
🪛 Ruff (0.12.2)
tests/integration/defs/test_e2e.py

2610-2610: Line too long (240 > 120)

(E501)

examples/llm-api/quickstart_multimodal.py

151-151: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)


156-156: Do not use bare except

(E722)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (2)
examples/llm-api/quickstart_multimodal.py (2)

253-253: No action needed.

Comment-only change for section separation.


199-206: default_multimodal_input_loader already handles 1-prompt -> N-media broadcasting.
The if len(media) > len(prompts) and len(prompts) == 1 branch wraps your list of media into a single-element media list to match the single prompt, and the subsequent assert len(media_or_embeddings) == len(prompts) enforces correct pairing. No changes needed in the example for prompt/media length alignment.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14620 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14620 [ run ] completed with state SUCCESS
/LLM/release-1.0/L0_MergeRequest_PR pipeline #33 completed with status: 'FAILURE'

@crazydemo crazydemo force-pushed the add_multi-turn_test branch from 302df88 to 22670bf Compare August 10, 2025 15:56
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (6)
examples/llm-api/quickstart_multimodal.py (4)

150-158: Fix docstring punctuation and avoid bare except in clear_cuda_cache()

Align with Ruff D415/E722; narrow exceptions and log unexpected errors.

Apply:

 def clear_cuda_cache():
-    """Clear CUDA cache to prevent memory issues"""
+    """Clear CUDA cache to prevent memory issues."""
     try:
         import torch
         if torch.cuda.is_available():
             torch.cuda.empty_cache()
-    except:
-        pass
+    except ImportError:
+        # Torch not installed – nothing to clear.
+        return
+    except Exception as err:
+        # Log once, but don't kill the caller.
+        print(f"[warn] Failed to clear CUDA cache: {err}")

196-209: Respect CLI args for image format, frames, and device

Hard-coded values ignore user input; pass through args to the loader.

Apply:

                     prompts=[cur_prompt],
                     media=args.media,
-                    image_data_format="pt",
-                    num_frames=8,
-                    device="cpu")
+                    image_data_format=args.image_format,
+                    num_frames=args.num_frames,
+                    device=args.device)

239-246: Always clear CUDA cache via finally

Errors skip clear_cuda_cache(); use finally to guarantee cleanup.

Apply:

-            except Exception as e:
-                print(f"Error in turn {i+1}: {e}")
-                import traceback
-                traceback.print_exc()
-                continue
-
-            clear_cuda_cache()
+            except Exception as e:
+                print(f"Error in turn {i+1}: {e}")
+                import traceback
+                traceback.print_exc()
+            finally:
+                clear_cuda_cache()

236-238: Multi-turn: conversation context is not persisted to next turn

The updated prompt isn’t saved; next iteration reuses the original prompt. Persist it for true multi-turn behavior.

Apply:

-                # Add to cur_prompt
-                cur_prompt = cur_prompt + response
+                # Persist conversation context for next turn
+                args.prompt[i] = f"{cur_prompt} assistant: {response}"
tests/integration/defs/test_e2e.py (2)

2573-2584: Ensure the test exercises all prompts by adding --conversation_turns

Avoid coupling to script defaults; make intent explicit.

Apply:

         "--modality",
         "image",
         "--multiturn",
+        "--conversation_turns",
+        str(len(accuracy_inputs["image"]["prompt"])),
         "--prompt",

2601-2620: Harden checks, reduce log noise, and fix E501 with structured assertions

  • Parse once and assert counts.
  • Case-insensitive matching.
  • Remove verbose prints.
  • Break long assert to satisfy E501.

Apply:

-    output = llm_venv.run_cmd(cmd, caller=check_output)
-    print("output:", output)
+    output = llm_venv.run_cmd(cmd, caller=check_output)
+    outputs = parse_output(output)
+    expected = expected_keywords[model_name]["image"]
+    assert len(outputs) == len(expected), (
+        f"Expected {len(expected)} outputs, got {len(outputs)}. "
+        f"Raw output (truncated): {output[:1000]}..."
+    )
@@
-    for prompt_output, prompt_keywords in zip(
-            parse_output(output), expected_keywords[model_name]["image"]):
-        matches = [
-            keyword in prompt_output.lower() for keyword in prompt_keywords
-        ]
-        obs_match_ratio = 1. * sum(matches) / len(matches)
-        print("prompt_output:", prompt_output)
-        print("prompt_keywords:", prompt_keywords)
-        print("matches:", matches)
-        print("obs_match_ratio:", obs_match_ratio)
-        assert obs_match_ratio >= match_ratio, f"Incorrect output!\nGenerated \"{prompt_output}\"\nExpected keywords \"{prompt_keywords}\"\n Matched keywords: {matches}\n Observed match ratio {obs_match_ratio} below threshold {match_ratio}"
+    for prompt_output, prompt_keywords in zip(outputs, expected):
+        matches = [keyword.lower() in prompt_output.lower()
+                   for keyword in prompt_keywords]
+        obs_match_ratio = 1.0 * sum(matches) / len(matches)
+        assert obs_match_ratio >= match_ratio, (
+            f"Incorrect output!\nGenerated \"{prompt_output}\"\n"
+            f"Expected keywords \"{prompt_keywords}\"\n"
+            f"Matched keywords: {matches}\n"
+            f"Observed match ratio {obs_match_ratio} below threshold {match_ratio}"
+        )
🧹 Nitpick comments (2)
examples/llm-api/quickstart_multimodal.py (2)

111-120: Make conversation_turns optional; default to number of provided prompts

Defaulting to 2 is surprising. Let users omit it and derive from len(prompts).

Apply:

-    parser.add_argument(
-        "--conversation_turns",
-        type=int,
-        default=2,
-        help="Number of conversation turns for automated testing.")
+    parser.add_argument(
+        "--conversation_turns",
+        type=int,
+        default=None,
+        help="Number of conversation turns (default: use number of provided prompts).")

190-191: Derive max_turns when conversation_turns is omitted

Guard for None to run all provided prompts by default.

Apply:

-        max_turns = min(args.conversation_turns, len(args.prompt))
+        max_turns = (len(args.prompt) if args.conversation_turns is None
+                     else min(args.conversation_turns, len(args.prompt)))
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 302df88 and 22670bf.

📒 Files selected for processing (4)
  • examples/llm-api/quickstart_multimodal.py (3 hunks)
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py (6 hunks)
  • tests/integration/defs/test_e2e.py (1 hunks)
  • tests/integration/test_lists/qa/llm_function_full.txt (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/integration/test_lists/qa/llm_function_full.txt
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
  • examples/llm-api/quickstart_multimodal.py
  • tests/integration/defs/test_e2e.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
  • examples/llm-api/quickstart_multimodal.py
  • tests/integration/defs/test_e2e.py
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
  • tests/integration/defs/test_e2e.py
🪛 Ruff (0.12.2)
examples/llm-api/quickstart_multimodal.py

151-151: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)


156-156: Do not use bare except

(E722)

tests/integration/defs/test_e2e.py

2610-2610: Line too long (240 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (6)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (6)

506-507: Migration to CudaGraphConfig looks correct

Using cuda_graph_config=CudaGraphConfig() if cuda_graph else None is the right replacement for the deprecated boolean flag. No issues spotted.


530-531: LGTM: consistent CudaGraphConfig usage in chunked prefill test

Same pattern applied consistently. Good.


651-652: LGTM: FP8 chunked prefill with CudaGraphConfig

Change aligns with new API; parameters remain unchanged.


675-676: LGTM: FP4 chunked prefill with CudaGraphConfig

Correct and consistent with other tests.


1036-1037: LGTM: Conditional CUDA graph config in CUTEDSL test

Conditional construction via CudaGraphConfig() is correct.


1194-1195: LGTM: Conditional CUDA graph config in 4-GPU CUTEDSL test

Matches the new API across the suite.

@crazydemo
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14714 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14714 [ run ] completed with state SUCCESS
/LLM/release-1.0/L0_MergeRequest_PR pipeline #48 completed with status: 'SUCCESS'

@crazydemo crazydemo force-pushed the add_multi-turn_test branch from 5e7b3ee to eaf5a2d Compare August 11, 2025 04:42
@crazydemo
Copy link
Collaborator Author

/bot reuse-pipeline

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14748 [ reuse-pipeline ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (6)
examples/llm-api/quickstart_multimodal.py (4)

150-158: Fix bare except and finish the docstring sentence.

Ruff flags D415 and E722. Limit exceptions and add a period.

-def clear_cuda_cache():
-    """Clear CUDA cache to prevent memory issues"""
-    try:
-        import torch
-        if torch.cuda.is_available():
-            torch.cuda.empty_cache()
-    except:
-        pass
+def clear_cuda_cache():
+    """Clear CUDA cache to prevent memory issues."""
+    try:
+        import torch
+    except ImportError:
+        # Torch not installed – nothing to clear.
+        return
+    try:
+        if torch.cuda.is_available():
+            torch.cuda.empty_cache()
+    except Exception as err:
+        # Log once, but don't kill the caller.
+        print(f"[warn] Failed to clear CUDA cache: {err}")

196-209: Respect user CLI args instead of hard-coded defaults.

Use args.image_format, args.num_frames, and args.device to match single-turn behavior and user expectations.

-                inputs = default_multimodal_input_loader(
+                inputs = default_multimodal_input_loader(
                     tokenizer=llm.tokenizer,
                     model_dir=llm._hf_model_dir,
                     model_type=model_type,
                     modality=args.modality,
                     prompts=[cur_prompt],
                     media=args.media,
-                    image_data_format="pt",
-                    num_frames=8,
-                    device="cpu")
+                    image_data_format=image_format,
+                    num_frames=args.num_frames,
+                    device=args.device)

239-246: Always clear CUDA cache using finally.

On exceptions, the continue skips clear_cuda_cache(). Use a finally block.

-            except Exception as e:
-                print(f"Error in turn {i+1}: {e}")
-                import traceback
-                traceback.print_exc()
-                continue
-
-            clear_cuda_cache()
+            except Exception as e:
+                print(f"Error in turn {i+1}: {e}")
+                import traceback
+                traceback.print_exc()
+            finally:
+                clear_cuda_cache()

236-238: Multi-turn context is not persisted — conversation state never grows.

You build cur_prompt but don’t save it, so the next turn reloads the original prompt. Persist it.

-                # Add to cur_prompt
-                cur_prompt = cur_prompt + response
+                # Persist conversation context for next turn
+                args.prompt[i] = f"{cur_prompt} assistant: {response}"

Alternative: maintain a history list per sample and concatenate before calling the loader for each turn.

tests/integration/defs/test_e2e.py (2)

2573-2584: Make the number of turns explicit to avoid silent truncation.

--conversation_turns defaults to 2; if more prompts are passed later, only two turns would run. Set it to the number of provided prompts for clarity and robustness.

         "--multiturn",
+        "--conversation_turns",
+        str(len(accuracy_inputs["image"]["prompt"])),
         "--prompt",

2609-2620: Harden matching (case-insensitive), remove noisy prints, and fix E501 long line.

This reduces flakiness, cleans logs, and complies with line length rules.

-    for prompt_output, prompt_keywords in zip(
-            parse_output(output), expected_keywords[model_name]["image"]):
-        matches = [
-            keyword in prompt_output.lower() for keyword in prompt_keywords
-        ]
-        obs_match_ratio = 1. * sum(matches) / len(matches)
-        print("prompt_output:", prompt_output)
-        print("prompt_keywords:", prompt_keywords)
-        print("matches:", matches)
-        print("obs_match_ratio:", obs_match_ratio)
-        assert obs_match_ratio >= match_ratio, f"Incorrect output!\nGenerated \"{prompt_output}\"\nExpected keywords \"{prompt_keywords}\"\n Matched keywords: {matches}\n Observed match ratio {obs_match_ratio} below threshold {match_ratio}"
+    for prompt_output, prompt_keywords in zip(outputs, expected):
+        matches = [keyword.lower() in prompt_output.lower()
+                   for keyword in prompt_keywords]
+        obs_match_ratio = 1.0 * sum(matches) / len(matches)
+        assert obs_match_ratio >= match_ratio, (
+            f"Incorrect output!\nGenerated \"{prompt_output}\"\n"
+            f"Expected keywords \"{prompt_keywords}\"\n"
+            f"Matched keywords: {matches}\n"
+            f"Observed match ratio {obs_match_ratio} below threshold {match_ratio}"
+        )
🧹 Nitpick comments (1)
tests/integration/defs/test_e2e.py (1)

2602-2602: Reduce CI log noise — drop raw output print and pre-parse instead.

Printing the entire raw output is noisy. Parse once and assert; include truncated raw text only in failure messages if needed.

-    output = llm_venv.run_cmd(cmd, caller=check_output)
-    print("output:", output)
+    output = llm_venv.run_cmd(cmd, caller=check_output)
+    outputs = parse_output(output)
+    expected = expected_keywords[model_name]["image"]
+    assert len(outputs) == len(expected), (
+        f"Expected {len(expected)} outputs, got {len(outputs)}. "
+        f"Raw output (truncated): {output[:1000]}..."
+    )
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 22670bf and eaf5a2d.

📒 Files selected for processing (5)
  • examples/llm-api/quickstart_multimodal.py (3 hunks)
  • tests/integration/defs/accuracy/test_disaggregated_serving.py (1 hunks)
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py (6 hunks)
  • tests/integration/defs/test_e2e.py (1 hunks)
  • tests/integration/test_lists/qa/llm_function_full.txt (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • tests/integration/test_lists/qa/llm_function_full.txt
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • tests/integration/defs/accuracy/test_disaggregated_serving.py
  • examples/llm-api/quickstart_multimodal.py
  • tests/integration/defs/test_e2e.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • tests/integration/defs/accuracy/test_disaggregated_serving.py
  • examples/llm-api/quickstart_multimodal.py
  • tests/integration/defs/test_e2e.py
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/integration/defs/test_e2e.py
🧬 Code Graph Analysis (2)
examples/llm-api/quickstart_multimodal.py (3)
tensorrt_llm/llmapi/llm.py (4)
  • prompt (79-80)
  • tokenizer (691-695)
  • tokenizer (698-699)
  • generate (235-313)
tensorrt_llm/inputs/utils.py (1)
  • default_multimodal_input_loader (475-635)
tensorrt_llm/_torch/models/modeling_phi4mm.py (1)
  • lora_request (267-288)
tests/integration/defs/test_e2e.py (4)
tests/integration/defs/conftest.py (2)
  • llm_models_root (77-83)
  • llm_root (180-181)
tests/integration/defs/triton_server/conftest.py (1)
  • llm_models_root (16-25)
tests/integration/defs/common.py (1)
  • parse_output (665-681)
tests/integration/defs/examples/test_llama.py (1)
  • parse_output (129-139)
🪛 Ruff (0.12.2)
examples/llm-api/quickstart_multimodal.py

151-151: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)


156-156: Do not use bare except

(E722)

tests/integration/defs/test_e2e.py

2610-2610: Line too long (240 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (2)
tests/integration/defs/accuracy/test_disaggregated_serving.py (1)

624-624: Appropriate Hopper-gating for FP8 model — add an explicit reason if supported.

Adding @skip_pre_hopper to test_auto_dtype for Qwen3-8B-FP8 is correct (FP8 requires Hopper+). If skip_pre_hopper supports a reason message, consider adding one for clarity in CI logs.

examples/llm-api/quickstart_multimodal.py (1)

111-119: New CLI flags look good.

--multiturn and --conversation_turns are sensible and scoped. No issues.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14748 [ reuse-pipeline ] completed with state SUCCESS
Reusing PR_Github #14714 for commit eaf5a2d

@crazydemo crazydemo force-pushed the add_multi-turn_test branch from eaf5a2d to 0555625 Compare August 11, 2025 06:24
@tensorrt-cicd
Copy link
Collaborator

PR_Github #15056 [ reuse-pipeline ] completed with state SUCCESS
Reusing PR_Github #14714 for commit 9a298af

@StanleySun639 StanleySun639 enabled auto-merge (squash) August 13, 2025 04:40
@LarryXFly LarryXFly disabled auto-merge August 13, 2025 05:10
@LarryXFly LarryXFly merged commit 15bcf80 into NVIDIA:release/1.0 Aug 13, 2025
5 checks passed
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Aug 22, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Aug 22, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Aug 22, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Aug 23, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Aug 24, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Aug 25, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Aug 26, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Aug 27, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Aug 27, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Aug 27, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Aug 27, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Aug 28, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Aug 28, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Aug 28, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Aug 28, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Aug 28, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Aug 28, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Aug 29, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Aug 29, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Aug 29, 2025
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Aug 30, 2025
joyang-nv pushed a commit that referenced this pull request Sep 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants