-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[TRTLLM-6975][test] Add multi-turn test cases for VLM models #6749
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughAdds multiturn multimodal conversation handling to examples/llm-api/quickstart_multimodal.py, introduces multimodal multiturn integration tests and test-list entries, migrates CUDA-graph booleans to a cuda_graph_config object across accuracy tests, tightens resource gating in disaggregated serving tests, and adds FP8 accuracy variants to reference YAMLs. (≤50 words) Changes
Sequence Diagram(s)mermaid
Estimated code review effort🎯 4 (Complex) | ⏱️ ~35 minutes Possibly related PRs
Suggested labels
Suggested reviewers
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (2)
tests/integration/test_lists/qa/llm_function_full.txt (1)
603-605
: Optional: expand coverage and watch runtime
- Consider adding one multiturn case for a Qwen2-VL or NVILA model later for broader backend coverage, if runtime budget allows.
- If these prove slow/flake in CI, annotate with TIMEOUT like other heavy tests in this file or place them in a nightly-focused list.
Happy to draft the additional test list entries or help profile runtime to set TIMEOUTs.
tests/integration/defs/test_e2e.py (1)
2600-2601
: Avoid dumping full generated output to CI logsPrinting the entire model output can flood CI logs and isn’t needed given the detailed assertion. Remove or gate behind a debug flag.
- print("output:", output)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
examples/llm-api/quickstart_multimodal.py
(3 hunks)tests/integration/defs/test_e2e.py
(1 hunks)tests/integration/test_lists/qa/llm_function_full.txt
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tests/integration/defs/test_e2e.py
examples/llm-api/quickstart_multimodal.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tests/integration/defs/test_e2e.py
examples/llm-api/quickstart_multimodal.py
🧠 Learnings (3)
📓 Common learnings
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/test_lists/qa/llm_function_full.txt
tests/integration/defs/test_e2e.py
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Applied to files:
tests/integration/test_lists/qa/llm_function_full.txt
tests/integration/defs/test_e2e.py
🪛 Ruff (0.12.2)
tests/integration/defs/test_e2e.py
2608-2608: Line too long (240 > 120)
(E501)
examples/llm-api/quickstart_multimodal.py
151-151: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
156-156: Do not use bare except
(E722)
🔇 Additional comments (5)
tests/integration/test_lists/qa/llm_function_full.txt (2)
603-605
: LGTM: Multiturn VLM cases added in the right section with consistent IDsNames and placement align with existing quickstart multimodal entries (e.g., Lines 600-602). Model IDs/paths look consistent with prior single-turn/2GPU variants.
603-605
: Multiturn tests and CLI logic verified
I’ve confirmed that:
test_ptp_quickstart_multimodal_multiturn
is defined intests/integration/defs/test_e2e.py
and parametrizes exactly the three new models (gemma-3-27b-it
,mistral-small-3.1-24b-instruct
,Phi-4-multimodal-instruct
).- The CLI in
examples/llm-api/quickstart_multimodal.py
supports--multiturn
and--conversation_turns
and includes atorch.cuda.empty_cache()
step.No changes needed.
tests/integration/defs/test_e2e.py (2)
2571-2582
: Consider explicitly passing conversation_turns to match the number of promptsAI summary notes quickstart_multimodal.py adds a --conversation_turns option. If the script doesn’t infer turns from provided prompts reliably, explicitly set it to len(prompts) to avoid ambiguity.
Would the CLI require or benefit from:
- f"--conversation_turns={len(accuracy_inputs['image']['prompt'])}"
2608-2610
: Fix line length violation flagged by Ruff (E501)Static analysis reports a too-long line here. The refactor above splits the zip args onto separate lines, resolving E501.
Likely an incorrect or invalid review comment.
examples/llm-api/quickstart_multimodal.py (1)
111-120
: CLI flag looks good – consider clarifying the help text.
--conversation_turns
controls maximum turns (it is clipped by prompt length).
Maybe say “Maximum number of …” to avoid confusion, but otherwise OK.
6618032
to
0e86043
Compare
0e86043
to
42cee24
Compare
/bot run |
PR_Github #14600 [ run ] triggered by Bot |
PR_Github #14600 [ run ] completed with state |
7cc9490
to
302df88
Compare
/bot run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (4)
examples/llm-api/quickstart_multimodal.py (3)
150-158
: Narrowexcept
and punctuate the docstring (Ruff D415, E722).Apply this diff:
-def clear_cuda_cache(): - """Clear CUDA cache to prevent memory issues""" - try: - import torch - if torch.cuda.is_available(): - torch.cuda.empty_cache() - except: - pass +def clear_cuda_cache(): + """Clear CUDA cache to prevent memory issues.""" + try: + import torch + except ImportError: + # Torch not installed – nothing to clear. + return + try: + if torch.cuda.is_available(): + torch.cuda.empty_cache() + except Exception as err: + # Log once, but don't kill the caller. + print(f"[warn] Failed to clear CUDA cache: {err}")
196-209
: Respect CLI parameters instead of hard-coded loader defaults.Apply this diff:
- inputs = default_multimodal_input_loader( + inputs = default_multimodal_input_loader( tokenizer=llm.tokenizer, model_dir=llm._hf_model_dir, model_type=model_type, modality=args.modality, prompts=[cur_prompt], media=args.media, - image_data_format="pt", - num_frames=8, - device="cpu") + image_data_format=args.image_format, + num_frames=args.num_frames, + device=args.device)
193-239
: Conversation history is not persisted between turns – context never grows.Each iteration uses a fresh
args.prompt[i]
; the assistant reply appended tocur_prompt
is discarded. Persist a running history so the model receives prior turns.Apply this diff:
- generated_outputs = [] # Store generated outputs for return + generated_outputs = [] # Store generated outputs for return + conversation_history = "" @@ - # Use multimodal input loader to process input with conversation context - cur_prompt = args.prompt[i] + # Use multimodal input loader to process input with conversation context + cur_user = args.prompt[i] + cur_prompt = (f"{conversation_history} user: {cur_user}" + if conversation_history else cur_user) @@ - generated_outputs.append({ + generated_outputs.append({ "turn": i + 1, - "user_input": cur_prompt, + "user_input": cur_user, "assistant_response": response, "media": args.media }) @@ - # Add to cur_prompt - cur_prompt = cur_prompt + response + # Persist conversation context for next turn + conversation_history = f"{cur_prompt} assistant: {response}"tests/integration/defs/test_e2e.py (1)
2601-2620
: Harden matching, fix Ruff E501, and remove noisy prints in multi-turn check
- Add an assertion that parsed outputs count equals number of prompts to catch parser/format drift early.
- Make keyword matching robust and case-insensitive on both sides.
- Remove debug prints that spam CI logs.
- Split long lines (Line 2610, assert message) to satisfy E501.
- output = llm_venv.run_cmd(cmd, caller=check_output) - print("output:", output) - # Set match ratio based on model - match_ratio = 4.0 / 5 - if model_name == "Phi-4-multimodal-instruct": - match_ratio = 0.6 - - # Check output accuracy - for prompt_output, prompt_keywords in zip( - parse_output(output), expected_keywords[model_name]["image"]): - matches = [ - keyword in prompt_output.lower() for keyword in prompt_keywords - ] - obs_match_ratio = 1. * sum(matches) / len(matches) - print("prompt_output:", prompt_output) - print("prompt_keywords:", prompt_keywords) - print("matches:", matches) - print("obs_match_ratio:", obs_match_ratio) - assert obs_match_ratio >= match_ratio, f"Incorrect output!\nGenerated \"{prompt_output}\"\nExpected keywords \"{prompt_keywords}\"\n Matched keywords: {matches}\n Observed match ratio {obs_match_ratio} below threshold {match_ratio}" + output = llm_venv.run_cmd(cmd, caller=check_output) + outputs = parse_output(output) + expected = expected_keywords[model_name]["image"] + assert len(outputs) == len(expected), ( + f"Expected {len(expected)} outputs, got {len(outputs)}. " + f"Raw output (truncated): {output[:1000]}..." + ) + # Set match ratio based on model + match_ratio = 4.0 / 5 + if model_name == "Phi-4-multimodal-instruct": + match_ratio = 0.6 + + # Check output accuracy + for prompt_output, prompt_keywords in zip(outputs, expected): + matches = [keyword.lower() in prompt_output.lower() + for keyword in prompt_keywords] + obs_match_ratio = 1.0 * sum(matches) / len(matches) + assert obs_match_ratio >= match_ratio, ( + f"Incorrect output!\nGenerated \"{prompt_output}\"\n" + f"Expected keywords \"{prompt_keywords}\"\n" + f"Matched keywords: {matches}\n" + f"Observed match ratio {obs_match_ratio} below threshold {match_ratio}" + )
🧹 Nitpick comments (2)
examples/llm-api/quickstart_multimodal.py (2)
111-119
: Validate--conversation_turns
to be >= 1.Prevent silent no-op when negative by enforcing a positive integer at parse time.
Apply this diff:
- parser.add_argument( - "--conversation_turns", - type=int, - default=2, - help="Number of conversation turns for automated testing.") + parser.add_argument( + "--conversation_turns", + type=positive_int, + default=2, + help="Number of conversation turns for automated testing (>= 1).")Add this helper (outside the shown range, e.g., above add_multimodal_args):
def positive_int(value: str) -> int: iv = int(value) if iv < 1: raise argparse.ArgumentTypeError("conversation_turns must be >= 1") return iv
212-217
: Redundantmodel_class is None
check insideif args.load_lora
.
model_class
is guaranteed to be set in the earlierif args.load_lora:
block; this branch cannot seemodel_class
as None.Apply this diff:
- if args.load_lora: - if model_class is None: - raise ValueError( - "model_class must be provided when load_lora is True" - ) - lora_request = model_class.lora_request( + if args.load_lora: + lora_request = model_class.lora_request( len(inputs), args.modality, llm._hf_model_dir)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
examples/llm-api/quickstart_multimodal.py
(3 hunks)tests/integration/defs/test_e2e.py
(1 hunks)tests/integration/test_lists/qa/llm_function_full.txt
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/integration/test_lists/qa/llm_function_full.txt
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tests/integration/defs/test_e2e.py
examples/llm-api/quickstart_multimodal.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tests/integration/defs/test_e2e.py
examples/llm-api/quickstart_multimodal.py
🧠 Learnings (5)
📓 Common learnings
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/defs/test_e2e.py
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Applied to files:
tests/integration/defs/test_e2e.py
📚 Learning: 2025-08-01T15:14:45.673Z
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
Applied to files:
tests/integration/defs/test_e2e.py
📚 Learning: 2025-08-06T21:22:55.018Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-06T21:22:55.018Z
Learning: Applies to **/*.py : When using try-except blocks in Python, limit the except to the smallest set of errors possible.
Applied to files:
examples/llm-api/quickstart_multimodal.py
🪛 Ruff (0.12.2)
tests/integration/defs/test_e2e.py
2610-2610: Line too long (240 > 120)
(E501)
examples/llm-api/quickstart_multimodal.py
151-151: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
156-156: Do not use bare except
(E722)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (2)
examples/llm-api/quickstart_multimodal.py (2)
253-253
: No action needed.Comment-only change for section separation.
199-206
: default_multimodal_input_loader already handles 1-prompt -> N-media broadcasting.
Theif len(media) > len(prompts) and len(prompts) == 1
branch wraps your list of media into a single-element media list to match the single prompt, and the subsequentassert len(media_or_embeddings) == len(prompts)
enforces correct pairing. No changes needed in the example for prompt/media length alignment.
PR_Github #14620 [ run ] triggered by Bot |
PR_Github #14620 [ run ] completed with state |
302df88
to
22670bf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (6)
examples/llm-api/quickstart_multimodal.py (4)
150-158
: Fix docstring punctuation and avoid bare except in clear_cuda_cache()Align with Ruff D415/E722; narrow exceptions and log unexpected errors.
Apply:
def clear_cuda_cache(): - """Clear CUDA cache to prevent memory issues""" + """Clear CUDA cache to prevent memory issues.""" try: import torch if torch.cuda.is_available(): torch.cuda.empty_cache() - except: - pass + except ImportError: + # Torch not installed – nothing to clear. + return + except Exception as err: + # Log once, but don't kill the caller. + print(f"[warn] Failed to clear CUDA cache: {err}")
196-209
: Respect CLI args for image format, frames, and deviceHard-coded values ignore user input; pass through args to the loader.
Apply:
prompts=[cur_prompt], media=args.media, - image_data_format="pt", - num_frames=8, - device="cpu") + image_data_format=args.image_format, + num_frames=args.num_frames, + device=args.device)
239-246
: Always clear CUDA cache via finallyErrors skip clear_cuda_cache(); use finally to guarantee cleanup.
Apply:
- except Exception as e: - print(f"Error in turn {i+1}: {e}") - import traceback - traceback.print_exc() - continue - - clear_cuda_cache() + except Exception as e: + print(f"Error in turn {i+1}: {e}") + import traceback + traceback.print_exc() + finally: + clear_cuda_cache()
236-238
: Multi-turn: conversation context is not persisted to next turnThe updated prompt isn’t saved; next iteration reuses the original prompt. Persist it for true multi-turn behavior.
Apply:
- # Add to cur_prompt - cur_prompt = cur_prompt + response + # Persist conversation context for next turn + args.prompt[i] = f"{cur_prompt} assistant: {response}"tests/integration/defs/test_e2e.py (2)
2573-2584
: Ensure the test exercises all prompts by adding --conversation_turnsAvoid coupling to script defaults; make intent explicit.
Apply:
"--modality", "image", "--multiturn", + "--conversation_turns", + str(len(accuracy_inputs["image"]["prompt"])), "--prompt",
2601-2620
: Harden checks, reduce log noise, and fix E501 with structured assertions
- Parse once and assert counts.
- Case-insensitive matching.
- Remove verbose prints.
- Break long assert to satisfy E501.
Apply:
- output = llm_venv.run_cmd(cmd, caller=check_output) - print("output:", output) + output = llm_venv.run_cmd(cmd, caller=check_output) + outputs = parse_output(output) + expected = expected_keywords[model_name]["image"] + assert len(outputs) == len(expected), ( + f"Expected {len(expected)} outputs, got {len(outputs)}. " + f"Raw output (truncated): {output[:1000]}..." + ) @@ - for prompt_output, prompt_keywords in zip( - parse_output(output), expected_keywords[model_name]["image"]): - matches = [ - keyword in prompt_output.lower() for keyword in prompt_keywords - ] - obs_match_ratio = 1. * sum(matches) / len(matches) - print("prompt_output:", prompt_output) - print("prompt_keywords:", prompt_keywords) - print("matches:", matches) - print("obs_match_ratio:", obs_match_ratio) - assert obs_match_ratio >= match_ratio, f"Incorrect output!\nGenerated \"{prompt_output}\"\nExpected keywords \"{prompt_keywords}\"\n Matched keywords: {matches}\n Observed match ratio {obs_match_ratio} below threshold {match_ratio}" + for prompt_output, prompt_keywords in zip(outputs, expected): + matches = [keyword.lower() in prompt_output.lower() + for keyword in prompt_keywords] + obs_match_ratio = 1.0 * sum(matches) / len(matches) + assert obs_match_ratio >= match_ratio, ( + f"Incorrect output!\nGenerated \"{prompt_output}\"\n" + f"Expected keywords \"{prompt_keywords}\"\n" + f"Matched keywords: {matches}\n" + f"Observed match ratio {obs_match_ratio} below threshold {match_ratio}" + )
🧹 Nitpick comments (2)
examples/llm-api/quickstart_multimodal.py (2)
111-120
: Make conversation_turns optional; default to number of provided promptsDefaulting to 2 is surprising. Let users omit it and derive from len(prompts).
Apply:
- parser.add_argument( - "--conversation_turns", - type=int, - default=2, - help="Number of conversation turns for automated testing.") + parser.add_argument( + "--conversation_turns", + type=int, + default=None, + help="Number of conversation turns (default: use number of provided prompts).")
190-191
: Derive max_turns when conversation_turns is omittedGuard for None to run all provided prompts by default.
Apply:
- max_turns = min(args.conversation_turns, len(args.prompt)) + max_turns = (len(args.prompt) if args.conversation_turns is None + else min(args.conversation_turns, len(args.prompt)))
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
examples/llm-api/quickstart_multimodal.py
(3 hunks)tests/integration/defs/accuracy/test_llm_api_pytorch.py
(6 hunks)tests/integration/defs/test_e2e.py
(1 hunks)tests/integration/test_lists/qa/llm_function_full.txt
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/integration/test_lists/qa/llm_function_full.txt
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tests/integration/defs/accuracy/test_llm_api_pytorch.py
examples/llm-api/quickstart_multimodal.py
tests/integration/defs/test_e2e.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tests/integration/defs/accuracy/test_llm_api_pytorch.py
examples/llm-api/quickstart_multimodal.py
tests/integration/defs/test_e2e.py
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/defs/accuracy/test_llm_api_pytorch.py
tests/integration/defs/test_e2e.py
🪛 Ruff (0.12.2)
examples/llm-api/quickstart_multimodal.py
151-151: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
156-156: Do not use bare except
(E722)
tests/integration/defs/test_e2e.py
2610-2610: Line too long (240 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (6)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (6)
506-507
: Migration to CudaGraphConfig looks correctUsing cuda_graph_config=CudaGraphConfig() if cuda_graph else None is the right replacement for the deprecated boolean flag. No issues spotted.
530-531
: LGTM: consistent CudaGraphConfig usage in chunked prefill testSame pattern applied consistently. Good.
651-652
: LGTM: FP8 chunked prefill with CudaGraphConfigChange aligns with new API; parameters remain unchanged.
675-676
: LGTM: FP4 chunked prefill with CudaGraphConfigCorrect and consistent with other tests.
1036-1037
: LGTM: Conditional CUDA graph config in CUTEDSL testConditional construction via CudaGraphConfig() is correct.
1194-1195
: LGTM: Conditional CUDA graph config in 4-GPU CUTEDSL testMatches the new API across the suite.
/bot run |
PR_Github #14714 [ run ] triggered by Bot |
PR_Github #14714 [ run ] completed with state |
5e7b3ee
to
eaf5a2d
Compare
/bot reuse-pipeline |
PR_Github #14748 [ reuse-pipeline ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (6)
examples/llm-api/quickstart_multimodal.py (4)
150-158
: Fix bare except and finish the docstring sentence.Ruff flags D415 and E722. Limit exceptions and add a period.
-def clear_cuda_cache(): - """Clear CUDA cache to prevent memory issues""" - try: - import torch - if torch.cuda.is_available(): - torch.cuda.empty_cache() - except: - pass +def clear_cuda_cache(): + """Clear CUDA cache to prevent memory issues.""" + try: + import torch + except ImportError: + # Torch not installed – nothing to clear. + return + try: + if torch.cuda.is_available(): + torch.cuda.empty_cache() + except Exception as err: + # Log once, but don't kill the caller. + print(f"[warn] Failed to clear CUDA cache: {err}")
196-209
: Respect user CLI args instead of hard-coded defaults.Use
args.image_format
,args.num_frames
, andargs.device
to match single-turn behavior and user expectations.- inputs = default_multimodal_input_loader( + inputs = default_multimodal_input_loader( tokenizer=llm.tokenizer, model_dir=llm._hf_model_dir, model_type=model_type, modality=args.modality, prompts=[cur_prompt], media=args.media, - image_data_format="pt", - num_frames=8, - device="cpu") + image_data_format=image_format, + num_frames=args.num_frames, + device=args.device)
239-246
: Always clear CUDA cache using finally.On exceptions, the
continue
skipsclear_cuda_cache()
. Use afinally
block.- except Exception as e: - print(f"Error in turn {i+1}: {e}") - import traceback - traceback.print_exc() - continue - - clear_cuda_cache() + except Exception as e: + print(f"Error in turn {i+1}: {e}") + import traceback + traceback.print_exc() + finally: + clear_cuda_cache()
236-238
: Multi-turn context is not persisted — conversation state never grows.You build
cur_prompt
but don’t save it, so the next turn reloads the original prompt. Persist it.- # Add to cur_prompt - cur_prompt = cur_prompt + response + # Persist conversation context for next turn + args.prompt[i] = f"{cur_prompt} assistant: {response}"Alternative: maintain a
history
list per sample and concatenate before calling the loader for each turn.tests/integration/defs/test_e2e.py (2)
2573-2584
: Make the number of turns explicit to avoid silent truncation.
--conversation_turns
defaults to 2; if more prompts are passed later, only two turns would run. Set it to the number of provided prompts for clarity and robustness."--multiturn", + "--conversation_turns", + str(len(accuracy_inputs["image"]["prompt"])), "--prompt",
2609-2620
: Harden matching (case-insensitive), remove noisy prints, and fix E501 long line.This reduces flakiness, cleans logs, and complies with line length rules.
- for prompt_output, prompt_keywords in zip( - parse_output(output), expected_keywords[model_name]["image"]): - matches = [ - keyword in prompt_output.lower() for keyword in prompt_keywords - ] - obs_match_ratio = 1. * sum(matches) / len(matches) - print("prompt_output:", prompt_output) - print("prompt_keywords:", prompt_keywords) - print("matches:", matches) - print("obs_match_ratio:", obs_match_ratio) - assert obs_match_ratio >= match_ratio, f"Incorrect output!\nGenerated \"{prompt_output}\"\nExpected keywords \"{prompt_keywords}\"\n Matched keywords: {matches}\n Observed match ratio {obs_match_ratio} below threshold {match_ratio}" + for prompt_output, prompt_keywords in zip(outputs, expected): + matches = [keyword.lower() in prompt_output.lower() + for keyword in prompt_keywords] + obs_match_ratio = 1.0 * sum(matches) / len(matches) + assert obs_match_ratio >= match_ratio, ( + f"Incorrect output!\nGenerated \"{prompt_output}\"\n" + f"Expected keywords \"{prompt_keywords}\"\n" + f"Matched keywords: {matches}\n" + f"Observed match ratio {obs_match_ratio} below threshold {match_ratio}" + )
🧹 Nitpick comments (1)
tests/integration/defs/test_e2e.py (1)
2602-2602
: Reduce CI log noise — drop raw output print and pre-parse instead.Printing the entire raw output is noisy. Parse once and assert; include truncated raw text only in failure messages if needed.
- output = llm_venv.run_cmd(cmd, caller=check_output) - print("output:", output) + output = llm_venv.run_cmd(cmd, caller=check_output) + outputs = parse_output(output) + expected = expected_keywords[model_name]["image"] + assert len(outputs) == len(expected), ( + f"Expected {len(expected)} outputs, got {len(outputs)}. " + f"Raw output (truncated): {output[:1000]}..." + )
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
examples/llm-api/quickstart_multimodal.py
(3 hunks)tests/integration/defs/accuracy/test_disaggregated_serving.py
(1 hunks)tests/integration/defs/accuracy/test_llm_api_pytorch.py
(6 hunks)tests/integration/defs/test_e2e.py
(1 hunks)tests/integration/test_lists/qa/llm_function_full.txt
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- tests/integration/test_lists/qa/llm_function_full.txt
- tests/integration/defs/accuracy/test_llm_api_pytorch.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tests/integration/defs/accuracy/test_disaggregated_serving.py
examples/llm-api/quickstart_multimodal.py
tests/integration/defs/test_e2e.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tests/integration/defs/accuracy/test_disaggregated_serving.py
examples/llm-api/quickstart_multimodal.py
tests/integration/defs/test_e2e.py
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/defs/test_e2e.py
🧬 Code Graph Analysis (2)
examples/llm-api/quickstart_multimodal.py (3)
tensorrt_llm/llmapi/llm.py (4)
prompt
(79-80)tokenizer
(691-695)tokenizer
(698-699)generate
(235-313)tensorrt_llm/inputs/utils.py (1)
default_multimodal_input_loader
(475-635)tensorrt_llm/_torch/models/modeling_phi4mm.py (1)
lora_request
(267-288)
tests/integration/defs/test_e2e.py (4)
tests/integration/defs/conftest.py (2)
llm_models_root
(77-83)llm_root
(180-181)tests/integration/defs/triton_server/conftest.py (1)
llm_models_root
(16-25)tests/integration/defs/common.py (1)
parse_output
(665-681)tests/integration/defs/examples/test_llama.py (1)
parse_output
(129-139)
🪛 Ruff (0.12.2)
examples/llm-api/quickstart_multimodal.py
151-151: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
156-156: Do not use bare except
(E722)
tests/integration/defs/test_e2e.py
2610-2610: Line too long (240 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (2)
tests/integration/defs/accuracy/test_disaggregated_serving.py (1)
624-624
: Appropriate Hopper-gating for FP8 model — add an explicit reason if supported.Adding
@skip_pre_hopper
totest_auto_dtype
for Qwen3-8B-FP8 is correct (FP8 requires Hopper+). Ifskip_pre_hopper
supports a reason message, consider adding one for clarity in CI logs.examples/llm-api/quickstart_multimodal.py (1)
111-119
: New CLI flags look good.
--multiturn
and--conversation_turns
are sensible and scoped. No issues.
PR_Github #14748 [ reuse-pipeline ] completed with state |
eaf5a2d
to
0555625
Compare
PR_Github #15056 [ reuse-pipeline ] completed with state |
…6749) Signed-off-by: Ivy Zhang <[email protected]>
…6749) Signed-off-by: Ivy Zhang <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…6749) Signed-off-by: Ivy Zhang <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…6749) Signed-off-by: Ivy Zhang <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…6749) Signed-off-by: Ivy Zhang <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…6749) Signed-off-by: Ivy Zhang <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…6749) Signed-off-by: Ivy Zhang <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…6749) Signed-off-by: Ivy Zhang <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…6749) Signed-off-by: Ivy Zhang <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…6749) Signed-off-by: Ivy Zhang <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…6749) Signed-off-by: Ivy Zhang <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…6749) Signed-off-by: Ivy Zhang <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…6749) Signed-off-by: Ivy Zhang <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…6749) Signed-off-by: Ivy Zhang <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…6749) Signed-off-by: Ivy Zhang <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…6749) Signed-off-by: Ivy Zhang <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…6749) Signed-off-by: Ivy Zhang <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…6749) Signed-off-by: Ivy Zhang <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…6749) Signed-off-by: Ivy Zhang <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…6749) Signed-off-by: Ivy Zhang <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…6749) Signed-off-by: Ivy Zhang <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
Signed-off-by: Ivy Zhang <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
Summary by CodeRabbit
New Features
Tests
Chores
Description
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.