fix: support mixture of text & multimodal prompts #6345

yechank-nvidia · 2025-07-25T00:40:51Z

Summary by CodeRabbit

Bug Fixes
- Improved handling of empty or missing multimodal data across several models, preventing errors when such data is not provided.
- Relaxed strict checks on the number of multimodal inputs, allowing more flexible input scenarios.
- Enhanced input processors to safely return results when multimodal data is absent.
New Features
- Added support for new multimodal input types: "multiple_image" and "mixture_text_image."
- Introduced a device selection option for input data processing in the multimodal quickstart example.
Refactor
- Clarified variable names related to multimodal embeddings for better readability.

coderabbitai · 2025-07-25T00:41:05Z

📝 Walkthrough

Walkthrough

The changes relax strict checks and assertions on the presence and count of multimodal inputs in several model forward methods and input processors. Early returns and conditional logic are introduced to handle cases where multimodal data is absent, and variable names are clarified throughout. No public interfaces or method signatures are modified. Additionally, new multimodal modalities are added to example scripts and test suites, and input loading logic is extended to support these new modalities.

Changes

File(s)	Change Summary
Model forward methods with multimodal embedding handling	`tensorrt_llm/_torch/models/modeling_gemma3vl.py`, `modeling_hyperclovax.py`, `modeling_llama.py`, `modeling_llava_next.py`, `modeling_mistral.py`, `modeling_phi4mm.py`, `modeling_vila.py`
Example multimodal script and input loader	`examples/llm-api/quickstart_multimodal.py`, `tensorrt_llm/inputs/utils.py`
Integration tests and test lists	`tests/integration/defs/test_e2e.py`, `tests/integration/test_lists/qa/examples_test_list.txt`, `tests/integration/test_lists/qa/llm_sanity_test.txt`, `tests/integration/test_lists/test-db/l0_h100.yml`

Sequence Diagram(s)

sequenceDiagram
    participant InputProcessor
    participant Model

    InputProcessor->>InputProcessor: Receive text & multimodal data
    alt Multimodal data absent
        InputProcessor-->>Caller: Return tokenized IDs, empty multimodal dict
    else Multimodal data present
        InputProcessor->>InputProcessor: Preprocess multimodal data
        InputProcessor-->>Caller: Return tokenized IDs, multimodal dict
    end

    Caller->>Model: Call forward with input IDs and multimodal params
    alt Multimodal params present
        Model->>Model: Extract multimodal embeddings
        Model->>Model: Fuse embeddings with input tokens
    else Multimodal params absent
        Model->>Model: Proceed without multimodal embeddings
    end
    Model-->>Caller: Return model output

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Suggested reviewers

brb-nv
lfr-0531
symphonylyh

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai generate unit tests to generate unit tests for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

brb-nv

LGTM.

amukkara

LGTM

2ez4bz · 2025-07-25T06:59:15Z

Are there any unit tests we could be adding?

tensorrt_llm/_torch/models/modeling_vila.py

Signed-off-by: yechank <[email protected]>

yechank-nvidia · 2025-07-29T04:39:23Z

@2ez4bz added test for Mistral case. Other models can reference from it.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

tests/integration/test_lists/qa/llm_sanity_test.txt (1)

105-105: Add …-mixture_text_image-False variant for symmetry

All other multimodal test-list entries come in <modality>-True and <modality>-False pairs to exercise both code paths. Adding only the True variant leaves the “no mixture” branch untested and breaks the implicit pattern.

tests/integration/test_lists/qa/examples_test_list.txt (1)

539-539: Mirror the new test with the False flag and verify naming consistency

Include …-mixture_text_image-False to exercise the code path where mixed images & text are disabled (keeps parity with the existing image-False/True and video-False/True cases).

Double-check that the modality string mixture_text_image matches what the loader expects (no camel-case or dash variations).

tensorrt_llm/inputs/utils.py (1)

473-585: Consider adding unit tests for the new modalities.

The implementation of "multiple_image" and "mixture_text_image" modalities is solid and well-integrated. Given that a commenter on the PR inquired about unit tests, consider adding test cases to verify:

"multiple_image" modality processes multiple images correctly

"mixture_text_image" modality handles mixed content with empty media slots

Conditional "multi_modal_data" inclusion works for both text-only and multimodal prompts

Would you like me to generate unit test cases for these new modalities to ensure comprehensive coverage of the mixed text/multimodal prompt functionality?

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 51c8045 and 0d2ee11.

📒 Files selected for processing (13)

examples/llm-api/quickstart_multimodal.py (5 hunks)
tensorrt_llm/_torch/models/modeling_gemma3vl.py (0 hunks)
tensorrt_llm/_torch/models/modeling_hyperclovax.py (1 hunks)
tensorrt_llm/_torch/models/modeling_llama.py (1 hunks)
tensorrt_llm/_torch/models/modeling_llava_next.py (2 hunks)
tensorrt_llm/_torch/models/modeling_mistral.py (2 hunks)
tensorrt_llm/_torch/models/modeling_phi4mm.py (1 hunks)
tensorrt_llm/_torch/models/modeling_vila.py (2 hunks)
tensorrt_llm/inputs/utils.py (3 hunks)
tests/integration/defs/test_e2e.py (3 hunks)
tests/integration/test_lists/qa/examples_test_list.txt (1 hunks)
tests/integration/test_lists/qa/llm_sanity_test.txt (1 hunks)
tests/integration/test_lists/test-db/l0_h100.yml (1 hunks)

💤 Files with no reviewable changes (1)

tensorrt_llm/_torch/models/modeling_gemma3vl.py

✅ Files skipped from review due to trivial changes (1)

tests/integration/test_lists/test-db/l0_h100.yml

🚧 Files skipped from review as they are similar to previous changes (6)

tensorrt_llm/_torch/models/modeling_hyperclovax.py
tensorrt_llm/_torch/models/modeling_llama.py
tensorrt_llm/_torch/models/modeling_mistral.py
tensorrt_llm/_torch/models/modeling_llava_next.py
tensorrt_llm/_torch/models/modeling_phi4mm.py
tensorrt_llm/_torch/models/modeling_vila.py

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case, and prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

tests/integration/defs/test_e2e.py
examples/llm-api/quickstart_multimodal.py
tensorrt_llm/inputs/utils.py

**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

tests/integration/defs/test_e2e.py
examples/llm-api/quickstart_multimodal.py
tensorrt_llm/inputs/utils.py

🧠 Learnings (5)

📓 Common learnings

Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

tests/integration/test_lists/qa/llm_sanity_test.txt (1)

Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

tests/integration/test_lists/qa/examples_test_list.txt (1)

Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

tests/integration/defs/test_e2e.py (1)

Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

tensorrt_llm/inputs/utils.py (1)

Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor() is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation() to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (11)

tests/integration/defs/test_e2e.py (3)

1942-1942: LGTM! Good addition of the new modality to test coverage.

The addition of "mixture_text_image" to the parametrized test extends coverage to the new modality, following the established pattern for multimodal testing.

1990-1999: Well-structured test data for the new mixture modality.

The test data effectively covers both text-only and image-based prompts within the same modality, with corresponding empty and populated media lists. This design properly validates the mixture functionality.

2055-2057: Appropriate expected keywords for the mixture modality test.

The keywords align well with the test prompts - internet invention-related terms for the text prompt and landscape descriptors for the image prompt.

examples/llm-api/quickstart_multimodal.py (5)

58-77: Well-designed example data for new multimodal modalities.

The addition of "multiple_image" and "mixture_text_image" modalities follows the established pattern and provides appropriate test cases. The mixture modality correctly demonstrates both text-only and image-based prompts with corresponding media structures.

88-91: LGTM! Proper addition of new modalities to CLI choices.

The new modality options are correctly added to the argument parser choices, making them available for command-line usage.

107-110: Useful addition of device configuration option.

The new --device argument provides helpful flexibility for controlling input tensor device placement while maintaining backward compatibility with the "cpu" default.

162-166: Good repositioning of default data assignment logic.

Moving the default prompt and media assignment after model type determination is a logical improvement that ensures the model type is available before processing example data.

175-175: LGTM! Proper use of the configurable device parameter.

Replacing the hardcoded "cpu" with args.device correctly implements the device configuration feature introduced earlier.

tensorrt_llm/inputs/utils.py (3)

490-490: LGTM! Clean addition of multiple_image support.

The extension to support "multiple_image" modality alongside "image" is well-implemented. Both modalities are processed identically, which maintains consistency in the downstream pipeline while providing a more descriptive option for users working with multiple images.

533-541: LGTM! Appropriate handling for mixture_text_image modality.

The implementation correctly filters out empty media items while processing valid images. This design supports the mixed text/image use case where some positions in the media array may be empty (representing text-only segments). The conditional check if m: prevents potential errors from attempting to load empty or None media items.

573-583: LGTM! Improved formatting and conditional multimodal data inclusion.

The refactoring enhances code readability and correctly implements conditional inclusion of multimodal data. The key improvement is only adding "multi_modal_data" to the input dictionary when multimodal placeholders are actually present (mm_placeholder_counts is truthy). This prevents unnecessary inclusion of empty multimodal data in text-only prompts, which aligns perfectly with the PR objective of supporting mixed text and multimodal prompts.

yechank-nvidia · 2025-07-29T10:20:16Z

/bot run

tensorrt-cicd · 2025-07-29T10:25:18Z

PR_Github #13352 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-29T21:31:55Z

PR_Github #13352 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9981 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

chang-l · 2025-07-30T04:25:36Z

examples/llm-api/quickstart_multimodal.py

-                        choices=["image", "video", "audio", "image_audio"],
+                        choices=[
+                            "image", "video", "audio", "image_audio",
+                            "multiple_image", "mixture_text_image"


Sorry to comment again on a closed PR, but quick question — do we actually need to create/define a new modality (other than image, video etc) here when there are multiple images or videos?

Can we update default loader to accommodate various combinations? [pure_txt, multiple_image with txt, image with txt, etc.]

Signed-off-by: yechank <[email protected]> Signed-off-by: Lanyu Liao <[email protected]>

Signed-off-by: yechank <[email protected]>

yechank-nvidia self-assigned this Jul 25, 2025

yechank-nvidia requested a review from a team as a code owner July 25, 2025 00:40

yechank-nvidia requested review from Naveassaf and schetlur-nv July 25, 2025 00:40

coderabbitai bot requested review from FrankD412, hyukn, nv-yilinf and tijyojwad July 25, 2025 00:41

brb-nv approved these changes Jul 25, 2025

View reviewed changes

amukkara approved these changes Jul 25, 2025

View reviewed changes

hyukn reviewed Jul 25, 2025

View reviewed changes

tensorrt_llm/_torch/models/modeling_vila.py Show resolved Hide resolved

yechank-nvidia added 2 commits July 29, 2025 13:36

fix: support mixture of text & multimodal prompts

69857bf

Signed-off-by: yechank <[email protected]>

add test mixture_text_image

0d2ee11

Signed-off-by: yechank <[email protected]>

yechank-nvidia force-pushed the fix_mixture_prompts branch from 51c8045 to 0d2ee11 Compare July 29, 2025 04:37

coderabbitai bot requested review from brb-nv, lfr-0531 and symphonylyh July 29, 2025 04:37

coderabbitai bot reviewed Jul 29, 2025

View reviewed changes

hyukn approved these changes Jul 30, 2025

View reviewed changes

hyukn merged commit d6eb8e2 into NVIDIA:main Jul 30, 2025
3 checks passed

chang-l reviewed Jul 30, 2025

View reviewed changes

lancelly pushed a commit to lancelly/TensorRT-LLM that referenced this pull request Aug 6, 2025

fix: support mixture of text & multimodal prompts (NVIDIA#6345)

44e8ac8

Signed-off-by: yechank <[email protected]> Signed-off-by: Lanyu Liao <[email protected]>

jain-ria pushed a commit to jain-ria/TensorRT-LLM that referenced this pull request Aug 7, 2025

fix: support mixture of text & multimodal prompts (NVIDIA#6345)

ac39af5

Signed-off-by: yechank <[email protected]>

coderabbitai bot mentioned this pull request Aug 11, 2025

[TRTLLM-6975][test] Add multi-turn test cases for VLM models #6749

Merged

This was referenced Aug 12, 2025

[TRTLLM-6771][feat] Support MMMU for multimodal models #6828

Merged

[TRTLLM-7094][feat] Gpt-oss reasoning content parsing in trtllm-serve #6854

Closed

fix: support mixture of text & multimodal prompts #6345

fix: support mixture of text & multimodal prompts #6345

Uh oh!

Conversation

yechank-nvidia commented Jul 25, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

Uh oh!

brb-nv left a comment

Choose a reason for hiding this comment

Uh oh!

amukkara left a comment

Choose a reason for hiding this comment

Uh oh!

2ez4bz commented Jul 25, 2025

Uh oh!

Uh oh!

yechank-nvidia commented Jul 29, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

yechank-nvidia commented Jul 29, 2025

Uh oh!

tensorrt-cicd commented Jul 29, 2025

Uh oh!

tensorrt-cicd commented Jul 29, 2025

Uh oh!

Uh oh!

chang-l Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yechank-nvidia commented Jul 25, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jul 25, 2025 •

edited

Loading