Skip to content

Conversation

@nv-guomingz
Copy link
Collaborator

@nv-guomingz nv-guomingz commented Aug 7, 2025

Cherry-pick #5995 into 1.0 release branch.

Summary by CodeRabbit

  • Documentation
    • Updated multiple documentation files to replace the term "experimental" with "prototype" or "beta" for various features and components.
    • Clarified or removed references to experimental status in descriptions, notes, and section headers across user guides, READMEs, and advanced usage docs.
    • Improved explanations and recommendations for certain features (e.g., FP8 GEMV/GEMM plugin, weights loader).
    • Corrected minor reference links and simplified some feature descriptions for clarity.

@nv-guomingz nv-guomingz requested a review from a team as a code owner August 7, 2025 02:26
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 7, 2025

📝 Walkthrough

Walkthrough

This change updates documentation across multiple files to revise feature status terminology, primarily replacing "experimental" with "prototype" or "beta" in various feature descriptions, notes, and section headers. Some explanatory notes about experimental status were removed or reworded, but no technical content, logic, or functionality was changed.

Changes

Cohort / File(s) Change Summary
AutoDeploy/Disaggregated/General Feature Status
README.md, examples/auto_deploy/README.md, examples/disaggregated/README.md, examples/sample_weight_stripping/README.md, examples/models/core/deepseek_v3/README.md
Changed feature status terminology from "experimental" to "prototype" in descriptions, headers, and disclaimers.
Documentation: Model/Backend Status
docs/source/torch.md
Updated PyTorch backend status from "experimental" to "beta" and simplified description.
Documentation: Disaggregated Service
docs/source/advanced/disaggregated-service.md
Changed feature status from "Experimental" to "Prototype" in title and note.
Documentation: XQA Optimization
docs/source/advanced/gpt-attention.md
Removed lines describing XQA as experimental and listing supported models.
Documentation: EAGLE Model
docs/source/advanced/speculative-decoding.md, examples/eagle/README.md
Removed or reworded notes about EAGLE-2 being experimental; merged model support statements.
Documentation: Weights Loader
docs/source/architecture/model-weights-loader.md
Removed "experimental" qualifier and clarified enabling condition for weights loader.
Documentation: Benchmarking
docs/source/performance/perf-benchmarking.md
Removed section on experimental ISL/OSL tuning hints in engine building.
Documentation: Quantization/Precision
docs/source/reference/precision.md, examples/models/core/llama/README.md
Removed "experimental" notes from quantization technique examples and FP8 plugin descriptions; clarified plugin usage and performance notes.

Sequence Diagram(s)

No sequence diagram generated, as all changes are limited to documentation and do not affect control flow or feature logic.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Possibly related PRs

Suggested labels

Documentation

Suggested reviewers

  • litaotju
  • syuoni
  • laikhtewari

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.

✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (6)
docs/source/architecture/model-weights-loader.md (1)

252-253: Clarify wording of default-enablement sentence

“The weights loader is enabled for LLaMA family models and Qwen models by default with TensorRT flow only.” is hard to parse. Consider:

-The weights loader is enabled for LLaMA family models and Qwen models by default with TensorRT flow only.
+The weights loader is enabled by default for LLaMA-family and Qwen models **when the TensorRT flow is used**.
docs/source/advanced/speculative-decoding.md (1)

171-172: Grammar & spacing fix

Drop the redundant “of” and insert a missing space after “engine”.

-... performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2 are both supported).
+... performed inside the TensorRT engine (EAGLE-1 and EAGLE-2 are both supported).
examples/models/core/llama/README.md (2)

679-681: Capitalize and tighten FP8 GEMV note

Minor style improvement for readability.

-Note: use FP8 GEMV to optimize performance in FP8 small-batch-size cases.
+Note: Use the FP8 GEMV plugin to optimize small-batch-size FP8 workloads.

697-699: Consistent kernel naming and punctuation

gemv/gemm mixes lower- and upper-case; stick to upper-case for acronyms and add a comma after “engine”.

-**Note**: FP8 gemv plugin uses CUDA cores to compute, by contrast to Tensor Core gemm kernel within cuBLAS.
+**Note**: The FP8 GEMV plugin uses CUDA cores, in contrast to the Tensor-Core GEMM kernels within cuBLAS.
examples/models/core/deepseek_v3/README.md (2)

33-33: Capitalize product name for consistency

In the bullet text, “triton inference server” should follow the official branding “Triton Inference Server” (capital “T” and “I”). This capitalization is already used elsewhere in project docs and improves professionalism.

-    - [tensorrtllm\_backend for triton inference server (Prototype)](#tensorrtllm_backend-for-triton-inference-server-prototype)
+    - [tensorrtllm\_backend for Triton Inference Server (Prototype)](#tensorrtllm_backend-for-triton-inference-server-prototype)

395-397: Minor wording & branding polish

  1. Heading: same capitalization concern as above – “Triton Inference Server” is the proper name.
  2. Body: “the pytorch path” → “the PyTorch path/back-end” to keep style consistent and capitalize the framework name.
-### tensorrtllm_backend for triton inference server (Prototype)
+### tensorrtllm_backend for Triton Inference Server (Prototype)

-To serve the model using [tensorrtllm_backend](https://github.com/triton-inference-server/tensorrtllm_backend.git), make sure the version is v0.19+ in which the pytorch path is added as a prototype feature.
+To serve the model using [tensorrtllm_backend](https://github.com/triton-inference-server/tensorrtllm_backend.git), ensure you are on version v0.19 or later, where the **PyTorch** path is available as a prototype feature.
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 13e0214 and e2b6edb.

📒 Files selected for processing (14)
  • README.md (1 hunks)
  • docs/source/advanced/disaggregated-service.md (1 hunks)
  • docs/source/advanced/gpt-attention.md (0 hunks)
  • docs/source/advanced/speculative-decoding.md (1 hunks)
  • docs/source/architecture/model-weights-loader.md (1 hunks)
  • docs/source/performance/perf-benchmarking.md (0 hunks)
  • docs/source/reference/precision.md (1 hunks)
  • docs/source/torch.md (1 hunks)
  • examples/auto_deploy/README.md (2 hunks)
  • examples/disaggregated/README.md (1 hunks)
  • examples/eagle/README.md (0 hunks)
  • examples/models/core/deepseek_v3/README.md (2 hunks)
  • examples/models/core/llama/README.md (2 hunks)
  • examples/sample_weight_stripping/README.md (2 hunks)
💤 Files with no reviewable changes (3)
  • examples/eagle/README.md
  • docs/source/advanced/gpt-attention.md
  • docs/source/performance/perf-benchmarking.md
🧰 Additional context used
🧠 Learnings (7)
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • examples/sample_weight_stripping/README.md
  • docs/source/advanced/speculative-decoding.md
  • docs/source/torch.md
  • README.md
  • docs/source/architecture/model-weights-loader.md
  • examples/models/core/deepseek_v3/README.md
📚 Learning: in tensorrt-llm, test files (files under tests/ directories) do not require nvidia copyright headers...
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • examples/sample_weight_stripping/README.md
  • docs/source/torch.md
  • README.md
  • docs/source/architecture/model-weights-loader.md
  • examples/models/core/deepseek_v3/README.md
📚 Learning: in tensorrt-llm, examples directory can have different dependency versions than the root requirement...
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

  • examples/sample_weight_stripping/README.md
  • docs/source/advanced/speculative-decoding.md
  • docs/source/torch.md
  • README.md
  • docs/source/architecture/model-weights-loader.md
  • examples/models/core/deepseek_v3/README.md
📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

Applied to files:

  • examples/sample_weight_stripping/README.md
  • docs/source/advanced/speculative-decoding.md
  • README.md
  • docs/source/architecture/model-weights-loader.md
📚 Learning: in tensorrt_llm/executor/worker.py, the lora adapter cache optimization logic that checks `is_adapte...
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Applied to files:

  • docs/source/architecture/model-weights-loader.md
📚 Learning: ministral is a valid model name from mistral ai, distinct from the regular mistral models. in tensor...
Learnt from: venkywonka
PR: NVIDIA/TensorRT-LLM#6650
File: tests/integration/test_lists/qa/llm_perf_cluster.yml:33-37
Timestamp: 2025-08-06T03:47:16.802Z
Learning: Ministral is a valid model name from Mistral AI, distinct from the regular Mistral models. In TensorRT-LLM test configurations, "ministral_8b" and "ministral_8b_fp8" are correct model identifiers and should not be changed to "mistral_8b".

Applied to files:

  • examples/models/core/deepseek_v3/README.md
📚 Learning: ministral is a valid and distinct model family from mistral ai, separate from their regular mistral ...
Learnt from: venkywonka
PR: NVIDIA/TensorRT-LLM#6650
File: tests/integration/test_lists/qa/llm_perf_cluster.yml:33-37
Timestamp: 2025-08-06T03:47:16.802Z
Learning: Ministral is a valid and distinct model family from Mistral AI, separate from their regular Mistral models. Ministral 8B is specifically designed for edge computing and on-device applications, released in October 2024. In TensorRT-LLM test configurations, "ministral_8b" and "ministral_8b_fp8" are correct model identifiers and should not be changed to "mistral_8b".

Applied to files:

  • examples/models/core/deepseek_v3/README.md
🪛 LanguageTool
docs/source/advanced/speculative-decoding.md

[style] ~171-~171: This phrase is redundant. Consider using “inside”.
Context: ...nd draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2...

(OUTSIDE_OF)

🔇 Additional comments (11)
docs/source/reference/precision.md (1)

104-107: GPT-J example link path verified

The directory examples/models/contrib/gptj/ exists in the repository, so the source:examples/models/contrib/gptj link will resolve correctly.

examples/disaggregated/README.md (1)

86-88: Update internal anchor references after header rename

Renaming the header to “Dynamic scaling (Prototype)” changes its markdown anchor (now #dynamic-scaling-prototype).
Please make sure any intra-document links or external references to the old #dynamic-scaling anchor are updated accordingly.

README.md (1)

253-253: LGTM! Terminology update aligns with project-wide standardization.

The change from "experimental" to "prototype" for the AutoDeploy backend description is consistent with the broader documentation update effort across the project.

docs/source/advanced/disaggregated-service.md (2)

3-3: LGTM! Header terminology update is consistent.

The section header change from "Experimental" to "Prototype" aligns with the project-wide effort to standardize feature status descriptions.


7-7: LGTM! Note text updated consistently.

The note text properly reflects the new "prototype" status, maintaining consistency with the header change and the broader documentation update.

examples/sample_weight_stripping/README.md (2)

15-15: LGTM! Table of contents updated consistently.

The table of contents entry correctly reflects the section header change from "Experimental" to "Prototype".


242-242: LGTM! Section header terminology standardized.

The section header change from "Experimental" to "Prototype" is consistent with the project-wide documentation update and aligns with the table of contents change.

examples/auto_deploy/README.md (2)

9-9: LGTM! Feature description updated consistently.

The change from "experimental" to "prototype" in the AutoDeploy feature description aligns with the project-wide terminology standardization effort.


453-453: LGTM! Disclaimer updated with consistent terminology.

The disclaimer section properly reflects the new "prototype" status while maintaining the appropriate cautionary language about the feature being in active development and beta stage.

docs/source/torch.md (2)

5-5: LGTM! Feature status updated to beta.

The note correctly reflects the change from "experimental" to "beta" status for the PyTorch backend feature.


7-7: LGTM! Description simplified appropriately.

The removal of "experimental" qualifier from the backend description streamlines the text while maintaining clarity about it being a new backend based on PyTorch.

@nv-guomingz
Copy link
Collaborator Author

/bot run --stage-list "A10-Build-docs"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14417 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14417 [ run ] completed with state FAILURE
/LLM/release-1.0/L0_MergeRequest_PR pipeline #8 (Partly Tested) completed with status: 'FAILURE'

@nv-guomingz
Copy link
Collaborator Author

/bot run --stage-list "A10-Build_Docs"

@nv-guomingz nv-guomingz closed this Aug 7, 2025
@nv-guomingz nv-guomingz force-pushed the user/guomingz/cherry-pick-5995 branch from e2b6edb to 53f94a4 Compare August 7, 2025 08:07
@tensorrt-cicd
Copy link
Collaborator

PR_Github #14426 [ ] completed with state FAILURE
Not allowed on merged PR

@nv-guomingz nv-guomingz deleted the user/guomingz/cherry-pick-5995 branch September 30, 2025 07:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants