Skip to content

Conversation

@nv-guomingz
Copy link
Collaborator

@nv-guomingz nv-guomingz commented Jul 14, 2025

Clean the doc by removing experimental label.

  • PyTorch Backend (Experimental --> Beta)
  • Disagg-serving (Experimental --> Prototype)
  • AutoDeploy (Experimental --> Prototype)
  • Use tensorrtllm_backend for triton inference server (Experimental --> Prototype)

Summary by CodeRabbit

Summary by CodeRabbit

  • Documentation
    • Removed references to features and techniques being "experimental" or subject to change across multiple documentation pages and READMEs.
    • Clarified default behavior and support contexts for specific features in the documentation.
    • Updated explanations and recommendations for FP8 GEMV/GEMM plugin usage, providing more detail and clearer guidance.
    • Simplified or removed descriptions of deprecated or experimental build modes and configuration options.
    • Updated feature status descriptions from "experimental" to "prototype" or "beta" in various documentation and example READMEs.

@nv-guomingz nv-guomingz requested review from QiJune and lowsfer July 14, 2025 08:30
@nv-guomingz nv-guomingz force-pushed the user/guomingz/clean_experimental branch from 107dbb3 to a808cc8 Compare July 14, 2025 08:39
@nv-guomingz nv-guomingz requested a review from a team as a code owner July 14, 2025 08:39
@nv-guomingz nv-guomingz requested a review from lucaslie July 14, 2025 08:39
@nv-guomingz nv-guomingz force-pushed the user/guomingz/clean_experimental branch from a808cc8 to d69b27e Compare July 14, 2025 08:48
@nv-guomingz nv-guomingz requested review from kaiyux and yweng0828 July 14, 2025 08:48
@nv-guomingz nv-guomingz force-pushed the user/guomingz/clean_experimental branch from d69b27e to 909bcb1 Compare July 14, 2025 08:56
@nv-guomingz nv-guomingz force-pushed the user/guomingz/clean_experimental branch from 909bcb1 to e3f1e8c Compare July 14, 2025 11:37
@nv-guomingz nv-guomingz force-pushed the user/guomingz/clean_experimental branch 2 times, most recently from cc18db1 to c6a80d1 Compare July 14, 2025 14:08
@nv-guomingz nv-guomingz force-pushed the user/guomingz/clean_experimental branch from c6a80d1 to ea7d44c Compare July 28, 2025 17:02
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jul 28, 2025

📝 Walkthrough

Walkthrough

This update modifies documentation files to remove or reword references to "experimental" status for several features, clarify default behaviors, and update technical explanations. No changes to code or public interfaces are present; all modifications are limited to documentation content and README files.

Changes

Cohort / File(s) Change Summary
Experimental Status Removal (General)
docs/source/advanced/gpt-attention.md, docs/source/torch.md, examples/eagle/README.md, docs/source/reference/precision.md, README.md, docs/source/advanced/disaggregated-service.md, examples/auto_deploy/README.md, examples/disaggregated/README.md, examples/models/core/deepseek_v3/README.md, examples/sample_weight_stripping/README.md
Removed or replaced references to features being "experimental" with "prototype" or "beta" status for XQA optimization, PyTorch backend, EAGLE-2, quantization examples, AutoDeploy backend, disaggregated service, dynamic scaling, tensorrtllm_backend for Triton, and sample weight stripping. No functional changes made.
Speculative Decoding Documentation
docs/source/advanced/speculative-decoding.md
Reworded the description of EAGLE speculative decoding to consolidate EAGLE-1 and EAGLE-2 support mentions, removing the explicit note about EAGLE-2's experimental status.
Performance Benchmarking Documentation
docs/source/performance/perf-benchmarking.md
Removed the section describing the experimental mode for building TensorRT-LLM engines with target ISL/OSL values, including example commands and explanations.
Model Weights Loader Clarification
docs/source/architecture/model-weights-loader.md
Clarified that the weights loader is enabled by default for LLaMA and Qwen models only when using the TensorRT flow, specifying the context more precisely.
FP8 Plugin Documentation Update
examples/models/core/llama/README.md
Updated the explanation of FP8 GEMV/GEMM plugin usage: replaced "Experimental" with "Note," provided a more detailed technical explanation of FP8 gemv, and removed the warning about performance degradation for larger batch sizes.

Sequence Diagram(s)

No sequence diagrams are generated, as all changes are limited to documentation and do not affect control flow or feature implementation.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Suggested labels

Documentation

Suggested reviewers

  • litaotju
  • syuoni

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.

✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (5)
docs/source/torch.md (1)

4-4: Re-phrase for a smoother reading flow

“launches a new backend” sounds like a one-off event. “introduces” (or “adds”) better reflects the documentation’s timeless nature.

-To enhance the usability of the system and improve developer efficiency, TensorRT-LLM launches a new backend based on PyTorch.
+To enhance usability and developer efficiency, TensorRT-LLM introduces a new backend based on PyTorch.
docs/source/advanced/speculative-decoding.md (1)

171-171: Minor grammar & spacing tidy-up

Remove the redundant “of”, add the missing space, and swap the en-dash for a hyphen to stay consistent.

-Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logits prediction, draft tokens acceptance and draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2 are both supported).
+Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model so that logits prediction, draft-token acceptance, and draft-token generation are performed inside the TensorRT engine (EAGLE-1 and EAGLE-2 are both supported).
examples/models/core/llama/README.md (2)

679-679: Capitalise sentence start & tighten wording

-Note: use FP8 GEMV to optimize performance in FP8 small-batch-size cases.
+Note: Use FP8 GEMV to optimise performance in small-batch-size FP8 scenarios.

697-697: Polish long explanatory note for readability

A few micro-fixes improve clarity:

-**Note**: FP8 gemv plugin uses CUDA cores to compute, by contrast to Tensor Core gemm kernel within cuBLAS. Over last year, as cuBLAS have improved their performance by a lot under small M case for Hopper(sm90), FP8 gemv kernel may or may not surpass cuBLAS, depending on specific gemm problem shape. Nonetheless, we still strongly recommend FP8 gemv kernel for Ada (sm89) as cuBLAS still falls behind gemv on it.
+**Note**: The FP8 GEMV plugin runs on CUDA cores, whereas cuBLAS uses Tensor-Core GEMM kernels. Over the last year cuBLAS performance for small-M cases on Hopper (SM90) has improved substantially, so FP8 GEMV may or may not outperform cuBLAS depending on the exact GEMM shape. We still strongly recommend FP8 GEMV on Ada (SM89), where cuBLAS continues to lag behind.
docs/source/architecture/model-weights-loader.md (1)

252-252: Clarify “by default” clause

The current wording is slightly ambiguous about when the loader is active.

-The weights loader is enabled for LLaMA family models and Qwen models by default with TensorRT flow only.
+By default, the weights loader is enabled for LLaMA-family and Qwen models when using the TensorRT flow.
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 60e4d3a and ea7d44c.

📒 Files selected for processing (8)
  • docs/source/advanced/gpt-attention.md (0 hunks)
  • docs/source/advanced/speculative-decoding.md (1 hunks)
  • docs/source/architecture/model-weights-loader.md (1 hunks)
  • docs/source/performance/perf-benchmarking.md (0 hunks)
  • docs/source/reference/precision.md (1 hunks)
  • docs/source/torch.md (1 hunks)
  • examples/eagle/README.md (0 hunks)
  • examples/models/core/llama/README.md (2 hunks)
💤 Files with no reviewable changes (3)
  • examples/eagle/README.md
  • docs/source/advanced/gpt-attention.md
  • docs/source/performance/perf-benchmarking.md
🧰 Additional context used
🧠 Learnings (2)
docs/source/architecture/model-weights-loader.md (2)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor() is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation() to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

docs/source/advanced/speculative-decoding.md (1)

Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor() is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation() to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

🪛 LanguageTool
docs/source/advanced/speculative-decoding.md

[style] ~171-~171: This phrase is redundant. Consider using “inside”.
Context: ...nd draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2...

(OUTSIDE_OF)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check

@nv-guomingz nv-guomingz force-pushed the user/guomingz/clean_experimental branch from ea7d44c to daa23b3 Compare July 28, 2025 17:14
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
docs/source/advanced/speculative-decoding.md (1)

171-171: Tighten wording and fix minor grammar issues

The sentence contains a few stylistic hiccups:
• “inside of” → “inside”
• missing space before the parenthesis after “engine”
• plural-singular mismatch in “draft tokens acceptance”
• superfluous comma after “Please”

Proposed tweak:

-Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logits prediction, draft tokens acceptance and draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2 are both supported). Please, visit the [EAGLE README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/eagle/README.md) for information about building and running the model.
+Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logits prediction, draft token acceptance, and draft token generation are performed inside the TensorRT engine (EAGLE-1 and EAGLE-2 are both supported). Please visit the [EAGLE README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/eagle/README.md) for information about building and running the model.
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ea7d44c and daa23b3.

📒 Files selected for processing (8)
  • docs/source/advanced/gpt-attention.md (0 hunks)
  • docs/source/advanced/speculative-decoding.md (1 hunks)
  • docs/source/architecture/model-weights-loader.md (1 hunks)
  • docs/source/performance/perf-benchmarking.md (0 hunks)
  • docs/source/reference/precision.md (1 hunks)
  • docs/source/torch.md (1 hunks)
  • examples/eagle/README.md (0 hunks)
  • examples/models/core/llama/README.md (2 hunks)
💤 Files with no reviewable changes (3)
  • docs/source/advanced/gpt-attention.md
  • docs/source/performance/perf-benchmarking.md
  • examples/eagle/README.md
✅ Files skipped from review due to trivial changes (3)
  • docs/source/architecture/model-weights-loader.md
  • docs/source/reference/precision.md
  • docs/source/torch.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • examples/models/core/llama/README.md
🧰 Additional context used
🧠 Learnings (1)
docs/source/advanced/speculative-decoding.md (2)

Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor() is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation() to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.598Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

🪛 LanguageTool
docs/source/advanced/speculative-decoding.md

[style] ~171-~171: This phrase is redundant. Consider using “inside”.
Context: ...nd draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2...

(OUTSIDE_OF)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check

@nv-guomingz nv-guomingz force-pushed the user/guomingz/clean_experimental branch from daa23b3 to 7e67143 Compare July 28, 2025 17:22
@coderabbitai coderabbitai bot requested review from litaotju and yizhang-nv July 28, 2025 17:23
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
docs/source/advanced/speculative-decoding.md (2)

171-171: Tighten wording and fix minor formatting issues
The phrase “inside of the TensorRT engine” is redundant, there’s a missing space before the parenthesis, and “draft tokens acceptance” should be singular.

-Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logits prediction, draft tokens acceptance and draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2 are both supported). Please, visit the [EAGLE README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/eagle/README.md) for information about building and running the model.
+Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logits prediction, draft-token acceptance, and draft-token generation are performed inside the TensorRT engine (EAGLE-1 and EAGLE-2 are both supported). For details on building and running the model, see the [EAGLE README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/eagle/README.md).

169-170: Correct typo and clarify sentences
“asssembled” has an extra “s”, and both sentences read more naturally with slight re-phrasing.

-In the EAGLE-1 decoding tree needs to be known during the decoding. In the EAGLE-2 this tree is asssembled during the execution by searching for the most probable hypothesis along the beam.
+In EAGLE-1, the decoding tree must be defined before decoding begins. In EAGLE-2, this tree is assembled during execution by searching for the most probable hypothesis along the beam.
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between daa23b3 and 7e67143.

📒 Files selected for processing (14)
  • README.md (1 hunks)
  • docs/source/advanced/disaggregated-service.md (1 hunks)
  • docs/source/advanced/gpt-attention.md (0 hunks)
  • docs/source/advanced/speculative-decoding.md (1 hunks)
  • docs/source/architecture/model-weights-loader.md (1 hunks)
  • docs/source/performance/perf-benchmarking.md (0 hunks)
  • docs/source/reference/precision.md (1 hunks)
  • docs/source/torch.md (1 hunks)
  • examples/auto_deploy/README.md (2 hunks)
  • examples/disaggregated/README.md (1 hunks)
  • examples/eagle/README.md (0 hunks)
  • examples/models/core/deepseek_v3/README.md (2 hunks)
  • examples/models/core/llama/README.md (2 hunks)
  • examples/sample_weight_stripping/README.md (2 hunks)
💤 Files with no reviewable changes (3)
  • examples/eagle/README.md
  • docs/source/advanced/gpt-attention.md
  • docs/source/performance/perf-benchmarking.md
✅ Files skipped from review due to trivial changes (9)
  • docs/source/advanced/disaggregated-service.md
  • examples/disaggregated/README.md
  • examples/auto_deploy/README.md
  • docs/source/reference/precision.md
  • README.md
  • examples/models/core/deepseek_v3/README.md
  • examples/sample_weight_stripping/README.md
  • docs/source/architecture/model-weights-loader.md
  • docs/source/torch.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • examples/models/core/llama/README.md
🧰 Additional context used
🧠 Learnings (1)
docs/source/advanced/speculative-decoding.md (2)

Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor() is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation() to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.598Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

🪛 LanguageTool
docs/source/advanced/speculative-decoding.md

[style] ~171-~171: This phrase is redundant. Consider using “inside”.
Context: ...nd draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2...

(OUTSIDE_OF)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check

@nv-guomingz nv-guomingz force-pushed the user/guomingz/clean_experimental branch 2 times, most recently from b569864 to 372fd83 Compare July 29, 2025 03:11
@nv-guomingz
Copy link
Collaborator Author

/bot run --stage-list "A10-Build_Docs"

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
docs/source/advanced/speculative-decoding.md (1)

171-171: Polish grammar & spacing for clarity

Minor wording and punctuation tweaks improve readability:

-Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logits prediction, draft tokens acceptance and draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2 are both supported). Please, visit the [EAGLE README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/eagle/README.md) for information about building and running the model.
+Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logit prediction, draft-token acceptance, and draft-token generation are performed inside the TensorRT engine (EAGLE-1 and EAGLE-2 are both supported). Please visit the [EAGLE README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/eagle/README.md) for information about building and running the model.

Key fixes:
• “inside the” instead of “inside of the”.
• Add space before the parenthesis.
• Oxford comma after “acceptance”.
• Singular “logit” and hyphenate compound nouns.
• Remove superfluous comma after “Please”.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7e67143 and 372fd83.

📒 Files selected for processing (14)
  • README.md (1 hunks)
  • docs/source/advanced/disaggregated-service.md (1 hunks)
  • docs/source/advanced/gpt-attention.md (0 hunks)
  • docs/source/advanced/speculative-decoding.md (1 hunks)
  • docs/source/architecture/model-weights-loader.md (1 hunks)
  • docs/source/performance/perf-benchmarking.md (0 hunks)
  • docs/source/reference/precision.md (1 hunks)
  • docs/source/torch.md (1 hunks)
  • examples/auto_deploy/README.md (2 hunks)
  • examples/disaggregated/README.md (1 hunks)
  • examples/eagle/README.md (0 hunks)
  • examples/models/core/deepseek_v3/README.md (2 hunks)
  • examples/models/core/llama/README.md (2 hunks)
  • examples/sample_weight_stripping/README.md (2 hunks)
💤 Files with no reviewable changes (3)
  • docs/source/advanced/gpt-attention.md
  • examples/eagle/README.md
  • docs/source/performance/perf-benchmarking.md
✅ Files skipped from review due to trivial changes (8)
  • README.md
  • docs/source/advanced/disaggregated-service.md
  • examples/sample_weight_stripping/README.md
  • docs/source/architecture/model-weights-loader.md
  • examples/disaggregated/README.md
  • docs/source/reference/precision.md
  • examples/auto_deploy/README.md
  • examples/models/core/deepseek_v3/README.md
🚧 Files skipped from review as they are similar to previous changes (2)
  • examples/models/core/llama/README.md
  • docs/source/torch.md
🧰 Additional context used
🧠 Learnings (1)
docs/source/advanced/speculative-decoding.md (2)

Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor() is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation() to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

🪛 LanguageTool
docs/source/advanced/speculative-decoding.md

[style] ~171-~171: This phrase is redundant. Consider using “inside”.
Context: ...nd draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2...

(OUTSIDE_OF)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13281 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13281 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #9920 (Partly Tested) completed with status: 'FAILURE'

@nv-guomingz nv-guomingz force-pushed the user/guomingz/clean_experimental branch from 372fd83 to f0fe05c Compare August 6, 2025 05:29
@nv-guomingz nv-guomingz requested a review from a team as a code owner August 6, 2025 05:29
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
docs/source/advanced/speculative-decoding.md (1)

171-171: Drop “of” after “inside” and fix missing space before parenthesis

Small wording/formatting tweaks improve readability.

-... performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2 are both supported).
+... performed inside the TensorRT engine (both EAGLE-1 and EAGLE-2 are supported).
examples/disaggregated/README.md (1)

112-116: Fix typo in YAML key refresh_interval
refersh_interval is miss-spelled. Anyone copying this sample will hit a configuration error.

-  refersh_interval: 10.0
+  refresh_interval: 10.0
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 372fd83 and f0fe05c.

📒 Files selected for processing (14)
  • README.md (1 hunks)
  • docs/source/advanced/disaggregated-service.md (1 hunks)
  • docs/source/advanced/gpt-attention.md (0 hunks)
  • docs/source/advanced/speculative-decoding.md (1 hunks)
  • docs/source/architecture/model-weights-loader.md (1 hunks)
  • docs/source/performance/perf-benchmarking.md (0 hunks)
  • docs/source/reference/precision.md (1 hunks)
  • docs/source/torch.md (1 hunks)
  • examples/auto_deploy/README.md (2 hunks)
  • examples/disaggregated/README.md (1 hunks)
  • examples/eagle/README.md (0 hunks)
  • examples/models/core/deepseek_v3/README.md (2 hunks)
  • examples/models/core/llama/README.md (2 hunks)
  • examples/sample_weight_stripping/README.md (2 hunks)
💤 Files with no reviewable changes (3)
  • docs/source/advanced/gpt-attention.md
  • examples/eagle/README.md
  • docs/source/performance/perf-benchmarking.md
✅ Files skipped from review due to trivial changes (7)
  • docs/source/advanced/disaggregated-service.md
  • examples/auto_deploy/README.md
  • examples/models/core/deepseek_v3/README.md
  • README.md
  • docs/source/reference/precision.md
  • docs/source/torch.md
  • docs/source/architecture/model-weights-loader.md
🚧 Files skipped from review as they are similar to previous changes (2)
  • examples/sample_weight_stripping/README.md
  • examples/models/core/llama/README.md
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: in tensorrt-llm, examples directory can have different dependency versions than the root requirement...
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

  • docs/source/advanced/speculative-decoding.md
📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

Applied to files:

  • docs/source/advanced/speculative-decoding.md
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • docs/source/advanced/speculative-decoding.md
🪛 LanguageTool
docs/source/advanced/speculative-decoding.md

[style] ~171-~171: This phrase is redundant. Consider using “inside”.
Context: ...nd draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2...

(OUTSIDE_OF)

🪛 markdownlint-cli2 (0.17.2)
examples/disaggregated/README.md

86-86: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


86-86: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
examples/disaggregated/README.md (1)

86-86: Status label update looks good
The heading change from “Experimental” to “Prototype” accurately reflects the new maturity stage and keeps terminology consistent across the docs.

@nv-guomingz nv-guomingz force-pushed the user/guomingz/clean_experimental branch from f0fe05c to 3aa3446 Compare August 6, 2025 16:18
@nv-guomingz
Copy link
Collaborator Author

/bot skip --comment "docs only change"

@nv-guomingz nv-guomingz enabled auto-merge (squash) August 6, 2025 16:18
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
docs/source/advanced/speculative-decoding.md (1)

171-171: Tighten grammar & spacing for clarity

Minor wording polish:
• “inside of” → “inside” (redundant “of”).
• Insert Oxford comma after “acceptance”.
• Add space before the opening parenthesis.
• Drop comma after “Please”.

-Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logits prediction, draft tokens acceptance and draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2 are both supported). Please, visit the [EAGLE README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/eagle/README.md) for information about building and running the model.
+Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logits prediction, draft token acceptance, and draft token generation are performed inside the TensorRT engine (EAGLE-1 and EAGLE-2 are both supported). Please visit the [EAGLE README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/eagle/README.md) for information about building and running the model.
examples/disaggregated/README.md (2)

110-116: Fix typo in key name – refresh_interval

refersh_interval will confuse users who copy-paste the YAML and may break config loaders that validate keys.

-  refersh_interval: 10.0
+  refresh_interval: 10.0

181-183: Correct section title – “Known Issues”

Minor wording nit:

-## Know Issues
+## Known Issues

This keeps terminology consistent across the docs.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f0fe05c and 3aa3446.

📒 Files selected for processing (14)
  • README.md (1 hunks)
  • docs/source/advanced/disaggregated-service.md (1 hunks)
  • docs/source/advanced/gpt-attention.md (0 hunks)
  • docs/source/advanced/speculative-decoding.md (1 hunks)
  • docs/source/architecture/model-weights-loader.md (1 hunks)
  • docs/source/performance/perf-benchmarking.md (0 hunks)
  • docs/source/reference/precision.md (1 hunks)
  • docs/source/torch.md (1 hunks)
  • examples/auto_deploy/README.md (2 hunks)
  • examples/disaggregated/README.md (1 hunks)
  • examples/eagle/README.md (0 hunks)
  • examples/models/core/deepseek_v3/README.md (2 hunks)
  • examples/models/core/llama/README.md (2 hunks)
  • examples/sample_weight_stripping/README.md (2 hunks)
💤 Files with no reviewable changes (3)
  • docs/source/advanced/gpt-attention.md
  • examples/eagle/README.md
  • docs/source/performance/perf-benchmarking.md
✅ Files skipped from review due to trivial changes (7)
  • docs/source/architecture/model-weights-loader.md
  • README.md
  • examples/models/core/deepseek_v3/README.md
  • docs/source/torch.md
  • docs/source/advanced/disaggregated-service.md
  • examples/auto_deploy/README.md
  • docs/source/reference/precision.md
🚧 Files skipped from review as they are similar to previous changes (2)
  • examples/sample_weight_stripping/README.md
  • examples/models/core/llama/README.md
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: in tensorrt-llm, examples directory can have different dependency versions than the root requirement...
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

  • docs/source/advanced/speculative-decoding.md
📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

Applied to files:

  • docs/source/advanced/speculative-decoding.md
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • docs/source/advanced/speculative-decoding.md
🪛 LanguageTool
docs/source/advanced/speculative-decoding.md

[style] ~171-~171: This phrase is redundant. Consider using “inside”.
Context: ...nd draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2...

(OUTSIDE_OF)

🪛 markdownlint-cli2 (0.17.2)
examples/disaggregated/README.md

86-86: Unordered list style
Expected: dash; Actual: asterisk

(MD004, ul-style)


86-86: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)

🔇 Additional comments (1)
examples/disaggregated/README.md (1)

86-86: Verify downstream links & anchor IDs after heading rename

Changing the heading from “Dynamic scaling (Experimental)” to “Dynamic scaling (Prototype)” also changes the automatic Markdown anchor (GitHub → #dynamic-scaling-prototype).
Please grep the repo (docs, blogs, notebooks) for #dynamic-scaling-experimental and update any in-page links or TOC generators that still point to the old anchor.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14318 [ skip ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14318 [ skip ] completed with state SUCCESS
Skipping testing for commit 3aa3446

@nv-guomingz nv-guomingz changed the title doc: remove the outdated features which marked as Experimental [None][doc]: remove the outdated features which marked as Experimental Aug 7, 2025
@nv-guomingz nv-guomingz force-pushed the user/guomingz/clean_experimental branch from 3aa3446 to 605a8c6 Compare August 7, 2025 01:40
@nv-guomingz
Copy link
Collaborator Author

/bot skip --comment "docs only change"

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
docs/source/advanced/speculative-decoding.md (1)

171-171: Remove redundant “of” and clean up punctuation

Minor wording and punctuation polish for clarity and consistency.

-Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logits prediction, draft tokens acceptance and draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2 are both supported). Please, visit the [EAGLE README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/eagle/README.md) for information about building and running the model.
+Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logits prediction, draft token acceptance, and draft token generation are performed inside the TensorRT engine (both EAGLE-1 and EAGLE-2 are supported). Please visit the [EAGLE README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/eagle/README.md) for information about building and running the model.
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3aa3446 and 605a8c6.

📒 Files selected for processing (14)
  • README.md (1 hunks)
  • docs/source/advanced/disaggregated-service.md (1 hunks)
  • docs/source/advanced/gpt-attention.md (0 hunks)
  • docs/source/advanced/speculative-decoding.md (1 hunks)
  • docs/source/architecture/model-weights-loader.md (1 hunks)
  • docs/source/performance/perf-benchmarking.md (0 hunks)
  • docs/source/reference/precision.md (1 hunks)
  • docs/source/torch.md (1 hunks)
  • examples/auto_deploy/README.md (2 hunks)
  • examples/disaggregated/README.md (1 hunks)
  • examples/eagle/README.md (0 hunks)
  • examples/models/core/deepseek_v3/README.md (2 hunks)
  • examples/models/core/llama/README.md (2 hunks)
  • examples/sample_weight_stripping/README.md (2 hunks)
💤 Files with no reviewable changes (3)
  • docs/source/advanced/gpt-attention.md
  • examples/eagle/README.md
  • docs/source/performance/perf-benchmarking.md
✅ Files skipped from review due to trivial changes (7)
  • examples/disaggregated/README.md
  • docs/source/advanced/disaggregated-service.md
  • examples/models/core/deepseek_v3/README.md
  • README.md
  • docs/source/architecture/model-weights-loader.md
  • examples/auto_deploy/README.md
  • docs/source/reference/precision.md
🚧 Files skipped from review as they are similar to previous changes (3)
  • examples/sample_weight_stripping/README.md
  • docs/source/torch.md
  • examples/models/core/llama/README.md
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: in tensorrt-llm, examples directory can have different dependency versions than the root requirement...
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

  • docs/source/advanced/speculative-decoding.md
📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

Applied to files:

  • docs/source/advanced/speculative-decoding.md
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • docs/source/advanced/speculative-decoding.md
🪛 LanguageTool
docs/source/advanced/speculative-decoding.md

[style] ~171-~171: This phrase is redundant. Consider using “inside”.
Context: ...nd draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2...

(OUTSIDE_OF)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14351 [ skip ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14351 [ skip ] completed with state SUCCESS
Skipping testing for commit 605a8c6

@nv-guomingz nv-guomingz merged commit f7f46a5 into NVIDIA:main Aug 7, 2025
3 of 4 checks passed
nv-guomingz added a commit to nv-guomingz/TensorRT-LLM that referenced this pull request Aug 7, 2025
@nv-guomingz nv-guomingz deleted the user/guomingz/clean_experimental branch September 30, 2025 07:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants