-
Notifications
You must be signed in to change notification settings - Fork 1.9k
[None][doc]: remove the outdated features which marked as Experimental #5995
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[None][doc]: remove the outdated features which marked as Experimental #5995
Conversation
107dbb3 to
a808cc8
Compare
a808cc8 to
d69b27e
Compare
d69b27e to
909bcb1
Compare
909bcb1 to
e3f1e8c
Compare
cc18db1 to
c6a80d1
Compare
c6a80d1 to
ea7d44c
Compare
📝 WalkthroughWalkthroughThis update modifies documentation files to remove or reword references to "experimental" status for several features, clarify default behaviors, and update technical explanations. No changes to code or public interfaces are present; all modifications are limited to documentation content and README files. Changes
Sequence Diagram(s)No sequence diagrams are generated, as all changes are limited to documentation and do not affect control flow or feature implementation. Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes Suggested labels
Suggested reviewers
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. ✨ Finishing Touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (5)
docs/source/torch.md (1)
4-4: Re-phrase for a smoother reading flow“launches a new backend” sounds like a one-off event. “introduces” (or “adds”) better reflects the documentation’s timeless nature.
-To enhance the usability of the system and improve developer efficiency, TensorRT-LLM launches a new backend based on PyTorch. +To enhance usability and developer efficiency, TensorRT-LLM introduces a new backend based on PyTorch.docs/source/advanced/speculative-decoding.md (1)
171-171: Minor grammar & spacing tidy-upRemove the redundant “of”, add the missing space, and swap the en-dash for a hyphen to stay consistent.
-Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logits prediction, draft tokens acceptance and draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2 are both supported). +Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model so that logits prediction, draft-token acceptance, and draft-token generation are performed inside the TensorRT engine (EAGLE-1 and EAGLE-2 are both supported).examples/models/core/llama/README.md (2)
679-679: Capitalise sentence start & tighten wording-Note: use FP8 GEMV to optimize performance in FP8 small-batch-size cases. +Note: Use FP8 GEMV to optimise performance in small-batch-size FP8 scenarios.
697-697: Polish long explanatory note for readabilityA few micro-fixes improve clarity:
-**Note**: FP8 gemv plugin uses CUDA cores to compute, by contrast to Tensor Core gemm kernel within cuBLAS. Over last year, as cuBLAS have improved their performance by a lot under small M case for Hopper(sm90), FP8 gemv kernel may or may not surpass cuBLAS, depending on specific gemm problem shape. Nonetheless, we still strongly recommend FP8 gemv kernel for Ada (sm89) as cuBLAS still falls behind gemv on it. +**Note**: The FP8 GEMV plugin runs on CUDA cores, whereas cuBLAS uses Tensor-Core GEMM kernels. Over the last year cuBLAS performance for small-M cases on Hopper (SM90) has improved substantially, so FP8 GEMV may or may not outperform cuBLAS depending on the exact GEMM shape. We still strongly recommend FP8 GEMV on Ada (SM89), where cuBLAS continues to lag behind.docs/source/architecture/model-weights-loader.md (1)
252-252: Clarify “by default” clauseThe current wording is slightly ambiguous about when the loader is active.
-The weights loader is enabled for LLaMA family models and Qwen models by default with TensorRT flow only. +By default, the weights loader is enabled for LLaMA-family and Qwen models when using the TensorRT flow.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
docs/source/advanced/gpt-attention.md(0 hunks)docs/source/advanced/speculative-decoding.md(1 hunks)docs/source/architecture/model-weights-loader.md(1 hunks)docs/source/performance/perf-benchmarking.md(0 hunks)docs/source/reference/precision.md(1 hunks)docs/source/torch.md(1 hunks)examples/eagle/README.md(0 hunks)examples/models/core/llama/README.md(2 hunks)
💤 Files with no reviewable changes (3)
- examples/eagle/README.md
- docs/source/advanced/gpt-attention.md
- docs/source/performance/perf-benchmarking.md
🧰 Additional context used
🧠 Learnings (2)
docs/source/architecture/model-weights-loader.md (2)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor() is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation() to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
docs/source/advanced/speculative-decoding.md (1)
Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor() is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation() to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
🪛 LanguageTool
docs/source/advanced/speculative-decoding.md
[style] ~171-~171: This phrase is redundant. Consider using “inside”.
Context: ...nd draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2...
(OUTSIDE_OF)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
ea7d44c to
daa23b3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
docs/source/advanced/speculative-decoding.md (1)
171-171: Tighten wording and fix minor grammar issuesThe sentence contains a few stylistic hiccups:
• “inside of” → “inside”
• missing space before the parenthesis after “engine”
• plural-singular mismatch in “draft tokens acceptance”
• superfluous comma after “Please”Proposed tweak:
-Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logits prediction, draft tokens acceptance and draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2 are both supported). Please, visit the [EAGLE README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/eagle/README.md) for information about building and running the model. +Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logits prediction, draft token acceptance, and draft token generation are performed inside the TensorRT engine (EAGLE-1 and EAGLE-2 are both supported). Please visit the [EAGLE README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/eagle/README.md) for information about building and running the model.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
docs/source/advanced/gpt-attention.md(0 hunks)docs/source/advanced/speculative-decoding.md(1 hunks)docs/source/architecture/model-weights-loader.md(1 hunks)docs/source/performance/perf-benchmarking.md(0 hunks)docs/source/reference/precision.md(1 hunks)docs/source/torch.md(1 hunks)examples/eagle/README.md(0 hunks)examples/models/core/llama/README.md(2 hunks)
💤 Files with no reviewable changes (3)
- docs/source/advanced/gpt-attention.md
- docs/source/performance/perf-benchmarking.md
- examples/eagle/README.md
✅ Files skipped from review due to trivial changes (3)
- docs/source/architecture/model-weights-loader.md
- docs/source/reference/precision.md
- docs/source/torch.md
🚧 Files skipped from review as they are similar to previous changes (1)
- examples/models/core/llama/README.md
🧰 Additional context used
🧠 Learnings (1)
docs/source/advanced/speculative-decoding.md (2)
Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor() is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation() to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.598Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
🪛 LanguageTool
docs/source/advanced/speculative-decoding.md
[style] ~171-~171: This phrase is redundant. Consider using “inside”.
Context: ...nd draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2...
(OUTSIDE_OF)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
daa23b3 to
7e67143
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
docs/source/advanced/speculative-decoding.md (2)
171-171: Tighten wording and fix minor formatting issues
The phrase “inside of the TensorRT engine” is redundant, there’s a missing space before the parenthesis, and “draft tokens acceptance” should be singular.-Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logits prediction, draft tokens acceptance and draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2 are both supported). Please, visit the [EAGLE README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/eagle/README.md) for information about building and running the model. +Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logits prediction, draft-token acceptance, and draft-token generation are performed inside the TensorRT engine (EAGLE-1 and EAGLE-2 are both supported). For details on building and running the model, see the [EAGLE README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/eagle/README.md).
169-170: Correct typo and clarify sentences
“asssembled” has an extra “s”, and both sentences read more naturally with slight re-phrasing.-In the EAGLE-1 decoding tree needs to be known during the decoding. In the EAGLE-2 this tree is asssembled during the execution by searching for the most probable hypothesis along the beam. +In EAGLE-1, the decoding tree must be defined before decoding begins. In EAGLE-2, this tree is assembled during execution by searching for the most probable hypothesis along the beam.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (14)
README.md(1 hunks)docs/source/advanced/disaggregated-service.md(1 hunks)docs/source/advanced/gpt-attention.md(0 hunks)docs/source/advanced/speculative-decoding.md(1 hunks)docs/source/architecture/model-weights-loader.md(1 hunks)docs/source/performance/perf-benchmarking.md(0 hunks)docs/source/reference/precision.md(1 hunks)docs/source/torch.md(1 hunks)examples/auto_deploy/README.md(2 hunks)examples/disaggregated/README.md(1 hunks)examples/eagle/README.md(0 hunks)examples/models/core/deepseek_v3/README.md(2 hunks)examples/models/core/llama/README.md(2 hunks)examples/sample_weight_stripping/README.md(2 hunks)
💤 Files with no reviewable changes (3)
- examples/eagle/README.md
- docs/source/advanced/gpt-attention.md
- docs/source/performance/perf-benchmarking.md
✅ Files skipped from review due to trivial changes (9)
- docs/source/advanced/disaggregated-service.md
- examples/disaggregated/README.md
- examples/auto_deploy/README.md
- docs/source/reference/precision.md
- README.md
- examples/models/core/deepseek_v3/README.md
- examples/sample_weight_stripping/README.md
- docs/source/architecture/model-weights-loader.md
- docs/source/torch.md
🚧 Files skipped from review as they are similar to previous changes (1)
- examples/models/core/llama/README.md
🧰 Additional context used
🧠 Learnings (1)
docs/source/advanced/speculative-decoding.md (2)
Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor() is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation() to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.598Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
🪛 LanguageTool
docs/source/advanced/speculative-decoding.md
[style] ~171-~171: This phrase is redundant. Consider using “inside”.
Context: ...nd draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2...
(OUTSIDE_OF)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
b569864 to
372fd83
Compare
|
/bot run --stage-list "A10-Build_Docs" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
docs/source/advanced/speculative-decoding.md (1)
171-171: Polish grammar & spacing for clarityMinor wording and punctuation tweaks improve readability:
-Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logits prediction, draft tokens acceptance and draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2 are both supported). Please, visit the [EAGLE README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/eagle/README.md) for information about building and running the model. +Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logit prediction, draft-token acceptance, and draft-token generation are performed inside the TensorRT engine (EAGLE-1 and EAGLE-2 are both supported). Please visit the [EAGLE README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/eagle/README.md) for information about building and running the model.Key fixes:
• “inside the” instead of “inside of the”.
• Add space before the parenthesis.
• Oxford comma after “acceptance”.
• Singular “logit” and hyphenate compound nouns.
• Remove superfluous comma after “Please”.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (14)
README.md(1 hunks)docs/source/advanced/disaggregated-service.md(1 hunks)docs/source/advanced/gpt-attention.md(0 hunks)docs/source/advanced/speculative-decoding.md(1 hunks)docs/source/architecture/model-weights-loader.md(1 hunks)docs/source/performance/perf-benchmarking.md(0 hunks)docs/source/reference/precision.md(1 hunks)docs/source/torch.md(1 hunks)examples/auto_deploy/README.md(2 hunks)examples/disaggregated/README.md(1 hunks)examples/eagle/README.md(0 hunks)examples/models/core/deepseek_v3/README.md(2 hunks)examples/models/core/llama/README.md(2 hunks)examples/sample_weight_stripping/README.md(2 hunks)
💤 Files with no reviewable changes (3)
- docs/source/advanced/gpt-attention.md
- examples/eagle/README.md
- docs/source/performance/perf-benchmarking.md
✅ Files skipped from review due to trivial changes (8)
- README.md
- docs/source/advanced/disaggregated-service.md
- examples/sample_weight_stripping/README.md
- docs/source/architecture/model-weights-loader.md
- examples/disaggregated/README.md
- docs/source/reference/precision.md
- examples/auto_deploy/README.md
- examples/models/core/deepseek_v3/README.md
🚧 Files skipped from review as they are similar to previous changes (2)
- examples/models/core/llama/README.md
- docs/source/torch.md
🧰 Additional context used
🧠 Learnings (1)
docs/source/advanced/speculative-decoding.md (2)
Learnt from: yechank-nvidia
PR: #6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using from_shared_tensor() is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call strip_for_generation() to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
🪛 LanguageTool
docs/source/advanced/speculative-decoding.md
[style] ~171-~171: This phrase is redundant. Consider using “inside”.
Context: ...nd draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2...
(OUTSIDE_OF)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
|
PR_Github #13281 [ run ] triggered by Bot |
|
PR_Github #13281 [ run ] completed with state |
372fd83 to
f0fe05c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
docs/source/advanced/speculative-decoding.md (1)
171-171: Drop “of” after “inside” and fix missing space before parenthesisSmall wording/formatting tweaks improve readability.
-... performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2 are both supported). +... performed inside the TensorRT engine (both EAGLE-1 and EAGLE-2 are supported).examples/disaggregated/README.md (1)
112-116: Fix typo in YAML keyrefresh_interval
refersh_intervalis miss-spelled. Anyone copying this sample will hit a configuration error.- refersh_interval: 10.0 + refresh_interval: 10.0
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (14)
README.md(1 hunks)docs/source/advanced/disaggregated-service.md(1 hunks)docs/source/advanced/gpt-attention.md(0 hunks)docs/source/advanced/speculative-decoding.md(1 hunks)docs/source/architecture/model-weights-loader.md(1 hunks)docs/source/performance/perf-benchmarking.md(0 hunks)docs/source/reference/precision.md(1 hunks)docs/source/torch.md(1 hunks)examples/auto_deploy/README.md(2 hunks)examples/disaggregated/README.md(1 hunks)examples/eagle/README.md(0 hunks)examples/models/core/deepseek_v3/README.md(2 hunks)examples/models/core/llama/README.md(2 hunks)examples/sample_weight_stripping/README.md(2 hunks)
💤 Files with no reviewable changes (3)
- docs/source/advanced/gpt-attention.md
- examples/eagle/README.md
- docs/source/performance/perf-benchmarking.md
✅ Files skipped from review due to trivial changes (7)
- docs/source/advanced/disaggregated-service.md
- examples/auto_deploy/README.md
- examples/models/core/deepseek_v3/README.md
- README.md
- docs/source/reference/precision.md
- docs/source/torch.md
- docs/source/architecture/model-weights-loader.md
🚧 Files skipped from review as they are similar to previous changes (2)
- examples/sample_weight_stripping/README.md
- examples/models/core/llama/README.md
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: in tensorrt-llm, examples directory can have different dependency versions than the root requirement...
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
Applied to files:
docs/source/advanced/speculative-decoding.md
📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
Applied to files:
docs/source/advanced/speculative-decoding.md
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
docs/source/advanced/speculative-decoding.md
🪛 LanguageTool
docs/source/advanced/speculative-decoding.md
[style] ~171-~171: This phrase is redundant. Consider using “inside”.
Context: ...nd draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2...
(OUTSIDE_OF)
🪛 markdownlint-cli2 (0.17.2)
examples/disaggregated/README.md
86-86: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
86-86: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
examples/disaggregated/README.md (1)
86-86: Status label update looks good
The heading change from “Experimental” to “Prototype” accurately reflects the new maturity stage and keeps terminology consistent across the docs.
f0fe05c to
3aa3446
Compare
|
/bot skip --comment "docs only change" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
docs/source/advanced/speculative-decoding.md (1)
171-171: Tighten grammar & spacing for clarityMinor wording polish:
• “inside of” → “inside” (redundant “of”).
• Insert Oxford comma after “acceptance”.
• Add space before the opening parenthesis.
• Drop comma after “Please”.-Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logits prediction, draft tokens acceptance and draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2 are both supported). Please, visit the [EAGLE README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/eagle/README.md) for information about building and running the model. +Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logits prediction, draft token acceptance, and draft token generation are performed inside the TensorRT engine (EAGLE-1 and EAGLE-2 are both supported). Please visit the [EAGLE README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/eagle/README.md) for information about building and running the model.examples/disaggregated/README.md (2)
110-116: Fix typo in key name –refresh_interval
refersh_intervalwill confuse users who copy-paste the YAML and may break config loaders that validate keys.- refersh_interval: 10.0 + refresh_interval: 10.0
181-183: Correct section title – “Known Issues”Minor wording nit:
-## Know Issues +## Known IssuesThis keeps terminology consistent across the docs.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (14)
README.md(1 hunks)docs/source/advanced/disaggregated-service.md(1 hunks)docs/source/advanced/gpt-attention.md(0 hunks)docs/source/advanced/speculative-decoding.md(1 hunks)docs/source/architecture/model-weights-loader.md(1 hunks)docs/source/performance/perf-benchmarking.md(0 hunks)docs/source/reference/precision.md(1 hunks)docs/source/torch.md(1 hunks)examples/auto_deploy/README.md(2 hunks)examples/disaggregated/README.md(1 hunks)examples/eagle/README.md(0 hunks)examples/models/core/deepseek_v3/README.md(2 hunks)examples/models/core/llama/README.md(2 hunks)examples/sample_weight_stripping/README.md(2 hunks)
💤 Files with no reviewable changes (3)
- docs/source/advanced/gpt-attention.md
- examples/eagle/README.md
- docs/source/performance/perf-benchmarking.md
✅ Files skipped from review due to trivial changes (7)
- docs/source/architecture/model-weights-loader.md
- README.md
- examples/models/core/deepseek_v3/README.md
- docs/source/torch.md
- docs/source/advanced/disaggregated-service.md
- examples/auto_deploy/README.md
- docs/source/reference/precision.md
🚧 Files skipped from review as they are similar to previous changes (2)
- examples/sample_weight_stripping/README.md
- examples/models/core/llama/README.md
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: in tensorrt-llm, examples directory can have different dependency versions than the root requirement...
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
Applied to files:
docs/source/advanced/speculative-decoding.md
📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
Applied to files:
docs/source/advanced/speculative-decoding.md
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
docs/source/advanced/speculative-decoding.md
🪛 LanguageTool
docs/source/advanced/speculative-decoding.md
[style] ~171-~171: This phrase is redundant. Consider using “inside”.
Context: ...nd draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2...
(OUTSIDE_OF)
🪛 markdownlint-cli2 (0.17.2)
examples/disaggregated/README.md
86-86: Unordered list style
Expected: dash; Actual: asterisk
(MD004, ul-style)
86-86: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
🔇 Additional comments (1)
examples/disaggregated/README.md (1)
86-86: Verify downstream links & anchor IDs after heading renameChanging the heading from “Dynamic scaling (Experimental)” to “Dynamic scaling (Prototype)” also changes the automatic Markdown anchor (GitHub →
#dynamic-scaling-prototype).
Please grep the repo (docs, blogs, notebooks) for#dynamic-scaling-experimentaland update any in-page links or TOC generators that still point to the old anchor.
|
PR_Github #14318 [ skip ] triggered by Bot |
|
PR_Github #14318 [ skip ] completed with state |
Signed-off-by: nv-guomingz <[email protected]>
3aa3446 to
605a8c6
Compare
|
/bot skip --comment "docs only change" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
docs/source/advanced/speculative-decoding.md (1)
171-171: Remove redundant “of” and clean up punctuationMinor wording and punctuation polish for clarity and consistency.
-Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logits prediction, draft tokens acceptance and draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2 are both supported). Please, visit the [EAGLE README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/eagle/README.md) for information about building and running the model. +Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logits prediction, draft token acceptance, and draft token generation are performed inside the TensorRT engine (both EAGLE-1 and EAGLE-2 are supported). Please visit the [EAGLE README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/eagle/README.md) for information about building and running the model.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (14)
README.md(1 hunks)docs/source/advanced/disaggregated-service.md(1 hunks)docs/source/advanced/gpt-attention.md(0 hunks)docs/source/advanced/speculative-decoding.md(1 hunks)docs/source/architecture/model-weights-loader.md(1 hunks)docs/source/performance/perf-benchmarking.md(0 hunks)docs/source/reference/precision.md(1 hunks)docs/source/torch.md(1 hunks)examples/auto_deploy/README.md(2 hunks)examples/disaggregated/README.md(1 hunks)examples/eagle/README.md(0 hunks)examples/models/core/deepseek_v3/README.md(2 hunks)examples/models/core/llama/README.md(2 hunks)examples/sample_weight_stripping/README.md(2 hunks)
💤 Files with no reviewable changes (3)
- docs/source/advanced/gpt-attention.md
- examples/eagle/README.md
- docs/source/performance/perf-benchmarking.md
✅ Files skipped from review due to trivial changes (7)
- examples/disaggregated/README.md
- docs/source/advanced/disaggregated-service.md
- examples/models/core/deepseek_v3/README.md
- README.md
- docs/source/architecture/model-weights-loader.md
- examples/auto_deploy/README.md
- docs/source/reference/precision.md
🚧 Files skipped from review as they are similar to previous changes (3)
- examples/sample_weight_stripping/README.md
- docs/source/torch.md
- examples/models/core/llama/README.md
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: in tensorrt-llm, examples directory can have different dependency versions than the root requirement...
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
Applied to files:
docs/source/advanced/speculative-decoding.md
📚 Learning: in tensorrt-llm's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()...
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
Applied to files:
docs/source/advanced/speculative-decoding.md
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
docs/source/advanced/speculative-decoding.md
🪛 LanguageTool
docs/source/advanced/speculative-decoding.md
[style] ~171-~171: This phrase is redundant. Consider using “inside”.
Context: ...nd draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2...
(OUTSIDE_OF)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
|
PR_Github #14351 [ skip ] triggered by Bot |
|
PR_Github #14351 [ skip ] completed with state |
…A#5995) Signed-off-by: nv-guomingz <[email protected]>
Clean the doc by removing experimental label.
Summary by CodeRabbit
Summary by CodeRabbit