Skip to content

Commit 372fd83

Browse files
committed
doc: update the outdated features status.
Signed-off-by: nv-guomingz <[email protected]>
1 parent 64ba483 commit 372fd83

File tree

14 files changed

+18
-32
lines changed

14 files changed

+18
-32
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -243,5 +243,5 @@ Deprecation is used to inform developers that some APIs and tools are no longer
243243
## Useful Links
244244
- [Quantized models on Hugging Face](https://huggingface.co/collections/nvidia/model-optimizer-66aa84f7966b3150262481a4): A growing collection of quantized (e.g., FP8, FP4) and optimized LLMs, including [DeepSeek FP4](https://huggingface.co/nvidia/DeepSeek-R1-FP4), ready for fast inference with TensorRT-LLM.
245245
- [NVIDIA Dynamo](https://github.com/ai-dynamo/dynamo): A datacenter scale distributed inference serving framework that works seamlessly with TensorRT-LLM.
246-
- [AutoDeploy](./examples/auto_deploy/README.md): An experimental backend for TensorRT-LLM to simplify and accelerate the deployment of PyTorch models.
246+
- [AutoDeploy](./examples/auto_deploy/README.md): A prototype backend for TensorRT-LLM to simplify and accelerate the deployment of PyTorch models.
247247
- [WeChat Discussion Group](https://github.com/NVIDIA/TensorRT-LLM/issues/5359): A real-time channel for TensorRT-LLM Q&A and news.

docs/source/advanced/disaggregated-service.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
11
(disaggregated-service)=
22

3-
# Disaggregated-Service (Experimental)
3+
# Disaggregated-Service (Prototype)
44

55
```{note}
66
Note:
7-
This feature is currently experimental, and the related API is subjected to change in future versions.
7+
This feature is currently in prototype, and the related API is subjected to change in future versions.
88
```
99
Currently TRT-LLM supports `disaggregated-service`, where the context and generation phases of a request can run on different executors. TRT-LLM's disaggregated service relies on the executor API, please make sure to read the [executor page](executor.md) before reading the document.
1010

docs/source/advanced/gpt-attention.md

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -112,8 +112,6 @@ printed.
112112
#### XQA Optimization
113113

114114
Another optimization for MQA/GQA in generation phase called XQA optimization.
115-
It is still experimental feature and support limited configurations. LLAMA2 70B
116-
is one model that it supports.
117115

118116
Support matrix of the XQA optimization:
119117
- FP16 / BF16 compute data type.

docs/source/advanced/speculative-decoding.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -168,7 +168,7 @@ TensorRT-LLM implements the ReDrafter model such that logits prediction, beam se
168168

169169
The EAGLE approach enhances the single-model Medusa method by predicting and verifying tokens using the same model. Similarly to ReDrafter, it predicts draft tokens using a recurrent predictor where each draft token depends on the previous one. However, unlike ReDrafter, it uses a single-layer transformer model to predict draft tokens from previous hidden states and decoded tokens. In the EAGLE-1 decoding tree needs to be known during the decoding. In the EAGLE-2 this tree is asssembled during the execution by searching for the most probable hypothesis along the beam.
170170

171-
Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logits prediction, draft tokens acceptance and draft token generation are performed inside of the TensorRT engine. EAGLE-1 and EAGLE-2 are both supported, while EAGLE-2 is currently in the experimental stage. Please, visit the [EAGLE README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/eagle/README.md) for information about building and running the model.
171+
Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logits prediction, draft tokens acceptance and draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2 are both supported). Please, visit the [EAGLE README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/eagle/README.md) for information about building and running the model.
172172

173173
### Disaggregated Serving
174174

docs/source/architecture/model-weights-loader.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -249,7 +249,7 @@ for tllm_key, param in tqdm(trtllm_model.named_parameters()):
249249
In this mode, every precision require user's own support.
250250

251251
## Trouble shooting
252-
The weights loader is an experimental feature for now, and is enabled for LLaMA family models and Qwen models by default.
252+
The weights loader is enabled for LLaMA family models and Qwen models by default with TensorRT flow only.
253253

254254
If users are encountered with failure caused by `ModelWeightsLoader`, a workaround is passing environmental variable `TRTLLM_DISABLE_UNIFIED_CONVERTER=1` to disable the model weights loader and fallback to the legacy path.
255255

docs/source/performance/perf-benchmarking.md

Lines changed: 0 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -236,15 +236,6 @@ The following command builds an FP8 quantized engine by specifying the engine tu
236236
trtllm-bench --model meta-llama/Llama-3.1-8B build --quantization FP8 --max_seq_len 4096 --max_batch_size 1024 --max_num_tokens 2048
237237
```
238238

239-
- [Experimental] Build engine with target ISL/OSL for optimization:
240-
In this experimental mode, you can provide hints to `trtllm-bench`'s tuning heuristic to optimize the engine on specific ISL and OSL targets.
241-
Generally, the target ISL and OSL aligns with the average ISL and OSL of the dataset, but you can experiment with different values to optimize the engine using this mode.
242-
The following command builds an FP8 quantized engine and optimizes for ISL:OSL targets of 128:128.
243-
244-
```shell
245-
trtllm-bench --model meta-llama/Llama-3.1-8B build --quantization FP8 --max_seq_len 4096 --target_isl 128 --target_osl 128
246-
```
247-
248239

249240
#### Parallelism Mapping Support
250241
The `trtllm-bench build` subcommand supports combinations of tensor-parallel (TP) and pipeline-parallel (PP) mappings as long as the world size (`tp_size x pp_size`) `<=` `8`. The parallelism mapping in build subcommad is controlled by `--tp_size` and `--pp_size` options. The following command builds an engine with TP2-PP2 mapping.

docs/source/reference/precision.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -103,8 +103,7 @@ Python function, for details.
103103

104104
This release includes examples of applying GPTQ to [GPT-NeoX](source:examples/models/core/gpt)
105105
and [LLaMA-v2](source:examples/models/core/llama), as well as an example of using AWQ with
106-
[GPT-J](source:examples/models/contrib/gpt). Those examples are experimental implementations and
107-
are likely to evolve in a future release.
106+
[GPT-J](source:examples/models/contrib/gptj).
108107

109108
## FP8 (Hopper)
110109

docs/source/torch.md

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,10 +2,9 @@
22

33
```{note}
44
Note:
5-
This feature is currently experimental, and the related API is subjected to change in future versions.
5+
This feature is currently in beta, and the related API is subjected to change in future versions.
66
```
7-
8-
To enhance the usability of the system and improve developer efficiency, TensorRT-LLM launches a new experimental backend based on PyTorch.
7+
To enhance the usability of the system and improve developer efficiency, TensorRT-LLM launches a new backend based on PyTorch.
98

109
The PyTorch backend of TensorRT-LLM is available in version 0.17 and later. You can try it via importing `tensorrt_llm._torch`.
1110

examples/auto_deploy/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66

77
<div align="left">
88

9-
AutoDeploy is an experimental feature in beta stage designed to simplify and accelerate the deployment of PyTorch models, including off-the-shelf models like those from Hugging Face, to TensorRT-LLM. It automates graph transformations to integrate inference optimizations such as tensor parallelism, KV-caching and quantization. AutoDeploy supports optimized in-framework deployment, minimizing the amount of manual modification needed.
9+
AutoDeploy is a prototype feature in beta stage designed to simplify and accelerate the deployment of PyTorch models, including off-the-shelf models like those from Hugging Face, to TensorRT-LLM. It automates graph transformations to integrate inference optimizations such as tensor parallelism, KV-caching and quantization. AutoDeploy supports optimized in-framework deployment, minimizing the amount of manual modification needed.
1010

1111
______________________________________________________________________
1212

@@ -450,4 +450,4 @@ the current progress in AutoDeploy and where you can help.
450450

451451
## Disclaimer
452452

453-
This project is in active development and is currently in an early (beta) stage. The code is experimental, subject to change, and may include backward-incompatible updates. While we strive for correctness, we provide no guarantees regarding functionality, stability, or reliability. Use at your own risk.
453+
This project is in active development and is currently in an early (beta) stage. The code is in prototype, subject to change, and may include backward-incompatible updates. While we strive for correctness, we provide no guarantees regarding functionality, stability, or reliability. Use at your own risk.

examples/disaggregated/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -83,7 +83,7 @@ Or using the provided client parsing the prompts from a file and sending request
8383
python3 ./clients/disagg_client.py -c disagg_config.yaml -p ./clients/prompts.json -e chat
8484
```
8585

86-
## Dynamic scaling (Experimental)
86+
## Dynamic scaling (Prototype)
8787

8888
Currently, trtllm supports dynamic addition and removal of servers by leveraging ETCD. To enable this feature, you should start the context and generation servers with an additional flag ```--metadata_server_config_file``` and ```--server_role```.
8989
Before launching the context and generation servers, you should first start the ETCD server. By default, the ETCD server listens for client requests at ```localhost:2379```.

0 commit comments

Comments
 (0)