Skip to content

Commit cc18db1

Browse files
committed
doc: remove the outdated features which marked as Experimental
Signed-off-by: nv-guomingz <[email protected]>
1 parent c7ffadf commit cc18db1

File tree

7 files changed

+5
-21
lines changed

7 files changed

+5
-21
lines changed

docs/source/advanced/gpt-attention.md

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -112,8 +112,6 @@ printed.
112112
#### XQA Optimization
113113

114114
Another optimization for MQA/GQA in generation phase called XQA optimization.
115-
It is still experimental feature and support limited configurations. LLAMA2 70B
116-
is one model that it supports.
117115

118116
Support matrix of the XQA optimization:
119117
- FP16 / BF16 compute data type.

docs/source/advanced/speculative-decoding.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -167,7 +167,7 @@ TensorRT-LLM implements the ReDrafter model such that logits prediction, beam se
167167

168168
The EAGLE approach enhances the single-model Medusa method by predicting and verifying tokens using the same model. Similarly to ReDrafter, it predicts draft tokens using a recurrent predictor where each draft token depends on the previous one. However, unlike ReDrafter, it uses a single-layer transformer model to predict draft tokens from previous hidden states and decoded tokens. In the EAGLE-1 decoding tree needs to be known during the decoding. In the EAGLE-2 this tree is asssembled during the execution by searching for the most probable hypothesis along the beam.
169169

170-
Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logits prediction, draft tokens acceptance and draft token generation are performed inside of the TensorRT engine. EAGLE-1 and EAGLE-2 are both supported, while EAGLE-2 is currently in the experimental stage. Please, visit the [EAGLE README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/eagle/README.md) for information about building and running the model.
170+
Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logits prediction, draft tokens acceptance and draft token generation are performed inside of the TensorRT engine(EAGLE-1 and EAGLE-2 are both supported). Please, visit the [EAGLE README](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/eagle/README.md) for information about building and running the model.
171171

172172
## Lookahead Decoding
173173

docs/source/architecture/model-weights-loader.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -249,7 +249,7 @@ for tllm_key, param in tqdm(trtllm_model.named_parameters()):
249249
In this mode, every precision require user's own support.
250250

251251
## Trouble shooting
252-
The weights loader is an experimental feature for now, and is enabled for LLaMA family models and Qwen models by default.
252+
The weights loader is enabled for LLaMA family models and Qwen models by default with TensorRT flow only.
253253

254254
If users are encountered with failure caused by `ModelWeightsLoader`, a workaround is passing environmental variable `TRTLLM_DISABLE_UNIFIED_CONVERTER=1` to disable the model weights loader and fallback to the legacy path.
255255

docs/source/performance/perf-benchmarking.md

Lines changed: 0 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -236,15 +236,6 @@ The following command builds an FP8 quantized engine by specifying the engine tu
236236
trtllm-bench --model meta-llama/Llama-3.1-8B build --quantization FP8 --max_seq_len 4096 --max_batch_size 1024 --max_num_tokens 2048
237237
```
238238

239-
- [Experimental] Build engine with target ISL/OSL for optimization:
240-
In this experimental mode, you can provide hints to `trtllm-bench`'s tuning heuristic to optimize the engine on specific ISL and OSL targets.
241-
Generally, the target ISL and OSL aligns with the average ISL and OSL of the dataset, but you can experiment with different values to optimize the engine using this mode.
242-
The following command builds an FP8 quantized engine and optimizes for ISL:OSL targets of 128:128.
243-
244-
```shell
245-
trtllm-bench --model meta-llama/Llama-3.1-8B build --quantization FP8 --max_seq_len 4096 --target_isl 128 --target_osl 128
246-
```
247-
248239

249240
#### Parallelism Mapping Support
250241
The `trtllm-bench build` subcommand supports combinations of tensor-parallel (TP) and pipeline-parallel (PP) mappings as long as the world size (`tp_size x pp_size`) `<=` `8`. The parallelism mapping in build subcommad is controlled by `--tp_size` and `--pp_size` options. The following command builds an engine with TP2-PP2 mapping.

docs/source/torch.md

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,7 @@
11
# PyTorch Backend
22

3-
```{note}
4-
Note:
5-
This feature is currently experimental, and the related API is subjected to change in future versions.
6-
```
73

8-
To enhance the usability of the system and improve developer efficiency, TensorRT-LLM launches a new experimental backend based on PyTorch.
4+
To enhance the usability of the system and improve developer efficiency, TensorRT-LLM launches a new backend based on PyTorch.
95

106
The PyTorch backend of TensorRT-LLM is available in version 0.17 and later. You can try it via importing `tensorrt_llm._torch`.
117

examples/eagle/README.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -98,7 +98,6 @@ To run non-greedy sampling and use typical acceptance, set `--eagle_posterior_th
9898
`--temperature` can be specified as well. When no `--eagle_posterior_threshold` is specified or `--temperature=0.0` is set, greedy sampling is used.
9999

100100
#### Run EAGLE-2
101-
**EAGLE-2 is still under the experimental stage.**
102101

103102
EAGLE-2 can be enabled with 2 runtime flags (`--eagle_use_dynamic_tree` and `--eagle_dynamic_tree_max_top_k=N`). The same engine can be used for EAGLE-1 and EAGLE-2. Eagle choices must not be set in case of EAGLE-2. EAGLE-2 will generate the tree corresponding to choices dynamically in the runtime. For more details, please refer to [EAGLE-2 paper](https://arxiv.org/pdf/2406.16858).
104103

examples/models/core/llama/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -672,7 +672,7 @@ trtllm-build --checkpoint_dir ./tllm_checkpoint_2gpu_fp8 \
672672
The peak GPU memory consumption when doing FP8 quantizaton is more than 210GB (there is also some activation memory occupation when doing calibration).
673673
So you need a node with at least 4 H100(A100) to run the quantization command. After quantization, 2 GPUs are okay to for building and run.
674674

675-
Experimental: use FP8 GEMV to optimize performance in FP8 small-batch-size cases.
675+
Note: use FP8 GEMV to optimize performance in FP8 small-batch-size cases.
676676

677677
```bash
678678
# Quantize HF LLaMA 7B into FP8 and export trtllm checkpoint
@@ -690,7 +690,7 @@ trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_fp8 \
690690
--gemm_plugin fp8
691691
```
692692

693-
**Note**: FP8 gemm plugin is an experimental feature aimed to improve performance in small-batch-size cases(e.g. BS<=4). Although inputs with batch size larger than 4 can be correctly inferenced, the performance may decrease as batch size grows.
693+
**Note**: FP8 gemv plugin uses CUDA cores to compute, by contrast to Tensor Core gemm kernel within cuBLAS. Over last year, as cuBLAS have improved their performance by a lot under small M case for Hopper(sm90), FP8 gemv kernel may or may not surpass cuBLAS, depending on specific gemm problem shape. Nonetheless, we still strongly recommend FP8 gemv kernel for Ada (sm89) as cuBLAS still falls behind gemv on it.
694694

695695
### Groupwise quantization (AWQ/GPTQ)
696696
One can enable AWQ/GPTQ INT4 weight only quantization with these options when building engine with `trtllm-build`:

0 commit comments

Comments
 (0)