Skip to content

Commit a2718a0

Browse files
QiJuneschetlur-nvchzblych
authored andcommitted
add release notes for 0.21 release (NVIDIA#6049)
Signed-off-by: junq <[email protected]> Signed-off-by: Sharan Chetlur <[email protected]> Signed-off-by: QI JUN <[email protected]> Co-authored-by: Sharan Chetlur <[email protected]> Co-authored-by: Yanchao Lu <[email protected]> Signed-off-by: Ransiki Zhang <[email protected]>
1 parent d719f76 commit a2718a0

File tree

1 file changed

+70
-0
lines changed

1 file changed

+70
-0
lines changed

docs/source/release-notes.md

Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,76 @@
44

55
All published functionality in the Release Notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our [NVIDIA Developer Forum](https://forums.developer.nvidia.com/).
66

7+
## TensorRT-LLM Release 0.21.0
8+
9+
### Key Features and Enhancements
10+
- **Model Support**
11+
- Added Gemma3 VLM support
12+
- **Features**
13+
- Added large-scale EP support
14+
- Integrated NIXL into the communication layer of the disaggregated service
15+
- Added fabric Memory support for KV Cache Transfer
16+
- Added MCP in ScaffoldingLLM
17+
- Added support for w4a8_mxfp4_fp8 quantization
18+
- Added support for fp8 rowwise quantization
19+
- Added generation logits support in TRTLLM Sampler
20+
- Added log probs support in TRTLLM Sampler
21+
- Optimized TRTLLM Sampler perf single beam single step
22+
- Enabled Disaggregated serving for Qwen-3
23+
- Added EAGLE3 support for Qwen-3
24+
- Fused finalize and allreduce for Qwen-MoE model
25+
- Refactored Fused MoE module
26+
- Added support for chunked attention on Blackwell and Hopper
27+
- Introduced sliding-window attention kernels for the generation phase on Blackwell
28+
- Updated DeepSeek FP8 TRT-LLM Gen cubins to improve performance in large batch size scenarios
29+
- Added FP8 block-scale GEMM support on SM89
30+
- Enabled overlap scheduler between draft forwards
31+
- Added Piecewise cuda graph support for MLA
32+
- Added model-agnostic one-engine eagle3
33+
- Enabled Finalize + Allreduce + add + rmsnorm fusion
34+
- Integrated TRT-LLM Gen FP8 block scale MoE with Pytorch workflow kernel autotuner
35+
- Added support for Eagle3 + disaggregated serving in two model speculative decoding flow
36+
- Validated Llama 3.1 models on H200 NVL
37+
- Benchmark:
38+
- Added all_reduce.py benchmark script for testing
39+
- Added beam width to trtllm-bench latency command
40+
- Fixed trtllm-bench iter_stats and cuda_graph_batch_sizes errors
41+
- Enabled trtllm-bench to run LoRA and add basic e2e perf testing capability for LoRA
42+
- Supported post_proc for bench
43+
- Added no_kv_cache_reuse option and streaming support for trtllm serve bench
44+
45+
### Infrastructure Changes
46+
- The base Docker image for TensorRT-LLM is updated to `nvcr.io/nvidia/pytorch:25.05-py3`.
47+
- The base Docker image for TensorRT-LLM Backend is updated to `nvcr.io/nvidia/tritonserver:25.05-py3`.
48+
- The dependent public PyTorch version is updated to 2.7.1.
49+
- The dependent TensorRT version is updated to 10.11.
50+
- The dependent NVIDIA ModelOpt version is updated to 0.31.
51+
- The dependent NCCL version is updated to 2.27.5.
52+
53+
### API Changes
54+
- Set _AutoDeployLlmArgs as primary config object
55+
- Removed decoder request from decoder interface
56+
- Enhanced the torch_compile_config in llm args
57+
- Removed the redundant use_kv_cache field from PytorchConfig
58+
- Moved allreduce_strategy from committed api to reference
59+
60+
### Fixed Issues
61+
- Fixed disaggregated service hang when MNNVL two-shot AllReduce is enabled (#4678)
62+
- Fixed EP load balancer with MTP layer and route offset by EP rank (#4767)
63+
- Fixed cuda graph padding for spec decoding (#4853)
64+
- Fixed llama 4 long context issue (#4809)
65+
- Fixed max_num_sequences calculation with overlap scheduling (#4532)
66+
- Fixed chunked prefill + overlap scheduling (#5761)
67+
- Fixed trtllm-bench hang issue due to LLM API IPC (#4798)
68+
- Fixed index out of bounds error in spec decoding (#5954)
69+
- Fixed MTP illegal memory access in cuda graph warmup (#5947)
70+
- Fixed no free slots error with spec decode + disagg (#5975)
71+
- Fixed one-off attention window size for Gemma3 1B (#5564)
72+
73+
### Known Issues
74+
- accuracy/test_cli_flow::TestGpt2::test_beam_search_large is broken.
75+
- Enabling disaggregated serving, MTP, and the overlap scheduler at the same time can lead to accuracy problems.
76+
777
## TensorRT-LLM Release 0.20.0
878

979
### Key Features and Enhancements

0 commit comments

Comments
 (0)