| 
4 | 4 | 
 
  | 
5 | 5 | All published functionality in the Release Notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our [NVIDIA Developer Forum](https://forums.developer.nvidia.com/).  | 
6 | 6 | 
 
  | 
 | 7 | +## TensorRT-LLM Release 0.21.0  | 
 | 8 | + | 
 | 9 | +### Key Features and Enhancements  | 
 | 10 | +- **Model Support**  | 
 | 11 | +  - Added Gemma3 VLM support  | 
 | 12 | +- **Features**  | 
 | 13 | +  - Added large-scale EP support  | 
 | 14 | +  - Integrated NIXL into the communication layer of the disaggregated service  | 
 | 15 | +  - Added fabric Memory support for KV Cache Transfer  | 
 | 16 | +  - Added MCP in ScaffoldingLLM  | 
 | 17 | +  - Added support for w4a8_mxfp4_fp8 quantization  | 
 | 18 | +  - Added support for fp8 rowwise quantization  | 
 | 19 | +  - Added generation logits support in TRTLLM Sampler  | 
 | 20 | +  - Added log probs support in TRTLLM Sampler  | 
 | 21 | +  - Optimized TRTLLM Sampler perf single beam single step  | 
 | 22 | +  - Enabled Disaggregated serving for Qwen-3  | 
 | 23 | +  - Added EAGLE3 support for Qwen-3  | 
 | 24 | +  - Fused finalize and allreduce for Qwen-MoE model  | 
 | 25 | +  - Refactored Fused MoE module  | 
 | 26 | +  - Added support for chunked attention on Blackwell and Hopper  | 
 | 27 | +  - Introduced sliding-window attention kernels for the generation phase on Blackwell  | 
 | 28 | +  - Updated DeepSeek FP8 TRT-LLM Gen cubins to improve performance in large batch size scenarios  | 
 | 29 | +  - Added FP8 block-scale GEMM support on SM89  | 
 | 30 | +  - Enabled overlap scheduler between draft forwards  | 
 | 31 | +  - Added Piecewise cuda graph support for MLA  | 
 | 32 | +  - Added model-agnostic one-engine eagle3  | 
 | 33 | +  - Enabled Finalize + Allreduce + add + rmsnorm fusion  | 
 | 34 | +  - Integrated TRT-LLM Gen FP8 block scale MoE with Pytorch workflow kernel autotuner  | 
 | 35 | +  - Added support for Eagle3 + disaggregated serving in two model speculative decoding flow  | 
 | 36 | +  - Validated Llama 3.1 models on H200 NVL  | 
 | 37 | +- Benchmark:  | 
 | 38 | +  - Added all_reduce.py benchmark script for testing  | 
 | 39 | +  - Added beam width to trtllm-bench latency command  | 
 | 40 | +  - Fixed trtllm-bench iter_stats and cuda_graph_batch_sizes errors  | 
 | 41 | +  - Enabled trtllm-bench to run LoRA and add basic e2e perf testing capability for LoRA  | 
 | 42 | +  - Supported post_proc for bench  | 
 | 43 | +  - Added no_kv_cache_reuse option and streaming support for trtllm serve bench  | 
 | 44 | + | 
 | 45 | +### Infrastructure Changes  | 
 | 46 | +- The base Docker image for TensorRT-LLM is updated to `nvcr.io/nvidia/pytorch:25.05-py3`.  | 
 | 47 | +- The base Docker image for TensorRT-LLM Backend is updated to `nvcr.io/nvidia/tritonserver:25.05-py3`.  | 
 | 48 | +- The dependent public PyTorch version is updated to 2.7.1.  | 
 | 49 | +- The dependent TensorRT version is updated to 10.11.  | 
 | 50 | +- The dependent NVIDIA ModelOpt version is updated to 0.31.  | 
 | 51 | +- The dependent NCCL version is updated to 2.27.5.  | 
 | 52 | + | 
 | 53 | +### API Changes  | 
 | 54 | +- Set _AutoDeployLlmArgs as primary config object  | 
 | 55 | +- Removed decoder request from decoder interface  | 
 | 56 | +- Enhanced the torch_compile_config in llm args  | 
 | 57 | +- Removed the redundant use_kv_cache field from PytorchConfig  | 
 | 58 | +- Moved allreduce_strategy from committed api to reference  | 
 | 59 | + | 
 | 60 | +### Fixed Issues  | 
 | 61 | +- Fixed disaggregated service hang when MNNVL two-shot AllReduce is enabled (#4678)  | 
 | 62 | +- Fixed EP load balancer with MTP layer and route offset by EP rank (#4767)  | 
 | 63 | +- Fixed cuda graph padding for spec decoding (#4853)  | 
 | 64 | +- Fixed llama 4 long context issue (#4809)  | 
 | 65 | +- Fixed max_num_sequences calculation with overlap scheduling (#4532)  | 
 | 66 | +- Fixed chunked prefill + overlap scheduling (#5761)  | 
 | 67 | +- Fixed trtllm-bench hang issue due to LLM API IPC (#4798)  | 
 | 68 | +- Fixed index out of bounds error in spec decoding (#5954)  | 
 | 69 | +- Fixed MTP illegal memory access in cuda graph warmup (#5947)  | 
 | 70 | +- Fixed no free slots error with spec decode + disagg (#5975)  | 
 | 71 | +- Fixed one-off attention window size for Gemma3 1B (#5564)  | 
 | 72 | + | 
 | 73 | +### Known Issues  | 
 | 74 | +- accuracy/test_cli_flow::TestGpt2::test_beam_search_large is broken.  | 
 | 75 | +- Enabling disaggregated serving, MTP, and the overlap scheduler at the same time can lead to accuracy problems.  | 
 | 76 | + | 
7 | 77 | ## TensorRT-LLM Release 0.20.0  | 
8 | 78 | 
 
  | 
9 | 79 | ### Key Features and Enhancements  | 
 | 
0 commit comments