|
4 | 4 |
|
5 | 5 | All published functionality in the Release Notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our [NVIDIA Developer Forum](https://forums.developer.nvidia.com/).
|
6 | 6 |
|
| 7 | +## TensorRT-LLM Release 0.20.0 |
| 8 | + |
| 9 | +### Key Features and Enhancements |
| 10 | +- **Model Support** |
| 11 | + - Added Qwen3 support.Refer to “Qwen3” section in `examples/models/core/qwen/README.md`. |
| 12 | + - Added HyperCLOVAX-SEED-Vision support in PyTorch flow. Refer to `examples/models/contrib/hyperclovax/README.md` |
| 13 | + - Added Dynasor-CoT in scaffolding examples. Refer to `examples/scaffolding/contrib/Dynasor/README.md` |
| 14 | + - Added Mistral Small 3.1 24B VLM support in TRT workflow |
| 15 | + - Added Gemma3-1b-it support in PyTorch workflow |
| 16 | + - Added Nemotron-H model support |
| 17 | + - Added Eagle-3 support for LLAMA4 |
| 18 | +- **PyTorch workflow** |
| 19 | + - Added lora support |
| 20 | + - Added return logits support |
| 21 | + - Adopt new logprob definition in PyTorch flow |
| 22 | + - Enabled per-request stats with PyTorch backend |
| 23 | + - Enabled LogitsProcessor in PyTorch backend |
| 24 | +- Benchmark: |
| 25 | + - Add beam width to low latency. |
| 26 | + - Fix trtllm-bench iter_stats and cuda_graph_batch_sizes errors. |
| 27 | + - Remove deprecated Python runtime benchmark |
| 28 | + - Add benchmark support for scaffolding |
| 29 | +- Multimodal models |
| 30 | + - Added support in trtllm-serve |
| 31 | + - Added support in trtllm-bench, the support is limited to image only for now |
| 32 | +- Supported DeepSeek-R1 W4A8 on Hopper |
| 33 | +- Add the RTX Pro 6000 support on single GPU |
| 34 | +- Integrated Llama4 input processor |
| 35 | +- Added CGA reduction FHMA kernels on Blackwell |
| 36 | +- Enabled chunked context for FlashInfer |
| 37 | +- Supported KV cache reuse for MLA |
| 38 | +- Added Piecewise CUDA Graph support |
| 39 | +- Supported multiple LoRA adapters and TP |
| 40 | +- Added KV cache-aware router for disaggregated serving |
| 41 | +- Unfused attention for native support |
| 42 | +- Added group_rms_norm kernel to normalize multiple inputs in a single operator |
| 43 | +- Added smart router for the MoE module |
| 44 | +- Added head size 72 support for QKV preprocessing kernel |
| 45 | +- Added MNNVL MoE A2A support |
| 46 | +- Optimized Large Embedding Tables in Multimodal Models |
| 47 | +- Supported Top-K logprobs and prompt_logprobs in LLMAPI |
| 48 | +- Enabled overlap scheduler in TRT workflow via executor API |
| 49 | + |
| 50 | +### Infrastructure Changes |
| 51 | +- **TRT-LLM team formally releases docker image on [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags)**. |
| 52 | +- The pre-built TensorRT-LLM wheel on PyPI is linked against PyTorch 2.7.0 now, which uses the CXX11 ABI |
| 53 | +- The dependent TensorRT version is updated to 10.10.0 |
| 54 | +- The dependent CUDA version is updated to 12.9.0 |
| 55 | +- The dependent public PyTorch version is updated to 2.7.0 |
| 56 | +- The dependent NVIDIA ModelOpt version is updated to 0.29.0 |
| 57 | +- The dependent NCCL version is maintained at 2.25.1 |
| 58 | +- Open-sourced XQA kernels |
| 59 | +- Dependent datasets version was upgraded to 3.1.0 |
| 60 | +- Migrate Triton Backend to TensorRT LLM repo to TensorRT LLM submodule |
| 61 | +- Downgrade gcc toolset version from 13 to 11 |
| 62 | + |
| 63 | +### API Changes |
| 64 | +- [Breaking Change]:Enable scheduling overlap by default |
| 65 | +- Remove deprecated GptSession/V1 from TRT workflow |
| 66 | +- Set _AutoDeployLlmArgs as primary config object |
| 67 | +- Allow overriding CLI arguments with YAML file in trtllm-serve |
| 68 | +- Introduced multimodal embedding field in LlmRequest |
| 69 | + |
| 70 | + |
| 71 | +### Fixed Issues |
| 72 | +- Fix hang bug when context server doesn't have enough capacity for KV Cache (#3095) |
| 73 | +- Fix C++ decoder synchronization in PyTorch (#3106) |
| 74 | +- Fix bug of create cuda stream as default parameter which will be initialized during importing (#3764) |
| 75 | +- Fix bug related to creating CUDA stream as default parameter, which will be initialized during importing (#3764) |
| 76 | +- Fix attention DP bug on Qwen3 MoE model (#4141) |
| 77 | +- Fix illegal memory access when running LLaMA 4 with CUDA Graph enabled (#4101) |
| 78 | +- Reset planned states to avoid memory leak in TrtllmAttentionWrapper (#4227) |
| 79 | + |
| 80 | +### Known Issues |
| 81 | +- multi-GPU model support on RTX Pro 6000 |
| 82 | + |
7 | 83 |
|
8 | 84 | ## TensorRT-LLM Release 0.19.0
|
9 | 85 |
|
|
0 commit comments