Skip to content

Commit cbc6455

Browse files
authored
doc:add release notes for v0.20.0 (#5150)
Signed-off-by: nv-guomingz <[email protected]>
1 parent 3f284f1 commit cbc6455

File tree

1 file changed

+76
-0
lines changed

1 file changed

+76
-0
lines changed

docs/source/release-notes.md

Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,82 @@
44

55
All published functionality in the Release Notes has been fully tested and verified with known limitations documented. To share feedback about this release, access our [NVIDIA Developer Forum](https://forums.developer.nvidia.com/).
66

7+
## TensorRT-LLM Release 0.20.0
8+
9+
### Key Features and Enhancements
10+
- **Model Support**
11+
- Added Qwen3 support.Refer to “Qwen3” section in `examples/models/core/qwen/README.md`.
12+
- Added HyperCLOVAX-SEED-Vision support in PyTorch flow. Refer to `examples/models/contrib/hyperclovax/README.md`
13+
- Added Dynasor-CoT in scaffolding examples. Refer to `examples/scaffolding/contrib/Dynasor/README.md`
14+
- Added Mistral Small 3.1 24B VLM support in TRT workflow
15+
- Added Gemma3-1b-it support in PyTorch workflow
16+
- Added Nemotron-H model support
17+
- Added Eagle-3 support for LLAMA4
18+
- **PyTorch workflow**
19+
- Added lora support
20+
- Added return logits support
21+
- Adopt new logprob definition in PyTorch flow
22+
- Enabled per-request stats with PyTorch backend
23+
- Enabled LogitsProcessor in PyTorch backend
24+
- Benchmark:
25+
- Add beam width to low latency.
26+
- Fix trtllm-bench iter_stats and cuda_graph_batch_sizes errors.
27+
- Remove deprecated Python runtime benchmark
28+
- Add benchmark support for scaffolding
29+
- Multimodal models
30+
- Added support in trtllm-serve
31+
- Added support in trtllm-bench, the support is limited to image only for now
32+
- Supported DeepSeek-R1 W4A8 on Hopper
33+
- Add the RTX Pro 6000 support on single GPU
34+
- Integrated Llama4 input processor
35+
- Added CGA reduction FHMA kernels on Blackwell
36+
- Enabled chunked context for FlashInfer
37+
- Supported KV cache reuse for MLA
38+
- Added Piecewise CUDA Graph support
39+
- Supported multiple LoRA adapters and TP
40+
- Added KV cache-aware router for disaggregated serving
41+
- Unfused attention for native support
42+
- Added group_rms_norm kernel to normalize multiple inputs in a single operator
43+
- Added smart router for the MoE module
44+
- Added head size 72 support for QKV preprocessing kernel
45+
- Added MNNVL MoE A2A support
46+
- Optimized Large Embedding Tables in Multimodal Models
47+
- Supported Top-K logprobs and prompt_logprobs in LLMAPI
48+
- Enabled overlap scheduler in TRT workflow via executor API
49+
50+
### Infrastructure Changes
51+
- **TRT-LLM team formally releases docker image on [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags)**.
52+
- The pre-built TensorRT-LLM wheel on PyPI is linked against PyTorch 2.7.0 now, which uses the CXX11 ABI
53+
- The dependent TensorRT version is updated to 10.10.0
54+
- The dependent CUDA version is updated to 12.9.0
55+
- The dependent public PyTorch version is updated to 2.7.0
56+
- The dependent NVIDIA ModelOpt version is updated to 0.29.0
57+
- The dependent NCCL version is maintained at 2.25.1
58+
- Open-sourced XQA kernels
59+
- Dependent datasets version was upgraded to 3.1.0
60+
- Migrate Triton Backend to TensorRT LLM repo to TensorRT LLM submodule
61+
- Downgrade gcc toolset version from 13 to 11
62+
63+
### API Changes
64+
- [Breaking Change]:Enable scheduling overlap by default
65+
- Remove deprecated GptSession/V1 from TRT workflow
66+
- Set _AutoDeployLlmArgs as primary config object
67+
- Allow overriding CLI arguments with YAML file in trtllm-serve
68+
- Introduced multimodal embedding field in LlmRequest
69+
70+
71+
### Fixed Issues
72+
- Fix hang bug when context server doesn't have enough capacity for KV Cache (#3095)
73+
- Fix C++ decoder synchronization in PyTorch (#3106)
74+
- Fix bug of create cuda stream as default parameter which will be initialized during importing (#3764)
75+
- Fix bug related to creating CUDA stream as default parameter, which will be initialized during importing (#3764)
76+
- Fix attention DP bug on Qwen3 MoE model (#4141)
77+
- Fix illegal memory access when running LLaMA 4 with CUDA Graph enabled (#4101)
78+
- Reset planned states to avoid memory leak in TrtllmAttentionWrapper (#4227)
79+
80+
### Known Issues
81+
- multi-GPU model support on RTX Pro 6000
82+
783

884
## TensorRT-LLM Release 0.19.0
985

0 commit comments

Comments
 (0)