Skip to content

Commit 82cb10d

Browse files
committed
Update link
Signed-off-by: Kaiyu Xie <[email protected]>
1 parent f0dc52b commit 82cb10d

File tree

1 file changed

+21
-21
lines changed

1 file changed

+21
-21
lines changed

docs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md

Lines changed: 21 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -2,27 +2,27 @@
22

33
By NVIDIA TensorRT-LLM Team
44

5-
- [Disaggregated Serving in TensorRT-LLM](#Disaggregated-Serving-in-TensorRT-LLM)
6-
- [Motivation](#Motivation)
7-
- [Disaggregated Serving in TensorRT-LLM](#Disaggregated-Serving-in-TensorRT-LLM)
5+
- [Disaggregated Serving in TensorRT-LLM](#disaggregated-serving-in-tensorrt-llm)
6+
- [Motivation](#motivation)
7+
- [Disaggregated Serving in TensorRT-LLM](#disaggregated-serving-in-tensorrt-llm-1)
88
- [trtllm-serve](#trtllm-serve)
9-
- [Dynamo](#Dynamo)
10-
- [Triton Inference Server](#Triton-Inference-Server)
11-
- [KV Cache Exchange](#KV-Cache-Exchange)
12-
- [Multi-backend Support](#Multi-backend-Support)
13-
- [Overlap Optimization](#Overlap-Optimization)
14-
- [Cache Layout Transformation](#Cache-Layout-Transformation)
15-
- [Performance Studies](#Performance-Studies)
16-
- [Measurement Methodology](#Measurement-Methodology)
17-
- [DeepSeek R1](#DeepSeek-R1)
18-
- [ISL 4400 - OSL 1200 (Machine Translation Dataset)](#ISL-4400---OSL-1200-Machine-Translation-Dataset)
19-
- [ISL 8192 - OSL 256 (Synthetic Dataset)](#ISL-8192---OSL-256-Synthetic-Dataset)
20-
- [ISL 4096 - OSL 1024 (Machine Translation Dataset)](#ISL-4096---OSL-1024-Machine-Translation-Dataset)
21-
- [Qwen 3](#Qwen-3)
22-
- [ISL 8192 - OSL 1024 (Machine Translation Dataset)](#ISL-8192---OSL-1024-Machine-Translation-Dataset)
23-
- [Reproducing Steps](#Reproducing-Steps)
24-
- [Future Work](#Future-Work)
25-
- [Acknowledgement](#Acknowledgement)
9+
- [Dynamo](#dynamo)
10+
- [Triton Inference Server](#triton-inference-server)
11+
- [KV Cache Exchange](#kv-cache-exchange)
12+
- [Multi-backend Support](#multi-backend-support)
13+
- [Overlap Optimization](#overlap-optimization)
14+
- [Cache Layout Transformation](#cache-layout-transformation)
15+
- [Performance Studies](#performance-studies)
16+
- [Measurement Methodology](#measurement-methodology)
17+
- [DeepSeek R1](#deepseek-r1)
18+
- [ISL 4400 - OSL 1200 (Machine Translation Dataset)](#isl-4400---osl-1200-machine-translation-dataset)
19+
- [ISL 8192 - OSL 256 (Synthetic Dataset)](#isl-8192---osl-256-synthetic-dataset)
20+
- [ISL 4096 - OSL 1024 (Machine Translation Dataset)](#isl-4096---osl-1024-machine-translation-dataset)
21+
- [Qwen 3](#qwen-3)
22+
- [ISL 8192 - OSL 1024 (Machine Translation Dataset)](#isl-8192---osl-1024-machine-translation-dataset)
23+
- [Reproducing Steps](#reproducing-steps)
24+
- [Future Work](#future-work)
25+
- [Acknowledgement](#acknowledgement)
2626

2727
In the past tech blogs, we have introduced optimization specifically for [low-latency](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog1_Pushing_Latency_Boundaries_Optimizing_DeepSeek-R1_Performance_on_NVIDIA_B200_GPUs.md) and [throughput](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog3_Optimizing_DeepSeek_R1_Throughput_on_NVIDIA_Blackwell_GPUs.md) oriented optimizations. For production deployment, users also care about per GPU throughput satisfying certain latency constraints. In this tech blog, we will introduce the design concept and usage of the TensorRT-LLM disaggregated serving which directly targets throughput@latency performance scenarios, together with performance study results.
2828

@@ -277,7 +277,7 @@ We also conducted performance evaluations of Qwen 3 on GB200 GPUs. The data indi
277277

278278
### Reproducing Steps
279279

280-
We provide a set of scripts to reproduce the performance data presented in this paper. Please refer to the usage instructions described in [this document](https://github.com/NVIDIA/TensorRT-LLM/tree/main/docs/source/scripts/disaggregated).
280+
We provide a set of scripts to reproduce the performance data presented in this paper. Please refer to the usage instructions described in [this document](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/disaggregated/slurm).
281281

282282
## Future Work
283283

0 commit comments

Comments
 (0)