Skip to content

Commit f08286c

Browse files
authored
doc: Refactor documents and examples of disaggregated serving and wide ep (#6054)
Signed-off-by: Kaiyu Xie <[email protected]>
1 parent 8ecdeee commit f08286c

File tree

17 files changed

+168
-922
lines changed

17 files changed

+168
-922
lines changed

docs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md

Lines changed: 21 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -2,27 +2,27 @@
22

33
By NVIDIA TensorRT-LLM Team
44

5-
- [Disaggregated Serving in TensorRT-LLM](#Disaggregated-Serving-in-TensorRT-LLM)
6-
- [Motivation](#Motivation)
7-
- [Disaggregated Serving in TensorRT-LLM](#Disaggregated-Serving-in-TensorRT-LLM)
5+
- [Disaggregated Serving in TensorRT-LLM](#disaggregated-serving-in-tensorrt-llm)
6+
- [Motivation](#motivation)
7+
- [Disaggregated Serving in TensorRT-LLM](#disaggregated-serving-in-tensorrt-llm-1)
88
- [trtllm-serve](#trtllm-serve)
9-
- [Dynamo](#Dynamo)
10-
- [Triton Inference Server](#Triton-Inference-Server)
11-
- [KV Cache Exchange](#KV-Cache-Exchange)
12-
- [Multi-backend Support](#Multi-backend-Support)
13-
- [Overlap Optimization](#Overlap-Optimization)
14-
- [Cache Layout Transformation](#Cache-Layout-Transformation)
15-
- [Performance Studies](#Performance-Studies)
16-
- [Measurement Methodology](#Measurement-Methodology)
17-
- [DeepSeek R1](#DeepSeek-R1)
18-
- [ISL 4400 - OSL 1200 (Machine Translation Dataset)](#ISL-4400---OSL-1200-Machine-Translation-Dataset)
19-
- [ISL 8192 - OSL 256 (Synthetic Dataset)](#ISL-8192---OSL-256-Synthetic-Dataset)
20-
- [ISL 4096 - OSL 1024 (Machine Translation Dataset)](#ISL-4096---OSL-1024-Machine-Translation-Dataset)
21-
- [Qwen 3](#Qwen-3)
22-
- [ISL 8192 - OSL 1024 (Machine Translation Dataset)](#ISL-8192---OSL-1024-Machine-Translation-Dataset)
23-
- [Reproducing Steps](#Reproducing-Steps)
24-
- [Future Work](#Future-Work)
25-
- [Acknowledgement](#Acknowledgement)
9+
- [Dynamo](#dynamo)
10+
- [Triton Inference Server](#triton-inference-server)
11+
- [KV Cache Exchange](#kv-cache-exchange)
12+
- [Multi-backend Support](#multi-backend-support)
13+
- [Overlap Optimization](#overlap-optimization)
14+
- [Cache Layout Transformation](#cache-layout-transformation)
15+
- [Performance Studies](#performance-studies)
16+
- [Measurement Methodology](#measurement-methodology)
17+
- [DeepSeek R1](#deepseek-r1)
18+
- [ISL 4400 - OSL 1200 (Machine Translation Dataset)](#isl-4400---osl-1200-machine-translation-dataset)
19+
- [ISL 8192 - OSL 256 (Synthetic Dataset)](#isl-8192---osl-256-synthetic-dataset)
20+
- [ISL 4096 - OSL 1024 (Machine Translation Dataset)](#isl-4096---osl-1024-machine-translation-dataset)
21+
- [Qwen 3](#qwen-3)
22+
- [ISL 8192 - OSL 1024 (Machine Translation Dataset)](#isl-8192---osl-1024-machine-translation-dataset)
23+
- [Reproducing Steps](#reproducing-steps)
24+
- [Future Work](#future-work)
25+
- [Acknowledgement](#acknowledgement)
2626

2727
In the past tech blogs, we have introduced optimization specifically for [low-latency](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog1_Pushing_Latency_Boundaries_Optimizing_DeepSeek-R1_Performance_on_NVIDIA_B200_GPUs.md) and [throughput](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog3_Optimizing_DeepSeek_R1_Throughput_on_NVIDIA_Blackwell_GPUs.md) oriented optimizations. For production deployment, users also care about per GPU throughput satisfying certain latency constraints. In this tech blog, we will introduce the design concept and usage of the TensorRT-LLM disaggregated serving which directly targets throughput@latency performance scenarios, together with performance study results.
2828

@@ -277,7 +277,7 @@ We also conducted performance evaluations of Qwen 3 on GB200 GPUs. The data indi
277277

278278
### Reproducing Steps
279279

280-
We provide a set of scripts to reproduce the performance data presented in this paper. Please refer to the usage instructions described in [this document](https://github.com/NVIDIA/TensorRT-LLM/tree/main/docs/source/scripts/disaggregated).
280+
We provide a set of scripts to reproduce the performance data presented in this paper. Please refer to the usage instructions described in [this document](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/disaggregated/slurm).
281281

282282
## Future Work
283283

docs/source/scripts/disaggregated/disaggr_torch.slurm

Lines changed: 0 additions & 112 deletions
This file was deleted.

0 commit comments

Comments
 (0)