|
2 | 2 |
|
3 | 3 | By NVIDIA TensorRT-LLM Team
|
4 | 4 |
|
5 |
| -- [Disaggregated Serving in TensorRT-LLM](#Disaggregated-Serving-in-TensorRT-LLM) |
6 |
| - - [Motivation](#Motivation) |
7 |
| - - [Disaggregated Serving in TensorRT-LLM](#Disaggregated-Serving-in-TensorRT-LLM) |
| 5 | +- [Disaggregated Serving in TensorRT-LLM](#disaggregated-serving-in-tensorrt-llm) |
| 6 | + - [Motivation](#motivation) |
| 7 | + - [Disaggregated Serving in TensorRT-LLM](#disaggregated-serving-in-tensorrt-llm-1) |
8 | 8 | - [trtllm-serve](#trtllm-serve)
|
9 |
| - - [Dynamo](#Dynamo) |
10 |
| - - [Triton Inference Server](#Triton-Inference-Server) |
11 |
| - - [KV Cache Exchange](#KV-Cache-Exchange) |
12 |
| - - [Multi-backend Support](#Multi-backend-Support) |
13 |
| - - [Overlap Optimization](#Overlap-Optimization) |
14 |
| - - [Cache Layout Transformation](#Cache-Layout-Transformation) |
15 |
| - - [Performance Studies](#Performance-Studies) |
16 |
| - - [Measurement Methodology](#Measurement-Methodology) |
17 |
| - - [DeepSeek R1](#DeepSeek-R1) |
18 |
| - - [ISL 4400 - OSL 1200 (Machine Translation Dataset)](#ISL-4400---OSL-1200-Machine-Translation-Dataset) |
19 |
| - - [ISL 8192 - OSL 256 (Synthetic Dataset)](#ISL-8192---OSL-256-Synthetic-Dataset) |
20 |
| - - [ISL 4096 - OSL 1024 (Machine Translation Dataset)](#ISL-4096---OSL-1024-Machine-Translation-Dataset) |
21 |
| - - [Qwen 3](#Qwen-3) |
22 |
| - - [ISL 8192 - OSL 1024 (Machine Translation Dataset)](#ISL-8192---OSL-1024-Machine-Translation-Dataset) |
23 |
| - - [Reproducing Steps](#Reproducing-Steps) |
24 |
| - - [Future Work](#Future-Work) |
25 |
| - - [Acknowledgement](#Acknowledgement) |
| 9 | + - [Dynamo](#dynamo) |
| 10 | + - [Triton Inference Server](#triton-inference-server) |
| 11 | + - [KV Cache Exchange](#kv-cache-exchange) |
| 12 | + - [Multi-backend Support](#multi-backend-support) |
| 13 | + - [Overlap Optimization](#overlap-optimization) |
| 14 | + - [Cache Layout Transformation](#cache-layout-transformation) |
| 15 | + - [Performance Studies](#performance-studies) |
| 16 | + - [Measurement Methodology](#measurement-methodology) |
| 17 | + - [DeepSeek R1](#deepseek-r1) |
| 18 | + - [ISL 4400 - OSL 1200 (Machine Translation Dataset)](#isl-4400---osl-1200-machine-translation-dataset) |
| 19 | + - [ISL 8192 - OSL 256 (Synthetic Dataset)](#isl-8192---osl-256-synthetic-dataset) |
| 20 | + - [ISL 4096 - OSL 1024 (Machine Translation Dataset)](#isl-4096---osl-1024-machine-translation-dataset) |
| 21 | + - [Qwen 3](#qwen-3) |
| 22 | + - [ISL 8192 - OSL 1024 (Machine Translation Dataset)](#isl-8192---osl-1024-machine-translation-dataset) |
| 23 | + - [Reproducing Steps](#reproducing-steps) |
| 24 | + - [Future Work](#future-work) |
| 25 | + - [Acknowledgement](#acknowledgement) |
26 | 26 |
|
27 | 27 | In the past tech blogs, we have introduced optimization specifically for [low-latency](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog1_Pushing_Latency_Boundaries_Optimizing_DeepSeek-R1_Performance_on_NVIDIA_B200_GPUs.md) and [throughput](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog3_Optimizing_DeepSeek_R1_Throughput_on_NVIDIA_Blackwell_GPUs.md) oriented optimizations. For production deployment, users also care about per GPU throughput satisfying certain latency constraints. In this tech blog, we will introduce the design concept and usage of the TensorRT-LLM disaggregated serving which directly targets throughput@latency performance scenarios, together with performance study results.
|
28 | 28 |
|
@@ -277,7 +277,7 @@ We also conducted performance evaluations of Qwen 3 on GB200 GPUs. The data indi
|
277 | 277 |
|
278 | 278 | ### Reproducing Steps
|
279 | 279 |
|
280 |
| -We provide a set of scripts to reproduce the performance data presented in this paper. Please refer to the usage instructions described in [this document](https://github.com/NVIDIA/TensorRT-LLM/tree/main/docs/source/scripts/disaggregated). |
| 280 | +We provide a set of scripts to reproduce the performance data presented in this paper. Please refer to the usage instructions described in [this document](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/disaggregated/slurm). |
281 | 281 |
|
282 | 282 | ## Future Work
|
283 | 283 |
|
|
0 commit comments