Skip to content

Commit d69ced8

Browse files
kaiyuxamirkl94
authored andcommitted
doc: Minor fixes and clarification (NVIDIA#4975)
Signed-off-by: Kaiyu Xie <[email protected]>
1 parent c0b9b6f commit d69ced8

File tree

1 file changed

+8
-4
lines changed

1 file changed

+8
-4
lines changed

examples/models/core/deepseek_v3/README.md

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ Please refer to [this guide](https://nvidia.github.io/TensorRT-LLM/installation/
3030
- [trtllm-serve](#trtllm-serve)
3131
- [Disaggregated Serving](#disaggregated-serving)
3232
- [Dynamo](#dynamo)
33-
- [tensorrtllm_backend for triton inference server (Experimental)](#tensorrtllm_backend-for-triton-inference-server-experimental)
33+
- [tensorrtllm\_backend for triton inference server (Experimental)](#tensorrtllm_backend-for-triton-inference-server-experimental)
3434
- [Advanced Usages](#advanced-usages)
3535
- [Multi-node](#multi-node)
3636
- [mpirun](#mpirun)
@@ -40,6 +40,8 @@ Please refer to [this guide](https://nvidia.github.io/TensorRT-LLM/installation/
4040
- [FlashMLA](#flashmla)
4141
- [FP8 KV Cache and MLA](#fp8-kv-cache-and-mla)
4242
- [W4AFP8](#w4afp8)
43+
- [Activation calibration](#activation-calibration)
44+
- [Weight quantization and assembling](#weight-quantization-and-assembling)
4345
- [KV Cache Reuse](#kv-cache-reuse)
4446
- [Notes and Troubleshooting](#notes-and-troubleshooting)
4547
- [Known Issues](#known-issues)
@@ -227,6 +229,8 @@ trtllm-eval --model <YOUR_MODEL_DIR> \
227229
## Serving
228230
### trtllm-serve
229231

232+
Take max-throughput scenario on B200 as an example, the settings are extracted from the [blog](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md#b200-max-throughput). **For users' own models and cases, the specific settings could be different to get best performance.**
233+
230234
To serve the model using `trtllm-serve`:
231235

232236
```bash
@@ -253,12 +257,12 @@ trtllm-serve \
253257
--host localhost \
254258
--port 8000 \
255259
--backend pytorch \
256-
--max_batch_size 161 \
257-
--max_num_tokens 1160 \
260+
--max_batch_size 384 \
261+
--max_num_tokens 1536 \
258262
--tp_size 8 \
259263
--ep_size 8 \
260264
--pp_size 1 \
261-
--kv_cache_free_gpu_memory_fraction 0.95 \
265+
--kv_cache_free_gpu_memory_fraction 0.85 \
262266
--extra_llm_api_options ./extra-llm-api-config.yml
263267
```
264268

0 commit comments

Comments
 (0)