Skip to content

Commit d010b20

Browse files
authored
[TRTLLM-7030][fix] BREAKING CHANGE: Mismatch between docs and actual commands (#7191)
Signed-off-by: Shixiaowei02 <[email protected]>
1 parent 5d16518 commit d010b20

File tree

50 files changed

+144
-140
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

50 files changed

+144
-140
lines changed

benchmarks/cpp/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -336,15 +336,15 @@ cd cpp/build
336336
`disaggServerBenchmark` only supports `decoder-only` models.
337337
Here is the basic usage:
338338
```
339-
export TRTLLM_USE_MPI_KVCACHE=1
339+
export TRTLLM_USE_UCX_KVCACHE=1
340340
mpirun -n ${proc} benchmarks/disaggServerBenchmark --context_engine_dirs ${context_engine_0},${context_engine_1}...,${context_engine_{m-1}} \
341341
--generation_engine_dirs ${generation_engine_0},${generation_engine_1}...,${generation_engine_{n-1}} --dataset ${dataset_path}
342342
```
343343
This command will launch m context engines and n generation engines. You need to ensure `proc` is equal to the sum of the number of processes required for each engine plus 1. Since we use orchestrator mode for `disaggServerBenchmark` we need an additional process as the orchestrator. For example, if there are two context engines (one is TP2_PP1,another is TP1_PP1) and two generation engines(one is TP2_PP1,another is TP1_PP1), then the `proc` value should be set to 7.
344344
345345
for example:
346346
```
347-
export TRTLLM_USE_MPI_KVCACHE=1
347+
export TRTLLM_USE_UCX_KVCACHE=1
348348
mpirun -n 7 benchmarks/disaggServerBenchmark --context_engine_dirs ${llama_7b_tp2_pp1_dir},${llama_7b_tp1_pp1_dir} --generation_engine_dirs ${llama_7b_tp1_pp1_dir},${llama_7b_tp2_pp1_dir} --dataset ${dataset_path}
349349

350350
# need 6 gpus and 7 processes to launch the benchmark.

docs/source/advanced/disaggregated-service.md

Lines changed: 0 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -66,17 +66,6 @@ A. Yes, it's recommended that different executor use different GPUs . We support
6666

6767
### Debugging FAQs
6868

69-
*Q. How to handle error `Disaggregated serving is not enabled, please check the configuration?`*
70-
71-
A. please set `backendType` of `CacheTransceiverConfig`.
72-
```cpp
73-
ExecutorConfig executorConfig{...};
74-
75-
executorConfig.setCacheTransceiverConfig(texec::CacheTransceiverConfig(BackendType::DEFAULT));
76-
```
77-
78-
When the environment variable `TRTLLM_USE_MPI_KVCACHE=1` is set, TRT-LLM will transfer the KV cache using `CUDA-aware MPI`. All executor processes involved must share the same MPI world communicator. Consequently, with `TRTLLM_USE_MPI_KVCACHE=1`, TRT-LLM only supports launching multiple executors via `MPI`. Additionally, the `CommunicationMode` for the executors must be set to `kLEADER` or `kORCHESTRATOR` with `SpawnProcesses=false` for the `disaggregated-service`. These restrictions do not apply when `TRTLLM_USE_UCX_KVCACHE=1` is set.
79-
8069
*Q. Does TRT-LLM support using GPU direct RDMA for inter-node KV Cache transfer?*
8170

8271
A. Yes, TRT-LLM supports using GPU direct RDMA for inter-node KV cache transfer.

examples/cpp/executor/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -124,10 +124,10 @@ From the `examples/cpp/executor/build` folder, you can also run the `executorExa
124124
```
125125
./executorExampleDisaggregated -h
126126
```
127-
Note setting `TRTLLM_USE_MPI_KVCACHE=1` is required to run disaggregated executor.
127+
Note setting `TRTLLM_USE_UCX_KVCACHE=1` is required to run disaggregated executor.
128128
For example, you can run :
129129
```
130-
export TRTLLM_USE_MPI_KVCACHE=1
130+
export TRTLLM_USE_UCX_KVCACHE=1
131131
132132
mpirun -n <num_ranks> --allow-run-as-root --oversubscribe ./executorExampleDisaggregated --context_engine_dir <path_to_context_engine_dir> --context_rank_size <num_ranks_for_context> --generation_engine_dir <path_to_generation_engine_dir> --generation_rank_size <num_ranks_for_generation> --input_tokens ../inputTokens.csv
133133

examples/disaggregated/README.md

Lines changed: 31 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -12,24 +12,39 @@ cache_transceiver_config:
1212
max_tokens_in_buffer: <int>
1313
```
1414
15-
`backend` specifies the communication backend for transferring the kvCache, valid options include `DEFAULT`,`UCX`, `NIXL`, and `MPI`, the default backend is UCX.
15+
`backend` specifies the communication backend for transferring the KV cache, valid options include `DEFAULT`, `UCX`, `NIXL`, and `MPI`, the default backend is `UCX`.
1616

17-
`max_tokens_in_buffer` defines the buffer size for kvCache transfers, it is recommended to set this value greater than or equal to the maximum ISL (Input Sequence Length) of all requests for optimal performance.
17+
`max_tokens_in_buffer` defines the buffer size for KV cache transfers, it is recommended to set this value greater than or equal to the maximum ISL (Input Sequence Length) of all requests for optimal performance.
1818

19-
You can use multiple `trtllm-serve` commands to launch the context and generation servers that will be used
20-
for disaggregated serving. For example, you could launch two context servers and one generation servers as follows:
19+
You can use multiple `trtllm-serve` commands to launch the context and generation servers required for disaggregated serving. For instance, you might start two context servers and one generation server as shown below.
2120

22-
```bash
23-
# Generate context_extra-llm-api-config.yml
24-
# Overlap scheduler for context servers are disabled because it's not supported for disaggregated context servers yet
25-
echo -e "disable_overlap_scheduler: True\ncache_transceiver_config:\n backend: UCX\n max_tokens_in_buffer: 2048" > context_extra-llm-api-config.yml
21+
Begin by creating `ctx_extra-llm-api-config.yml` and `gen_extra-llm-api-config.yml` following the specified format.
2622

27-
# Start context servers
28-
CUDA_VISIBLE_DEVICES=0 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8001 --extra_llm_api_options ./context_extra-llm-api-config.yml &> log_ctx_0 &
29-
CUDA_VISIBLE_DEVICES=1 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8002 --extra_llm_api_options ./context_extra-llm-api-config.yml &> log_ctx_1 &
23+
```yaml
24+
# ctx_extra-llm-api-config.yml
25+
26+
# The overlap scheduler for context servers is currently disabled, as it is
27+
# not yet supported in disaggregated context server architectures.
28+
disable_overlap_scheduler: True
29+
cache_transceiver_config:
30+
backend: UCX
31+
max_tokens_in_buffer: 2048
32+
```
3033

31-
# Generate gen_extra-llm-api-config.yml
32-
echo -e "cache_transceiver_config:\n backend: UCX\n max_tokens_in_buffer: 2048" > gen_extra-llm-api-config.yml
34+
```yaml
35+
# gen_extra-llm-api-config.yml
36+
37+
cache_transceiver_config:
38+
backend: UCX
39+
max_tokens_in_buffer: 2048
40+
```
41+
42+
Then, start the context and generation servers separately.
43+
44+
```bash
45+
# Start context servers
46+
CUDA_VISIBLE_DEVICES=0 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8001 --extra_llm_api_options ./ctx_extra-llm-api-config.yml &> log_ctx_0 &
47+
CUDA_VISIBLE_DEVICES=1 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8002 --extra_llm_api_options ./ctx_extra-llm-api-config.yml &> log_ctx_1 &
3348
3449
# Start generation servers
3550
CUDA_VISIBLE_DEVICES=2 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8003 --extra_llm_api_options ./gen_extra-llm-api-config.yml &> log_gen_0 &
@@ -95,8 +110,8 @@ After this, you can enable the dynamic scaling feature for the use case above as
95110
export TRTLLM_USE_UCX_KVCACHE=1
96111
97112
# Context servers
98-
CUDA_VISIBLE_DEVICES=0 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8001 --server_role CONTEXT --extra_llm_api_options ./context_extra-llm-api-config.yml --metadata_server_config_file ./metadata_config.yml &> log_ctx_0 &
99-
CUDA_VISIBLE_DEVICES=1 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8002 --server_role CONTEXT --extra_llm_api_options ./context_extra-llm-api-config.yml --metadata_server_config_file ./metadata_config.yml &> log_ctx_1 &
113+
CUDA_VISIBLE_DEVICES=0 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8001 --server_role CONTEXT --extra_llm_api_options ./ctx_extra-llm-api-config.yml --metadata_server_config_file ./metadata_config.yml &> log_ctx_0 &
114+
CUDA_VISIBLE_DEVICES=1 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8002 --server_role CONTEXT --extra_llm_api_options ./ctx_extra-llm-api-config.yml --metadata_server_config_file ./metadata_config.yml &> log_ctx_1 &
100115
101116
# Generation servers
102117
CUDA_VISIBLE_DEVICES=2 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8003 --server_role GENERATION --extra_llm_api_options ./gen_extra-llm-api-config.yml --metadata_server_config_file ./metadata_config.yml &> log_gen_0 &
@@ -180,4 +195,4 @@ trtllm-serve disaggregated -c disagg_config.yaml
180195

181196
## Know Issues
182197

183-
The MPI communication backend for kvCache transfer has been deprecated and may not be supported in the future. When using the MPI backend, the environment variable `TRTLLM_USE_MPI_KVCACHE=1` should be set to avoid conflicts between mpi4py and kvCache transfer.
198+
The MPI communication backend for KV cache transfer has been deprecated and may not be supported in the future. When using the MPI backend, the environment variable `TRTLLM_USE_MPI_KVCACHE=1` should be set to avoid conflicts between mpi4py and KV cache transfer.

examples/disaggregated/disagg_config.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,14 +11,14 @@ context_servers:
1111
kv_cache_config:
1212
free_gpu_memory_fraction: 0.2
1313
cache_transceiver_config:
14-
backend: "default"
14+
backend: "DEFAULT"
1515
urls:
1616
- "localhost:8001"
1717
generation_servers:
1818
num_instances: 1
1919
tensor_parallel_size: 1
2020
pipeline_parallel_size: 1
2121
cache_transceiver_config:
22-
backend: "default"
22+
backend: "DEFAULT"
2323
urls:
2424
- "localhost:8002"

examples/disaggregated/slurm/gen_yaml.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -197,7 +197,7 @@ def gen_config_file(config_path: str,
197197
},
198198
'cache_transceiver_config': {
199199
'max_tokens_in_buffer': cache_transceiver_max_num_tokens,
200-
'backend': 'default',
200+
'backend': 'DEFAULT',
201201
},
202202
},
203203
'generation_servers': {
@@ -225,7 +225,7 @@ def gen_config_file(config_path: str,
225225
},
226226
'cache_transceiver_config': {
227227
'max_tokens_in_buffer': cache_transceiver_max_num_tokens,
228-
'backend': 'default',
228+
'backend': 'DEFAULT',
229229
},
230230
'stream_interval': 20,
231231
}

tensorrt_llm/llmapi/llm_args.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1039,7 +1039,7 @@ class CacheTransceiverConfig(StrictBaseModel, PybindMirror):
10391039
Configuration for the cache transceiver.
10401040
"""
10411041

1042-
backend: Optional[Literal["default", "ucx", "nixl", "mpi"]] = Field(
1042+
backend: Optional[Literal["DEFAULT", "UCX", "NIXL", "MPI"]] = Field(
10431043
default=None,
10441044
description=
10451045
"The communication backend type to use for the cache transceiver.")

tests/integration/defs/accuracy/test_disaggregated_serving.py

Lines changed: 20 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -260,7 +260,7 @@ def run_parallel_test(model_name: str, model_path: str, ctx_pp: int,
260260
"disable_overlap_scheduler": True,
261261
"kv_cache_config": kv_cache_config,
262262
"cache_transceiver_config": {
263-
"backend": "default"
263+
"backend": "DEFAULT"
264264
}
265265
}
266266
gen_server_config = {
@@ -269,7 +269,7 @@ def run_parallel_test(model_name: str, model_path: str, ctx_pp: int,
269269
"disable_overlap_scheduler": True,
270270
"kv_cache_config": kv_cache_config,
271271
"cache_transceiver_config": {
272-
"backend": "default"
272+
"backend": "DEFAULT"
273273
}
274274
}
275275

@@ -309,8 +309,8 @@ def test_auto_dtype(self, disable_overlap_scheduler):
309309
gen_server_config = {
310310
"disable_overlap_scheduler": disable_overlap_scheduler
311311
}
312-
ctx_server_config["cache_transceiver_config"] = {"backend": "default"}
313-
gen_server_config["cache_transceiver_config"] = {"backend": "default"}
312+
ctx_server_config["cache_transceiver_config"] = {"backend": "DEFAULT"}
313+
gen_server_config["cache_transceiver_config"] = {"backend": "DEFAULT"}
314314
disaggregated_server_config = {
315315
"hostname": "localhost",
316316
"port": 8000,
@@ -351,15 +351,15 @@ def test_ngram(self):
351351
"disable_overlap_scheduler": True,
352352
"kv_cache_config": kv_cache_config,
353353
"cache_transceiver_config": {
354-
"backend": "default"
354+
"backend": "DEFAULT"
355355
}
356356
}
357357
gen_server_config = {
358358
"disable_overlap_scheduler": True,
359359
"speculative_config": speculative_decoding_config,
360360
"kv_cache_config": kv_cache_config,
361361
"cache_transceiver_config": {
362-
"backend": "default"
362+
"backend": "DEFAULT"
363363
}
364364
}
365365
disaggregated_server_config = {
@@ -404,7 +404,7 @@ def test_eagle3(self, overlap_scheduler, eagle3_one_model):
404404
"max_num_tokens": 13393 * 2,
405405
"max_batch_size": 1,
406406
"cache_transceiver_config": {
407-
"backend": "default"
407+
"backend": "DEFAULT"
408408
},
409409
"cuda_graph_config": None,
410410
}
@@ -418,7 +418,7 @@ def test_eagle3(self, overlap_scheduler, eagle3_one_model):
418418
"max_num_tokens": 13393 * 2,
419419
"max_batch_size": 16,
420420
"cache_transceiver_config": {
421-
"backend": "default"
421+
"backend": "DEFAULT"
422422
},
423423
"cuda_graph_config": None,
424424
}
@@ -472,8 +472,8 @@ class TestLlama4ScoutInstruct(LlmapiAccuracyTestHarness):
472472
def test_auto_dtype(self, overlap_scheduler):
473473
ctx_server_config = {"disable_overlap_scheduler": True}
474474
gen_server_config = {"disable_overlap_scheduler": overlap_scheduler}
475-
ctx_server_config["cache_transceiver_config"] = {"backend": "default"}
476-
gen_server_config["cache_transceiver_config"] = {"backend": "default"}
475+
ctx_server_config["cache_transceiver_config"] = {"backend": "DEFAULT"}
476+
gen_server_config["cache_transceiver_config"] = {"backend": "DEFAULT"}
477477
# Keep this low to avoid warmup OOM in CI
478478
ctx_server_config["max_seq_len"] = 8192
479479
gen_server_config["max_seq_len"] = 8192
@@ -513,13 +513,13 @@ def test_nixl_backend(self):
513513
ctx_server_config = {
514514
"disable_overlap_scheduler": True,
515515
"cache_transceiver_config": {
516-
"backend": "nixl"
516+
"backend": "NIXL"
517517
}
518518
}
519519
gen_server_config = {
520520
"disable_overlap_scheduler": True,
521521
"cache_transceiver_config": {
522-
"backend": "nixl"
522+
"backend": "NIXL"
523523
}
524524
}
525525
disaggregated_server_config = {
@@ -550,8 +550,8 @@ def test_nixl_backend(self):
550550
def test_auto_dtype(self, overlap_scheduler, mtp_nextn):
551551
ctx_server_config = {"disable_overlap_scheduler": True}
552552
gen_server_config = {"disable_overlap_scheduler": not overlap_scheduler}
553-
ctx_server_config["cache_transceiver_config"] = {"backend": "default"}
554-
gen_server_config["cache_transceiver_config"] = {"backend": "default"}
553+
ctx_server_config["cache_transceiver_config"] = {"backend": "DEFAULT"}
554+
gen_server_config["cache_transceiver_config"] = {"backend": "DEFAULT"}
555555
if mtp_nextn > 0:
556556
ctx_server_config["speculative_config"] = {
557557
"decoding_type": "MTP",
@@ -597,14 +597,14 @@ def test_auto_dtype(self, overlap_scheduler):
597597
"disable_overlap_scheduler": True,
598598
"cuda_graph_config": None,
599599
"cache_transceiver_config": {
600-
"backend": "default"
600+
"backend": "DEFAULT"
601601
}
602602
}
603603
gen_server_config = {
604604
"disable_overlap_scheduler": overlap_scheduler,
605605
"cuda_graph_config": None,
606606
"cache_transceiver_config": {
607-
"backend": "default"
607+
"backend": "DEFAULT"
608608
}
609609
}
610610
ctx_server_config["kv_cache_config"] = {
@@ -648,13 +648,13 @@ def test_nixl_backend(self):
648648
ctx_server_config = {
649649
"disable_overlap_scheduler": True,
650650
"cache_transceiver_config": {
651-
"backend": "nixl"
651+
"backend": "NIXL"
652652
}
653653
}
654654
gen_server_config = {
655655
"disable_overlap_scheduler": True,
656656
"cache_transceiver_config": {
657-
"backend": "nixl"
657+
"backend": "NIXL"
658658
}
659659
}
660660
ctx_server_config["cache_transceiver_config"]
@@ -686,14 +686,14 @@ def test_auto_dtype(self, overlap_scheduler):
686686
"disable_overlap_scheduler": True,
687687
"cuda_graph_config": None,
688688
"cache_transceiver_config": {
689-
"backend": "default"
689+
"backend": "DEFAULT"
690690
}
691691
}
692692
gen_server_config = {
693693
"disable_overlap_scheduler": overlap_scheduler,
694694
"cuda_graph_config": None,
695695
"cache_transceiver_config": {
696-
"backend": "default"
696+
"backend": "DEFAULT"
697697
}
698698
}
699699
disaggregated_server_config = {

tests/integration/defs/disaggregated/test_configs/disagg_config_cache_aware_balance.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ context_servers:
2121
event_buffer_max_size: 1024
2222
free_gpu_memory_fraction: 0.1
2323
cache_transceiver_config:
24-
backend: default
24+
backend: DEFAULT
2525
urls:
2626
- "localhost:8001"
2727
- "localhost:8002"
@@ -35,7 +35,7 @@ generation_servers:
3535
tensor_parallel_size: 1
3636
pipeline_parallel_size: 1
3737
cache_transceiver_config:
38-
backend: default
38+
backend: DEFAULT
3939
kv_cache_config:
4040
enable_block_reuse: True
4141
enable_partial_reuse: False

tests/integration/defs/disaggregated/test_configs/disagg_config_cache_aware_balance_deepseek_v3.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ context_servers:
1717
event_buffer_max_size: 1024
1818
free_gpu_memory_fraction: 0.1
1919
cache_transceiver_config:
20-
backend: "default"
20+
backend: "DEFAULT"
2121
urls:
2222
- "localhost:8001"
2323
- "localhost:8002"
@@ -33,7 +33,7 @@ generation_servers:
3333
event_buffer_max_size: 1024
3434
free_gpu_memory_fraction: 0.1
3535
cache_transceiver_config:
36-
backend: "default"
36+
backend: "DEFAULT"
3737
urls:
3838
- "localhost:8003"
3939
- "localhost:8004"

0 commit comments

Comments
 (0)