You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This command will launch m context engines and n generation engines. You need to ensure `proc` is equal to the sum of the number of processes required for each engine plus 1. Since we use orchestrator mode for `disaggServerBenchmark` we need an additional process as the orchestrator. For example, if there are two context engines (one is TP2_PP1,another is TP1_PP1) and two generation engines(one is TP2_PP1,another is TP1_PP1), then the `proc` value should be set to 7.
When the environment variable `TRTLLM_USE_MPI_KVCACHE=1` is set, TRT-LLM will transfer the KV cache using `CUDA-aware MPI`. All executor processes involved must share the same MPI world communicator. Consequently, with `TRTLLM_USE_MPI_KVCACHE=1`, TRT-LLM only supports launching multiple executors via `MPI`. Additionally, the `CommunicationMode` for the executors must be set to `kLEADER` or `kORCHESTRATOR` with `SpawnProcesses=false` for the `disaggregated-service`. These restrictions do not apply when `TRTLLM_USE_UCX_KVCACHE=1` is set.
79
-
80
69
*Q. Does TRT-LLM support using GPU direct RDMA for inter-node KV Cache transfer?*
81
70
82
71
A. Yes, TRT-LLM supports using GPU direct RDMA for inter-node KV cache transfer.
Copy file name to clipboardExpand all lines: examples/disaggregated/README.md
+31-16Lines changed: 31 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -12,24 +12,39 @@ cache_transceiver_config:
12
12
max_tokens_in_buffer: <int>
13
13
```
14
14
15
-
`backend` specifies the communication backend for transferring the kvCache, valid options include `DEFAULT`,`UCX`, `NIXL`, and `MPI`, the default backend is UCX.
15
+
`backend` specifies the communication backend for transferring the KV cache, valid options include `DEFAULT`,`UCX`, `NIXL`, and `MPI`, the default backend is `UCX`.
16
16
17
-
`max_tokens_in_buffer`defines the buffer size for kvCache transfers, it is recommended to set this value greater than or equal to the maximum ISL (Input Sequence Length) of all requests for optimal performance.
17
+
`max_tokens_in_buffer`defines the buffer size for KV cache transfers, it is recommended to set this value greater than or equal to the maximum ISL (Input Sequence Length) of all requests for optimal performance.
18
18
19
-
You can use multiple `trtllm-serve` commands to launch the context and generation servers that will be used
20
-
for disaggregated serving. For example, you could launch two context servers and one generation servers as follows:
19
+
You can use multiple `trtllm-serve` commands to launch the context and generation servers required for disaggregated serving. For instance, you might start two context servers and one generation server as shown below.
21
20
22
-
```bash
23
-
# Generate context_extra-llm-api-config.yml
24
-
# Overlap scheduler for context servers are disabled because it's not supported for disaggregated context servers yet
The MPI communication backend for kvCache transfer has been deprecated and may not be supported in the future. When using the MPI backend, the environment variable `TRTLLM_USE_MPI_KVCACHE=1` should be set to avoid conflicts between mpi4py and kvCache transfer.
198
+
The MPI communication backend for KV cache transfer has been deprecated and may not be supported in the future. When using the MPI backend, the environment variable `TRTLLM_USE_MPI_KVCACHE=1` should be set to avoid conflicts between mpi4py and KV cache transfer.
0 commit comments