Skip to content

Commit 091a1db

Browse files
committed
chore: [Breaking Change] Rename cuda_graph_config padding_enabled field to enable_padding.
Signed-off-by: nv-guomingz <[email protected]>
1 parent 6490a27 commit 091a1db

File tree

18 files changed

+43
-43
lines changed

18 files changed

+43
-43
lines changed

docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -196,7 +196,7 @@ We are seeing meaningful speedup using FP8 KV cache, thus refreshing the numbers
196196
```bash
197197
cat >./extra-llm-api-config.yml <<EOF
198198
cuda_graph_config:
199-
padding_enabled: true
199+
enable_padding: true
200200
batch_sizes:
201201
- 896
202202
- 512
@@ -263,7 +263,7 @@ YOUR_DATA_PATH=./dataset.txt
263263

264264
cat >./extra-llm-api-config.yml <<EOF
265265
cuda_graph_config:
266-
padding_enabled: true
266+
enable_padding: true
267267
batch_sizes:
268268
- 1
269269
- 2

docs/source/blogs/tech_blog/blog3_Optimizing_DeepSeek_R1_Throughput_on_NVIDIA_Blackwell_GPUs.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -157,7 +157,7 @@ These optimizations target the overall execution flow, scheduling, and resource
157157

158158
There is a feature called CUDA Graph padding in TensorRT-LLM, which is a good trade-off between the number of CUDA Graphs and the CUDA Graph hit ratio; it tries to pad a batch to the nearest one with a captured CUDA Graph. Normally you should enable the CUDA Graph padding feature to increase the CUDA Graph hit rate, but the padding itself has some overhead due to wasted tokens computation.
159159

160-
Users can opt-out the CUDA Graph padding feature to see the perf benefits, by setting the `cuda_graph_config:\n padding_enabled: False`, see API here [Pytorch backend config](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/pyexecutor/config.py#L41)
160+
Users can opt-out the CUDA Graph padding feature to see the perf benefits, by setting the `cuda_graph_config:\n enable_padding: False`, see API here [Pytorch backend config](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/pyexecutor/config.py#L41)
161161

162162
* Overlap Scheduler:
163163

docs/source/performance/perf-overview.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -201,7 +201,7 @@ trtllm-bench --model $model_name throughput --dataset $dataset_file --backend py
201201
`llm_options.yml`
202202
```yaml
203203
cuda_graph_config:
204-
padding_enabled: true
204+
enable_padding: true
205205
batch_sizes:
206206
- 1
207207
- 2

docs/source/scripts/disaggregated/gen_yaml.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -190,7 +190,7 @@ def gen_config_file(config_path: str,
190190
'max_seq_len': 8576,
191191
'free_gpu_memory_fraction': gen_gpu_memory_fraction,
192192
'cuda_graph_config': {
193-
'padding_enabled': True,
193+
'enable_padding': True,
194194
'batch_sizes': gen_cuda_graph_batch_sizes,
195195
},
196196
'print_iter_log': True,

examples/llm-api/quickstart_advanced.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -188,7 +188,7 @@ def setup_llm(args):
188188

189189
cuda_graph_config = CudaGraphConfig(
190190
batch_sizes=args.cuda_graph_batch_sizes,
191-
padding_enabled=args.cuda_graph_padding_enabled,
191+
enable_padding=args.cuda_graph_padding_enabled,
192192
) if args.use_cuda_graph else None
193193
llm = LLM(
194194
model=args.model_dir,

examples/models/core/deepseek_v3/README.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -142,7 +142,7 @@ python /app/tensorrt_llm/benchmarks/cpp/prepare_dataset.py \
142142

143143
cat <<EOF > /tmp/extra-llm-api-config.yml
144144
cuda_graph_config:
145-
padding_enabled: true
145+
enable_padding: true
146146
batch_sizes: [1, 4, 8, 12]
147147
EOF
148148

@@ -169,7 +169,7 @@ python /app/tensorrt_llm/benchmarks/cpp/prepare_dataset.py \
169169

170170
cat <<EOF > /tmp/extra-llm-api-config.yml
171171
cuda_graph_config:
172-
padding_enabled: true
172+
enable_padding: true
173173
batch_sizes: [1, 2]
174174
moe_max_num_tokens: 16384
175175
EOF
@@ -237,7 +237,7 @@ To serve the model using `trtllm-serve`:
237237
```bash
238238
cat >./extra-llm-api-config.yml <<EOF
239239
cuda_graph_config:
240-
padding_enabled: true
240+
enable_padding: true
241241
batch_sizes:
242242
- 1
243243
- 2
@@ -316,7 +316,7 @@ export TRTLLM_USE_UCX_KVCACHE=1
316316

317317
cat >./gen-extra-llm-api-config.yml <<EOF
318318
cuda_graph_config:
319-
padding_enabled: true
319+
enable_padding: true
320320
batch_sizes:
321321
- 1
322322
- 2
@@ -538,7 +538,7 @@ python3 /path/to/TensorRT-LLM/benchmarks/cpp/prepare_dataset.py \
538538

539539
cat >/path/to/TensorRT-LLM/extra-llm-api-config.yml <<EOF
540540
cuda_graph_config:
541-
padding_enabled: true
541+
enable_padding: true
542542
batch_sizes:
543543
- 1
544544
- 2

examples/models/core/qwen/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -745,7 +745,7 @@ To serve the model using `trtllm-serve`:
745745
```bash
746746
cat >./extra-llm-api-config.yml <<EOF
747747
cuda_graph_config:
748-
padding_enabled: true
748+
enable_padding: true
749749
batch_sizes:
750750
- 1
751751
- 2
@@ -821,7 +821,7 @@ export TRTLLM_USE_UCX_KVCACHE=1
821821

822822
cat >./gen-extra-llm-api-config.yml <<EOF
823823
cuda_graph_config:
824-
padding_enabled: true
824+
enable_padding: true
825825
batch_sizes:
826826
- 1
827827
- 2

examples/wide_ep/slurm_scripts/gen_yaml.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -196,7 +196,7 @@ def gen_config_file(config_path: str,
196196
'max_seq_len': 2176,
197197
'free_gpu_memory_fraction': gen_gpu_memory_fraction,
198198
'cuda_graph_config': {
199-
'padding_enabled': True,
199+
'enable_padding': True,
200200
'batch_sizes': gen_cuda_graph_batch_sizes,
201201
},
202202
'print_iter_log': True,

tensorrt_llm/_torch/pyexecutor/model_engine.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -309,7 +309,7 @@ def get_rank_model_storage(model):
309309
def _filter_cuda_graph_batch_sizes(cuda_graph_batch_sizes: list[int],
310310
max_batch_size: int, max_num_tokens: int,
311311
max_draft_len: int,
312-
padding_enabled: bool) -> list[int]:
312+
enable_padding: bool) -> list[int]:
313313
# This is the largest possible batch size for a pure decoding batch.
314314
max_cuda_graph_bs = min(max_batch_size,
315315
int(max_num_tokens / (1 + max_draft_len)))
@@ -326,8 +326,8 @@ def _filter_cuda_graph_batch_sizes(cuda_graph_batch_sizes: list[int],
326326
# is that if the user is OK padding to a batch size B, they should also
327327
# be OK with padding to some size B' < B since the performance will generally
328328
# just be better in the smaller case.
329-
if padding_enabled and (i == 0
330-
or result[i - 1] != max_cuda_graph_bs):
329+
if enable_padding and (i == 0
330+
or result[i - 1] != max_cuda_graph_bs):
331331
logger.warning(
332332
"CUDA graph padding is enabled, but one of the given CUDA graph "
333333
f"batch sizes ({bs}) is larger than the executor's max batch size "

tensorrt_llm/bench/benchmark/utils/general.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -152,7 +152,7 @@ def get_settings(params: dict, dataset_metadata: DatasetMetadata, model: str,
152152
pass
153153

154154
cuda_graph_config = {
155-
"padding_enabled": True,
155+
"enable_padding": True,
156156
"max_batch_size": max_batch_size
157157
}
158158

0 commit comments

Comments
 (0)