Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -134,9 +134,8 @@ To do the benchmark, run the following command:
YOUR_DATA_PATH=<your dataset file following the format>

cat >./extra-llm-api-config.yml<<EOF
pytorch_backend_config:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should throw a "deprecated" message when the old config type is detected. Currently old configs are still accepted, but all the fields under pytorch_backend_config are ignored, thus none of the configs are taking effect. This causes confusion to users (like myself).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a very fair request. @Superjomn

use_cuda_graph: true
moe_backend: TRTLLM
use_cuda_graph: true
moe_backend: TRTLLM
speculative_config:
decoding_type: MTP
num_nextn_predict_layers: 3
Expand Down Expand Up @@ -202,21 +201,20 @@ python ${YOUR_WORK_PATH}/benchmarks/cpp/prepare_dataset.py \
YOUR_DATA_PATH=./dataset.txt

cat >./extra-llm-api-config.yml <<EOF
pytorch_backend_config:
use_cuda_graph: true
cuda_graph_padding_enabled: true
cuda_graph_batch_sizes:
- 1
- 2
- 4
- 8
- 16
- 32
- 64
- 128
- 256
- 384
print_iter_log: true
use_cuda_graph: true
cuda_graph_padding_enabled: true
cuda_graph_batch_sizes:
- 1
- 2
- 4
- 8
- 16
- 32
- 64
- 128
- 256
- 384
print_iter_log: true
enable_attention_dp: true
EOF

Expand Down Expand Up @@ -257,8 +255,7 @@ To do the benchmark, run the following command:
YOUR_DATA_PATH=<your dataset file following the format>

cat >./extra-llm-api-config.yml<<EOF
pytorch_backend_config:
use_cuda_graph: true
use_cuda_graph: true
speculative_config:
decoding_type: MTP
num_nextn_predict_layers: 3
Expand Down Expand Up @@ -307,10 +304,9 @@ python ${YOUR_WORK_PATH}/benchmarks/cpp/prepare_dataset.py \
YOUR_DATA_PATH=./dataset.txt

cat >./extra-llm-api-config.yml<<EOF
pytorch_backend_config:
use_cuda_graph: true
cuda_graph_batch_sizes:
- 128
use_cuda_graph: true
cuda_graph_batch_sizes:
- 128
enable_attention_dp: true
EOF

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -121,9 +121,8 @@ To benchmark min-latency performance with MTP, you need to follow [this document
YOUR_DATA_PATH=<your dataset file following the format>

cat >./extra-llm-api-config.yml<<EOF
pytorch_backend_config:
use_cuda_graph: true
moe_backend: TRTLLM
use_cuda_graph: true
moe_backend: TRTLLM
speculative_config:
decoding_type: MTP
num_nextn_predict_layers: 3
Expand Down Expand Up @@ -177,9 +176,8 @@ To benchmark min-latency performance with MTP Relaxed Acceptance, you need to fo
YOUR_DATA_PATH=<your dataset file following the format>

cat >./extra-llm-api-config.yml<<EOF
pytorch_backend_config:
use_cuda_graph: true
moe_backend: TRTLLM
use_cuda_graph: true
moe_backend: TRTLLM
speculative_config:
decoding_type: MTP
num_nextn_predict_layers: 3
Expand Down
3 changes: 1 addition & 2 deletions docs/source/performance/perf-benchmarking.md
Original file line number Diff line number Diff line change
Expand Up @@ -628,8 +628,7 @@ If you would like to force the KV cache quantizaton, you can specify the followi
when the checkpoint precision is `null`:

```yaml
pytorch_backend_config:
kv_cache_dtype: "fp8"
kv_cache_dtype: "fp8"
```

```{tip}
Expand Down
8 changes: 3 additions & 5 deletions docs/source/performance/perf-overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -200,11 +200,9 @@ trtllm-bench --model $model_name throughput --dataset $dataset_file --backend py

`llm_options.yml`
```yaml

pytorch_backend_config:
use_cuda_graph: true
cuda_graph_padding_enabled: true
cuda_graph_batch_sizes:
use_cuda_graph: true
cuda_graph_padding_enabled: true
cuda_graph_batch_sizes:
- 1
- 2
- 4
Expand Down
2 changes: 1 addition & 1 deletion docs/source/torch/attention.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ The following sections explain how to use these implementations and provide a br


There are currently three available attention backends: the vanilla backend, the TRT-LLM backend, and the Flashinfer backend.
You can specify the desired attention backend using `PyTorchConfig.attn_backend`. For instance, to utilize the Flashinfer backend, you can create a `PyTorchConfig` with `attn_backend = "flashinfer"` and then pass it to the `LLM` constructor as follows: `LLM(pytorch_backend_config=pytorch_config)`. This will enable the use of the Flashinfer backend for your model.
You can specify the desired attention backend using `PyTorchConfig.attn_backend`. For instance, to utilize the Flashinfer backend, you can pass `attn_backend="flashinfer"` to the `LLM` constructor as follows: `LLM(attn_backend="flashinfer")`. This will enable the use of the Flashinfer backend for your model.

The vanilla backend, `VanillaAttention`, is a reference implementation designed primarily for inflight batching and linear KV cache support. While it serves as a useful baseline, it is not recommended for production use due to its limited optimizations.

Expand Down
2 changes: 1 addition & 1 deletion examples/auto_deploy/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -265,7 +265,7 @@ llm = LLM(
model=<HF_MODEL_CARD_OR_DIR>,
backend="autodeploy",
build_config=build_config,
pytorch_backend_config=ad_config,
auto_deploy_config=ad_config,
tensor_parallel_size=<NUM_WORLD_RANK>,
)

Expand Down
2 changes: 1 addition & 1 deletion examples/auto_deploy/build_and_run_ad.py
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ def build_llm_from_config(config: SimpleConfig) -> LLM:
model=factory.model,
backend="autodeploy",
build_config=build_config,
pytorch_backend_config=ad_config,
auto_deploy_config=ad_config,
tensor_parallel_size=config.world_size,
tokenizer=factory.init_tokenizer() if config.customize_tokenizer else None,
)
Expand Down
7 changes: 3 additions & 4 deletions examples/disaggregated/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ You can use multiple `trtllm-serve` commands to launch the context and generatio
for disaggregated serving. For example, you could launch two context servers and one generation servers as follows:

```
echo -e "pytorch_backend_config:\n disable_overlap_scheduler: True\ncache_transceiver_config:\n max_num_tokens: 2048" > context_extra-llm-api-config.yml
echo -e "disable_overlap_scheduler: True\ncache_transceiver_config:\nmax_num_tokens: 2048" > context_extra-llm-api-config.yml
echo -e "cache_transceiver_config:\n max_num_tokens: 2048" > gen_extra-llm-api-config.yml

export TRTLLM_USE_UCX_KVCACHE=1
Expand Down Expand Up @@ -63,9 +63,8 @@ hostname: localhost
port: 8000
model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
backend: "pytorch"
pytorch_backend_config:
use_cuda_graph: False
disable_overlap_scheduler: True
use_cuda_graph: False
disable_overlap_scheduler: True
context_servers:
num_instances: 1
tensor_parallel_size: 1
Expand Down
5 changes: 2 additions & 3 deletions examples/disaggregated/disagg_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,8 @@ port: 8000
model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
free_gpu_memory_fraction: 0.25
backend: "pytorch"
pytorch_backend_config:
use_cuda_graph: False
disable_overlap_scheduler: True
use_cuda_graph: False
disable_overlap_scheduler: True
context_servers:
num_instances: 1
tensor_parallel_size: 1
Expand Down
6 changes: 2 additions & 4 deletions examples/llm-api/llm_inference_kv_events.py
Original file line number Diff line number Diff line change
@@ -1,17 +1,15 @@
### Get KV Cache Events

from tensorrt_llm import LLM, SamplingParams
from tensorrt_llm._torch.pyexecutor.config import PyTorchConfig
from tensorrt_llm.llmapi import KvCacheConfig


def main():
pytorch_config = PyTorchConfig(autotuner_enabled=False,
kv_cache_dtype='auto')

llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
tensor_parallel_size=2,
pytorch_backend_config=pytorch_config,
autotuner_enabled=False,
kv_cache_dtype='auto',
kv_cache_config=KvCacheConfig(enable_block_reuse=True,
event_buffer_max_size=1024),
backend="pytorch")
Expand Down
7 changes: 3 additions & 4 deletions examples/llm-api/llm_mgmn_trtllm_bench.sh
Original file line number Diff line number Diff line change
Expand Up @@ -74,10 +74,9 @@ srun -l \

# This is optional
cat > /tmp/pytorch_extra_args.txt << EOF
pytorch_backend_config:
use_cuda_graph: false
cuda_graph_padding_enabled: false
print_iter_log: true
use_cuda_graph: false
cuda_graph_padding_enabled: false
print_iter_log: true
enable_attention_dp: false
EOF

Expand Down
3 changes: 1 addition & 2 deletions examples/llm-eval/lm-eval-harness/lm_eval_tensorrt_llm.py
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,6 @@ def __init__(
if hasattr(PyTorchConfig, "moe_backend"):
pytorch_config_params["moe_backend"] = self.moe_backend
print(f"Info: moe_backend is set to {self.moe_backend}")
pytorch_config = PyTorchConfig(**pytorch_config_params)

# stop words not currently supported by torch backend
self.use_stop_words = False
Expand All @@ -110,7 +109,7 @@ def __init__(
tensor_parallel_size=tp,
trust_remote_code=trust_remote_code,
enable_chunked_prefill=False,
pytorch_backend_config=pytorch_config,
**pytorch_config_params,
tokenizer=self.tokenizer,
kv_cache_config=trt_kv_cache_config,
moe_expert_parallel_size=self.moe_expert_parallel_size,
Expand Down
82 changes: 38 additions & 44 deletions examples/models/core/deepseek_v3/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -140,10 +140,9 @@ python /app/tensorrt_llm/benchmarks/cpp/prepare_dataset.py \
--num-requests 24 > /tmp/benchmarking_64k.txt

cat <<EOF > /tmp/extra-llm-api-config.yml
pytorch_backend_config:
use_cuda_graph: true
cuda_graph_padding_enabled: true
cuda_graph_batch_sizes: [1, 4, 8, 12]
use_cuda_graph: true
cuda_graph_padding_enabled: true
cuda_graph_batch_sizes: [1, 4, 8, 12]
EOF

trtllm-bench -m deepseek-ai/DeepSeek-R1 --model_path ${DS_R1_NVFP4_MODEL_PATH} throughput \
Expand All @@ -168,11 +167,10 @@ python /app/tensorrt_llm/benchmarks/cpp/prepare_dataset.py \
--num-requests 4 > /tmp/benchmarking_128k.txt

cat <<EOF > /tmp/extra-llm-api-config.yml
pytorch_backend_config:
use_cuda_graph: true
cuda_graph_padding_enabled: true
cuda_graph_batch_sizes: [1, 2]
moe_max_num_tokens: 16384
use_cuda_graph: true
cuda_graph_padding_enabled: true
cuda_graph_batch_sizes: [1, 2]
moe_max_num_tokens: 16384
EOF

trtllm-bench -m deepseek-ai/DeepSeek-R1 --model_path ${DS_R1_NVFP4_MODEL_PATH} throughput \
Expand All @@ -193,8 +191,7 @@ Evaluate the model accuracy using `trtllm-eval`.
1. (Optional) Prepare an advanced configuration file:
```bash
cat >./extra-llm-api-config.yml <<EOF
pytorch_backend_config:
use_cuda_graph: true
use_cuda_graph: true
enable_attention_dp: true
EOF
```
Expand Down Expand Up @@ -236,21 +233,20 @@ To serve the model using `trtllm-serve`:

```bash
cat >./extra-llm-api-config.yml <<EOF
pytorch_backend_config:
use_cuda_graph: true
cuda_graph_padding_enabled: true
cuda_graph_batch_sizes:
- 1
- 2
- 4
- 8
- 16
- 32
- 64
- 128
- 256
- 384
print_iter_log: true
use_cuda_graph: true
cuda_graph_padding_enabled: true
cuda_graph_batch_sizes:
- 1
- 2
- 4
- 8
- 16
- 32
- 64
- 128
- 256
- 384
print_iter_log: true
enable_attention_dp: true
EOF

Expand Down Expand Up @@ -427,21 +423,20 @@ python3 /path/to/TensorRT-LLM/benchmarks/cpp/prepare_dataset.py \
--input-mean=1024 --output-mean=2048 --input-stdev=0 --output-stdev=0 > /tmp/dataset.txt

cat >/path/to/TensorRT-LLM/extra-llm-api-config.yml <<EOF
pytorch_backend_config:
use_cuda_graph: true
cuda_graph_padding_enabled: true
cuda_graph_batch_sizes:
- 1
- 2
- 4
- 8
- 16
- 32
- 64
- 128
- 256
- 384
print_iter_log: true
use_cuda_graph: true
cuda_graph_padding_enabled: true
cuda_graph_batch_sizes:
- 1
- 2
- 4
- 8
- 16
- 32
- 64
- 128
- 256
- 384
print_iter_log: true
enable_attention_dp: true
EOF
```
Expand Down Expand Up @@ -605,9 +600,8 @@ To enable FP8 MLA, modify the `kv_cache_quant_algo` property. The following show
Alternatively, configure FP8 MLA through the `kv_cache_dtype` of the PyTorch backend config. An example is to use `--kv_cache_dtype` of `quickstart_advanced.py`. Also, you can edit `extra-llm-api-config.yml` consumed by `--extra_llm_api_options` of `trtllm-serve`, `trtllm-bench` and so on:
```yaml
# ...
pytorch_backend_config:
kv_cache_dtype: fp8
# ...
kv_cache_dtype: fp8
# ...
```

### W4AFP8
Expand Down
29 changes: 14 additions & 15 deletions examples/models/core/qwen/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -653,21 +653,20 @@ To serve the model using `trtllm-serve`:

```bash
cat >./extra-llm-api-config.yml <<EOF
pytorch_backend_config:
use_cuda_graph: true
cuda_graph_padding_enabled: true
cuda_graph_batch_sizes:
- 1
- 2
- 4
- 8
- 16
- 32
- 64
- 128
- 256
- 384
print_iter_log: true
use_cuda_graph: true
cuda_graph_padding_enabled: true
cuda_graph_batch_sizes:
- 1
- 2
- 4
- 8
- 16
- 32
- 64
- 128
- 256
- 384
print_iter_log: true
enable_attention_dp: true
EOF

Expand Down
Loading