Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
bd36c9f
[#5249][AutoDeploy] Refactor the signatures of AD graph transformatio…
galagam Jul 8, 2025
8f4326b
Fix AD trtllm-bench integration test and enable trtllm-bench integrat…
nzmora-nvidia Jul 9, 2025
084703e
Revert "Fix AD trtllm-bench integration test and enable trtllm-bench …
lucaslie Jul 9, 2025
a38f2de
[AutoDeploy] yaml config loader for dynamic settings (closes #4366) (…
lucaslie Jul 9, 2025
0605b83
[AutoDeploy] Refining AD configurability (#75)
lucaslie Jul 9, 2025
1ce910b
[TRTLLM-5446 Part1] Add torch ref backend (#74)
nvchenghaoz Jul 10, 2025
aeac894
Fix trtllm-bench test
nzmora-nvidia Jul 11, 2025
e6bd1f4
Fix the unit test failure (#83)
nvchenghaoz Jul 11, 2025
79706cb
feat:[AutoDeploy] Support Quantized MoE matcher - Step1 (#68)
Fridah-nv Jul 11, 2025
deefc58
Fix loading of aliased weights (#85)
suyoggupta Jul 14, 2025
d16cd22
fix overlap scheduler in AD (#79)
suyoggupta Jul 15, 2025
e6b2d00
[AutoDeploy] Improved config deepmerge handling (#82)
lucaslie Jul 15, 2025
89cce46
[AutoDeploy][1/n] Modular InferenceOptimizer; fixes #4328, #4367, #43…
lucaslie Jul 15, 2025
644c6d3
[feat] TP Sharding logic split to pattern detection and executor (fix…
greg-kwasniewski1 Jul 16, 2025
3622dff
[None] Add sink/sliding window support for Triton (#77)
nvchenghaoz Jul 16, 2025
cfe2b9c
Revert "[None] Add sink/sliding window support for Triton (#77)" (#92)
lucaslie Jul 17, 2025
963e13d
[AutoDeploy][2/n] Modular InferenceOptimizer (#90)
lucaslie Jul 17, 2025
5b69d2c
[None] Add the torch source implementation for new params. (#89)
nvchenghaoz Jul 17, 2025
c2d2065
[AutoDeploy] Modular export patches + registry; fixes #5728 (#91)
lucaslie Jul 17, 2025
1ec1448
Fix cudagraphs, add rms norm pattern matcher (#87)
suyoggupta Jul 17, 2025
4d89913
move assertion check cleanup back to stock export (#93)
lucaslie Jul 17, 2025
d84ce33
[feat] Sharding logic split to pattern detection and executor for EP …
greg-kwasniewski1 Jul 18, 2025
a8b54f9
Attention Pattern Matcher (closes #4404) (#88)
Fridah-nv Jul 18, 2025
29bb062
Add sinks / sliding window for Triton backend (#95)
nvchenghaoz Jul 18, 2025
810df7d
Revert "[None] Add the torch source implementation for new params. (#…
lucaslie Jul 18, 2025
a9e227e
Revert "Attention Pattern Matcher (closes #4404) (#88)"
lucaslie Jul 21, 2025
0c224e4
Change the all-reduce strategy to NCCL (#99)
nzmora-nvidia Jul 23, 2025
b8c5b9c
respect max_seq_len setting for pos embeddings (#103)
lucaslie Jul 23, 2025
9662d81
feat: update graph test helper for testing new inf optimizer (#106)
h-guo18 Jul 23, 2025
a83a455
[Fix] Fix the torch backend (#108)
nvchenghaoz Jul 24, 2025
b9b06c7
[Reopen #88]Attention Pattern Matcher with torch._inductor pattern ma…
Fridah-nv Jul 25, 2025
4e10f76
improve error handling and graph clean-up (#105)
lucaslie Jul 25, 2025
7810dd0
[Update #89] Add torch ref attention (#107)
nvchenghaoz Jul 26, 2025
b47abb4
Move quantization &quant_moe to new inf optimizer (#112)
h-guo18 Jul 26, 2025
4f8f767
add eager attention pattern that does not cast attn weight to fp 32 (…
Fridah-nv Jul 29, 2025
913695f
add a new rope pattern for llama4 scout (#97)
Fridah-nv Jul 29, 2025
f6834b1
Move attention transforms to new inference optimizer (#115)
h-guo18 Jul 31, 2025
5d9b1b9
TRTLLM-6142: set torch recompile_limit based on cuda_graph_batch_size…
MrGeva Jul 31, 2025
c4c6138
avoid gpu->cpu transfer when using overlap scheduler
suyoggupta Jul 27, 2025
2a49116
prealloc
suyoggupta Jul 28, 2025
9883468
optimize prepare input
suyoggupta Jul 29, 2025
8f07e6a
draft - create input_ids on GPU
Jul 29, 2025
b5b8f6d
better indexing
Jul 29, 2025
35e20f2
skip placeholder step
Jul 29, 2025
5946ade
eliminate all syncevents - still have idle time
Jul 31, 2025
27a3d7f
remove dummy gen request handling
Jul 31, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Empty file added benchmarks/cpp/__init__.py
Empty file.
Empty file.
6 changes: 4 additions & 2 deletions examples/auto_deploy/.vscode/launch.json
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,10 @@
"--args.model-factory=AutoModelForCausalLM",
"--benchmark.enabled=false",
"--prompt.batch-size=2",
"--args.model-kwargs",
"num_hidden_layers=3,num_attention_heads=32",
"--args.model-kwargs.num-hidden-layers=3",
"--args.model-kwargs.num-attention-heads=32",
"--prompt.sp-kwargs.max-tokens=128",
// "--dry-run", // uncomment to print the final config and return
],
"console": "integratedTerminal",
"justMyCode": false,
Expand Down
218 changes: 197 additions & 21 deletions examples/auto_deploy/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

<div align="left">

AutoDeploy is designed to simplify and accelerate the deployment of PyTorch models, including off-the-shelf models like those from Hugging Face, to TensorRT-LLM. It automates graph transformations to integrate inference optimizations such as tensor parallelism, KV-caching and quantization. AutoDeploy supports optimized in-framework deployment, minimizing the amount of manual modification needed.
AutoDeploy is an experimental feature in beta stage designed to simplify and accelerate the deployment of PyTorch models, including off-the-shelf models like those from Hugging Face, to TensorRT-LLM. It automates graph transformations to integrate inference optimizations such as tensor parallelism, KV-caching and quantization. AutoDeploy supports optimized in-framework deployment, minimizing the amount of manual modification needed.

______________________________________________________________________

Expand Down Expand Up @@ -146,7 +146,7 @@ Below is a non-exhaustive list of common config options:
| `--args.skip-loading-weights` | Only load the architecture, not the weights |
| `--args.model-kwargs` | Extra kwargs that are being passed to the model initializer in the model factory |
| `--args.tokenizer-kwargs` | Extra kwargs that are being passed to the tokenizer initializer in the model factory |
| `--args.world-size` | The number of GPUs for Tensor Parallel |
| `--args.world-size` | The number of GPUs used for auto-sharding the model |
| `--args.runtime` | Specifies which type of Engine to use during runtime (`"demollm"` or `"trtllm"`) |
| `--args.compile-backend` | Specifies how to compile the graph at the end |
| `--args.attn-backend` | Specifies kernel implementation for attention |
Expand All @@ -157,7 +157,7 @@ Below is a non-exhaustive list of common config options:
| `--prompt.batch-size` | Number of queries to generate |
| `--benchmark.enabled` | Whether to run the built-in benchmark (true/false) |

For default values and additional configuration options, refer to the `ExperimentConfig` class in [build_and_run_ad.py](./build_and_run_ad.py) file.
For default values and additional configuration options, refer to the [`ExperimentConfig`](./build_and_run_ad.py) class in [build_and_run_ad.py](./build_and_run_ad.py) file.

Here is a more complete example of using the script:

Expand All @@ -172,7 +172,7 @@ python build_and_run_ad.py \
--benchmark.enabled True
```

#### Logging Level
### Logging Level

Use the following env variable to specify the logging level of our built-in logger ordered by
decreasing verbosity;
Expand Down Expand Up @@ -223,17 +223,14 @@ AutoDeploy can be seamlessly integrated into your existing workflows using TRT-L

Here is an example of how you can build an LLM object with AutoDeploy integration:

<details>
<summary>Click to expand the example</summary>

```
from tensorrt_llm._torch.auto_deploy import LLM


# Construct the LLM high-level interface object with autodeploy as backend
llm = LLM(
model=<HF_MODEL_CARD_OR_DIR>,
world_size=<NUM_WORLD_RANK>,
world_size=<DESIRED_WORLD_SIZE>,
compile_backend="torch-compile",
model_kwargs={"num_hidden_layers": 2}, # test with smaller model configuration
attn_backend="flashinfer", # choose between "triton" and "flashinfer"
Expand All @@ -249,28 +246,207 @@ llm = LLM(

```

Please consult the [AutoDeploy `LLM` API](../../tensorrt_llm/_torch/auto_deploy/llm.py) and the
[`AutoDeployConfig` class](../../tensorrt_llm/_torch/auto_deploy/llm_args.py)
for more detail on how AutoDeploy is configured via the `**kwargs` of the `LLM` API.

### Expert Configuration of LLM API

For expert TensorRT-LLM users, we also expose the full set of [`LlmArgs`](../../tensorrt_llm/_torch/auto_deploy/llm_args.py)
*at your own risk* (the argument list diverges from TRT-LLM's argument list):

<details>
<summary>Click to expand for more details on using LlmArgs directly</summary>

- All config fields that are used by the AutoDeploy core pipeline (i.e. the `InferenceOptimizer`) are
_exclusively_ exposed in the [`AutoDeployConfig` class](../../tensorrt_llm/_torch/auto_deploy/llm_args.py).
Please make sure to refer to those first.
- For expert users we expose the full set of [`LlmArgs`](../../tensorrt_llm/_torch/auto_deploy/llm_args.py)
that can be used to configure the [AutoDeploy `LLM` API](../../tensorrt_llm/_torch/auto_deploy/llm.py) including runtime options.
- Note that some fields in the full [`LlmArgs`](../../tensorrt_llm/_torch/auto_deploy/llm_args.py)
object are overlapping, duplicated, and/or _ignored_ in AutoDeploy, particularly arguments
pertaining to configuring the model itself since AutoDeploy's model ingestion+optimize pipeline
significantly differs from the default manual workflow in TensorRT-LLM.
- However, with the proper care the full [`LlmArgs`](../../tensorrt_llm/_torch/auto_deploy/llm_args.py)
objects can be used to configure advanced runtime options in TensorRT-LLM.
- Note that any valid field can be simply provided as keyword argument ("`**kwargs`") to the
[AutoDeploy `LLM` API](../../tensorrt_llm/_torch/auto_deploy/llm.py).

</details>

For more examples on TRT-LLM LLM API, visit [`this page`](https://nvidia.github.io/TensorRT-LLM/examples/llm_api_examples.html).
### Expert Configuration of `build_and_run_ad.py`

______________________________________________________________________
For expert users, `build_and_run_ad.py` provides advanced configuration capabilities through a flexible argument parser powered by PyDantic Settings and OmegaConf. You can use dot notation for CLI arguments, provide multiple YAML configuration files, and leverage sophisticated configuration precedence rules to create complex deployment configurations.

## Roadmap
<details>
<summary>Click to expand for detailed configuration examples</summary>

1. **Model Coverage:**
#### CLI Arguments with Dot Notation

- Expand support for additional LLM variants and features:
- LoRA
- Speculative Decoding
- Model specialization for disaggregated serving
The script supports flexible CLI argument parsing using dot notation to modify nested configurations dynamically. You can target any field in both the [`ExperimentConfig`](./build_and_run_ad.py) and nested [`AutoDeployConfig`](../../tensorrt_llm/_torch/auto_deploy/llm_args.py)/[`LlmArgs`](../../tensorrt_llm/_torch/auto_deploy/llm_args.) objects:

1. **Performance Optimization:**
```bash
# Configure model parameters
# NOTE: config values like num_hidden_layers are automatically resolved into the appropriate nested
# dict value ``{"args": {"model_kwargs": {"num_hidden_layers": 10}}}`` although not explicitly
# specified as CLI arg
python build_and_run_ad.py \
--model "meta-llama/Meta-Llama-3.1-8B-Instruct" \
--args.model-kwargs.num-hidden-layers=10 \
--args.model-kwargs.hidden-size=2048 \
--args.tokenizer-kwargs.padding-side=left

- Enhance inference speed and efficiency with:
- MoE fusion and all-reduce fusion techniques
- Reuse of TRT-LLM PyTorch operators for greater efficiency
# Configure runtime and backend settings
python build_and_run_ad.py \
--model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" \
--args.world-size=2 \
--args.compile-backend=torch-opt \
--args.attn-backend=flashinfer

______________________________________________________________________
# Configure prompting and benchmarking
python build_and_run_ad.py \
--model "microsoft/phi-4" \
--prompt.batch-size=4 \
--prompt.sp-kwargs.max-tokens=200 \
--prompt.sp-kwargs.temperature=0.7 \
--benchmark.enabled=true \
--benchmark.bs=8 \
--benchmark.isl=1024
```

#### YAML Configuration Files

Both [`ExperimentConfig`](./build_and_run_ad.py) and [`AutoDeployConfig`](../../tensorrt_llm/_torch/auto_deploy/llm_args.py)/[`LlmArgs`](../../tensorrt_llm/_torch/auto_deploy/llm_args.py) inherit from [`DynamicYamlMixInForSettings`](../../tensorrt_llm/_torch/auto_deploy/utils/_config.py), enabling you to provide multiple YAML configuration files that are automatically deep-merged at runtime.

Create a YAML configuration file (e.g., `my_config.yaml`):

```yaml
# my_config.yaml
args:
model_kwargs:
num_hidden_layers: 12
hidden_size: 1024
world_size: 4
compile_backend: torch-compile
attn_backend: triton
max_seq_len: 2048
max_batch_size: 16
transforms:
sharding:
strategy: auto
quantization:
enabled: false

prompt:
batch_size: 8
sp_kwargs:
max_tokens: 150
temperature: 0.8
top_k: 50

benchmark:
enabled: true
num: 20
bs: 4
isl: 1024
osl: 256
```

Create an additional override file (e.g., `production.yaml`):

```yaml
# production.yaml
args:
world_size: 8
compile_backend: torch-opt
max_batch_size: 32

benchmark:
enabled: false
```

Then use these configurations:

```bash
# Using single YAML config
python build_and_run_ad.py \
--model "meta-llama/Meta-Llama-3.1-8B-Instruct" \
--yaml-configs my_config.yaml

# Using multiple YAML configs (deep merged in order, later files have higher priority)
python build_and_run_ad.py \
--model "meta-llama/Meta-Llama-3.1-8B-Instruct" \
--yaml-configs my_config.yaml production.yaml

# Targeting nested AutoDeployConfig with separate YAML
python build_and_run_ad.py \
--model "meta-llama/Meta-Llama-3.1-8B-Instruct" \
--yaml-configs my_config.yaml \
--args.yaml-configs autodeploy_overrides.yaml
```

#### Configuration Precedence and Deep Merging

The configuration system follows a strict precedence order where higher priority sources override lower priority ones:

1. **CLI Arguments** (highest priority) - Direct command line arguments
1. **YAML Configs** - Files specified via `--yaml-configs` and `--args.yaml-configs`
1. **Default Settings** (lowest priority) - Built-in defaults from the config classes

**Deep Merging**: Unlike simple overwriting, deep merging intelligently combines nested dictionaries recursively. For example:

```yaml
# Base config
args:
model_kwargs:
num_hidden_layers: 10
hidden_size: 1024
max_seq_len: 2048
```

```yaml
# Override config
args:
model_kwargs:
hidden_size: 2048 # This will override
# num_hidden_layers: 10 remains unchanged
world_size: 4 # This gets added
```

**Nested Config Behavior**: When using nested configurations, outer YAML configs become init settings for inner objects, giving them higher precedence:

```bash
# The outer yaml-configs affects the entire ExperimentConfig
# The inner args.yaml-configs affects only the AutoDeployConfig
python build_and_run_ad.py \
--model "meta-llama/Meta-Llama-3.1-8B-Instruct" \
--yaml-configs experiment_config.yaml \
--args.yaml-configs autodeploy_config.yaml \
--args.world-size=8 # CLI override beats both YAML configs
```

#### Built-in Default Configuration

Both [`AutoDeployConfig`](../../tensorrt_llm/_torch/auto_deploy/llm_args.py) and [`LlmArgs`](../../tensorrt_llm/_torch/auto_deploy/llm_args.py) classes automatically load a built-in [`default.yaml`](../../tensorrt_llm/_torch/auto_deploy/config/default.yaml) configuration file that provides sensible defaults for the AutoDeploy inference optimizer pipeline. This file is specified in the [`_get_config_dict()`](../../tensorrt_llm/_torch/auto_deploy/llm_args.py) function and defines default transform configurations for graph optimization stages.

The built-in defaults are automatically merged with your configurations at the lowest priority level, ensuring that your custom settings always override the defaults. You can inspect the current default configuration to understand the baseline transform pipeline:

```bash
# View the default configuration
cat tensorrt_llm/_torch/auto_deploy/config/default.yaml

# Override specific transform settings
python build_and_run_ad.py \
--model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" \
--args.transforms.export-to-gm.strict=true
```

</details>

## Roadmap

Check out our [Github Project Board](https://github.com/orgs/NVIDIA/projects/83) to learn more about
the current progress in AutoDeploy and where you can help.

## Disclaimer

Expand Down
Loading