Skip to content

Commit 41fb8aa

Browse files
lucasliegalagamnzmora-nvidianvchenghaozFridah-nv
authored
[AutoDeploy] merge feat/ad-2025-07-07 (#6196)
Signed-off-by: Gal Hubara Agam <[email protected]> Signed-off-by: Neta Zmora <[email protected]> Signed-off-by: Lucas Liebenwein <[email protected]> Signed-off-by: nvchenghaoz <[email protected]> Signed-off-by: Frida Hou <[email protected]> Signed-off-by: greg-kwasniewski1 <[email protected]> Signed-off-by: Suyog Gupta <[email protected]> Co-authored-by: Gal Hubara-Agam <[email protected]> Co-authored-by: Neta Zmora <[email protected]> Co-authored-by: nvchenghaoz <[email protected]> Co-authored-by: Frida Hou <[email protected]> Co-authored-by: Suyog Gupta <[email protected]> Co-authored-by: Grzegorz Kwasniewski <[email protected]>
1 parent 5234502 commit 41fb8aa

File tree

107 files changed

+7024
-1376
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

107 files changed

+7024
-1376
lines changed

benchmarks/cpp/__init__.py

Whitespace-only changes.

benchmarks/cpp/utils/__init__.py

Whitespace-only changes.

examples/auto_deploy/.vscode/launch.json

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,8 +16,10 @@
1616
"--args.model-factory=AutoModelForCausalLM",
1717
"--benchmark.enabled=false",
1818
"--prompt.batch-size=2",
19-
"--args.model-kwargs",
20-
"num_hidden_layers=3,num_attention_heads=32",
19+
"--args.model-kwargs.num-hidden-layers=3",
20+
"--args.model-kwargs.num-attention-heads=32",
21+
"--prompt.sp-kwargs.max-tokens=128",
22+
// "--dry-run", // uncomment to print the final config and return
2123
],
2224
"console": "integratedTerminal",
2325
"justMyCode": false,

examples/auto_deploy/README.md

Lines changed: 197 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66

77
<div align="left">
88

9-
AutoDeploy is designed to simplify and accelerate the deployment of PyTorch models, including off-the-shelf models like those from Hugging Face, to TensorRT-LLM. It automates graph transformations to integrate inference optimizations such as tensor parallelism, KV-caching and quantization. AutoDeploy supports optimized in-framework deployment, minimizing the amount of manual modification needed.
9+
AutoDeploy is an experimental feature in beta stage designed to simplify and accelerate the deployment of PyTorch models, including off-the-shelf models like those from Hugging Face, to TensorRT-LLM. It automates graph transformations to integrate inference optimizations such as tensor parallelism, KV-caching and quantization. AutoDeploy supports optimized in-framework deployment, minimizing the amount of manual modification needed.
1010

1111
______________________________________________________________________
1212

@@ -146,7 +146,7 @@ Below is a non-exhaustive list of common config options:
146146
| `--args.skip-loading-weights` | Only load the architecture, not the weights |
147147
| `--args.model-kwargs` | Extra kwargs that are being passed to the model initializer in the model factory |
148148
| `--args.tokenizer-kwargs` | Extra kwargs that are being passed to the tokenizer initializer in the model factory |
149-
| `--args.world-size` | The number of GPUs for Tensor Parallel |
149+
| `--args.world-size` | The number of GPUs used for auto-sharding the model |
150150
| `--args.runtime` | Specifies which type of Engine to use during runtime (`"demollm"` or `"trtllm"`) |
151151
| `--args.compile-backend` | Specifies how to compile the graph at the end |
152152
| `--args.attn-backend` | Specifies kernel implementation for attention |
@@ -157,7 +157,7 @@ Below is a non-exhaustive list of common config options:
157157
| `--prompt.batch-size` | Number of queries to generate |
158158
| `--benchmark.enabled` | Whether to run the built-in benchmark (true/false) |
159159

160-
For default values and additional configuration options, refer to the `ExperimentConfig` class in [build_and_run_ad.py](./build_and_run_ad.py) file.
160+
For default values and additional configuration options, refer to the [`ExperimentConfig`](./build_and_run_ad.py) class in [build_and_run_ad.py](./build_and_run_ad.py) file.
161161

162162
Here is a more complete example of using the script:
163163

@@ -172,7 +172,7 @@ python build_and_run_ad.py \
172172
--benchmark.enabled True
173173
```
174174

175-
#### Logging Level
175+
### Logging Level
176176

177177
Use the following env variable to specify the logging level of our built-in logger ordered by
178178
decreasing verbosity;
@@ -223,17 +223,14 @@ AutoDeploy can be seamlessly integrated into your existing workflows using TRT-L
223223

224224
Here is an example of how you can build an LLM object with AutoDeploy integration:
225225

226-
<details>
227-
<summary>Click to expand the example</summary>
228-
229226
```
230227
from tensorrt_llm._torch.auto_deploy import LLM
231228
232229
233230
# Construct the LLM high-level interface object with autodeploy as backend
234231
llm = LLM(
235232
model=<HF_MODEL_CARD_OR_DIR>,
236-
world_size=<NUM_WORLD_RANK>,
233+
world_size=<DESIRED_WORLD_SIZE>,
237234
compile_backend="torch-compile",
238235
model_kwargs={"num_hidden_layers": 2}, # test with smaller model configuration
239236
attn_backend="flashinfer", # choose between "triton" and "flashinfer"
@@ -249,28 +246,207 @@ llm = LLM(
249246
250247
```
251248

249+
Please consult the [AutoDeploy `LLM` API](../../tensorrt_llm/_torch/auto_deploy/llm.py) and the
250+
[`AutoDeployConfig` class](../../tensorrt_llm/_torch/auto_deploy/llm_args.py)
251+
for more detail on how AutoDeploy is configured via the `**kwargs` of the `LLM` API.
252+
253+
### Expert Configuration of LLM API
254+
255+
For expert TensorRT-LLM users, we also expose the full set of [`LlmArgs`](../../tensorrt_llm/_torch/auto_deploy/llm_args.py)
256+
*at your own risk* (the argument list diverges from TRT-LLM's argument list):
257+
258+
<details>
259+
<summary>Click to expand for more details on using LlmArgs directly</summary>
260+
261+
- All config fields that are used by the AutoDeploy core pipeline (i.e. the `InferenceOptimizer`) are
262+
_exclusively_ exposed in the [`AutoDeployConfig` class](../../tensorrt_llm/_torch/auto_deploy/llm_args.py).
263+
Please make sure to refer to those first.
264+
- For expert users we expose the full set of [`LlmArgs`](../../tensorrt_llm/_torch/auto_deploy/llm_args.py)
265+
that can be used to configure the [AutoDeploy `LLM` API](../../tensorrt_llm/_torch/auto_deploy/llm.py) including runtime options.
266+
- Note that some fields in the full [`LlmArgs`](../../tensorrt_llm/_torch/auto_deploy/llm_args.py)
267+
object are overlapping, duplicated, and/or _ignored_ in AutoDeploy, particularly arguments
268+
pertaining to configuring the model itself since AutoDeploy's model ingestion+optimize pipeline
269+
significantly differs from the default manual workflow in TensorRT-LLM.
270+
- However, with the proper care the full [`LlmArgs`](../../tensorrt_llm/_torch/auto_deploy/llm_args.py)
271+
objects can be used to configure advanced runtime options in TensorRT-LLM.
272+
- Note that any valid field can be simply provided as keyword argument ("`**kwargs`") to the
273+
[AutoDeploy `LLM` API](../../tensorrt_llm/_torch/auto_deploy/llm.py).
274+
252275
</details>
253276

254-
For more examples on TRT-LLM LLM API, visit [`this page`](https://nvidia.github.io/TensorRT-LLM/examples/llm_api_examples.html).
277+
### Expert Configuration of `build_and_run_ad.py`
255278

256-
______________________________________________________________________
279+
For expert users, `build_and_run_ad.py` provides advanced configuration capabilities through a flexible argument parser powered by PyDantic Settings and OmegaConf. You can use dot notation for CLI arguments, provide multiple YAML configuration files, and leverage sophisticated configuration precedence rules to create complex deployment configurations.
257280

258-
## Roadmap
281+
<details>
282+
<summary>Click to expand for detailed configuration examples</summary>
259283

260-
1. **Model Coverage:**
284+
#### CLI Arguments with Dot Notation
261285

262-
- Expand support for additional LLM variants and features:
263-
- LoRA
264-
- Speculative Decoding
265-
- Model specialization for disaggregated serving
286+
The script supports flexible CLI argument parsing using dot notation to modify nested configurations dynamically. You can target any field in both the [`ExperimentConfig`](./build_and_run_ad.py) and nested [`AutoDeployConfig`](../../tensorrt_llm/_torch/auto_deploy/llm_args.py)/[`LlmArgs`](../../tensorrt_llm/_torch/auto_deploy/llm_args.) objects:
266287

267-
1. **Performance Optimization:**
288+
```bash
289+
# Configure model parameters
290+
# NOTE: config values like num_hidden_layers are automatically resolved into the appropriate nested
291+
# dict value ``{"args": {"model_kwargs": {"num_hidden_layers": 10}}}`` although not explicitly
292+
# specified as CLI arg
293+
python build_and_run_ad.py \
294+
--model "meta-llama/Meta-Llama-3.1-8B-Instruct" \
295+
--args.model-kwargs.num-hidden-layers=10 \
296+
--args.model-kwargs.hidden-size=2048 \
297+
--args.tokenizer-kwargs.padding-side=left
268298

269-
- Enhance inference speed and efficiency with:
270-
- MoE fusion and all-reduce fusion techniques
271-
- Reuse of TRT-LLM PyTorch operators for greater efficiency
299+
# Configure runtime and backend settings
300+
python build_and_run_ad.py \
301+
--model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" \
302+
--args.world-size=2 \
303+
--args.compile-backend=torch-opt \
304+
--args.attn-backend=flashinfer
272305

273-
______________________________________________________________________
306+
# Configure prompting and benchmarking
307+
python build_and_run_ad.py \
308+
--model "microsoft/phi-4" \
309+
--prompt.batch-size=4 \
310+
--prompt.sp-kwargs.max-tokens=200 \
311+
--prompt.sp-kwargs.temperature=0.7 \
312+
--benchmark.enabled=true \
313+
--benchmark.bs=8 \
314+
--benchmark.isl=1024
315+
```
316+
317+
#### YAML Configuration Files
318+
319+
Both [`ExperimentConfig`](./build_and_run_ad.py) and [`AutoDeployConfig`](../../tensorrt_llm/_torch/auto_deploy/llm_args.py)/[`LlmArgs`](../../tensorrt_llm/_torch/auto_deploy/llm_args.py) inherit from [`DynamicYamlMixInForSettings`](../../tensorrt_llm/_torch/auto_deploy/utils/_config.py), enabling you to provide multiple YAML configuration files that are automatically deep-merged at runtime.
320+
321+
Create a YAML configuration file (e.g., `my_config.yaml`):
322+
323+
```yaml
324+
# my_config.yaml
325+
args:
326+
model_kwargs:
327+
num_hidden_layers: 12
328+
hidden_size: 1024
329+
world_size: 4
330+
compile_backend: torch-compile
331+
attn_backend: triton
332+
max_seq_len: 2048
333+
max_batch_size: 16
334+
transforms:
335+
sharding:
336+
strategy: auto
337+
quantization:
338+
enabled: false
339+
340+
prompt:
341+
batch_size: 8
342+
sp_kwargs:
343+
max_tokens: 150
344+
temperature: 0.8
345+
top_k: 50
346+
347+
benchmark:
348+
enabled: true
349+
num: 20
350+
bs: 4
351+
isl: 1024
352+
osl: 256
353+
```
354+
355+
Create an additional override file (e.g., `production.yaml`):
356+
357+
```yaml
358+
# production.yaml
359+
args:
360+
world_size: 8
361+
compile_backend: torch-opt
362+
max_batch_size: 32
363+
364+
benchmark:
365+
enabled: false
366+
```
367+
368+
Then use these configurations:
369+
370+
```bash
371+
# Using single YAML config
372+
python build_and_run_ad.py \
373+
--model "meta-llama/Meta-Llama-3.1-8B-Instruct" \
374+
--yaml-configs my_config.yaml
375+
376+
# Using multiple YAML configs (deep merged in order, later files have higher priority)
377+
python build_and_run_ad.py \
378+
--model "meta-llama/Meta-Llama-3.1-8B-Instruct" \
379+
--yaml-configs my_config.yaml production.yaml
380+
381+
# Targeting nested AutoDeployConfig with separate YAML
382+
python build_and_run_ad.py \
383+
--model "meta-llama/Meta-Llama-3.1-8B-Instruct" \
384+
--yaml-configs my_config.yaml \
385+
--args.yaml-configs autodeploy_overrides.yaml
386+
```
387+
388+
#### Configuration Precedence and Deep Merging
389+
390+
The configuration system follows a strict precedence order where higher priority sources override lower priority ones:
391+
392+
1. **CLI Arguments** (highest priority) - Direct command line arguments
393+
1. **YAML Configs** - Files specified via `--yaml-configs` and `--args.yaml-configs`
394+
1. **Default Settings** (lowest priority) - Built-in defaults from the config classes
395+
396+
**Deep Merging**: Unlike simple overwriting, deep merging intelligently combines nested dictionaries recursively. For example:
397+
398+
```yaml
399+
# Base config
400+
args:
401+
model_kwargs:
402+
num_hidden_layers: 10
403+
hidden_size: 1024
404+
max_seq_len: 2048
405+
```
406+
407+
```yaml
408+
# Override config
409+
args:
410+
model_kwargs:
411+
hidden_size: 2048 # This will override
412+
# num_hidden_layers: 10 remains unchanged
413+
world_size: 4 # This gets added
414+
```
415+
416+
**Nested Config Behavior**: When using nested configurations, outer YAML configs become init settings for inner objects, giving them higher precedence:
417+
418+
```bash
419+
# The outer yaml-configs affects the entire ExperimentConfig
420+
# The inner args.yaml-configs affects only the AutoDeployConfig
421+
python build_and_run_ad.py \
422+
--model "meta-llama/Meta-Llama-3.1-8B-Instruct" \
423+
--yaml-configs experiment_config.yaml \
424+
--args.yaml-configs autodeploy_config.yaml \
425+
--args.world-size=8 # CLI override beats both YAML configs
426+
```
427+
428+
#### Built-in Default Configuration
429+
430+
Both [`AutoDeployConfig`](../../tensorrt_llm/_torch/auto_deploy/llm_args.py) and [`LlmArgs`](../../tensorrt_llm/_torch/auto_deploy/llm_args.py) classes automatically load a built-in [`default.yaml`](../../tensorrt_llm/_torch/auto_deploy/config/default.yaml) configuration file that provides sensible defaults for the AutoDeploy inference optimizer pipeline. This file is specified in the [`_get_config_dict()`](../../tensorrt_llm/_torch/auto_deploy/llm_args.py) function and defines default transform configurations for graph optimization stages.
431+
432+
The built-in defaults are automatically merged with your configurations at the lowest priority level, ensuring that your custom settings always override the defaults. You can inspect the current default configuration to understand the baseline transform pipeline:
433+
434+
```bash
435+
# View the default configuration
436+
cat tensorrt_llm/_torch/auto_deploy/config/default.yaml
437+
438+
# Override specific transform settings
439+
python build_and_run_ad.py \
440+
--model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" \
441+
--args.transforms.export-to-gm.strict=true
442+
```
443+
444+
</details>
445+
446+
## Roadmap
447+
448+
Check out our [Github Project Board](https://github.com/orgs/NVIDIA/projects/83) to learn more about
449+
the current progress in AutoDeploy and where you can help.
274450

275451
## Disclaimer
276452

0 commit comments

Comments
 (0)