DRAFT: Prepare inputs optimizations #6518

galagam · 2025-07-31T12:24:02Z

Summary by CodeRabbit

New Features
- Introduced modular and extensible graph transformation and export frameworks with configurable pipelines and patch management.
- Added backend-specific RMSNorm fusion and support for quantized Mixture-of-Experts (MoE) in FP8 and FP4 formats.
- Implemented advanced attention mechanisms with sliding window and sink token features, including a pure PyTorch backend.
- Enhanced configuration management with dynamic YAML merging and deep configuration overrides.
- Added comprehensive testing utilities and reference implementations for attention and MoE operators.
Improvements
- Refactored attention, MoE, and sharding pattern matching for better modularity, extensibility, and in-place graph transformations.
- Expanded expert-level documentation and usage guidance for advanced configuration and deployment scenarios.
- Improved test coverage for quantized MoE, attention backends, sharding detection, and transformation correctness.
- Updated export process to use official PyTorch export APIs with improved patch application and deduplication.
- Enhanced device handling, memory management, and batch size ordering in CUDA graph capture for reliability.
Bug Fixes
- Fixed parameter loading and alias handling during export to maintain correct state dict semantics.
- Corrected memory size calculations and logging units in cache resizing utilities.
Documentation
- Significantly expanded and clarified user and expert documentation, including configuration, advanced usage, and roadmap references.
Chores
- Cleaned up deprecated modules and imports; replaced legacy export and transformation calls with new modular optimizer.
- Updated test utilities and parameterizations for consistency with new transformation and export frameworks.
- Added new test cases for parallel config validation, quantized MoE, and attention backends.
- Removed obsolete tests and streamlined attention mask handling in test models.

…rmations to return None (#71) * Refactor the signatures of AD graph transformations to return None (NVIDIA#5249) Refactor signatures of AD graph transformations from gm = transformation(gm) to transformation(gm) Since the AD graph transformations modify the input GraphModule in-place. Previous signature style was misleading. Signed-off-by: Gal Hubara Agam <[email protected]>

…ion (#76) * Fix trtllm-bench test and enable trtllm-bench integration Signed-off-by: Neta Zmora <[email protected]> * Remove unneeded __init__.py Signed-off-by: Neta Zmora <[email protected]> --------- Signed-off-by: Neta Zmora <[email protected]>

…integrat…" (#78) This reverts commit 600f26e.

) (#73) * yaml config loader for dynamic settings Signed-off-by: Lucas Liebenwein <[email protected]> * updates for yaml mixin Signed-off-by: Lucas Liebenwein <[email protected]> * addressing reviewer feedback Signed-off-by: Lucas Liebenwein <[email protected]> --------- Signed-off-by: Lucas Liebenwein <[email protected]>

* [AutoDeploy] Refining AD configurability Signed-off-by: Lucas Liebenwein <[email protected]> * addressed reviewer feedback Signed-off-by: Lucas Liebenwein <[email protected]> --------- Signed-off-by: Lucas Liebenwein <[email protected]>

* Add the Torch backend and update the test to use the torch backend. Signed-off-by: nvchenghaoz <[email protected]> * Add the sinks and fix the test failures Signed-off-by: nvchenghaoz <[email protected]> * address reviewer's comments Signed-off-by: nvchenghaoz <[email protected]> * use custom op convention Signed-off-by: nvchenghaoz <[email protected]> * move the ref to the utils_test Signed-off-by: nvchenghaoz <[email protected]> * Add torch backend tests in ad_build_small_single.py Signed-off-by: nvchenghaoz <[email protected]> * Address hidden comments... Signed-off-by: nvchenghaoz <[email protected]> --------- Signed-off-by: nvchenghaoz <[email protected]> Signed-off-by: Lucas Liebenwein <[email protected]>

Signed-off-by: nvchenghaoz <[email protected]>

* add torch_fp8_moe and fp8 linear support in pattern matcher, update unit tests Signed-off-by: Frida Hou <[email protected]> * add torch-fp4-moe and fp4 support in pattern matcher, unit test has acc issue and e2e mixtral fp4 has kernel error wo moe matching Signed-off-by: Frida Hou <[email protected]> * add pre-commit hook Signed-off-by: Frida Hou <[email protected]> * hacky fix for e2e run of mixtral FP4 and fp4 op unit test Signed-off-by: Frida Hou <[email protected]> * EP support for torch_fp4_moe and torch_fp8_moe Signed-off-by: Frida Hou <[email protected]> * fix rebase: op rename, shard_load_hook bug in FP4 Signed-off-by: Frida Hou <[email protected]> * fix pre-commit Signed-off-by: Frida Hou <[email protected]> * fix weight loading-load_hook issue for FP4, update function to handle exclude_modules in hf_quant_config Signed-off-by: Frida Hou <[email protected]> * addressing feedback, add moe op template, update op names,other minor refinements Signed-off-by: Frida Hou <[email protected]> * move common functionality to utility Signed-off-by: Frida Hou <[email protected]> * fix FP4QuantizationImpl register from rebase Signed-off-by: Frida Hou <[email protected]> * add quantize_moe pass for patched torch_moe op Signed-off-by: Frida Hou <[email protected]> * add transformation unit tests for FP8 and FP4 Signed-off-by: Frida Hou <[email protected]> * update should_skip_quantization to fix bmm unit test Signed-off-by: Frida Hou <[email protected]> * update BMMDynamicModel and utils to extract weight for dynamic BMM case Signed-off-by: Frida Hou <[email protected]> * update BMMDynamicModel to drop linear op Signed-off-by: Frida Hou <[email protected]> * minor Signed-off-by: Frida Hou <[email protected]> --------- Signed-off-by: Frida Hou <[email protected]>

* remove assert, add qwen small to tests * lint Signed-off-by: Suyog Gupta <[email protected]> --------- Signed-off-by: Suyog Gupta <[email protected]>

* fix overlap scheduler in AD Signed-off-by: Suyog Gupta <[email protected]> * cleanups Signed-off-by: Suyog Gupta <[email protected]> * fix nest sequences Signed-off-by: Suyog Gupta <[email protected]> * nits * avoid hardcoding max beam width Signed-off-by: Suyog Gupta <[email protected]> * avoid hardcoding max beam width Signed-off-by: Suyog Gupta <[email protected]> * clean logic and max_beam_width arg Signed-off-by: Lucas Liebenwein <[email protected]> --------- Signed-off-by: Suyog Gupta <[email protected]> Signed-off-by: Lucas Liebenwein <[email protected]> Co-authored-by: Lucas Liebenwein <[email protected]>

Signed-off-by: Lucas Liebenwein <[email protected]>

…IA#4367, NVIDIA#4366 (#84) Signed-off-by: Lucas Liebenwein <[email protected]>

NVIDIA#5916) (#86) * introduced basic sharding config logic * transformation_executor works for TP parallelism. Updated test_graph_sharding Signed-off-by: greg-kwasniewski1 <[email protected]> * Switched from dataclass to pydantic. Added run_pattern_detection_test functionality, applied to test_graph_sharding Signed-off-by: greg-kwasniewski1 <[email protected]> * Restructured transformation execution logic. transformation_executor applies any generic transformations Signed-off-by: greg-kwasniewski1 <[email protected]> * Detection + execution logic moved only to sharding. Transformation work on node.name Signed-off-by: greg-kwasniewski1 <[email protected]> * Removed redundant params Signed-off-by: greg-kwasniewski1 <[email protected]> --------- Signed-off-by: greg-kwasniewski1 <[email protected]>

* Add sink/sliding window support for Triton Signed-off-by: nvchenghaoz <[email protected]> * Add the test and fix the functional implementations Signed-off-by: nvchenghaoz <[email protected]> --------- Signed-off-by: nvchenghaoz <[email protected]>

This reverts commit a37797b. Signed-off-by: Lucas Liebenwein <[email protected]>

* moving more transforms into the modular system Signed-off-by: Lucas Liebenwein <[email protected]> * fixes for some configs Signed-off-by: Lucas Liebenwein <[email protected]> --------- Signed-off-by: Lucas Liebenwein <[email protected]>

* Add the torch ref implementation for new params. Signed-off-by: nvchenghaoz <[email protected]> * Remove comment Signed-off-by: nvchenghaoz <[email protected]> --------- Signed-off-by: nvchenghaoz <[email protected]>

* Modular export patches + registry; fixes NVIDIA#5728 Signed-off-by: Lucas Liebenwein <[email protected]> * patch library for models Signed-off-by: Lucas Liebenwein <[email protected]> * unit test fixes Signed-off-by: Lucas Liebenwein <[email protected]> * addressing reviewer feedback Signed-off-by: Lucas Liebenwein <[email protected]> --------- Signed-off-by: Lucas Liebenwein <[email protected]>

* fix overlap scheduler in AD Signed-off-by: Suyog Gupta <[email protected]> * cleanups Signed-off-by: Suyog Gupta <[email protected]> * fix nest sequences Signed-off-by: Suyog Gupta <[email protected]> * nits * avoid hardcoding max beam width Signed-off-by: Suyog Gupta <[email protected]> * avoid hardcoding max beam width Signed-off-by: Suyog Gupta <[email protected]> * cudagraph fixes + rms norm Signed-off-by: Suyog Gupta <[email protected]> * fix test Signed-off-by: Suyog Gupta <[email protected]> * revert ad_executor changes Signed-off-by: Suyog Gupta <[email protected]> * Review comments + make sure num_pages >= max batch size * wrapping reviewer feedback and open items Signed-off-by: Lucas Liebenwein <[email protected]> --------- Signed-off-by: Suyog Gupta <[email protected]> Signed-off-by: Lucas Liebenwein <[email protected]> Co-authored-by: Lucas Liebenwein <[email protected]>

Signed-off-by: Lucas Liebenwein <[email protected]>

…and BMM (fixes NVIDIA#5916) (#94) * Updated tests Signed-off-by: greg-kwasniewski1 <[email protected]> * fixed tp sharding bug Signed-off-by: greg-kwasniewski1 <[email protected]> * Fixed sharding tests Signed-off-by: greg-kwasniewski1 <[email protected]> * Fixed sharding tests 1.1 Signed-off-by: greg-kwasniewski1 <[email protected]> * import fix Signed-off-by: Lucas Liebenwein <[email protected]> --------- Signed-off-by: greg-kwasniewski1 <[email protected]> Signed-off-by: Lucas Liebenwein <[email protected]> Co-authored-by: Lucas Liebenwein <[email protected]>

* WIP for attention matching: repeat_kv, eager_attention_matching Signed-off-by: Frida Hou <[email protected]> * works e2e with llama2 and llama3.1, eager and sdpa Signed-off-by: Frida Hou <[email protected]> * update for unit test test_attention_matcher Signed-off-by: Frida Hou <[email protected]> * minor Signed-off-by: Frida Hou <[email protected]> * minor Signed-off-by: Frida Hou <[email protected]> * unify into one transformation, update unit tests Signed-off-by: Frida Hou <[email protected]> * update hf_test to verify transformed output, update move_to_devide to recompile graph Signed-off-by: Frida Hou <[email protected]> * update after rebase Signed-off-by: Frida Hou <[email protected]> * minor Signed-off-by: Frida Hou <[email protected]> * update docstring Signed-off-by: Frida Hou <[email protected]> * minor Signed-off-by: Frida Hou <[email protected]> --------- Signed-off-by: Frida Hou <[email protected]>

Signed-off-by: nvchenghaoz <[email protected]>

…)" This reverts commit c245cf3.

This reverts commit a8b54f9.

* Change the all-reduce strategy to NCCL When the strategy is set to AUTO and world_size>1 we experience hangs and CUDA memory errors. * This is the same issue as https://nvbugspro.nvidia.com/bug/5331013 * Without this change test test_ad_build_small_multi.py fails (tp==2) * This is a temporary change until we understand why this hang is happening. * On dllcuster this issue does not manifest. Signed-off-by: Neta Zmora <[email protected]> * Re-enable test_ad_build_small_multi.py tests/unittest/_torch/auto_deploy/unit/multigpu/test_ad_build_small_multi.py Signed-off-by: Neta Zmora <[email protected]> * fix kvcache mem size compute - convert to MB Signed-off-by: Gal Agam <[email protected]> --------- Signed-off-by: Neta Zmora <[email protected]> Signed-off-by: Gal Agam <[email protected]> Co-authored-by: Gal Agam <[email protected]>

Signed-off-by: Lucas Liebenwein <[email protected]>

Signed-off-by: haoguo <[email protected]>

* attention matcher with torch._inductor pattern matcher,matching repeat kv, sdpa and group attention, update unit tests Signed-off-by: Frida Hou <[email protected]> * Fix the torch backend Attention Signed-off-by: nvchenghaoz <[email protected]> * Revert "attention matcher with torch._inductor pattern matcher,matching repeat kv, sdpa and group attention, update unit tests" This reverts commit 5743fb3. --------- Signed-off-by: Frida Hou <[email protected]> Signed-off-by: nvchenghaoz <[email protected]> Co-authored-by: Frida Hou <[email protected]>

…tcher (#101) * attention matcher with torch._inductor pattern matcher,matching repeat kv, sdpa and group attention, update unit tests Signed-off-by: Frida Hou <[email protected]> * update matcher to only handle causal attn mask and set is_causal=True Signed-off-by: Frida Hou <[email protected]> * separate into three transformations Signed-off-by: Frida Hou <[email protected]> --------- Signed-off-by: Frida Hou <[email protected]>

* improve error handling and graph clean-up Signed-off-by: Lucas Liebenwein <[email protected]> * fix: avoid modify immutable type TransformInfo Signed-off-by: haoguo <[email protected]> --------- Signed-off-by: Lucas Liebenwein <[email protected]> Signed-off-by: haoguo <[email protected]> Co-authored-by: haoguo <[email protected]>

* attention matcher with torch._inductor pattern matcher,matching repeat kv, sdpa and group attention, update unit tests Signed-off-by: Frida Hou <[email protected]> * Update the torch ref op Signed-off-by: nvchenghaoz <[email protected]> * Revert "attention matcher with torch._inductor pattern matcher,matching repeat kv, sdpa and group attention, update unit tests" This reverts commit 5743fb3. --------- Signed-off-by: Frida Hou <[email protected]> Signed-off-by: nvchenghaoz <[email protected]> Co-authored-by: Frida Hou <[email protected]>

* refactor: move quantization and quant_moe to new inf optimizer Signed-off-by: haoguo <[email protected]> * refactor: use quant_config from factory instead of new config type Signed-off-by: haoguo <[email protected]> * refactor: del old files; update default.yaml Signed-off-by: haoguo <[email protected]> * move helper class FakeFacotry to _graph_test_helpers.py Signed-off-by: haoguo <[email protected]> * polish: remove unreachable branch in quantization.py Co-authored-by: Fridah-nv <[email protected]> Signed-off-by: h-guo18 <[email protected]> * style: run pre-commit Signed-off-by: haoguo <[email protected]> * fix to fetch hf_quant_config from fetched dir Signed-off-by: Frida Hou <[email protected]> --------- Signed-off-by: haoguo <[email protected]> Signed-off-by: h-guo18 <[email protected]> Signed-off-by: Frida Hou <[email protected]> Co-authored-by: Fridah-nv <[email protected]>

…121) Signed-off-by: Frida Hou <[email protected]>

Signed-off-by: Frida Hou <[email protected]>

* refactor: merge attn updates;move to new inf optimizer Signed-off-by: haoguo <[email protected]> * minor: fix import Signed-off-by: haoguo <[email protected]> * doc: fix file docstring Signed-off-by: haoguo <[email protected]> * Update tensorrt_llm/_torch/auto_deploy/transform/library/attention.py Co-authored-by: Lucas Liebenwein <[email protected]> Signed-off-by: h-guo18 <[email protected]> * Update tensorrt_llm/_torch/auto_deploy/transform/library/attention.py Co-authored-by: Lucas Liebenwein <[email protected]> Signed-off-by: h-guo18 <[email protected]> * Update tensorrt_llm/_torch/auto_deploy/transform/library/attention.py Co-authored-by: Lucas Liebenwein <[email protected]> Signed-off-by: h-guo18 <[email protected]> * polish: use config.run_shape_prop for shape prop Signed-off-by: haoguo <[email protected]> * polish: remove redundant canonicalize() Signed-off-by: haoguo <[email protected]> --------- Signed-off-by: haoguo <[email protected]> Signed-off-by: h-guo18 <[email protected]> Co-authored-by: Lucas Liebenwein <[email protected]>

…s and refactored (#119) * refactored compile_limit Signed-off-by: Eran Geva <[email protected]> * removed changes made to TorchCompileCompiler Signed-off-by: Eran Geva <[email protected]> * set cache_size_limit in TorchCompileCompiler Signed-off-by: Eran Geva <[email protected]> --------- Signed-off-by: Eran Geva <[email protected]>

coderabbitai · 2025-07-31T12:24:12Z

Caution

Review failed

The pull request is closed.

Walkthrough

This update introduces a major overhaul of the AutoDeploy framework for PyTorch model deployment and optimization. Key changes include a modular transformation pipeline, deep YAML configuration support, enhanced quantization and sharding transforms, new backend and custom operator support, and a comprehensive refactor of testing utilities. Many transformations now operate in-place, and the export/patching logic is modularized for extensibility.

Changes

Cohort / File(s)	Change Summary
AutoDeploy Example Config & CLI `examples/auto_deploy/.vscode/launch.json`, `examples/auto_deploy/README.md`, `examples/auto_deploy/build_and_run_ad.py`, `examples/auto_deploy/build_and_run_flux.py`	Updated CLI argument parsing, improved documentation, support for dynamic YAML configs, and enhanced CLI flexibility for model deployment.
Requirements & Packaging `requirements.txt`, `setup.py`	Added YAML and OmegaConf dependencies. Updated package data and extraction logic to include YAML files.
AutoDeploy Config & Utilities `tensorrt_llm/_torch/auto_deploy/llm_args.py`, `tensorrt_llm/_torch/auto_deploy/utils/_config.py`, `tensorrt_llm/_torch/auto_deploy/utils/node_utils.py`, `tensorrt_llm/_torch/auto_deploy/utils/pattern_matcher.py`, `tensorrt_llm/_torch/auto_deploy/utils/quantization_utils.py`	Refactored configuration classes, added dynamic YAML/deep merge support, improved node utilities, and enhanced quantization helpers.
AutoDeploy Core & Model Factory `tensorrt_llm/_torch/auto_deploy/__init__.py`, `tensorrt_llm/_torch/auto_deploy/models/__init__.py`, `tensorrt_llm/_torch/auto_deploy/models/factory.py`, `tensorrt_llm/_torch/auto_deploy/models/hf.py`, `tensorrt_llm/_torch/auto_deploy/models/patches/__init__.py`, `tensorrt_llm/_torch/auto_deploy/models/patches/decilm.py`, `tensorrt_llm/_torch/auto_deploy/models/patches/deepseek.py`, `tensorrt_llm/_torch/auto_deploy/models/patches/mixtral.py`, `tensorrt_llm/_torch/auto_deploy/models/patches/phi.py`, `tensorrt_llm/_torch/auto_deploy/models/patches/qwen3.py`	Streamlined model factory imports, modularized patch registration, improved default config handling, and migrated patch logic to class-based system.
Custom Operators and Backends `tensorrt_llm/_torch/auto_deploy/custom_ops/__init__.py`, `tensorrt_llm/_torch/auto_deploy/custom_ops/_triton_attention_internal.py`, `tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py`, `tensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py`, `tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py`, `tensorrt_llm/_torch/auto_deploy/custom_ops/torch_backend_attention.py`, `tensorrt_llm/_torch/auto_deploy/custom_ops/torch_moe.py`, `tensorrt_llm/_torch/auto_deploy/custom_ops/triton_attention.py`	Added new and refactored custom ops for attention, RMSNorm, and MoE, supporting new backends (Torch, Triton, FlashInfer), quantized variants, and advanced features (sliding window, sinks, logit cap).
CUDA Graph Backend `tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py`	Improved batch size sorting, memory pool management, and async input buffer updates for CUDA graph execution.
Distributed Utilities `tensorrt_llm/_torch/auto_deploy/distributed/trtllm.py`	Changed allreduce strategy to NCCL as a temporary workaround.
Export System & Patch Framework `tensorrt_llm/_torch/auto_deploy/export/__init__.py`, `tensorrt_llm/_torch/auto_deploy/export/export.py`, `tensorrt_llm/_torch/auto_deploy/export/interface.py`, `tensorrt_llm/_torch/auto_deploy/export/library/__init__.py`, `tensorrt_llm/_torch/auto_deploy/export/library/autocast_noop.py`, `tensorrt_llm/_torch/auto_deploy/export/library/linear.py`, `tensorrt_llm/_torch/auto_deploy/export/library/modelopt_context.py`, `tensorrt_llm/_torch/auto_deploy/export/library/sdpa.py`, `tensorrt_llm/_torch/auto_deploy/export/library/sdpa_kernel_noop.py`, `tensorrt_llm/_torch/auto_deploy/export/library/tensor_meta_device.py`, `tensorrt_llm/_torch/auto_deploy/export/library/torch_modulelist_getitem.py`, `tensorrt_llm/_torch/auto_deploy/export/library/torch_where.py`, `tensorrt_llm/_torch/auto_deploy/export/library/transformers_sdpa_mask.py`	Introduced a modular export patch framework, implemented multiple patches for PyTorch/transformers quirks, and centralized export logic with deduplication and device cleanup.
Transform Pipeline & Registry `tensorrt_llm/_torch/auto_deploy/transform/__init__.py`, `tensorrt_llm/_torch/auto_deploy/transform/interface.py`, `tensorrt_llm/_torch/auto_deploy/transform/library/__init__.py`, `tensorrt_llm/_torch/auto_deploy/transform/library/build_model.py`, `tensorrt_llm/_torch/auto_deploy/transform/library/cleanup_input_constraints.py`, `tensorrt_llm/_torch/auto_deploy/transform/library/cleanup_noop_add.py`, `tensorrt_llm/_torch/auto_deploy/transform/library/cleanup_noop_slice.py`, `tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py`, `tensorrt_llm/_torch/auto_deploy/transform/library/quantization.py`, `tensorrt_llm/_torch/auto_deploy/transform/library/quantize_moe.py`, `tensorrt_llm/_torch/auto_deploy/transform/optimizer.py`	Introduced a modular, extensible transformation pipeline with registry, typed configs, and new transforms for model build, export, cleanup, quantization, and MoE quantization.
Transformations: In-Place Refactor & Library `tensorrt_llm/_torch/auto_deploy/transformations/__init__.py`, `tensorrt_llm/_torch/auto_deploy/transformations/_graph.py`, `tensorrt_llm/_torch/auto_deploy/transformations/library/__init__.py`, `tensorrt_llm/_torch/auto_deploy/transformations/library/attention.py`, `tensorrt_llm/_torch/auto_deploy/transformations/library/collectives.py`, `tensorrt_llm/_torch/auto_deploy/transformations/library/eliminate_redundant_transposes.py`, `tensorrt_llm/_torch/auto_deploy/transformations/library/fused_moe.py`, `tensorrt_llm/_torch/auto_deploy/transformations/library/fusion.py`, `tensorrt_llm/_torch/auto_deploy/transformations/library/kvcache.py`, `tensorrt_llm/_torch/auto_deploy/transformations/library/rope.py`, `tensorrt_llm/_torch/auto_deploy/transformations/library/sharding.py`, `tensorrt_llm/_torch/auto_deploy/transformations/library/visualization.py`, `tensorrt_llm/_torch/auto_deploy/transformations/transform.py`	Refactored all transformation functions to operate in-place (no longer return new GMs), modularized sharding and fusion, improved pattern matching, and updated for new optimizer pipeline.
Removed/Deprecated Files `tensorrt_llm/_torch/auto_deploy/transformations/export.py`, `tensorrt_llm/_torch/auto_deploy/transformations/library/ep_sharding.py`	Removed legacy export and expert parallel sharding logic, replaced by modular transform and registry-based system.
AutoDeploy Shim & Engine `tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py`	Updated ADEngine to support new config, device handling, max_beam_width, and improved input preparation.
Default Config & Data `tensorrt_llm/_torch/auto_deploy/config/default.yaml`	Added default YAML configuration for transforms and pipeline stages.
Testing Utilities & Test Refactor `tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py`, `tests/unittest/_torch/auto_deploy/_utils_test/_model_test_utils.py`, `tests/unittest/_torch/auto_deploy/_utils_test/torch_attention_reference.py`	Added and refactored test helpers for graph transforms, sharding detection, and attention reference implementations.
Unit & Integration Tests `tests/unittest/_torch/auto_deploy/integration/test_llama4_vlm_export.py`, `tests/unittest/_torch/auto_deploy/unit/multigpu/test_ad_build_small_multi.py`, `tests/unittest/_torch/auto_deploy/unit/multigpu/transformations/library/test_allreduce_residual_rmsnorm_fusion.py`, `tests/unittest/_torch/auto_deploy/unit/multigpu/transformations/library/test_bmm_sharding.py`, `tests/unittest/_torch/auto_deploy/unit/multigpu/transformations/library/test_ep_sharding.py`, `tests/unittest/_torch/auto_deploy/unit/multigpu/transformations/library/test_tp_sharding.py`, `tests/unittest/_torch/auto_deploy/unit/singlegpu/compile/test_captured_graph.py`, `tests/unittest/_torch/auto_deploy/unit/singlegpu/compile/test_compiler.py`, `tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_ad_moe_op.py`, `tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_attention_op.py`, `tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_flashinfer_attention_op.py`, `tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_torch_attention_op.py`, `tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/triton_kernels/test_attention_with_kv_cache.py`, `tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/triton_kernels/test_triton_rms_norm.py`, `tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_deepseek_patches.py`, `tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_engine.py`, `tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_llm_config.py`, `tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_build_small_single.py`, `tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py`, `tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher.py`, `tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py`, `tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_fuse_rmsnorm.py`, `tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py`, `tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_moe_fusion.py`, `tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_quant_moe.py`, `tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_quantization.py`, `tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_rope_transformation.py`, `tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/test_export.py`	Refactored and extended tests for new transform pipeline, sharding detection, quantization, MoE fusion, attention ops, and configuration validation. Many tests now use in-place transforms and new helper utilities.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI/Script
    participant ConfigLoader
    participant ModelFactory
    participant TransformPipeline
    participant ExportSystem
    participant CustomOps
    participant OptimizedModel

    User->>CLI/Script: Launch with CLI args/YAML configs
    CLI/Script->>ConfigLoader: Parse, merge, and validate configs
    ConfigLoader->>ModelFactory: Instantiate with config
    ModelFactory->>TransformPipeline: Build initial model
    TransformPipeline->>ExportSystem: Export to FX GraphModule (with patches)
    ExportSystem->>CustomOps: Register and patch ops as needed
    TransformPipeline->>TransformPipeline: Apply transforms (quantization, sharding, fusion, etc.)
    TransformPipeline->>OptimizedModel: Return optimized model/graph
    OptimizedModel-->>User: Ready for inference/deployment

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~90+ minutes

Possibly related PRs

[AutoDeploy] merge feat/ad-2025-07-07 #6196: Shares identical changes in the AutoDeploy example, configuration, and CLI, including refactoring of ExperimentConfig and CLI argument handling; directly related at the code level.

Suggested reviewers

litaotju
pcastonguay
nv-guomingz
shaharmor98

Poem

In the warren of code, where the YAML files grow,
A rabbit found transforms, all lined up in a row.
With quantization and sharding, and configs so neat,
The pipeline now hops with in-place repeat.
Custom ops sparkle, tests multiply,
This modular meadow makes bunnies hop high!
🐇✨

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai generate unit tests to generate unit tests for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

Signed-off-by: Suyog Gupta <[email protected]>

galagam and others added 30 commits July 21, 2025 07:25

Revert "Fix AD trtllm-bench integration test and enable trtllm-bench …

084703e

…integrat…" (#78) This reverts commit 600f26e.

[AutoDeploy] Refining AD configurability (#75)

0605b83

* [AutoDeploy] Refining AD configurability Signed-off-by: Lucas Liebenwein <[email protected]> * addressed reviewer feedback Signed-off-by: Lucas Liebenwein <[email protected]> --------- Signed-off-by: Lucas Liebenwein <[email protected]>

Fix trtllm-bench test

aeac894

Fix the unit test failure (#83)

e6bd1f4

Signed-off-by: nvchenghaoz <[email protected]>

Fix loading of aliased weights (#85)

deefc58

* remove assert, add qwen small to tests * lint Signed-off-by: Suyog Gupta <[email protected]> --------- Signed-off-by: Suyog Gupta <[email protected]>

[AutoDeploy] Improved config deepmerge handling (#82)

e6b2d00

Signed-off-by: Lucas Liebenwein <[email protected]>

[AutoDeploy][1/n] Modular InferenceOptimizer; fixes NVIDIA#4328, NVID…

89cce46

…IA#4367, NVIDIA#4366 (#84) Signed-off-by: Lucas Liebenwein <[email protected]>

Revert "[None] Add sink/sliding window support for Triton (#77)" (#92)

cfe2b9c

This reverts commit a37797b. Signed-off-by: Lucas Liebenwein <[email protected]>

[None] Add the torch source implementation for new params. (#89)

5b69d2c

* Add the torch ref implementation for new params. Signed-off-by: nvchenghaoz <[email protected]> * Remove comment Signed-off-by: nvchenghaoz <[email protected]> --------- Signed-off-by: nvchenghaoz <[email protected]>

move assertion check cleanup back to stock export (#93)

4d89913

Signed-off-by: Lucas Liebenwein <[email protected]>

Add sinks / sliding window for Triton backend (#95)

29bb062

Signed-off-by: nvchenghaoz <[email protected]>

Revert "[None] Add the torch source implementation for new params. (#89…

810df7d

…)" This reverts commit c245cf3.

Revert "Attention Pattern Matcher (closes NVIDIA#4404) (#88)"

a9e227e

This reverts commit a8b54f9.

respect max_seq_len setting for pos embeddings (#103)

b8c5b9c

Signed-off-by: Lucas Liebenwein <[email protected]>

feat: update graph test helper for testing new inf optimizer (#106)

9662d81

Signed-off-by: haoguo <[email protected]>

Fridah-nv and others added 8 commits July 25, 2025 15:22

add eager attention pattern that does not cast attn weight to fp 32 (#…

4f8f767

…121) Signed-off-by: Frida Hou <[email protected]>

add a new rope pattern for llama4 scout (#97)

913695f

Signed-off-by: Frida Hou <[email protected]>

suyoggupta and others added 8 commits July 31, 2025 05:26

avoid gpu->cpu transfer when using overlap scheduler

c4c6138

Signed-off-by: Suyog Gupta <[email protected]>

prealloc

2a49116

optimize prepare input

9883468

draft - create input_ids on GPU

8f07e6a

better indexing

b5b8f6d

skip placeholder step

35e20f2

eliminate all syncevents - still have idle time

5946ade

remove dummy gen request handling

27a3d7f

galagam force-pushed the user/ghubaraagam/prepare-inputs-2 branch from bc5b1b6 to 27a3d7f Compare July 31, 2025 12:27

galagam closed this Jul 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DRAFT: Prepare inputs optimizations #6518

DRAFT: Prepare inputs optimizations #6518

Uh oh!

galagam commented Jul 31, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jul 31, 2025 •

edited

Loading

Review failed

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

Uh oh!

Uh oh!

DRAFT: Prepare inputs optimizations #6518

DRAFT: Prepare inputs optimizations #6518

Uh oh!

Conversation

galagam commented Jul 31, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

Uh oh!

Uh oh!

galagam commented Jul 31, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jul 31, 2025 •

edited

Loading