Skip to content

Conversation

galagam
Copy link
Collaborator

@galagam galagam commented Jul 31, 2025

Summary by CodeRabbit

  • New Features

    • Introduced modular and extensible graph transformation and export frameworks with configurable pipelines and patch management.
    • Added backend-specific RMSNorm fusion and support for quantized Mixture-of-Experts (MoE) in FP8 and FP4 formats.
    • Implemented advanced attention mechanisms with sliding window and sink token features, including a pure PyTorch backend.
    • Enhanced configuration management with dynamic YAML merging and deep configuration overrides.
    • Added comprehensive testing utilities and reference implementations for attention and MoE operators.
  • Improvements

    • Refactored attention, MoE, and sharding pattern matching for better modularity, extensibility, and in-place graph transformations.
    • Expanded expert-level documentation and usage guidance for advanced configuration and deployment scenarios.
    • Improved test coverage for quantized MoE, attention backends, sharding detection, and transformation correctness.
    • Updated export process to use official PyTorch export APIs with improved patch application and deduplication.
    • Enhanced device handling, memory management, and batch size ordering in CUDA graph capture for reliability.
  • Bug Fixes

    • Fixed parameter loading and alias handling during export to maintain correct state dict semantics.
    • Corrected memory size calculations and logging units in cache resizing utilities.
  • Documentation

    • Significantly expanded and clarified user and expert documentation, including configuration, advanced usage, and roadmap references.
  • Chores

    • Cleaned up deprecated modules and imports; replaced legacy export and transformation calls with new modular optimizer.
    • Updated test utilities and parameterizations for consistency with new transformation and export frameworks.
    • Added new test cases for parallel config validation, quantized MoE, and attention backends.
    • Removed obsolete tests and streamlined attention mask handling in test models.

galagam and others added 30 commits July 21, 2025 07:25
…rmations to return None (#71)

* Refactor the signatures of AD graph transformations to return None (NVIDIA#5249)

Refactor signatures of AD graph transformations from
  gm = transformation(gm)
to
  transformation(gm)

Since the AD graph transformations modify the input GraphModule
in-place. Previous signature style was misleading.

Signed-off-by: Gal Hubara Agam <[email protected]>
…ion (#76)

* Fix trtllm-bench test and enable trtllm-bench integration

Signed-off-by: Neta Zmora <[email protected]>

* Remove unneeded __init__.py

Signed-off-by: Neta Zmora <[email protected]>

---------

Signed-off-by: Neta Zmora <[email protected]>
) (#73)

* yaml config loader for dynamic settings

Signed-off-by: Lucas Liebenwein <[email protected]>

* updates for yaml mixin

Signed-off-by: Lucas Liebenwein <[email protected]>

* addressing reviewer feedback

Signed-off-by: Lucas Liebenwein <[email protected]>

---------

Signed-off-by: Lucas Liebenwein <[email protected]>
* [AutoDeploy] Refining AD configurability

Signed-off-by: Lucas Liebenwein <[email protected]>

* addressed reviewer feedback

Signed-off-by: Lucas Liebenwein <[email protected]>

---------

Signed-off-by: Lucas Liebenwein <[email protected]>
* Add the Torch backend and update the test to use the torch backend.

Signed-off-by: nvchenghaoz <[email protected]>

* Add the sinks and fix the test failures

Signed-off-by: nvchenghaoz <[email protected]>

* address reviewer's comments

Signed-off-by: nvchenghaoz <[email protected]>

* use custom op convention

Signed-off-by: nvchenghaoz <[email protected]>

* move the ref to the utils_test

Signed-off-by: nvchenghaoz <[email protected]>

* Add torch backend tests in ad_build_small_single.py

Signed-off-by: nvchenghaoz <[email protected]>

* Address hidden comments...

Signed-off-by: nvchenghaoz <[email protected]>

---------

Signed-off-by: nvchenghaoz <[email protected]>
Signed-off-by: Lucas Liebenwein <[email protected]>
* add torch_fp8_moe and fp8 linear support in pattern matcher, update unit tests

Signed-off-by: Frida Hou <[email protected]>

* add torch-fp4-moe and fp4 support in pattern matcher, unit test has acc issue and e2e mixtral fp4 has kernel error wo moe matching

Signed-off-by: Frida Hou <[email protected]>

* add pre-commit hook

Signed-off-by: Frida Hou <[email protected]>

* hacky fix for e2e run of mixtral FP4 and fp4 op unit test

Signed-off-by: Frida Hou <[email protected]>

* EP support for torch_fp4_moe and torch_fp8_moe

Signed-off-by: Frida Hou <[email protected]>

* fix rebase: op rename, shard_load_hook bug in FP4

Signed-off-by: Frida Hou <[email protected]>

* fix pre-commit

Signed-off-by: Frida Hou <[email protected]>

* fix weight loading-load_hook issue for FP4, update function to handle exclude_modules in hf_quant_config

Signed-off-by: Frida Hou <[email protected]>

* addressing feedback, add moe op template, update op names,other minor refinements

Signed-off-by: Frida Hou <[email protected]>

* move common functionality to utility

Signed-off-by: Frida Hou <[email protected]>

* fix FP4QuantizationImpl register from rebase

Signed-off-by: Frida Hou <[email protected]>

* add quantize_moe pass for patched torch_moe op

Signed-off-by: Frida Hou <[email protected]>

* add transformation unit tests for FP8 and FP4

Signed-off-by: Frida Hou <[email protected]>

* update should_skip_quantization to fix bmm unit test

Signed-off-by: Frida Hou <[email protected]>

* update BMMDynamicModel and utils to extract weight for dynamic BMM case

Signed-off-by: Frida Hou <[email protected]>

* update BMMDynamicModel to drop linear op

Signed-off-by: Frida Hou <[email protected]>

* minor

Signed-off-by: Frida Hou <[email protected]>

---------

Signed-off-by: Frida Hou <[email protected]>
* remove assert, add qwen small to tests

* lint

Signed-off-by: Suyog Gupta <[email protected]>

---------

Signed-off-by: Suyog Gupta <[email protected]>
* fix overlap scheduler in AD

Signed-off-by: Suyog Gupta <[email protected]>

* cleanups

Signed-off-by: Suyog Gupta <[email protected]>

* fix nest sequences

Signed-off-by: Suyog Gupta <[email protected]>

* nits

* avoid hardcoding max beam width

Signed-off-by: Suyog Gupta <[email protected]>

* avoid hardcoding max beam width

Signed-off-by: Suyog Gupta <[email protected]>

* clean logic and max_beam_width arg

Signed-off-by: Lucas Liebenwein <[email protected]>

---------

Signed-off-by: Suyog Gupta <[email protected]>
Signed-off-by: Lucas Liebenwein <[email protected]>
Co-authored-by: Lucas Liebenwein <[email protected]>
NVIDIA#5916) (#86)

* introduced basic sharding config logic

* transformation_executor works for TP parallelism. Updated test_graph_sharding

Signed-off-by: greg-kwasniewski1 <[email protected]>

* Switched from dataclass to pydantic. Added run_pattern_detection_test functionality, applied to test_graph_sharding

Signed-off-by: greg-kwasniewski1 <[email protected]>

* Restructured transformation execution logic. transformation_executor applies any generic transformations

Signed-off-by: greg-kwasniewski1 <[email protected]>

* Detection + execution logic moved only to sharding. Transformation work on node.name

Signed-off-by: greg-kwasniewski1 <[email protected]>

* Removed redundant params

Signed-off-by: greg-kwasniewski1 <[email protected]>

---------

Signed-off-by: greg-kwasniewski1 <[email protected]>
* Add sink/sliding window support for Triton

Signed-off-by: nvchenghaoz <[email protected]>

* Add the test and fix the functional implementations

Signed-off-by: nvchenghaoz <[email protected]>

---------

Signed-off-by: nvchenghaoz <[email protected]>
* moving more transforms into the modular system

Signed-off-by: Lucas Liebenwein <[email protected]>

* fixes for some configs

Signed-off-by: Lucas Liebenwein <[email protected]>

---------

Signed-off-by: Lucas Liebenwein <[email protected]>
* Add the torch ref implementation for new params.

Signed-off-by: nvchenghaoz <[email protected]>

* Remove comment

Signed-off-by: nvchenghaoz <[email protected]>

---------

Signed-off-by: nvchenghaoz <[email protected]>
* Modular export patches + registry; fixes NVIDIA#5728

Signed-off-by: Lucas Liebenwein <[email protected]>

* patch library for models

Signed-off-by: Lucas Liebenwein <[email protected]>

* unit test fixes

Signed-off-by: Lucas Liebenwein <[email protected]>

* addressing reviewer feedback

Signed-off-by: Lucas Liebenwein <[email protected]>

---------

Signed-off-by: Lucas Liebenwein <[email protected]>
* fix overlap scheduler in AD

Signed-off-by: Suyog Gupta <[email protected]>

* cleanups

Signed-off-by: Suyog Gupta <[email protected]>

* fix nest sequences

Signed-off-by: Suyog Gupta <[email protected]>

* nits

* avoid hardcoding max beam width

Signed-off-by: Suyog Gupta <[email protected]>

* avoid hardcoding max beam width

Signed-off-by: Suyog Gupta <[email protected]>

* cudagraph fixes + rms norm

Signed-off-by: Suyog Gupta <[email protected]>

* fix test

Signed-off-by: Suyog Gupta <[email protected]>

* revert ad_executor changes

Signed-off-by: Suyog Gupta <[email protected]>

* Review comments + make sure num_pages >= max batch size

* wrapping reviewer feedback and open items

Signed-off-by: Lucas Liebenwein <[email protected]>

---------

Signed-off-by: Suyog Gupta <[email protected]>
Signed-off-by: Lucas Liebenwein <[email protected]>
Co-authored-by: Lucas Liebenwein <[email protected]>
…and BMM (fixes NVIDIA#5916) (#94)

* Updated tests

Signed-off-by: greg-kwasniewski1 <[email protected]>

* fixed tp sharding bug

Signed-off-by: greg-kwasniewski1 <[email protected]>

* Fixed sharding tests

Signed-off-by: greg-kwasniewski1 <[email protected]>

* Fixed sharding tests 1.1

Signed-off-by: greg-kwasniewski1 <[email protected]>

* import fix

Signed-off-by: Lucas Liebenwein <[email protected]>

---------

Signed-off-by: greg-kwasniewski1 <[email protected]>
Signed-off-by: Lucas Liebenwein <[email protected]>
Co-authored-by: Lucas Liebenwein <[email protected]>
* WIP for attention matching: repeat_kv, eager_attention_matching

Signed-off-by: Frida Hou <[email protected]>

* works e2e with llama2 and llama3.1, eager and sdpa

Signed-off-by: Frida Hou <[email protected]>

* update for unit test test_attention_matcher

Signed-off-by: Frida Hou <[email protected]>

* minor

Signed-off-by: Frida Hou <[email protected]>

* minor

Signed-off-by: Frida Hou <[email protected]>

* unify into one transformation, update unit tests

Signed-off-by: Frida Hou <[email protected]>

* update hf_test to verify transformed output, update move_to_devide to recompile graph

Signed-off-by: Frida Hou <[email protected]>

* update after rebase

Signed-off-by: Frida Hou <[email protected]>

* minor

Signed-off-by: Frida Hou <[email protected]>

* update docstring

Signed-off-by: Frida Hou <[email protected]>

* minor

Signed-off-by: Frida Hou <[email protected]>

---------

Signed-off-by: Frida Hou <[email protected]>
* Change the all-reduce strategy to NCCL

When the strategy is set to AUTO and world_size>1 we experience hangs and CUDA
memory errors.

* This is the same issue as https://nvbugspro.nvidia.com/bug/5331013
* Without this change test test_ad_build_small_multi.py fails (tp==2)
* This is a temporary change until we understand why this hang is happening.
* On dllcuster this issue does not manifest.

Signed-off-by: Neta Zmora <[email protected]>

* Re-enable test_ad_build_small_multi.py

tests/unittest/_torch/auto_deploy/unit/multigpu/test_ad_build_small_multi.py

Signed-off-by: Neta Zmora <[email protected]>

* fix kvcache mem size compute - convert to MB

Signed-off-by: Gal Agam <[email protected]>

---------

Signed-off-by: Neta Zmora <[email protected]>
Signed-off-by: Gal Agam <[email protected]>
Co-authored-by: Gal Agam <[email protected]>
* attention matcher with torch._inductor pattern matcher,matching repeat kv, sdpa and group attention, update unit tests

Signed-off-by: Frida Hou <[email protected]>

* Fix the torch backend Attention

Signed-off-by: nvchenghaoz <[email protected]>

* Revert "attention matcher with torch._inductor pattern matcher,matching repeat kv, sdpa and group attention, update unit tests"

This reverts commit 5743fb3.

---------

Signed-off-by: Frida Hou <[email protected]>
Signed-off-by: nvchenghaoz <[email protected]>
Co-authored-by: Frida Hou <[email protected]>
Fridah-nv and others added 8 commits July 25, 2025 15:22
…tcher (#101)

* attention matcher with torch._inductor pattern matcher,matching repeat kv, sdpa and group attention, update unit tests

Signed-off-by: Frida Hou <[email protected]>

* update matcher to only handle causal attn mask and set is_causal=True

Signed-off-by: Frida Hou <[email protected]>

* separate into three transformations

Signed-off-by: Frida Hou <[email protected]>

---------

Signed-off-by: Frida Hou <[email protected]>
* improve error handling and graph clean-up

Signed-off-by: Lucas Liebenwein <[email protected]>

* fix: avoid modify immutable type TransformInfo

Signed-off-by: haoguo <[email protected]>

---------

Signed-off-by: Lucas Liebenwein <[email protected]>
Signed-off-by: haoguo <[email protected]>
Co-authored-by: haoguo <[email protected]>
* attention matcher with torch._inductor pattern matcher,matching repeat kv, sdpa and group attention, update unit tests

Signed-off-by: Frida Hou <[email protected]>

* Update the torch ref op

Signed-off-by: nvchenghaoz <[email protected]>

* Revert "attention matcher with torch._inductor pattern matcher,matching repeat kv, sdpa and group attention, update unit tests"

This reverts commit 5743fb3.

---------

Signed-off-by: Frida Hou <[email protected]>
Signed-off-by: nvchenghaoz <[email protected]>
Co-authored-by: Frida Hou <[email protected]>
* refactor: move quantization and quant_moe to new inf optimizer

Signed-off-by: haoguo <[email protected]>

* refactor: use quant_config from factory instead of new config type

Signed-off-by: haoguo <[email protected]>

* refactor: del old files; update default.yaml

Signed-off-by: haoguo <[email protected]>

* move helper class FakeFacotry to _graph_test_helpers.py

Signed-off-by: haoguo <[email protected]>

* polish: remove unreachable branch in quantization.py

Co-authored-by: Fridah-nv <[email protected]>
Signed-off-by: h-guo18 <[email protected]>

* style: run pre-commit

Signed-off-by: haoguo <[email protected]>

* fix to fetch hf_quant_config from fetched dir

Signed-off-by: Frida Hou <[email protected]>

---------

Signed-off-by: haoguo <[email protected]>
Signed-off-by: h-guo18 <[email protected]>
Signed-off-by: Frida Hou <[email protected]>
Co-authored-by: Fridah-nv <[email protected]>
* refactor: merge attn updates;move to new inf optimizer

Signed-off-by: haoguo <[email protected]>

* minor: fix import

Signed-off-by: haoguo <[email protected]>

* doc: fix file docstring

Signed-off-by: haoguo <[email protected]>

* Update tensorrt_llm/_torch/auto_deploy/transform/library/attention.py

Co-authored-by: Lucas Liebenwein <[email protected]>
Signed-off-by: h-guo18 <[email protected]>

* Update tensorrt_llm/_torch/auto_deploy/transform/library/attention.py

Co-authored-by: Lucas Liebenwein <[email protected]>
Signed-off-by: h-guo18 <[email protected]>

* Update tensorrt_llm/_torch/auto_deploy/transform/library/attention.py

Co-authored-by: Lucas Liebenwein <[email protected]>
Signed-off-by: h-guo18 <[email protected]>

* polish: use config.run_shape_prop for shape prop

Signed-off-by: haoguo <[email protected]>

* polish: remove redundant canonicalize()

Signed-off-by: haoguo <[email protected]>

---------

Signed-off-by: haoguo <[email protected]>
Signed-off-by: h-guo18 <[email protected]>
Co-authored-by: Lucas Liebenwein <[email protected]>
…s and refactored (#119)

* refactored compile_limit

Signed-off-by: Eran Geva <[email protected]>

* removed changes made to TorchCompileCompiler

Signed-off-by: Eran Geva <[email protected]>

* set cache_size_limit in TorchCompileCompiler

Signed-off-by: Eran Geva <[email protected]>

---------

Signed-off-by: Eran Geva <[email protected]>
Copy link
Contributor

coderabbitai bot commented Jul 31, 2025

Caution

Review failed

The pull request is closed.

Walkthrough

This update introduces a major overhaul of the AutoDeploy framework for PyTorch model deployment and optimization. Key changes include a modular transformation pipeline, deep YAML configuration support, enhanced quantization and sharding transforms, new backend and custom operator support, and a comprehensive refactor of testing utilities. Many transformations now operate in-place, and the export/patching logic is modularized for extensibility.

Changes

Cohort / File(s) Change Summary
AutoDeploy Example Config & CLI
examples/auto_deploy/.vscode/launch.json, examples/auto_deploy/README.md, examples/auto_deploy/build_and_run_ad.py, examples/auto_deploy/build_and_run_flux.py
Updated CLI argument parsing, improved documentation, support for dynamic YAML configs, and enhanced CLI flexibility for model deployment.
Requirements & Packaging
requirements.txt, setup.py
Added YAML and OmegaConf dependencies. Updated package data and extraction logic to include YAML files.
AutoDeploy Config & Utilities
tensorrt_llm/_torch/auto_deploy/llm_args.py, tensorrt_llm/_torch/auto_deploy/utils/_config.py, tensorrt_llm/_torch/auto_deploy/utils/node_utils.py, tensorrt_llm/_torch/auto_deploy/utils/pattern_matcher.py, tensorrt_llm/_torch/auto_deploy/utils/quantization_utils.py
Refactored configuration classes, added dynamic YAML/deep merge support, improved node utilities, and enhanced quantization helpers.
AutoDeploy Core & Model Factory
tensorrt_llm/_torch/auto_deploy/__init__.py, tensorrt_llm/_torch/auto_deploy/models/__init__.py, tensorrt_llm/_torch/auto_deploy/models/factory.py, tensorrt_llm/_torch/auto_deploy/models/hf.py, tensorrt_llm/_torch/auto_deploy/models/patches/__init__.py, tensorrt_llm/_torch/auto_deploy/models/patches/decilm.py, tensorrt_llm/_torch/auto_deploy/models/patches/deepseek.py, tensorrt_llm/_torch/auto_deploy/models/patches/mixtral.py, tensorrt_llm/_torch/auto_deploy/models/patches/phi.py, tensorrt_llm/_torch/auto_deploy/models/patches/qwen3.py
Streamlined model factory imports, modularized patch registration, improved default config handling, and migrated patch logic to class-based system.
Custom Operators and Backends
tensorrt_llm/_torch/auto_deploy/custom_ops/__init__.py, tensorrt_llm/_torch/auto_deploy/custom_ops/_triton_attention_internal.py, tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py, tensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py, tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py, tensorrt_llm/_torch/auto_deploy/custom_ops/torch_backend_attention.py, tensorrt_llm/_torch/auto_deploy/custom_ops/torch_moe.py, tensorrt_llm/_torch/auto_deploy/custom_ops/triton_attention.py
Added new and refactored custom ops for attention, RMSNorm, and MoE, supporting new backends (Torch, Triton, FlashInfer), quantized variants, and advanced features (sliding window, sinks, logit cap).
CUDA Graph Backend
tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py
Improved batch size sorting, memory pool management, and async input buffer updates for CUDA graph execution.
Distributed Utilities
tensorrt_llm/_torch/auto_deploy/distributed/trtllm.py
Changed allreduce strategy to NCCL as a temporary workaround.
Export System & Patch Framework
tensorrt_llm/_torch/auto_deploy/export/__init__.py, tensorrt_llm/_torch/auto_deploy/export/export.py, tensorrt_llm/_torch/auto_deploy/export/interface.py, tensorrt_llm/_torch/auto_deploy/export/library/__init__.py, tensorrt_llm/_torch/auto_deploy/export/library/autocast_noop.py, tensorrt_llm/_torch/auto_deploy/export/library/linear.py, tensorrt_llm/_torch/auto_deploy/export/library/modelopt_context.py, tensorrt_llm/_torch/auto_deploy/export/library/sdpa.py, tensorrt_llm/_torch/auto_deploy/export/library/sdpa_kernel_noop.py, tensorrt_llm/_torch/auto_deploy/export/library/tensor_meta_device.py, tensorrt_llm/_torch/auto_deploy/export/library/torch_modulelist_getitem.py, tensorrt_llm/_torch/auto_deploy/export/library/torch_where.py, tensorrt_llm/_torch/auto_deploy/export/library/transformers_sdpa_mask.py
Introduced a modular export patch framework, implemented multiple patches for PyTorch/transformers quirks, and centralized export logic with deduplication and device cleanup.
Transform Pipeline & Registry
tensorrt_llm/_torch/auto_deploy/transform/__init__.py, tensorrt_llm/_torch/auto_deploy/transform/interface.py, tensorrt_llm/_torch/auto_deploy/transform/library/__init__.py, tensorrt_llm/_torch/auto_deploy/transform/library/build_model.py, tensorrt_llm/_torch/auto_deploy/transform/library/cleanup_input_constraints.py, tensorrt_llm/_torch/auto_deploy/transform/library/cleanup_noop_add.py, tensorrt_llm/_torch/auto_deploy/transform/library/cleanup_noop_slice.py, tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py, tensorrt_llm/_torch/auto_deploy/transform/library/quantization.py, tensorrt_llm/_torch/auto_deploy/transform/library/quantize_moe.py, tensorrt_llm/_torch/auto_deploy/transform/optimizer.py
Introduced a modular, extensible transformation pipeline with registry, typed configs, and new transforms for model build, export, cleanup, quantization, and MoE quantization.
Transformations: In-Place Refactor & Library
tensorrt_llm/_torch/auto_deploy/transformations/__init__.py, tensorrt_llm/_torch/auto_deploy/transformations/_graph.py, tensorrt_llm/_torch/auto_deploy/transformations/library/__init__.py, tensorrt_llm/_torch/auto_deploy/transformations/library/attention.py, tensorrt_llm/_torch/auto_deploy/transformations/library/collectives.py, tensorrt_llm/_torch/auto_deploy/transformations/library/eliminate_redundant_transposes.py, tensorrt_llm/_torch/auto_deploy/transformations/library/fused_moe.py, tensorrt_llm/_torch/auto_deploy/transformations/library/fusion.py, tensorrt_llm/_torch/auto_deploy/transformations/library/kvcache.py, tensorrt_llm/_torch/auto_deploy/transformations/library/rope.py, tensorrt_llm/_torch/auto_deploy/transformations/library/sharding.py, tensorrt_llm/_torch/auto_deploy/transformations/library/visualization.py, tensorrt_llm/_torch/auto_deploy/transformations/transform.py
Refactored all transformation functions to operate in-place (no longer return new GMs), modularized sharding and fusion, improved pattern matching, and updated for new optimizer pipeline.
Removed/Deprecated Files
tensorrt_llm/_torch/auto_deploy/transformations/export.py, tensorrt_llm/_torch/auto_deploy/transformations/library/ep_sharding.py
Removed legacy export and expert parallel sharding logic, replaced by modular transform and registry-based system.
AutoDeploy Shim & Engine
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
Updated ADEngine to support new config, device handling, max_beam_width, and improved input preparation.
Default Config & Data
tensorrt_llm/_torch/auto_deploy/config/default.yaml
Added default YAML configuration for transforms and pipeline stages.
Testing Utilities & Test Refactor
tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py, tests/unittest/_torch/auto_deploy/_utils_test/_model_test_utils.py, tests/unittest/_torch/auto_deploy/_utils_test/torch_attention_reference.py
Added and refactored test helpers for graph transforms, sharding detection, and attention reference implementations.
Unit & Integration Tests
tests/unittest/_torch/auto_deploy/integration/test_llama4_vlm_export.py, tests/unittest/_torch/auto_deploy/unit/multigpu/test_ad_build_small_multi.py, tests/unittest/_torch/auto_deploy/unit/multigpu/transformations/library/test_allreduce_residual_rmsnorm_fusion.py, tests/unittest/_torch/auto_deploy/unit/multigpu/transformations/library/test_bmm_sharding.py, tests/unittest/_torch/auto_deploy/unit/multigpu/transformations/library/test_ep_sharding.py, tests/unittest/_torch/auto_deploy/unit/multigpu/transformations/library/test_tp_sharding.py, tests/unittest/_torch/auto_deploy/unit/singlegpu/compile/test_captured_graph.py, tests/unittest/_torch/auto_deploy/unit/singlegpu/compile/test_compiler.py, tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_ad_moe_op.py, tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_attention_op.py, tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_flashinfer_attention_op.py, tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_torch_attention_op.py, tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/triton_kernels/test_attention_with_kv_cache.py, tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/triton_kernels/test_triton_rms_norm.py, tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_deepseek_patches.py, tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_engine.py, tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_llm_config.py, tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_build_small_single.py, tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py, tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher.py, tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py, tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_fuse_rmsnorm.py, tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py, tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_moe_fusion.py, tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_quant_moe.py, tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_quantization.py, tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_rope_transformation.py, tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/test_export.py
Refactored and extended tests for new transform pipeline, sharding detection, quantization, MoE fusion, attention ops, and configuration validation. Many tests now use in-place transforms and new helper utilities.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI/Script
    participant ConfigLoader
    participant ModelFactory
    participant TransformPipeline
    participant ExportSystem
    participant CustomOps
    participant OptimizedModel

    User->>CLI/Script: Launch with CLI args/YAML configs
    CLI/Script->>ConfigLoader: Parse, merge, and validate configs
    ConfigLoader->>ModelFactory: Instantiate with config
    ModelFactory->>TransformPipeline: Build initial model
    TransformPipeline->>ExportSystem: Export to FX GraphModule (with patches)
    ExportSystem->>CustomOps: Register and patch ops as needed
    TransformPipeline->>TransformPipeline: Apply transforms (quantization, sharding, fusion, etc.)
    TransformPipeline->>OptimizedModel: Return optimized model/graph
    OptimizedModel-->>User: Ready for inference/deployment
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~90+ minutes

Possibly related PRs

  • [AutoDeploy] merge feat/ad-2025-07-07 #6196: Shares identical changes in the AutoDeploy example, configuration, and CLI, including refactoring of ExperimentConfig and CLI argument handling; directly related at the code level.

Suggested reviewers

  • litaotju
  • pcastonguay
  • nv-guomingz
  • shaharmor98

Poem

In the warren of code, where the YAML files grow,
A rabbit found transforms, all lined up in a row.
With quantization and sharding, and configs so neat,
The pipeline now hops with in-place repeat.
Custom ops sparkle, tests multiply,
This modular meadow makes bunnies hop high!
🐇✨

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@galagam galagam force-pushed the user/ghubaraagam/prepare-inputs-2 branch from bc5b1b6 to 27a3d7f Compare July 31, 2025 12:27
@galagam galagam closed this Jul 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants