Skip to content

Conversation

@h-guo18
Copy link
Collaborator

@h-guo18 h-guo18 commented Jul 24, 2025

Summary by CodeRabbit

  • New Features

    • Introduced a modular, staged graph transformation and export pipeline for PyTorch models, including dynamic YAML configuration, deep merging, and advanced CLI argument handling.
    • Added backend-specific RMSNorm, quantized MoE (FP8/NVFP4), and Torch attention custom operators, with corresponding graph pattern fusion transforms.
    • Integrated a flexible patch system for export compatibility (e.g., SDPA, ModuleList, linear ops, meta device, transformers).
    • Added a comprehensive inference optimizer for efficient model deployment.
  • Enhancements

    • Improved configuration validation, error checking, and support for nested YAML files.
    • Expanded attention and MoE quantization support, including new test coverage.
    • Optimized CUDA graph capture with memory pool reuse and improved logging.
  • Bug Fixes

    • Fixed parameter deduplication, device info cleanup, and input constraint handling in exported graphs.
    • Corrected sharding, caching, and input preparation logic for distributed and single-GPU scenarios.
  • Documentation

    • Updated and expanded documentation for configuration, advanced usage, and expert options.
  • Tests

    • Added and refactored extensive unit and integration tests for new transforms, quantization, sharding, custom ops, and export compatibility.
  • Chores

    • Refactored and reorganized codebase for modularity, maintainability, and extensibility, including deprecating legacy transformation modules.

Description

Issue #4403 . This PR moves fuse_rmsnorm only.

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

galagam and others added 30 commits July 21, 2025 07:25
…rmations to return None (#71)

* Refactor the signatures of AD graph transformations to return None (NVIDIA#5249)

Refactor signatures of AD graph transformations from
  gm = transformation(gm)
to
  transformation(gm)

Since the AD graph transformations modify the input GraphModule
in-place. Previous signature style was misleading.

Signed-off-by: Gal Hubara Agam <[email protected]>
…ion (#76)

* Fix trtllm-bench test and enable trtllm-bench integration

Signed-off-by: Neta Zmora <[email protected]>

* Remove unneeded __init__.py

Signed-off-by: Neta Zmora <[email protected]>

---------

Signed-off-by: Neta Zmora <[email protected]>
) (#73)

* yaml config loader for dynamic settings

Signed-off-by: Lucas Liebenwein <[email protected]>

* updates for yaml mixin

Signed-off-by: Lucas Liebenwein <[email protected]>

* addressing reviewer feedback

Signed-off-by: Lucas Liebenwein <[email protected]>

---------

Signed-off-by: Lucas Liebenwein <[email protected]>
* [AutoDeploy] Refining AD configurability

Signed-off-by: Lucas Liebenwein <[email protected]>

* addressed reviewer feedback

Signed-off-by: Lucas Liebenwein <[email protected]>

---------

Signed-off-by: Lucas Liebenwein <[email protected]>
* Add the Torch backend and update the test to use the torch backend.

Signed-off-by: nvchenghaoz <[email protected]>

* Add the sinks and fix the test failures

Signed-off-by: nvchenghaoz <[email protected]>

* address reviewer's comments

Signed-off-by: nvchenghaoz <[email protected]>

* use custom op convention

Signed-off-by: nvchenghaoz <[email protected]>

* move the ref to the utils_test

Signed-off-by: nvchenghaoz <[email protected]>

* Add torch backend tests in ad_build_small_single.py

Signed-off-by: nvchenghaoz <[email protected]>

* Address hidden comments...

Signed-off-by: nvchenghaoz <[email protected]>

---------

Signed-off-by: nvchenghaoz <[email protected]>
Signed-off-by: Lucas Liebenwein <[email protected]>
* add torch_fp8_moe and fp8 linear support in pattern matcher, update unit tests

Signed-off-by: Frida Hou <[email protected]>

* add torch-fp4-moe and fp4 support in pattern matcher, unit test has acc issue and e2e mixtral fp4 has kernel error wo moe matching

Signed-off-by: Frida Hou <[email protected]>

* add pre-commit hook

Signed-off-by: Frida Hou <[email protected]>

* hacky fix for e2e run of mixtral FP4 and fp4 op unit test

Signed-off-by: Frida Hou <[email protected]>

* EP support for torch_fp4_moe and torch_fp8_moe

Signed-off-by: Frida Hou <[email protected]>

* fix rebase: op rename, shard_load_hook bug in FP4

Signed-off-by: Frida Hou <[email protected]>

* fix pre-commit

Signed-off-by: Frida Hou <[email protected]>

* fix weight loading-load_hook issue for FP4, update function to handle exclude_modules in hf_quant_config

Signed-off-by: Frida Hou <[email protected]>

* addressing feedback, add moe op template, update op names,other minor refinements

Signed-off-by: Frida Hou <[email protected]>

* move common functionality to utility

Signed-off-by: Frida Hou <[email protected]>

* fix FP4QuantizationImpl register from rebase

Signed-off-by: Frida Hou <[email protected]>

* add quantize_moe pass for patched torch_moe op

Signed-off-by: Frida Hou <[email protected]>

* add transformation unit tests for FP8 and FP4

Signed-off-by: Frida Hou <[email protected]>

* update should_skip_quantization to fix bmm unit test

Signed-off-by: Frida Hou <[email protected]>

* update BMMDynamicModel and utils to extract weight for dynamic BMM case

Signed-off-by: Frida Hou <[email protected]>

* update BMMDynamicModel to drop linear op

Signed-off-by: Frida Hou <[email protected]>

* minor

Signed-off-by: Frida Hou <[email protected]>

---------

Signed-off-by: Frida Hou <[email protected]>
* remove assert, add qwen small to tests

* lint

Signed-off-by: Suyog Gupta <[email protected]>

---------

Signed-off-by: Suyog Gupta <[email protected]>
* fix overlap scheduler in AD

Signed-off-by: Suyog Gupta <[email protected]>

* cleanups

Signed-off-by: Suyog Gupta <[email protected]>

* fix nest sequences

Signed-off-by: Suyog Gupta <[email protected]>

* nits

* avoid hardcoding max beam width

Signed-off-by: Suyog Gupta <[email protected]>

* avoid hardcoding max beam width

Signed-off-by: Suyog Gupta <[email protected]>

* clean logic and max_beam_width arg

Signed-off-by: Lucas Liebenwein <[email protected]>

---------

Signed-off-by: Suyog Gupta <[email protected]>
Signed-off-by: Lucas Liebenwein <[email protected]>
Co-authored-by: Lucas Liebenwein <[email protected]>
NVIDIA#5916) (#86)

* introduced basic sharding config logic

* transformation_executor works for TP parallelism. Updated test_graph_sharding

Signed-off-by: greg-kwasniewski1 <[email protected]>

* Switched from dataclass to pydantic. Added run_pattern_detection_test functionality, applied to test_graph_sharding

Signed-off-by: greg-kwasniewski1 <[email protected]>

* Restructured transformation execution logic. transformation_executor applies any generic transformations

Signed-off-by: greg-kwasniewski1 <[email protected]>

* Detection + execution logic moved only to sharding. Transformation work on node.name

Signed-off-by: greg-kwasniewski1 <[email protected]>

* Removed redundant params

Signed-off-by: greg-kwasniewski1 <[email protected]>

---------

Signed-off-by: greg-kwasniewski1 <[email protected]>
* Add sink/sliding window support for Triton

Signed-off-by: nvchenghaoz <[email protected]>

* Add the test and fix the functional implementations

Signed-off-by: nvchenghaoz <[email protected]>

---------

Signed-off-by: nvchenghaoz <[email protected]>
* moving more transforms into the modular system

Signed-off-by: Lucas Liebenwein <[email protected]>

* fixes for some configs

Signed-off-by: Lucas Liebenwein <[email protected]>

---------

Signed-off-by: Lucas Liebenwein <[email protected]>
* Add the torch ref implementation for new params.

Signed-off-by: nvchenghaoz <[email protected]>

* Remove comment

Signed-off-by: nvchenghaoz <[email protected]>

---------

Signed-off-by: nvchenghaoz <[email protected]>
* Modular export patches + registry; fixes NVIDIA#5728

Signed-off-by: Lucas Liebenwein <[email protected]>

* patch library for models

Signed-off-by: Lucas Liebenwein <[email protected]>

* unit test fixes

Signed-off-by: Lucas Liebenwein <[email protected]>

* addressing reviewer feedback

Signed-off-by: Lucas Liebenwein <[email protected]>

---------

Signed-off-by: Lucas Liebenwein <[email protected]>
* fix overlap scheduler in AD

Signed-off-by: Suyog Gupta <[email protected]>

* cleanups

Signed-off-by: Suyog Gupta <[email protected]>

* fix nest sequences

Signed-off-by: Suyog Gupta <[email protected]>

* nits

* avoid hardcoding max beam width

Signed-off-by: Suyog Gupta <[email protected]>

* avoid hardcoding max beam width

Signed-off-by: Suyog Gupta <[email protected]>

* cudagraph fixes + rms norm

Signed-off-by: Suyog Gupta <[email protected]>

* fix test

Signed-off-by: Suyog Gupta <[email protected]>

* revert ad_executor changes

Signed-off-by: Suyog Gupta <[email protected]>

* Review comments + make sure num_pages >= max batch size

* wrapping reviewer feedback and open items

Signed-off-by: Lucas Liebenwein <[email protected]>

---------

Signed-off-by: Suyog Gupta <[email protected]>
Signed-off-by: Lucas Liebenwein <[email protected]>
Co-authored-by: Lucas Liebenwein <[email protected]>
…and BMM (fixes NVIDIA#5916) (#94)

* Updated tests

Signed-off-by: greg-kwasniewski1 <[email protected]>

* fixed tp sharding bug

Signed-off-by: greg-kwasniewski1 <[email protected]>

* Fixed sharding tests

Signed-off-by: greg-kwasniewski1 <[email protected]>

* Fixed sharding tests 1.1

Signed-off-by: greg-kwasniewski1 <[email protected]>

* import fix

Signed-off-by: Lucas Liebenwein <[email protected]>

---------

Signed-off-by: greg-kwasniewski1 <[email protected]>
Signed-off-by: Lucas Liebenwein <[email protected]>
Co-authored-by: Lucas Liebenwein <[email protected]>
* WIP for attention matching: repeat_kv, eager_attention_matching

Signed-off-by: Frida Hou <[email protected]>

* works e2e with llama2 and llama3.1, eager and sdpa

Signed-off-by: Frida Hou <[email protected]>

* update for unit test test_attention_matcher

Signed-off-by: Frida Hou <[email protected]>

* minor

Signed-off-by: Frida Hou <[email protected]>

* minor

Signed-off-by: Frida Hou <[email protected]>

* unify into one transformation, update unit tests

Signed-off-by: Frida Hou <[email protected]>

* update hf_test to verify transformed output, update move_to_devide to recompile graph

Signed-off-by: Frida Hou <[email protected]>

* update after rebase

Signed-off-by: Frida Hou <[email protected]>

* minor

Signed-off-by: Frida Hou <[email protected]>

* update docstring

Signed-off-by: Frida Hou <[email protected]>

* minor

Signed-off-by: Frida Hou <[email protected]>

---------

Signed-off-by: Frida Hou <[email protected]>
* Change the all-reduce strategy to NCCL

When the strategy is set to AUTO and world_size>1 we experience hangs and CUDA
memory errors.

* This is the same issue as https://nvbugspro.nvidia.com/bug/5331013
* Without this change test test_ad_build_small_multi.py fails (tp==2)
* This is a temporary change until we understand why this hang is happening.
* On dllcuster this issue does not manifest.

Signed-off-by: Neta Zmora <[email protected]>

* Re-enable test_ad_build_small_multi.py

tests/unittest/_torch/auto_deploy/unit/multigpu/test_ad_build_small_multi.py

Signed-off-by: Neta Zmora <[email protected]>

* fix kvcache mem size compute - convert to MB

Signed-off-by: Gal Agam <[email protected]>

---------

Signed-off-by: Neta Zmora <[email protected]>
Signed-off-by: Gal Agam <[email protected]>
Co-authored-by: Gal Agam <[email protected]>
@h-guo18 h-guo18 requested review from a team as code owners July 24, 2025 03:45
@h-guo18 h-guo18 requested review from FrankD412 and lucaslie July 24, 2025 03:45
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jul 24, 2025

Caution

Review failed

The pull request is closed.

Walkthrough

This update introduces a major modularization and refactor of the AutoDeploy graph transformation, export, and configuration system. Key changes include new modular export and transform frameworks, dynamic YAML-based config merging, new backend-specific custom ops, quantization and sharding enhancements, and extensive updates to test infrastructure. Numerous new modules and classes were added, legacy transformation code was deprecated or replaced, and documentation was expanded.

Changes

File(s) / Path(s) Change Summary
examples/auto_deploy/.vscode/launch.json, README.md, build_and_run_ad.py Refactored experiment config for dynamic YAML merging, enhanced CLI arg parsing, clarified docs, and improved prompt/model kwarg handling.
tensorrt_llm/_torch/auto_deploy/export/, transform/, utils/_config.py Introduced modular export and transform frameworks with patch/transform registries, deep YAML config merging, and new patch/transform libraries.
tensorrt_llm/_torch/auto_deploy/llm_args.py, models/hf.py, shim/ad_executor.py Refactored LLM argument/config structure for stricter validation, dynamic merging, and support for new fields (e.g., max_beam_width).
tensorrt_llm/_torch/auto_deploy/custom_ops/, custom_ops/torch_backend_attention.py, custom_ops/rms_norm.py, custom_ops/torch_moe.py Added/extended custom ops for Torch/triton/flashinfer backends, including new RMSNorm and quantized MoE implementations, and enhanced attention ops with sinks and sliding window support.
tensorrt_llm/_torch/auto_deploy/transform/library/, transformations/library/ Added new graph transforms for model building, export, RMSNorm fusion, quantized MoE, input constraint cleanup, and more.
tensorrt_llm/_torch/auto_deploy/transformations/ (legacy) Deprecated/replaced legacy transformation and export utilities; removed or refactored direct graph returns to in-place mutation.
tensorrt_llm/_torch/auto_deploy/models/patches/ Modularized model-specific export patches; encapsulated monkey-patches in patch classes for registry-based management.
tensorrt_llm/_torch/auto_deploy/utils/quantization_utils.py, utils/node_utils.py Added quantization skip/extract helpers, improved pattern matching and node filtering utilities.
tensorrt_llm/_torch/auto_deploy/transformations/library/sharding.py Refactored sharding logic: introduced typed config objects, deferred application, and modularized detection/execution of TP/BMM/EP sharding.
tensorrt_llm/_torch/auto_deploy/models/__init__.py, models/factory.py Reduced wildcard imports, minor signature simplifications.
tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py CUDA graph batch sizes now sorted descending; introduced memory pool reuse and warm-up logging.
requirements.txt, setup.py Added omegaconf and enabled YAML support for pydantic-settings; included YAML files in package data.
tensorrt_llm/bench/benchmark/throughput.py Adjusted backend-specific argument handling for AutoDeployLLM instantiation.
tests/unittest/_torch/auto_deploy/ Extensive test refactor: new reference modules, quantized MoE/attention tests, modular optimizer usage, pattern detection tests, and in-place transform handling.
Miscellaneous (models/patches/*.py, transformations/*.py, etc.) Minor comment, import, and docstring updates; migration to new modular interfaces.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI/Script
    participant ConfigLoader
    participant ExportRegistry
    participant TransformRegistry
    participant ModelFactory
    participant GraphModule
    participant CustomOps

    User->>CLI/Script: Launch with CLI args/YAML
    CLI/Script->>ConfigLoader: Parse CLI args, merge YAML configs
    ConfigLoader->>ConfigLoader: Deep merge, validate config
    CLI/Script->>ModelFactory: Create model factory from config
    CLI/Script->>ExportRegistry: Apply export patches (as context managers)
    ExportRegistry->>ModelFactory: Build model (possibly on meta device)
    ModelFactory->>GraphModule: Export model to FX graph
    ExportRegistry->>GraphModule: Deduplicate params, clean up devices
    CLI/Script->>TransformRegistry: Apply graph transforms in stage order
    TransformRegistry->>GraphModule: Apply transforms (e.g., fuse RMSNorm, quantize MoE)
    GraphModule->>CustomOps: Replace patterns with backend-specific ops
    CLI/Script->>GraphModule: Finalize, run inference/benchmark
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~90+ minutes

This is a critical, high-complexity refactor involving new frameworks, deep config changes, new custom ops, sharding/quantization logic, and extensive test updates across many files.

Possibly related PRs

  • [AutoDeploy] merge feat/ad-2025-07-07 #6196: Shares identical changes to launch config, README, experiment config refactor, dynamic YAML merging, CLI argument parsing, and export submodule import; these changes are directly related at the code level.

Suggested labels

Community want to contribute

Suggested reviewers

  • shaharmor98
  • nv-guomingz
  • litaotju

Poem

A rabbit hops through fields of code,
With YAML, ops, and graphs bestowed.
It patches, fuses, quantizes, too—
Modular magic, configs anew!
From sharding fields to custom norms,
This bunny leaps through transform storms.
Review this garden, see it bloom—
🐇✨ Modular AutoDeploy in full costume!

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@h-guo18 h-guo18 closed this Jul 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants