Move Fuse RMSNorm to new Inf Optimizer #6318

h-guo18 · 2025-07-24T03:45:07Z

Summary by CodeRabbit

New Features
- Introduced a modular, staged graph transformation and export pipeline for PyTorch models, including dynamic YAML configuration, deep merging, and advanced CLI argument handling.
- Added backend-specific RMSNorm, quantized MoE (FP8/NVFP4), and Torch attention custom operators, with corresponding graph pattern fusion transforms.
- Integrated a flexible patch system for export compatibility (e.g., SDPA, ModuleList, linear ops, meta device, transformers).
- Added a comprehensive inference optimizer for efficient model deployment.
Enhancements
- Improved configuration validation, error checking, and support for nested YAML files.
- Expanded attention and MoE quantization support, including new test coverage.
- Optimized CUDA graph capture with memory pool reuse and improved logging.
Bug Fixes
- Fixed parameter deduplication, device info cleanup, and input constraint handling in exported graphs.
- Corrected sharding, caching, and input preparation logic for distributed and single-GPU scenarios.
Documentation
- Updated and expanded documentation for configuration, advanced usage, and expert options.
Tests
- Added and refactored extensive unit and integration tests for new transforms, quantization, sharding, custom ops, and export compatibility.
Chores
- Refactored and reorganized codebase for modularity, maintainability, and extensibility, including deprecating legacy transformation modules.

Description

Issue #4403 . This PR moves fuse_rmsnorm only.

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

…rmations to return None (#71) * Refactor the signatures of AD graph transformations to return None (NVIDIA#5249) Refactor signatures of AD graph transformations from gm = transformation(gm) to transformation(gm) Since the AD graph transformations modify the input GraphModule in-place. Previous signature style was misleading. Signed-off-by: Gal Hubara Agam <[email protected]>

…ion (#76) * Fix trtllm-bench test and enable trtllm-bench integration Signed-off-by: Neta Zmora <[email protected]> * Remove unneeded __init__.py Signed-off-by: Neta Zmora <[email protected]> --------- Signed-off-by: Neta Zmora <[email protected]>

…integrat…" (#78) This reverts commit 600f26e.

) (#73) * yaml config loader for dynamic settings Signed-off-by: Lucas Liebenwein <[email protected]> * updates for yaml mixin Signed-off-by: Lucas Liebenwein <[email protected]> * addressing reviewer feedback Signed-off-by: Lucas Liebenwein <[email protected]> --------- Signed-off-by: Lucas Liebenwein <[email protected]>

* [AutoDeploy] Refining AD configurability Signed-off-by: Lucas Liebenwein <[email protected]> * addressed reviewer feedback Signed-off-by: Lucas Liebenwein <[email protected]> --------- Signed-off-by: Lucas Liebenwein <[email protected]>

* Add the Torch backend and update the test to use the torch backend. Signed-off-by: nvchenghaoz <[email protected]> * Add the sinks and fix the test failures Signed-off-by: nvchenghaoz <[email protected]> * address reviewer's comments Signed-off-by: nvchenghaoz <[email protected]> * use custom op convention Signed-off-by: nvchenghaoz <[email protected]> * move the ref to the utils_test Signed-off-by: nvchenghaoz <[email protected]> * Add torch backend tests in ad_build_small_single.py Signed-off-by: nvchenghaoz <[email protected]> * Address hidden comments... Signed-off-by: nvchenghaoz <[email protected]> --------- Signed-off-by: nvchenghaoz <[email protected]> Signed-off-by: Lucas Liebenwein <[email protected]>

Signed-off-by: nvchenghaoz <[email protected]>

* add torch_fp8_moe and fp8 linear support in pattern matcher, update unit tests Signed-off-by: Frida Hou <[email protected]> * add torch-fp4-moe and fp4 support in pattern matcher, unit test has acc issue and e2e mixtral fp4 has kernel error wo moe matching Signed-off-by: Frida Hou <[email protected]> * add pre-commit hook Signed-off-by: Frida Hou <[email protected]> * hacky fix for e2e run of mixtral FP4 and fp4 op unit test Signed-off-by: Frida Hou <[email protected]> * EP support for torch_fp4_moe and torch_fp8_moe Signed-off-by: Frida Hou <[email protected]> * fix rebase: op rename, shard_load_hook bug in FP4 Signed-off-by: Frida Hou <[email protected]> * fix pre-commit Signed-off-by: Frida Hou <[email protected]> * fix weight loading-load_hook issue for FP4, update function to handle exclude_modules in hf_quant_config Signed-off-by: Frida Hou <[email protected]> * addressing feedback, add moe op template, update op names,other minor refinements Signed-off-by: Frida Hou <[email protected]> * move common functionality to utility Signed-off-by: Frida Hou <[email protected]> * fix FP4QuantizationImpl register from rebase Signed-off-by: Frida Hou <[email protected]> * add quantize_moe pass for patched torch_moe op Signed-off-by: Frida Hou <[email protected]> * add transformation unit tests for FP8 and FP4 Signed-off-by: Frida Hou <[email protected]> * update should_skip_quantization to fix bmm unit test Signed-off-by: Frida Hou <[email protected]> * update BMMDynamicModel and utils to extract weight for dynamic BMM case Signed-off-by: Frida Hou <[email protected]> * update BMMDynamicModel to drop linear op Signed-off-by: Frida Hou <[email protected]> * minor Signed-off-by: Frida Hou <[email protected]> --------- Signed-off-by: Frida Hou <[email protected]>

* remove assert, add qwen small to tests * lint Signed-off-by: Suyog Gupta <[email protected]> --------- Signed-off-by: Suyog Gupta <[email protected]>

* fix overlap scheduler in AD Signed-off-by: Suyog Gupta <[email protected]> * cleanups Signed-off-by: Suyog Gupta <[email protected]> * fix nest sequences Signed-off-by: Suyog Gupta <[email protected]> * nits * avoid hardcoding max beam width Signed-off-by: Suyog Gupta <[email protected]> * avoid hardcoding max beam width Signed-off-by: Suyog Gupta <[email protected]> * clean logic and max_beam_width arg Signed-off-by: Lucas Liebenwein <[email protected]> --------- Signed-off-by: Suyog Gupta <[email protected]> Signed-off-by: Lucas Liebenwein <[email protected]> Co-authored-by: Lucas Liebenwein <[email protected]>

Signed-off-by: Lucas Liebenwein <[email protected]>

…IA#4367, NVIDIA#4366 (#84) Signed-off-by: Lucas Liebenwein <[email protected]>

NVIDIA#5916) (#86) * introduced basic sharding config logic * transformation_executor works for TP parallelism. Updated test_graph_sharding Signed-off-by: greg-kwasniewski1 <[email protected]> * Switched from dataclass to pydantic. Added run_pattern_detection_test functionality, applied to test_graph_sharding Signed-off-by: greg-kwasniewski1 <[email protected]> * Restructured transformation execution logic. transformation_executor applies any generic transformations Signed-off-by: greg-kwasniewski1 <[email protected]> * Detection + execution logic moved only to sharding. Transformation work on node.name Signed-off-by: greg-kwasniewski1 <[email protected]> * Removed redundant params Signed-off-by: greg-kwasniewski1 <[email protected]> --------- Signed-off-by: greg-kwasniewski1 <[email protected]>

* Add sink/sliding window support for Triton Signed-off-by: nvchenghaoz <[email protected]> * Add the test and fix the functional implementations Signed-off-by: nvchenghaoz <[email protected]> --------- Signed-off-by: nvchenghaoz <[email protected]>

This reverts commit a37797b. Signed-off-by: Lucas Liebenwein <[email protected]>

* moving more transforms into the modular system Signed-off-by: Lucas Liebenwein <[email protected]> * fixes for some configs Signed-off-by: Lucas Liebenwein <[email protected]> --------- Signed-off-by: Lucas Liebenwein <[email protected]>

* Add the torch ref implementation for new params. Signed-off-by: nvchenghaoz <[email protected]> * Remove comment Signed-off-by: nvchenghaoz <[email protected]> --------- Signed-off-by: nvchenghaoz <[email protected]>

* Modular export patches + registry; fixes NVIDIA#5728 Signed-off-by: Lucas Liebenwein <[email protected]> * patch library for models Signed-off-by: Lucas Liebenwein <[email protected]> * unit test fixes Signed-off-by: Lucas Liebenwein <[email protected]> * addressing reviewer feedback Signed-off-by: Lucas Liebenwein <[email protected]> --------- Signed-off-by: Lucas Liebenwein <[email protected]>

* fix overlap scheduler in AD Signed-off-by: Suyog Gupta <[email protected]> * cleanups Signed-off-by: Suyog Gupta <[email protected]> * fix nest sequences Signed-off-by: Suyog Gupta <[email protected]> * nits * avoid hardcoding max beam width Signed-off-by: Suyog Gupta <[email protected]> * avoid hardcoding max beam width Signed-off-by: Suyog Gupta <[email protected]> * cudagraph fixes + rms norm Signed-off-by: Suyog Gupta <[email protected]> * fix test Signed-off-by: Suyog Gupta <[email protected]> * revert ad_executor changes Signed-off-by: Suyog Gupta <[email protected]> * Review comments + make sure num_pages >= max batch size * wrapping reviewer feedback and open items Signed-off-by: Lucas Liebenwein <[email protected]> --------- Signed-off-by: Suyog Gupta <[email protected]> Signed-off-by: Lucas Liebenwein <[email protected]> Co-authored-by: Lucas Liebenwein <[email protected]>

Signed-off-by: Lucas Liebenwein <[email protected]>

…and BMM (fixes NVIDIA#5916) (#94) * Updated tests Signed-off-by: greg-kwasniewski1 <[email protected]> * fixed tp sharding bug Signed-off-by: greg-kwasniewski1 <[email protected]> * Fixed sharding tests Signed-off-by: greg-kwasniewski1 <[email protected]> * Fixed sharding tests 1.1 Signed-off-by: greg-kwasniewski1 <[email protected]> * import fix Signed-off-by: Lucas Liebenwein <[email protected]> --------- Signed-off-by: greg-kwasniewski1 <[email protected]> Signed-off-by: Lucas Liebenwein <[email protected]> Co-authored-by: Lucas Liebenwein <[email protected]>

* WIP for attention matching: repeat_kv, eager_attention_matching Signed-off-by: Frida Hou <[email protected]> * works e2e with llama2 and llama3.1, eager and sdpa Signed-off-by: Frida Hou <[email protected]> * update for unit test test_attention_matcher Signed-off-by: Frida Hou <[email protected]> * minor Signed-off-by: Frida Hou <[email protected]> * minor Signed-off-by: Frida Hou <[email protected]> * unify into one transformation, update unit tests Signed-off-by: Frida Hou <[email protected]> * update hf_test to verify transformed output, update move_to_devide to recompile graph Signed-off-by: Frida Hou <[email protected]> * update after rebase Signed-off-by: Frida Hou <[email protected]> * minor Signed-off-by: Frida Hou <[email protected]> * update docstring Signed-off-by: Frida Hou <[email protected]> * minor Signed-off-by: Frida Hou <[email protected]> --------- Signed-off-by: Frida Hou <[email protected]>

Signed-off-by: nvchenghaoz <[email protected]>

…)" This reverts commit c245cf3.

This reverts commit a8b54f9.

* Change the all-reduce strategy to NCCL When the strategy is set to AUTO and world_size>1 we experience hangs and CUDA memory errors. * This is the same issue as https://nvbugspro.nvidia.com/bug/5331013 * Without this change test test_ad_build_small_multi.py fails (tp==2) * This is a temporary change until we understand why this hang is happening. * On dllcuster this issue does not manifest. Signed-off-by: Neta Zmora <[email protected]> * Re-enable test_ad_build_small_multi.py tests/unittest/_torch/auto_deploy/unit/multigpu/test_ad_build_small_multi.py Signed-off-by: Neta Zmora <[email protected]> * fix kvcache mem size compute - convert to MB Signed-off-by: Gal Agam <[email protected]> --------- Signed-off-by: Neta Zmora <[email protected]> Signed-off-by: Gal Agam <[email protected]> Co-authored-by: Gal Agam <[email protected]>

Signed-off-by: Lucas Liebenwein <[email protected]>

Signed-off-by: haoguo <[email protected]>

coderabbitai · 2025-07-24T03:45:26Z

Caution

Review failed

The pull request is closed.

Walkthrough

This update introduces a major modularization and refactor of the AutoDeploy graph transformation, export, and configuration system. Key changes include new modular export and transform frameworks, dynamic YAML-based config merging, new backend-specific custom ops, quantization and sharding enhancements, and extensive updates to test infrastructure. Numerous new modules and classes were added, legacy transformation code was deprecated or replaced, and documentation was expanded.

Changes

File(s) / Path(s)	Change Summary
`examples/auto_deploy/.vscode/launch.json`, `README.md`, `build_and_run_ad.py`	Refactored experiment config for dynamic YAML merging, enhanced CLI arg parsing, clarified docs, and improved prompt/model kwarg handling.
`tensorrt_llm/_torch/auto_deploy/export/`, `transform/`, `utils/_config.py`	Introduced modular export and transform frameworks with patch/transform registries, deep YAML config merging, and new patch/transform libraries.
`tensorrt_llm/_torch/auto_deploy/llm_args.py`, `models/hf.py`, `shim/ad_executor.py`	Refactored LLM argument/config structure for stricter validation, dynamic merging, and support for new fields (e.g., `max_beam_width`).
`tensorrt_llm/_torch/auto_deploy/custom_ops/`, `custom_ops/torch_backend_attention.py`, `custom_ops/rms_norm.py`, `custom_ops/torch_moe.py`	Added/extended custom ops for Torch/triton/flashinfer backends, including new RMSNorm and quantized MoE implementations, and enhanced attention ops with sinks and sliding window support.
`tensorrt_llm/_torch/auto_deploy/transform/library/`, `transformations/library/`	Added new graph transforms for model building, export, RMSNorm fusion, quantized MoE, input constraint cleanup, and more.
`tensorrt_llm/_torch/auto_deploy/transformations/` (legacy)	Deprecated/replaced legacy transformation and export utilities; removed or refactored direct graph returns to in-place mutation.
`tensorrt_llm/_torch/auto_deploy/models/patches/`	Modularized model-specific export patches; encapsulated monkey-patches in patch classes for registry-based management.
`tensorrt_llm/_torch/auto_deploy/utils/quantization_utils.py`, `utils/node_utils.py`	Added quantization skip/extract helpers, improved pattern matching and node filtering utilities.
`tensorrt_llm/_torch/auto_deploy/transformations/library/sharding.py`	Refactored sharding logic: introduced typed config objects, deferred application, and modularized detection/execution of TP/BMM/EP sharding.
`tensorrt_llm/_torch/auto_deploy/models/__init__.py`, `models/factory.py`	Reduced wildcard imports, minor signature simplifications.
`tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py`	CUDA graph batch sizes now sorted descending; introduced memory pool reuse and warm-up logging.
`requirements.txt`, `setup.py`	Added `omegaconf` and enabled YAML support for `pydantic-settings`; included YAML files in package data.
`tensorrt_llm/bench/benchmark/throughput.py`	Adjusted backend-specific argument handling for AutoDeployLLM instantiation.
`tests/unittest/_torch/auto_deploy/`	Extensive test refactor: new reference modules, quantized MoE/attention tests, modular optimizer usage, pattern detection tests, and in-place transform handling.
Miscellaneous (`models/patches/.py`, `transformations/.py`, etc.)	Minor comment, import, and docstring updates; migration to new modular interfaces.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI/Script
    participant ConfigLoader
    participant ExportRegistry
    participant TransformRegistry
    participant ModelFactory
    participant GraphModule
    participant CustomOps

    User->>CLI/Script: Launch with CLI args/YAML
    CLI/Script->>ConfigLoader: Parse CLI args, merge YAML configs
    ConfigLoader->>ConfigLoader: Deep merge, validate config
    CLI/Script->>ModelFactory: Create model factory from config
    CLI/Script->>ExportRegistry: Apply export patches (as context managers)
    ExportRegistry->>ModelFactory: Build model (possibly on meta device)
    ModelFactory->>GraphModule: Export model to FX graph
    ExportRegistry->>GraphModule: Deduplicate params, clean up devices
    CLI/Script->>TransformRegistry: Apply graph transforms in stage order
    TransformRegistry->>GraphModule: Apply transforms (e.g., fuse RMSNorm, quantize MoE)
    GraphModule->>CustomOps: Replace patterns with backend-specific ops
    CLI/Script->>GraphModule: Finalize, run inference/benchmark

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~90+ minutes

This is a critical, high-complexity refactor involving new frameworks, deep config changes, new custom ops, sharding/quantization logic, and extensive test updates across many files.

Possibly related PRs

[AutoDeploy] merge feat/ad-2025-07-07 #6196: Shares identical changes to launch config, README, experiment config refactor, dynamic YAML merging, CLI argument parsing, and export submodule import; these changes are directly related at the code level.

Suggested labels

Community want to contribute

Suggested reviewers

shaharmor98
nv-guomingz
litaotju

Poem

A rabbit hops through fields of code,
With YAML, ops, and graphs bestowed.
It patches, fuses, quantizes, too—
Modular magic, configs anew!
From sharding fields to custom norms,
This bunny leaps through transform storms.
Review this garden, see it bloom—
🐇✨ Modular AutoDeploy in full costume!

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai generate unit tests to generate unit tests for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

galagam and others added 30 commits July 21, 2025 07:25

Revert "Fix AD trtllm-bench integration test and enable trtllm-bench …

084703e

…integrat…" (#78) This reverts commit 600f26e.

[AutoDeploy] Refining AD configurability (#75)

0605b83

* [AutoDeploy] Refining AD configurability Signed-off-by: Lucas Liebenwein <[email protected]> * addressed reviewer feedback Signed-off-by: Lucas Liebenwein <[email protected]> --------- Signed-off-by: Lucas Liebenwein <[email protected]>

Fix trtllm-bench test

aeac894

Fix the unit test failure (#83)

e6bd1f4

Signed-off-by: nvchenghaoz <[email protected]>

Fix loading of aliased weights (#85)

deefc58

* remove assert, add qwen small to tests * lint Signed-off-by: Suyog Gupta <[email protected]> --------- Signed-off-by: Suyog Gupta <[email protected]>

[AutoDeploy] Improved config deepmerge handling (#82)

e6b2d00

Signed-off-by: Lucas Liebenwein <[email protected]>

[AutoDeploy][1/n] Modular InferenceOptimizer; fixes NVIDIA#4328, NVID…

89cce46

…IA#4367, NVIDIA#4366 (#84) Signed-off-by: Lucas Liebenwein <[email protected]>

Revert "[None] Add sink/sliding window support for Triton (#77)" (#92)

cfe2b9c

This reverts commit a37797b. Signed-off-by: Lucas Liebenwein <[email protected]>

[None] Add the torch source implementation for new params. (#89)

5b69d2c

* Add the torch ref implementation for new params. Signed-off-by: nvchenghaoz <[email protected]> * Remove comment Signed-off-by: nvchenghaoz <[email protected]> --------- Signed-off-by: nvchenghaoz <[email protected]>

move assertion check cleanup back to stock export (#93)

4d89913

Signed-off-by: Lucas Liebenwein <[email protected]>

Add sinks / sliding window for Triton backend (#95)

29bb062

Signed-off-by: nvchenghaoz <[email protected]>

Revert "[None] Add the torch source implementation for new params. (#89…

810df7d

…)" This reverts commit c245cf3.

Revert "Attention Pattern Matcher (closes NVIDIA#4404) (#88)"

a9e227e

This reverts commit a8b54f9.

respect max_seq_len setting for pos embeddings (#103)

b8c5b9c

Signed-off-by: Lucas Liebenwein <[email protected]>

feat: update graph test helper for testing new inf optimizer

cd8796b

Signed-off-by: haoguo <[email protected]>

refactor: move fuse_rmsnorm

3a77e6f

Signed-off-by: haoguo <[email protected]>

polish: remove redundant canonicalize graph

ae81cb1

Signed-off-by: haoguo <[email protected]>

h-guo18 requested review from a team as code owners July 24, 2025 03:45

h-guo18 requested review from FrankD412 and lucaslie July 24, 2025 03:45

h-guo18 closed this Jul 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Move Fuse RMSNorm to new Inf Optimizer #6318

Move Fuse RMSNorm to new Inf Optimizer #6318

Uh oh!

h-guo18 commented Jul 24, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jul 24, 2025 •

edited

Loading

Review failed

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Move Fuse RMSNorm to new Inf Optimizer #6318

Move Fuse RMSNorm to new Inf Optimizer #6318

Uh oh!

Conversation

h-guo18 commented Jul 24, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

coderabbitai bot commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

h-guo18 commented Jul 24, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jul 24, 2025 •

edited

Loading