-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Move Fuse RMSNorm to new Inf Optimizer #6318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…rmations to return None (#71) * Refactor the signatures of AD graph transformations to return None (NVIDIA#5249) Refactor signatures of AD graph transformations from gm = transformation(gm) to transformation(gm) Since the AD graph transformations modify the input GraphModule in-place. Previous signature style was misleading. Signed-off-by: Gal Hubara Agam <[email protected]>
…ion (#76) * Fix trtllm-bench test and enable trtllm-bench integration Signed-off-by: Neta Zmora <[email protected]> * Remove unneeded __init__.py Signed-off-by: Neta Zmora <[email protected]> --------- Signed-off-by: Neta Zmora <[email protected]>
) (#73) * yaml config loader for dynamic settings Signed-off-by: Lucas Liebenwein <[email protected]> * updates for yaml mixin Signed-off-by: Lucas Liebenwein <[email protected]> * addressing reviewer feedback Signed-off-by: Lucas Liebenwein <[email protected]> --------- Signed-off-by: Lucas Liebenwein <[email protected]>
* [AutoDeploy] Refining AD configurability Signed-off-by: Lucas Liebenwein <[email protected]> * addressed reviewer feedback Signed-off-by: Lucas Liebenwein <[email protected]> --------- Signed-off-by: Lucas Liebenwein <[email protected]>
* Add the Torch backend and update the test to use the torch backend. Signed-off-by: nvchenghaoz <[email protected]> * Add the sinks and fix the test failures Signed-off-by: nvchenghaoz <[email protected]> * address reviewer's comments Signed-off-by: nvchenghaoz <[email protected]> * use custom op convention Signed-off-by: nvchenghaoz <[email protected]> * move the ref to the utils_test Signed-off-by: nvchenghaoz <[email protected]> * Add torch backend tests in ad_build_small_single.py Signed-off-by: nvchenghaoz <[email protected]> * Address hidden comments... Signed-off-by: nvchenghaoz <[email protected]> --------- Signed-off-by: nvchenghaoz <[email protected]> Signed-off-by: Lucas Liebenwein <[email protected]>
Signed-off-by: nvchenghaoz <[email protected]>
* add torch_fp8_moe and fp8 linear support in pattern matcher, update unit tests Signed-off-by: Frida Hou <[email protected]> * add torch-fp4-moe and fp4 support in pattern matcher, unit test has acc issue and e2e mixtral fp4 has kernel error wo moe matching Signed-off-by: Frida Hou <[email protected]> * add pre-commit hook Signed-off-by: Frida Hou <[email protected]> * hacky fix for e2e run of mixtral FP4 and fp4 op unit test Signed-off-by: Frida Hou <[email protected]> * EP support for torch_fp4_moe and torch_fp8_moe Signed-off-by: Frida Hou <[email protected]> * fix rebase: op rename, shard_load_hook bug in FP4 Signed-off-by: Frida Hou <[email protected]> * fix pre-commit Signed-off-by: Frida Hou <[email protected]> * fix weight loading-load_hook issue for FP4, update function to handle exclude_modules in hf_quant_config Signed-off-by: Frida Hou <[email protected]> * addressing feedback, add moe op template, update op names,other minor refinements Signed-off-by: Frida Hou <[email protected]> * move common functionality to utility Signed-off-by: Frida Hou <[email protected]> * fix FP4QuantizationImpl register from rebase Signed-off-by: Frida Hou <[email protected]> * add quantize_moe pass for patched torch_moe op Signed-off-by: Frida Hou <[email protected]> * add transformation unit tests for FP8 and FP4 Signed-off-by: Frida Hou <[email protected]> * update should_skip_quantization to fix bmm unit test Signed-off-by: Frida Hou <[email protected]> * update BMMDynamicModel and utils to extract weight for dynamic BMM case Signed-off-by: Frida Hou <[email protected]> * update BMMDynamicModel to drop linear op Signed-off-by: Frida Hou <[email protected]> * minor Signed-off-by: Frida Hou <[email protected]> --------- Signed-off-by: Frida Hou <[email protected]>
* remove assert, add qwen small to tests * lint Signed-off-by: Suyog Gupta <[email protected]> --------- Signed-off-by: Suyog Gupta <[email protected]>
* fix overlap scheduler in AD Signed-off-by: Suyog Gupta <[email protected]> * cleanups Signed-off-by: Suyog Gupta <[email protected]> * fix nest sequences Signed-off-by: Suyog Gupta <[email protected]> * nits * avoid hardcoding max beam width Signed-off-by: Suyog Gupta <[email protected]> * avoid hardcoding max beam width Signed-off-by: Suyog Gupta <[email protected]> * clean logic and max_beam_width arg Signed-off-by: Lucas Liebenwein <[email protected]> --------- Signed-off-by: Suyog Gupta <[email protected]> Signed-off-by: Lucas Liebenwein <[email protected]> Co-authored-by: Lucas Liebenwein <[email protected]>
Signed-off-by: Lucas Liebenwein <[email protected]>
…IA#4367, NVIDIA#4366 (#84) Signed-off-by: Lucas Liebenwein <[email protected]>
NVIDIA#5916) (#86) * introduced basic sharding config logic * transformation_executor works for TP parallelism. Updated test_graph_sharding Signed-off-by: greg-kwasniewski1 <[email protected]> * Switched from dataclass to pydantic. Added run_pattern_detection_test functionality, applied to test_graph_sharding Signed-off-by: greg-kwasniewski1 <[email protected]> * Restructured transformation execution logic. transformation_executor applies any generic transformations Signed-off-by: greg-kwasniewski1 <[email protected]> * Detection + execution logic moved only to sharding. Transformation work on node.name Signed-off-by: greg-kwasniewski1 <[email protected]> * Removed redundant params Signed-off-by: greg-kwasniewski1 <[email protected]> --------- Signed-off-by: greg-kwasniewski1 <[email protected]>
* Add sink/sliding window support for Triton Signed-off-by: nvchenghaoz <[email protected]> * Add the test and fix the functional implementations Signed-off-by: nvchenghaoz <[email protected]> --------- Signed-off-by: nvchenghaoz <[email protected]>
This reverts commit a37797b. Signed-off-by: Lucas Liebenwein <[email protected]>
* moving more transforms into the modular system Signed-off-by: Lucas Liebenwein <[email protected]> * fixes for some configs Signed-off-by: Lucas Liebenwein <[email protected]> --------- Signed-off-by: Lucas Liebenwein <[email protected]>
* Add the torch ref implementation for new params. Signed-off-by: nvchenghaoz <[email protected]> * Remove comment Signed-off-by: nvchenghaoz <[email protected]> --------- Signed-off-by: nvchenghaoz <[email protected]>
* Modular export patches + registry; fixes NVIDIA#5728 Signed-off-by: Lucas Liebenwein <[email protected]> * patch library for models Signed-off-by: Lucas Liebenwein <[email protected]> * unit test fixes Signed-off-by: Lucas Liebenwein <[email protected]> * addressing reviewer feedback Signed-off-by: Lucas Liebenwein <[email protected]> --------- Signed-off-by: Lucas Liebenwein <[email protected]>
* fix overlap scheduler in AD Signed-off-by: Suyog Gupta <[email protected]> * cleanups Signed-off-by: Suyog Gupta <[email protected]> * fix nest sequences Signed-off-by: Suyog Gupta <[email protected]> * nits * avoid hardcoding max beam width Signed-off-by: Suyog Gupta <[email protected]> * avoid hardcoding max beam width Signed-off-by: Suyog Gupta <[email protected]> * cudagraph fixes + rms norm Signed-off-by: Suyog Gupta <[email protected]> * fix test Signed-off-by: Suyog Gupta <[email protected]> * revert ad_executor changes Signed-off-by: Suyog Gupta <[email protected]> * Review comments + make sure num_pages >= max batch size * wrapping reviewer feedback and open items Signed-off-by: Lucas Liebenwein <[email protected]> --------- Signed-off-by: Suyog Gupta <[email protected]> Signed-off-by: Lucas Liebenwein <[email protected]> Co-authored-by: Lucas Liebenwein <[email protected]>
Signed-off-by: Lucas Liebenwein <[email protected]>
…and BMM (fixes NVIDIA#5916) (#94) * Updated tests Signed-off-by: greg-kwasniewski1 <[email protected]> * fixed tp sharding bug Signed-off-by: greg-kwasniewski1 <[email protected]> * Fixed sharding tests Signed-off-by: greg-kwasniewski1 <[email protected]> * Fixed sharding tests 1.1 Signed-off-by: greg-kwasniewski1 <[email protected]> * import fix Signed-off-by: Lucas Liebenwein <[email protected]> --------- Signed-off-by: greg-kwasniewski1 <[email protected]> Signed-off-by: Lucas Liebenwein <[email protected]> Co-authored-by: Lucas Liebenwein <[email protected]>
* WIP for attention matching: repeat_kv, eager_attention_matching Signed-off-by: Frida Hou <[email protected]> * works e2e with llama2 and llama3.1, eager and sdpa Signed-off-by: Frida Hou <[email protected]> * update for unit test test_attention_matcher Signed-off-by: Frida Hou <[email protected]> * minor Signed-off-by: Frida Hou <[email protected]> * minor Signed-off-by: Frida Hou <[email protected]> * unify into one transformation, update unit tests Signed-off-by: Frida Hou <[email protected]> * update hf_test to verify transformed output, update move_to_devide to recompile graph Signed-off-by: Frida Hou <[email protected]> * update after rebase Signed-off-by: Frida Hou <[email protected]> * minor Signed-off-by: Frida Hou <[email protected]> * update docstring Signed-off-by: Frida Hou <[email protected]> * minor Signed-off-by: Frida Hou <[email protected]> --------- Signed-off-by: Frida Hou <[email protected]>
Signed-off-by: nvchenghaoz <[email protected]>
…)" This reverts commit c245cf3.
This reverts commit a8b54f9.
* Change the all-reduce strategy to NCCL When the strategy is set to AUTO and world_size>1 we experience hangs and CUDA memory errors. * This is the same issue as https://nvbugspro.nvidia.com/bug/5331013 * Without this change test test_ad_build_small_multi.py fails (tp==2) * This is a temporary change until we understand why this hang is happening. * On dllcuster this issue does not manifest. Signed-off-by: Neta Zmora <[email protected]> * Re-enable test_ad_build_small_multi.py tests/unittest/_torch/auto_deploy/unit/multigpu/test_ad_build_small_multi.py Signed-off-by: Neta Zmora <[email protected]> * fix kvcache mem size compute - convert to MB Signed-off-by: Gal Agam <[email protected]> --------- Signed-off-by: Neta Zmora <[email protected]> Signed-off-by: Gal Agam <[email protected]> Co-authored-by: Gal Agam <[email protected]>
Signed-off-by: Lucas Liebenwein <[email protected]>
Signed-off-by: haoguo <[email protected]>
Signed-off-by: haoguo <[email protected]>
Signed-off-by: haoguo <[email protected]>
|
Caution Review failedThe pull request is closed. WalkthroughThis update introduces a major modularization and refactor of the AutoDeploy graph transformation, export, and configuration system. Key changes include new modular export and transform frameworks, dynamic YAML-based config merging, new backend-specific custom ops, quantization and sharding enhancements, and extensive updates to test infrastructure. Numerous new modules and classes were added, legacy transformation code was deprecated or replaced, and documentation was expanded. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant CLI/Script
participant ConfigLoader
participant ExportRegistry
participant TransformRegistry
participant ModelFactory
participant GraphModule
participant CustomOps
User->>CLI/Script: Launch with CLI args/YAML
CLI/Script->>ConfigLoader: Parse CLI args, merge YAML configs
ConfigLoader->>ConfigLoader: Deep merge, validate config
CLI/Script->>ModelFactory: Create model factory from config
CLI/Script->>ExportRegistry: Apply export patches (as context managers)
ExportRegistry->>ModelFactory: Build model (possibly on meta device)
ModelFactory->>GraphModule: Export model to FX graph
ExportRegistry->>GraphModule: Deduplicate params, clean up devices
CLI/Script->>TransformRegistry: Apply graph transforms in stage order
TransformRegistry->>GraphModule: Apply transforms (e.g., fuse RMSNorm, quantize MoE)
GraphModule->>CustomOps: Replace patterns with backend-specific ops
CLI/Script->>GraphModule: Finalize, run inference/benchmark
Estimated code review effort🎯 5 (Critical) | ⏱️ ~90+ minutes This is a critical, high-complexity refactor involving new frameworks, deep config changes, new custom ops, sharding/quantization logic, and extensive test updates across many files. Possibly related PRs
Suggested labels
Suggested reviewers
Poem
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. ✨ Finishing Touches
🧪 Generate unit tests
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
Summary by CodeRabbit
New Features
Enhancements
Bug Fixes
Documentation
Tests
Chores
Description
Issue #4403 . This PR moves fuse_rmsnorm only.
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.