-
Notifications
You must be signed in to change notification settings - Fork 1.8k
DRAFT: Prepare inputs optimizations #6518
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRAFT: Prepare inputs optimizations #6518
Conversation
…rmations to return None (#71) * Refactor the signatures of AD graph transformations to return None (NVIDIA#5249) Refactor signatures of AD graph transformations from gm = transformation(gm) to transformation(gm) Since the AD graph transformations modify the input GraphModule in-place. Previous signature style was misleading. Signed-off-by: Gal Hubara Agam <[email protected]>
…ion (#76) * Fix trtllm-bench test and enable trtllm-bench integration Signed-off-by: Neta Zmora <[email protected]> * Remove unneeded __init__.py Signed-off-by: Neta Zmora <[email protected]> --------- Signed-off-by: Neta Zmora <[email protected]>
) (#73) * yaml config loader for dynamic settings Signed-off-by: Lucas Liebenwein <[email protected]> * updates for yaml mixin Signed-off-by: Lucas Liebenwein <[email protected]> * addressing reviewer feedback Signed-off-by: Lucas Liebenwein <[email protected]> --------- Signed-off-by: Lucas Liebenwein <[email protected]>
* [AutoDeploy] Refining AD configurability Signed-off-by: Lucas Liebenwein <[email protected]> * addressed reviewer feedback Signed-off-by: Lucas Liebenwein <[email protected]> --------- Signed-off-by: Lucas Liebenwein <[email protected]>
* Add the Torch backend and update the test to use the torch backend. Signed-off-by: nvchenghaoz <[email protected]> * Add the sinks and fix the test failures Signed-off-by: nvchenghaoz <[email protected]> * address reviewer's comments Signed-off-by: nvchenghaoz <[email protected]> * use custom op convention Signed-off-by: nvchenghaoz <[email protected]> * move the ref to the utils_test Signed-off-by: nvchenghaoz <[email protected]> * Add torch backend tests in ad_build_small_single.py Signed-off-by: nvchenghaoz <[email protected]> * Address hidden comments... Signed-off-by: nvchenghaoz <[email protected]> --------- Signed-off-by: nvchenghaoz <[email protected]> Signed-off-by: Lucas Liebenwein <[email protected]>
Signed-off-by: nvchenghaoz <[email protected]>
* add torch_fp8_moe and fp8 linear support in pattern matcher, update unit tests Signed-off-by: Frida Hou <[email protected]> * add torch-fp4-moe and fp4 support in pattern matcher, unit test has acc issue and e2e mixtral fp4 has kernel error wo moe matching Signed-off-by: Frida Hou <[email protected]> * add pre-commit hook Signed-off-by: Frida Hou <[email protected]> * hacky fix for e2e run of mixtral FP4 and fp4 op unit test Signed-off-by: Frida Hou <[email protected]> * EP support for torch_fp4_moe and torch_fp8_moe Signed-off-by: Frida Hou <[email protected]> * fix rebase: op rename, shard_load_hook bug in FP4 Signed-off-by: Frida Hou <[email protected]> * fix pre-commit Signed-off-by: Frida Hou <[email protected]> * fix weight loading-load_hook issue for FP4, update function to handle exclude_modules in hf_quant_config Signed-off-by: Frida Hou <[email protected]> * addressing feedback, add moe op template, update op names,other minor refinements Signed-off-by: Frida Hou <[email protected]> * move common functionality to utility Signed-off-by: Frida Hou <[email protected]> * fix FP4QuantizationImpl register from rebase Signed-off-by: Frida Hou <[email protected]> * add quantize_moe pass for patched torch_moe op Signed-off-by: Frida Hou <[email protected]> * add transformation unit tests for FP8 and FP4 Signed-off-by: Frida Hou <[email protected]> * update should_skip_quantization to fix bmm unit test Signed-off-by: Frida Hou <[email protected]> * update BMMDynamicModel and utils to extract weight for dynamic BMM case Signed-off-by: Frida Hou <[email protected]> * update BMMDynamicModel to drop linear op Signed-off-by: Frida Hou <[email protected]> * minor Signed-off-by: Frida Hou <[email protected]> --------- Signed-off-by: Frida Hou <[email protected]>
* remove assert, add qwen small to tests * lint Signed-off-by: Suyog Gupta <[email protected]> --------- Signed-off-by: Suyog Gupta <[email protected]>
* fix overlap scheduler in AD Signed-off-by: Suyog Gupta <[email protected]> * cleanups Signed-off-by: Suyog Gupta <[email protected]> * fix nest sequences Signed-off-by: Suyog Gupta <[email protected]> * nits * avoid hardcoding max beam width Signed-off-by: Suyog Gupta <[email protected]> * avoid hardcoding max beam width Signed-off-by: Suyog Gupta <[email protected]> * clean logic and max_beam_width arg Signed-off-by: Lucas Liebenwein <[email protected]> --------- Signed-off-by: Suyog Gupta <[email protected]> Signed-off-by: Lucas Liebenwein <[email protected]> Co-authored-by: Lucas Liebenwein <[email protected]>
Signed-off-by: Lucas Liebenwein <[email protected]>
…IA#4367, NVIDIA#4366 (#84) Signed-off-by: Lucas Liebenwein <[email protected]>
NVIDIA#5916) (#86) * introduced basic sharding config logic * transformation_executor works for TP parallelism. Updated test_graph_sharding Signed-off-by: greg-kwasniewski1 <[email protected]> * Switched from dataclass to pydantic. Added run_pattern_detection_test functionality, applied to test_graph_sharding Signed-off-by: greg-kwasniewski1 <[email protected]> * Restructured transformation execution logic. transformation_executor applies any generic transformations Signed-off-by: greg-kwasniewski1 <[email protected]> * Detection + execution logic moved only to sharding. Transformation work on node.name Signed-off-by: greg-kwasniewski1 <[email protected]> * Removed redundant params Signed-off-by: greg-kwasniewski1 <[email protected]> --------- Signed-off-by: greg-kwasniewski1 <[email protected]>
* Add sink/sliding window support for Triton Signed-off-by: nvchenghaoz <[email protected]> * Add the test and fix the functional implementations Signed-off-by: nvchenghaoz <[email protected]> --------- Signed-off-by: nvchenghaoz <[email protected]>
This reverts commit a37797b. Signed-off-by: Lucas Liebenwein <[email protected]>
* moving more transforms into the modular system Signed-off-by: Lucas Liebenwein <[email protected]> * fixes for some configs Signed-off-by: Lucas Liebenwein <[email protected]> --------- Signed-off-by: Lucas Liebenwein <[email protected]>
* Add the torch ref implementation for new params. Signed-off-by: nvchenghaoz <[email protected]> * Remove comment Signed-off-by: nvchenghaoz <[email protected]> --------- Signed-off-by: nvchenghaoz <[email protected]>
* Modular export patches + registry; fixes NVIDIA#5728 Signed-off-by: Lucas Liebenwein <[email protected]> * patch library for models Signed-off-by: Lucas Liebenwein <[email protected]> * unit test fixes Signed-off-by: Lucas Liebenwein <[email protected]> * addressing reviewer feedback Signed-off-by: Lucas Liebenwein <[email protected]> --------- Signed-off-by: Lucas Liebenwein <[email protected]>
* fix overlap scheduler in AD Signed-off-by: Suyog Gupta <[email protected]> * cleanups Signed-off-by: Suyog Gupta <[email protected]> * fix nest sequences Signed-off-by: Suyog Gupta <[email protected]> * nits * avoid hardcoding max beam width Signed-off-by: Suyog Gupta <[email protected]> * avoid hardcoding max beam width Signed-off-by: Suyog Gupta <[email protected]> * cudagraph fixes + rms norm Signed-off-by: Suyog Gupta <[email protected]> * fix test Signed-off-by: Suyog Gupta <[email protected]> * revert ad_executor changes Signed-off-by: Suyog Gupta <[email protected]> * Review comments + make sure num_pages >= max batch size * wrapping reviewer feedback and open items Signed-off-by: Lucas Liebenwein <[email protected]> --------- Signed-off-by: Suyog Gupta <[email protected]> Signed-off-by: Lucas Liebenwein <[email protected]> Co-authored-by: Lucas Liebenwein <[email protected]>
Signed-off-by: Lucas Liebenwein <[email protected]>
…and BMM (fixes NVIDIA#5916) (#94) * Updated tests Signed-off-by: greg-kwasniewski1 <[email protected]> * fixed tp sharding bug Signed-off-by: greg-kwasniewski1 <[email protected]> * Fixed sharding tests Signed-off-by: greg-kwasniewski1 <[email protected]> * Fixed sharding tests 1.1 Signed-off-by: greg-kwasniewski1 <[email protected]> * import fix Signed-off-by: Lucas Liebenwein <[email protected]> --------- Signed-off-by: greg-kwasniewski1 <[email protected]> Signed-off-by: Lucas Liebenwein <[email protected]> Co-authored-by: Lucas Liebenwein <[email protected]>
* WIP for attention matching: repeat_kv, eager_attention_matching Signed-off-by: Frida Hou <[email protected]> * works e2e with llama2 and llama3.1, eager and sdpa Signed-off-by: Frida Hou <[email protected]> * update for unit test test_attention_matcher Signed-off-by: Frida Hou <[email protected]> * minor Signed-off-by: Frida Hou <[email protected]> * minor Signed-off-by: Frida Hou <[email protected]> * unify into one transformation, update unit tests Signed-off-by: Frida Hou <[email protected]> * update hf_test to verify transformed output, update move_to_devide to recompile graph Signed-off-by: Frida Hou <[email protected]> * update after rebase Signed-off-by: Frida Hou <[email protected]> * minor Signed-off-by: Frida Hou <[email protected]> * update docstring Signed-off-by: Frida Hou <[email protected]> * minor Signed-off-by: Frida Hou <[email protected]> --------- Signed-off-by: Frida Hou <[email protected]>
Signed-off-by: nvchenghaoz <[email protected]>
…)" This reverts commit c245cf3.
This reverts commit a8b54f9.
* Change the all-reduce strategy to NCCL When the strategy is set to AUTO and world_size>1 we experience hangs and CUDA memory errors. * This is the same issue as https://nvbugspro.nvidia.com/bug/5331013 * Without this change test test_ad_build_small_multi.py fails (tp==2) * This is a temporary change until we understand why this hang is happening. * On dllcuster this issue does not manifest. Signed-off-by: Neta Zmora <[email protected]> * Re-enable test_ad_build_small_multi.py tests/unittest/_torch/auto_deploy/unit/multigpu/test_ad_build_small_multi.py Signed-off-by: Neta Zmora <[email protected]> * fix kvcache mem size compute - convert to MB Signed-off-by: Gal Agam <[email protected]> --------- Signed-off-by: Neta Zmora <[email protected]> Signed-off-by: Gal Agam <[email protected]> Co-authored-by: Gal Agam <[email protected]>
Signed-off-by: Lucas Liebenwein <[email protected]>
Signed-off-by: haoguo <[email protected]>
* attention matcher with torch._inductor pattern matcher,matching repeat kv, sdpa and group attention, update unit tests Signed-off-by: Frida Hou <[email protected]> * Fix the torch backend Attention Signed-off-by: nvchenghaoz <[email protected]> * Revert "attention matcher with torch._inductor pattern matcher,matching repeat kv, sdpa and group attention, update unit tests" This reverts commit 5743fb3. --------- Signed-off-by: Frida Hou <[email protected]> Signed-off-by: nvchenghaoz <[email protected]> Co-authored-by: Frida Hou <[email protected]>
…tcher (#101) * attention matcher with torch._inductor pattern matcher,matching repeat kv, sdpa and group attention, update unit tests Signed-off-by: Frida Hou <[email protected]> * update matcher to only handle causal attn mask and set is_causal=True Signed-off-by: Frida Hou <[email protected]> * separate into three transformations Signed-off-by: Frida Hou <[email protected]> --------- Signed-off-by: Frida Hou <[email protected]>
* improve error handling and graph clean-up Signed-off-by: Lucas Liebenwein <[email protected]> * fix: avoid modify immutable type TransformInfo Signed-off-by: haoguo <[email protected]> --------- Signed-off-by: Lucas Liebenwein <[email protected]> Signed-off-by: haoguo <[email protected]> Co-authored-by: haoguo <[email protected]>
* attention matcher with torch._inductor pattern matcher,matching repeat kv, sdpa and group attention, update unit tests Signed-off-by: Frida Hou <[email protected]> * Update the torch ref op Signed-off-by: nvchenghaoz <[email protected]> * Revert "attention matcher with torch._inductor pattern matcher,matching repeat kv, sdpa and group attention, update unit tests" This reverts commit 5743fb3. --------- Signed-off-by: Frida Hou <[email protected]> Signed-off-by: nvchenghaoz <[email protected]> Co-authored-by: Frida Hou <[email protected]>
* refactor: move quantization and quant_moe to new inf optimizer Signed-off-by: haoguo <[email protected]> * refactor: use quant_config from factory instead of new config type Signed-off-by: haoguo <[email protected]> * refactor: del old files; update default.yaml Signed-off-by: haoguo <[email protected]> * move helper class FakeFacotry to _graph_test_helpers.py Signed-off-by: haoguo <[email protected]> * polish: remove unreachable branch in quantization.py Co-authored-by: Fridah-nv <[email protected]> Signed-off-by: h-guo18 <[email protected]> * style: run pre-commit Signed-off-by: haoguo <[email protected]> * fix to fetch hf_quant_config from fetched dir Signed-off-by: Frida Hou <[email protected]> --------- Signed-off-by: haoguo <[email protected]> Signed-off-by: h-guo18 <[email protected]> Signed-off-by: Frida Hou <[email protected]> Co-authored-by: Fridah-nv <[email protected]>
…121) Signed-off-by: Frida Hou <[email protected]>
Signed-off-by: Frida Hou <[email protected]>
* refactor: merge attn updates;move to new inf optimizer Signed-off-by: haoguo <[email protected]> * minor: fix import Signed-off-by: haoguo <[email protected]> * doc: fix file docstring Signed-off-by: haoguo <[email protected]> * Update tensorrt_llm/_torch/auto_deploy/transform/library/attention.py Co-authored-by: Lucas Liebenwein <[email protected]> Signed-off-by: h-guo18 <[email protected]> * Update tensorrt_llm/_torch/auto_deploy/transform/library/attention.py Co-authored-by: Lucas Liebenwein <[email protected]> Signed-off-by: h-guo18 <[email protected]> * Update tensorrt_llm/_torch/auto_deploy/transform/library/attention.py Co-authored-by: Lucas Liebenwein <[email protected]> Signed-off-by: h-guo18 <[email protected]> * polish: use config.run_shape_prop for shape prop Signed-off-by: haoguo <[email protected]> * polish: remove redundant canonicalize() Signed-off-by: haoguo <[email protected]> --------- Signed-off-by: haoguo <[email protected]> Signed-off-by: h-guo18 <[email protected]> Co-authored-by: Lucas Liebenwein <[email protected]>
…s and refactored (#119) * refactored compile_limit Signed-off-by: Eran Geva <[email protected]> * removed changes made to TorchCompileCompiler Signed-off-by: Eran Geva <[email protected]> * set cache_size_limit in TorchCompileCompiler Signed-off-by: Eran Geva <[email protected]> --------- Signed-off-by: Eran Geva <[email protected]>
Caution Review failedThe pull request is closed. WalkthroughThis update introduces a major overhaul of the AutoDeploy framework for PyTorch model deployment and optimization. Key changes include a modular transformation pipeline, deep YAML configuration support, enhanced quantization and sharding transforms, new backend and custom operator support, and a comprehensive refactor of testing utilities. Many transformations now operate in-place, and the export/patching logic is modularized for extensibility. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant CLI/Script
participant ConfigLoader
participant ModelFactory
participant TransformPipeline
participant ExportSystem
participant CustomOps
participant OptimizedModel
User->>CLI/Script: Launch with CLI args/YAML configs
CLI/Script->>ConfigLoader: Parse, merge, and validate configs
ConfigLoader->>ModelFactory: Instantiate with config
ModelFactory->>TransformPipeline: Build initial model
TransformPipeline->>ExportSystem: Export to FX GraphModule (with patches)
ExportSystem->>CustomOps: Register and patch ops as needed
TransformPipeline->>TransformPipeline: Apply transforms (quantization, sharding, fusion, etc.)
TransformPipeline->>OptimizedModel: Return optimized model/graph
OptimizedModel-->>User: Ready for inference/deployment
Estimated code review effort🎯 5 (Critical) | ⏱️ ~90+ minutes Possibly related PRs
Suggested reviewers
Poem
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
Signed-off-by: Suyog Gupta <[email protected]>
bc5b1b6
to
27a3d7f
Compare
Summary by CodeRabbit
New Features
Improvements
Bug Fixes
Documentation
Chores