Skip to content

Conversation

lucaslie
Copy link
Member

@lucaslie lucaslie commented Jul 18, 2025

Summary by CodeRabbit

  • New Features

    • Introduced a modular graph transformation and export pipeline with configurable stages and patch systems.
    • Added backend-specific custom operators and transformations for attention, RMSNorm, and Mixture-of-Experts (MoE), including quantized FP8/FP4 and sliding window attention support.
    • Enabled dynamic layered YAML configuration loading with deep merging and strict validation.
    • Enhanced CLI and YAML configuration for expert usage with detailed documentation.
    • Added comprehensive testing for configuration, quantization, sharding, and custom operator correctness.
    • Added a pure PyTorch backend for cached multi-head attention with advanced features and a torch-based reference implementation for attention tests.
    • Introduced new AutoDeployConfig and refined LlmArgs for improved configuration management and validation.
    • Added support for max beam width and new tokens handling in the AutoDeploy executor.
  • Bug Fixes

    • Improved device and parameter deduplication during export.
    • Fixed parameter and buffer loading, caching, and quantization issues in graph transformations.
    • Addressed issues with tensor creation on meta device and patched PyTorch functions for export compatibility.
  • Refactor

    • Consolidated pattern matching, quantization, and sharding logic into in-place graph transformations.
    • Modularized patch and transform registration for easier maintenance.
    • Updated tests to align with new transformation APIs and in-place graph modifications.
    • Refactored sharding transforms to a declarative, two-phase detection and application model.
    • Simplified import structures and unified export function usage.
    • Replaced legacy export pipeline with a new robust export module and patch system.
    • Replaced manual attention reference computations in tests with centralized torch-based reference implementations.
  • Chores

    • Updated dependencies to support YAML-based configuration.
    • Added new tests covering configuration loading, sharding detection, and operator correctness.
    • Added auto-import modules for export patches, model patches, and transforms.
    • Improved logging and code comments for clarity and maintainability.
  • Documentation

    • Expanded README and in-code docs with detailed expert usage, configuration options, and best practices.
    • Added detailed docstrings for new configuration, transform, and export interfaces.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Copy link
Contributor

coderabbitai bot commented Jul 18, 2025

Walkthrough

This update introduces a modular, extensible transformation and export pipeline for PyTorch models in the AutoDeploy system. It replaces legacy monolithic transformation code with a registry-based system for export patches and graph transforms, adds deep YAML config merging, and expands quantization, sharding, and MoE support. The update includes new backend-specific custom ops, extensive test coverage, and improved configuration management.

Changes

File(s) / Area Change Summary
tensorrt_llm/_torch/auto_deploy/export/, tensorrt_llm/_torch/auto_deploy/transform/, tensorrt_llm/_torch/auto_deploy/transformations/ New modular export and transform systems: registry-based export patch framework, transform registry, and library auto-importers. Legacy transformation modules deprecated or removed.
tensorrt_llm/_torch/auto_deploy/export/library/, tensorrt_llm/_torch/auto_deploy/transform/library/ New and refactored export patches (e.g., autocast_noop, sdpa, tensor_meta_device, linear, modelopt_context, sdpa_kernel_noop, torch_modulelist_getitem, torch_where, transformers_sdpa_mask) and transform passes (e.g., build_model, export_to_gm, cleanup, quantize_moe, fuse_rmsnorm). Library __init__.py files auto-discover and import all patches/transforms.
tensorrt_llm/_torch/auto_deploy/llm_args.py, tensorrt_llm/_torch/auto_deploy/utils/_config.py Split LLM config into AutoDeployConfig and LlmArgs with stricter validation and YAML-based defaults. Added dynamic YAML config loading and deep merging via OmegaConf.
tensorrt_llm/_torch/auto_deploy/models/, tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py Centralized model and tokenizer defaults, improved checkpoint loading, and support for new config fields (e.g., max_beam_width). Model patching modularized.
tensorrt_llm/_torch/auto_deploy/custom_ops/, tensorrt_llm/_torch/auto_deploy/transformations/library/ Added/extended custom ops for attention (torch/triton/flashinfer), RMSNorm, MoE (FP8/FP4), and support for new features like logit capping, sliding window, and sinks. Pattern matching and fusion logic refactored to support quantized and backend-specific variants.
tensorrt_llm/_torch/auto_deploy/transformations/library/sharding.py Refactored sharding detection and application into a two-phase, config-driven system supporting TP, BMM, and EP sharding, with pattern detection and transform execution separated.
tensorrt_llm/_torch/auto_deploy/transformations/library/attention.py, tests/unittest/.../test_attention_matcher.py Unified attention pattern matching using a registry-based matcher, replacing multiple ad-hoc matchers and tests.
tensorrt_llm/_torch/auto_deploy/transformations/library/fused_moe.py, tensorrt_llm/_torch/auto_deploy/transformations/library/quantize_moe.py MoE pattern matching and quantization extended to support FP8 and FP4 quantization, with new test coverage.
tensorrt_llm/_torch/auto_deploy/transformations/library/rms_norm.py, tests/unittest/.../test_fuse_rmsnorm.py Added RMSNorm fusion transform supporting FlashInfer, Triton, and Torch backends, with new tests for correctness and operator presence.
tensorrt_llm/_torch/auto_deploy/transformations/library/rope.py RoPE pattern matching and optimization functions now operate in-place and return match counts, not modified graphs.
tensorrt_llm/_torch/auto_deploy/transformations/library/collectives.py, fusion.py, etc. All graph transforms now operate in-place and return None, reflecting a shift to side-effect-based modification.
tensorrt_llm/_torch/auto_deploy/utils/quantization_utils.py, node_utils.py Added quantization skipping, scale extraction, and quantization type detection utilities. Refined node filtering and pattern matching helpers.
examples/auto_deploy/, requirements.txt, setup.py Updated CLI, README, and launch configs to reflect new config fields and usage. Added omegaconf and YAML support to requirements. Included YAML files in package data.
tests/unittest/_torch/auto_deploy/ Extensive new and refactored tests for all new features: config merging, export/transform passes, quantization, MoE, sharding, attention, RMSNorm, and backend ops. Tests updated for in-place graph modification and new APIs.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI/Script
    participant ConfigLoader
    participant ModelFactory
    participant InferenceOptimizer
    participant TransformRegistry
    participant ExportPatchRegistry
    participant GraphModule

    User->>CLI/Script: Launch with CLI args/YAML configs
    CLI/Script->>ConfigLoader: Load and deep-merge YAML configs
    ConfigLoader->>ModelFactory: Create model factory with config
    CLI/Script->>InferenceOptimizer: Initialize with factory and transform config
    InferenceOptimizer->>TransformRegistry: For each transform stage, get transform class
    InferenceOptimizer->>ExportPatchRegistry: Apply export patches as needed
    InferenceOptimizer->>GraphModule: Apply transforms in order (build, export, cleanup, quantize, fuse, etc.)
    GraphModule-->>InferenceOptimizer: Modified graph after all transforms
    InferenceOptimizer-->>CLI/Script: Return optimized inference model
    CLI/Script-->>User: Ready-to-use inference model
Loading

Estimated code review effort

4 (~90 minutes)

Possibly related PRs

Suggested reviewers

  • chzblych
  • nv-guomingz
  • juney-nvidia

Poem

🐇✨
In the burrows of code, we rabbits delight,
With YAML and patches, our configs take flight.
Transforms and exports, now modular and neat,
Quantized and sharded, our models compete!
Attention and MoE, fused with great care,
Hop forward, dear models—deploy anywhere!
🥕🚀
✨ Finishing Touches
  • 📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@lucaslie lucaslie moved this from Backlog to In review in AutoDeploy Board Jul 18, 2025
@lucaslie lucaslie self-assigned this Jul 18, 2025
@lucaslie lucaslie linked an issue Jul 18, 2025 that may be closed by this pull request
@lucaslie lucaslie enabled auto-merge (squash) July 18, 2025 23:54
@lucaslie
Copy link
Member Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12356 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 24

🔭 Outside diff range comments (2)
tensorrt_llm/_torch/auto_deploy/transformations/library/eliminate_redundant_transposes.py (1)

98-104: Fix self-reference bug in graph node replacement.

Line 103 has a critical bug where new_contiguous_node.replace_input_with(new_contiguous_node, original_input) is trying to replace the node's input with itself, which doesn't make sense logically.

-            new_contiguous_node.replace_input_with(new_contiguous_node, original_input)
+            # The new_contiguous_node already uses original_input as its argument,
+            # so no additional input replacement is needed here

Additionally, there's a potential issue with the logic flow: original_input.replace_all_uses_with(new_contiguous_node) on line 102 will replace all uses of original_input, but then we immediately try to modify new_contiguous_node's inputs, which may cause inconsistencies.

tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/triton_kernels/test_attention_with_kv_cache.py (1)

286-307: Add sliding window mask to reference computation.

The test passes sliding_window parameter to the kernel but doesn't apply the corresponding mask in the reference computation.

The reference computation should apply sliding window masking when sliding_window > 0 to properly validate the kernel behavior. Consider adding sliding window mask similar to what's done in test_context_attention_kv_flattened.

♻️ Duplicate comments (2)
tensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py (1)

79-83: Apply same docstring improvement for consistency.

tests/unittest/_torch/auto_deploy/unit/multigpu/transformations/library/test_ep_sharding.py (1)

60-68: Same code duplication issue as in TP sharding test.

🧹 Nitpick comments (46)
tensorrt_llm/_torch/auto_deploy/export/library/__init__.py (1)

1-16: Functional but duplicates code from models/patches init.

This auto-import implementation is correct and well-documented, but it's identical to the pattern in tensorrt_llm/_torch/auto_deploy/models/patches/__init__.py. Consider extracting this common pattern into a reusable utility function to eliminate code duplication.

Example refactor to create a shared utility:

# Create shared utility in a common location
def auto_import_submodules(package_path, package_name):
    """Auto-import all non-private submodules in a package."""
    import importlib
    import pkgutil
    
    all_modules = []
    for _, module_name, is_pkg in pkgutil.iter_modules(package_path):
        if module_name.startswith("_"):
            continue
        all_modules.append(module_name)
        importlib.import_module(f"{package_name}.{module_name}")
    return all_modules

# Then use in both files:
__all__ = auto_import_submodules(__path__, __name__)
tensorrt_llm/_torch/auto_deploy/export/library/sdpa.py (1)

23-23: Consider adding error handling for missing custom operator.

The patch assumes torch.ops.auto_deploy.torch_attention_sdpa exists. Consider adding validation to ensure the custom operator is available before applying the patch.

    def _apply_patch(self):
        """Apply the SDPA patch."""
+        # Validate custom operator exists
+        if not hasattr(torch.ops.auto_deploy, 'torch_attention_sdpa'):
+            raise RuntimeError("Custom operator torch.ops.auto_deploy.torch_attention_sdpa not found")
+        
        # Store original function
        self.original_values["F.scaled_dot_product_attention"] = F.scaled_dot_product_attention

        # Apply patch
        F.scaled_dot_product_attention = torch.ops.auto_deploy.torch_attention_sdpa
tensorrt_llm/_torch/auto_deploy/config/default.yaml (1)

16-21: Consider adding comments for cleanup stages.

The cleanup stages (cleanup_noop_slice, cleanup_noop_add, cleanup_input_constraints) lack explanatory comments unlike the other stages. Adding brief descriptions would improve maintainability.

  cleanup_noop_slice:
    stage: post_export
+    # Remove unnecessary slice operations
  cleanup_noop_add:
    stage: post_export
+    # Remove unnecessary add operations  
  cleanup_input_constraints:
    stage: post_export
+    # Clean up input constraint nodes
tensorrt_llm/_torch/auto_deploy/transform/library/__init__.py (1)

12-16: Consider performance implications of eager loading.

All transform modules are imported immediately when the package is imported, which could impact startup time. Consider if lazy loading would be more appropriate for this use case.

You could implement lazy loading by only populating __all__ during discovery and deferring actual imports until the transforms are needed, or document that eager loading is intentional for registration purposes.

tensorrt_llm/_torch/auto_deploy/export/library/linear.py (1)

31-31: Consider thread safety implications.

The patch modifies the global F.linear function, which could potentially affect concurrent exports if they occur in the same process. Consider documenting this limitation or implementing thread-local patching if needed.

tensorrt_llm/_torch/auto_deploy/export/library/tensor_meta_device.py (2)

24-26: Consider expanding the condition to handle more cases.

The current condition only handles the exact case of data == 0.0. Consider whether other numeric values or data types might also have issues on the meta device.

-            if data == 0.0 and device is not None and torch.device(device) == torch.device("meta"):
-                return torch.zeros((), **kwargs)
+            if (
+                isinstance(data, (int, float)) 
+                and data == 0.0 
+                and device is not None 
+                and torch.device(device) == torch.device("meta")
+            ):
+                return torch.zeros((), **kwargs)

24-24: Add error handling for invalid device specifications.

The torch.device(device) call could raise an exception for invalid device strings. Consider adding error handling or validation.

-            if data == 0.0 and device is not None and torch.device(device) == torch.device("meta"):
+            if data == 0.0 and device is not None:
+                try:
+                    device_obj = torch.device(device)
+                    if device_obj == torch.device("meta"):
+                        return torch.zeros((), **kwargs)
+                except (TypeError, RuntimeError):
+                    pass  # Fall back to original function for invalid device specs
tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_flashinfer_attention_op.py (1)

444-460: Address inconsistent reference implementation usage.

This test function still uses torch.nn.functional.scaled_dot_product_attention directly instead of TorchAttentionReference. For consistency and maintainability, consider updating this test to use the centralized reference implementation as well.

Do you want me to update this test to use TorchAttentionReference for consistency with the other test functions?

tensorrt_llm/_torch/auto_deploy/transform/library/cleanup_noop_slice.py (1)

37-37: Consider using a more portable max value check.

Using torch.iinfo(torch.long).max assumes that the slice end uses the same maximum value. Consider also checking for other common "unbounded" representations.

-            if node.args[2] != 0 or node.args[3] != torch.iinfo(torch.long).max:
+            # Check for no-op slice: start=0 and end is effectively unbounded
+            max_vals = [torch.iinfo(torch.long).max, torch.iinfo(torch.int64).max, float('inf')]
+            if node.args[2] != 0 or node.args[3] not in max_vals:
tensorrt_llm/_torch/auto_deploy/custom_ops/_triton_attention_internal.py (1)

354-356: Consistent parameter addition but needs documentation.

Same sliding window and sinks parameters as other calls, maintaining consistency.

Add inline comments for the new parameters for clarity:

-        False,
-        None,
+        False,  # sliding_window_enabled
+        None,   # sinks
tensorrt_llm/_torch/auto_deploy/export/library/transformers_sdpa_mask.py (1)

49-51: Consider logging skipped patches for debugging.

Silent skipping of patches due to ImportError might make debugging difficult in development environments.

         except ImportError:
             # If transformers is not available or doesn't have required modules, skip patch
-            pass
+            # Consider adding debug logging here for development
+            pass
tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_attention_op.py (1)

425-426: TODO comment indicates incomplete refactoring.

The TODO comment correctly identifies the intention to replace the manual reference computation with the torch backend reference, consistent with the pattern applied to other tests. However, the refactoring for this test remains incomplete.

Consider completing the refactoring for test_paged_gqa_op to maintain consistency with the other test functions, or create a follow-up task to address this.

tests/unittest/_torch/auto_deploy/_utils_test/_model_test_utils.py (1)

245-245: Significant behavioral change in BMMDynamicModel.

The model now uses fixed learnable weights instead of input-dependent dynamic weight generation. While this simplifies the model and may improve test determinism, the name "BMMDynamicModel" might be misleading since weights are no longer dynamically generated from inputs.

Consider updating the class name or docstring to reflect that weights are now fixed learnable parameters rather than input-dependent.

 class BMMDynamicModel(nn.Module):
-    """BMM model with dynamic tensor weights for testing."""
+    """BMM model with fixed learnable weights for testing."""

Also applies to: 252-252

examples/auto_deploy/README.md (1)

262-262: Minor markdown style inconsistency.

The linter flags inconsistent emphasis style (underscores vs asterisks). For consistency with markdown best practices, consider updating:

-*exclusively* exposed in the [`AutoDeployConfig` class]
+**exclusively** exposed in the [`AutoDeployConfig` class]
-_ignored_ in AutoDeploy
+**ignored** in AutoDeploy

Also applies to: 267-267

tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py (1)

50-50: Consider improving assertion message for better debugging.

The assertion correctly validates the precondition, but could provide more context for debugging:

-assert len(gm.graph.nodes) == 0, "Expected empty graph module."
+assert len(gm.graph.nodes) == 0, f"Expected empty graph module for export, but found {len(gm.graph.nodes)} nodes. This transform should be applied to a dummy graph module containing only the factory_model submodule."
tensorrt_llm/_torch/auto_deploy/transform/optimizer.py (1)

47-56: Complete the docstring to include all parameters.

The docstring is missing documentation for the gm parameter.

Add the missing parameter:

 Args:
     cm: The cached sequence interface defining the sequence interface.
+    gm: Optional GraphModule to transform. If None, creates an empty one.
tests/unittest/_torch/auto_deploy/unit/multigpu/transformations/library/test_tp_sharding.py (1)

172-198: Consider extracting common model initialization logic.

There's significant code duplication between _run_pattern_detection_job and _run_job for model and input initialization (lines 178-198 duplicate lines 89-109).

Extract common initialization:

def _init_model_and_input(model_cls, bias=False):
    """Initialize model and input tensor for testing."""
    batch_size = 4
    sequence_len = 8
    num_features = 32
    num_heads = 4
    num_key_value_heads = 1
    
    if model_cls == GQA_Block:
        model = model_cls(
            num_attention_heads=num_heads,
            hidden_size=num_features,
            num_key_value_heads=num_key_value_heads,
        ).to(device="cuda", dtype=torch.float16)
    else:
        model = model_cls(num_features, num_features, bias=bias).to(
            device="cuda", dtype=torch.float16
        )
    x = torch.randn(batch_size, sequence_len, num_features, device="cuda", dtype=torch.float16)
    return model, x, num_features, num_heads, num_key_value_heads
tensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py (1)

42-61: Consider improving docstring consistency.

The implementation is correct, but the fake implementation's docstring is less detailed than the FlashInfer version.

Update the docstring for consistency:

 @triton_rmsnorm.register_fake
 def _(input: torch.Tensor, weight: torch.Tensor, eps: float) -> torch.Tensor:
-    """Fake implementation for the custom operator during tracing."""
+    """Fake implementation for the custom operator during tracing.
+
+    Args:
+        input: Input tensor to normalize.
+        weight: Scaling weights for the normalized output.
+        eps: Small constant for numerical stability.
+
+    Returns:
+        Empty tensor with same shape as input.
+    """
     return torch.empty_like(input)
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py (1)

107-107: Consider tracking the TODO for future improvements

The TODO suggests this test could benefit from a custom InferenceOptimizer config. Consider creating an issue to track this enhancement.

Would you like me to open an issue to track this TODO for implementing a custom InferenceOptimizer config?

tensorrt_llm/_torch/auto_deploy/custom_ops/triton_kernels/attention_with_kv_cache.py (1)

398-402: Sinks mechanism correctly integrated

The implementation properly adds the sinks contribution to the softmax denominator. Consider adding a comment explaining what "sinks" represent in the attention mechanism for better code documentation.

tensorrt_llm/_torch/auto_deploy/transformations/library/quantize_moe.py (1)

34-60: Well-structured quantization logic

The nested function quantize_param_list cleanly handles weight quantization, scale registration, and hook setup. Consider documenting the memory implications of quantizing large MoE models with many experts.

tensorrt_llm/_torch/auto_deploy/utils/quantization_utils.py (1)

505-513: Consider making the quantization type detection more extensible.

The function currently hardcodes the list of quantization implementations to check. This could be made more maintainable.

Consider using a registry pattern or class attribute to make this more extensible:

 def get_scales_and_type_from_node(node: Node) -> Tuple[Dict[str, Node], str]:
     """Returns a dict of scale args and quantization type string ('fp4', 'fp8', etc)."""
-    for qtype in [FP4QuantizationImpl, FP8QuantizationImpl]:
+    # Consider adding a class method to QuantizationImpl to get all implementations
+    for qtype in QuantizationImpl.get_all_implementations():
         if is_op(node, qtype.target_op()):
             return extract_scales_from_node(
                 node, qtype.scale_names()
             ), qtype.__name__.lower().replace("quantizationimpl", "")
     return None, "simple"
tensorrt_llm/_torch/auto_deploy/utils/_config.py (1)

58-62: Enforce MRO requirement programmatically.

The documentation states this class must come first in the MRO, but this isn't enforced in code.

Consider adding a check in __init_subclass__ to enforce the MRO requirement:

def __init_subclass__(cls, **kwargs):
    super().__init_subclass__(**kwargs)
    # Ensure DynamicYamlMixInForSettings is first in MRO (after the class itself)
    if len(cls.__mro__) > 2 and cls.__mro__[1] != DynamicYamlMixInForSettings:
        raise TypeError(
            f"{cls.__name__} must inherit from DynamicYamlMixInForSettings "
            "as the first base class to ensure proper yaml_configs processing"
        )
tensorrt_llm/_torch/auto_deploy/transformations/library/fused_moe.py (2)

188-189: Address TODO for quantized model fusion.

The TODO indicates that trtllm_moe_fused doesn't currently support quantized models. This could be a performance limitation.

Would you like me to help implement the quantized version of trtllm_moe_fused or create an issue to track this enhancement?


139-143: Update docstring to reflect in-place modification.

The function now performs in-place modifications but the docstring doesn't indicate this change.

Update the docstring to clarify that the function modifies the graph module in-place:

 def fuse_moe(gm: torch.fx.GraphModule) -> None:
     """
     Scan the FX graph and replace all calls to torch.ops.auto_deploy.torch_moe with
     torch.ops.auto_deploy.trtllm_moe_fused.
+    
+    This function modifies the graph module in-place.
     """
tensorrt_llm/_torch/auto_deploy/export/export.py (2)

29-36: Consider checking buffer devices as well

The function only checks parameter devices but applies the device to all nodes in the graph. Buffers might be on different devices than parameters.

def _clean_up_device_info(gm: fx.GraphModule) -> None:
    """Correct device information in the graph."""
-    devices = {t.device for _, t in gm.named_parameters()}
+    devices = {t.device for _, t in gm.named_parameters()}
+    devices.update({t.device for _, t in gm.named_buffers()})
    if len(devices) == 0:
        return
    elif len(devices) > 1:
-        raise AssertionError("All parameters should be on the same device.")
+        raise AssertionError("All parameters and buffers should be on the same device.")

269-272: Address TODO about overlap between deduplication functions

The TODO comment suggests there's overlap between _add_load_hook_for_aliased_params and _deduplicate_params_and_buffers. This should be clarified to avoid maintenance issues.

Would you like me to analyze the overlap between these functions and suggest a refactoring approach to consolidate them?

tests/unittest/_torch/auto_deploy/unit/singlegpu/utils/test_config.py (2)

843-862: Test doesn't actually verify relative path handling

The test is named test_relative_and_absolute_paths and has a comment about testing relative paths, but it only uses absolute paths throughout. Consider either renaming the test or implementing actual relative path testing.


866-866: Add newline at end of file

-        assert settings.simple.name == "config2"  # from config2
+        assert settings.simple.name == "config2"  # from config2
+
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (1)

171-173: Consider optimizing GPU->CPU transfer for new_tokens

The comment indicates a performance concern about the GPU->CPU transfer. This could be a bottleneck in the inference pipeline.

Consider tracking this as a performance optimization opportunity. Would you like me to open an issue to investigate avoiding this transfer by keeping the tensor operations on the GPU?

tensorrt_llm/_torch/auto_deploy/export/interface.py (1)

250-250: Add newline at end of file

-    yield from _apply_patches(patches)
+    yield from _apply_patches(patches)
+
tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_torch_attention_op.py (2)

332-339: Document known backend issues with sinks

The code handles backend failures when sinks are enabled, suggesting known issues. Consider adding a comment explaining the specific backend limitations or linking to a tracking issue.

         # For sinks: test that backend runs without crashing (backend has bugs)
         # and validate correct sinks behavior with numpy reference
+        # TODO: Backend currently has issues with sinks implementation - see issue #XXX
         try:

488-488: Add newline at end of file

-        assert result[0].shape[0] == batch_size, "First tensor should have batch_size elements"
+        assert result[0].shape[0] == batch_size, "First tensor should have batch_size elements"
+
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_moe_fusion.py (5)

44-44: Document the magic numbers for weight scale factors.

The magic numbers 448 and 432 lack explanation. Please add a comment explaining why these specific values are chosen for bfloat16 vs other dtypes.

-        wt_factor = 448 if dtype == torch.bfloat16 else 432
+        # FP8 E4M3 max value is 448 for bfloat16 and 432 for float16/float32
+        wt_factor = 448 if dtype == torch.bfloat16 else 432

46-48: Consider using deterministic weight initialization for tests.

Using torch.randn without a seed in test code can lead to non-deterministic behavior. Consider setting a seed or accepting initialized weights as parameters.

+        torch.manual_seed(42)  # Ensure deterministic initialization
         w1_fp32 = torch.randn(ffn_dim, hidden_dim, device=device)
         w3_fp32 = torch.randn(ffn_dim, hidden_dim, device=device)
         w2_fp32 = torch.randn(hidden_dim, ffn_dim, device=device)

99-101: Document the weight initialization scale factor.

The weights are initialized with a 0.01 scale factor. Please add a comment explaining why this specific value is chosen for FP4 quantization.

         # Prepare full-precision weights
+        # Small initialization scale (0.01) helps with FP4 quantization stability
         w1_fp32 = torch.randn(ffn_dim, hidden_dim, device=device, dtype=dtype) * 0.01

244-245: Optimize input tensor creation.

Creating the input tensor on CPU and then moving it to the device in the forward pass is inefficient. Consider creating it directly on the target device.

-        input_ids = self.get_input(device="cpu")  # or pass as constructor arg
-        input_sample = self.embedding(input_ids)
+        with torch.device(device):
+            input_ids = self.get_input(device=device)
+            input_sample = self.embedding(input_ids)

262-262: Remove duplicate seed setting.

The random seed is set both in get_input (line 262) and in the test function (line 294). Consider removing one to avoid confusion.

Also applies to: 294-294

tensorrt_llm/_torch/auto_deploy/custom_ops/torch_moe.py (1)

274-275: Consider moving empty input check to the template.

The empty input check here is good, but for consistency, consider moving this check to the _template_moe function since it could benefit all MoE variants.

tensorrt_llm/_torch/auto_deploy/transform/interface.py (1)

232-241: Consider more specific exception handling.

The broad exception catch might hide specific errors. Consider catching more specific exceptions or at least logging the full traceback for debugging.

             try:
                 gm, info = self._apply(gm, cm, factory)
-            except Exception as e:
+            except (TransformError, ValueError, TypeError) as e:
                 error_msg = f"Transform {t_name} failed"
                 if self.config.skip_on_error:
                     ad_logger.warning(f"{error_msg}: {e}")
+                    ad_logger.debug(f"Full traceback:", exc_info=True)
                     info = TransformInfo(skipped=True, num_matches=0)
                 else:
                     raise TransformError(error_msg) from e
+            except Exception as e:
+                # Unexpected errors should always be raised
+                ad_logger.error(f"Unexpected error in transform {t_name}: {e}")
+                raise
tensorrt_llm/_torch/auto_deploy/llm_args.py (1)

205-206: Address the TODO for Path support.

The TODO comment indicates that Path objects should be supported in the model factory. Consider implementing this support or creating an issue to track it.

Would you like me to help implement Path support in the model factory or create an issue to track this enhancement?

tensorrt_llm/_torch/auto_deploy/transformations/library/sharding.py (1)

498-500: Consider validating ShardingConfig is not None.

Add a validation check to ensure sharding_config is provided when world_size >= 2.

     if world_size < 2:
         ad_logger.info("Skipping sharding for single device")
         return
+    
+    assert sharding_config is not None, "sharding_config must be provided for multi-device setup"
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_backend_attention.py (2)

188-213: Complex but correct sliding window implementation.

The sliding window mask implementation correctly handles position differences between query and key positions. However, consider adding more inline documentation to explain the logic, especially around lines 194-207 where position differences are calculated.

+            # Calculate position differences for sliding window mask
+            # Query positions are offset by input_pos_i, key positions start from 0
             query_positions = torch.arange(
                 input_pos_i, input_pos_i + seq_len_i, device=q.device
             )  # [seq_len_i]
             key_positions = torch.arange(0, kv_seq_len, device=q.device)  # [kv_seq_len]

             # Create position difference matrix: query_pos - key_pos
+            # This represents how far back each query position is looking
             pos_diff = query_positions.unsqueeze(1) - key_positions.unsqueeze(
                 0
             )  # [seq_len_i, kv_seq_len]

475-484: Consider using approximate equality check for scale validation.

When validating the scale parameter, consider allowing for floating-point precision differences if you want to verify it matches the expected 1/sqrt(head_dim) value.

         # Validate scale
         if not isinstance(scale, float):
             ad_logger.warning("Provided scale is not a float. Using default scale instead.")
             scale = None
+        elif scale is not None:
+            # Optionally validate scale is approximately 1/sqrt(head_dim)
+            k_fake: FakeTensor = source_attn_node.args[1].meta["val"]
+            expected_scale = 1.0 / math.sqrt(k_fake.shape[3])
+            if abs(scale - expected_scale) > 1e-6:
+                ad_logger.debug(f"Scale {scale} differs from expected {expected_scale}")
tensorrt_llm/_torch/auto_deploy/transformations/library/attention.py (1)

226-231: Consider using more meaningful scalar values

The scalar values for scale and dropout (e.g., 0.1234743, 0.85849734) appear arbitrary. Consider using more intuitive values or adding a comment explaining why these specific values were chosen.

    configs = [
-        (_sfdp_pattern_1, _sfdp_replacement_1, True, 0.1234743, 0.85849734),
-        (_sfdp_pattern_2, _sfdp_replacement_2, False, 0.234743, 0.5849734),
+        (_sfdp_pattern_1, _sfdp_replacement_1, True, 0.125, 0.1),  # 1/8 scale, 10% dropout
+        (_sfdp_pattern_2, _sfdp_replacement_2, False, 0.25, 0.2),  # 1/4 scale, 20% dropout
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher.py (1)

1174-1176: Document transpose node count expectations

The test expects "at least 7" transposes for bsnd but exactly 4 for bnsd. Consider adding a comment explaining why this asymmetry exists (e.g., some transposes may be optimized away in the bsnd case).

             # - 3 for the new input transposes
             # - 1 for the new output transpose
             # Note: Some nodes may be optimized away, so we check for at least 7
+            # The graph optimizer may fuse or eliminate some transposes, hence "at least"
             if len(transpose_nodes) < 7:

Also applies to: 1201-1203

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12356 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9178 completed with status: 'FAILURE'

@lucaslie
Copy link
Member Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12444 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12444 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9254 completed with status: 'FAILURE'

galagam and others added 10 commits July 21, 2025 07:25
…rmations to return None (#71)

* Refactor the signatures of AD graph transformations to return None (NVIDIA#5249)

Refactor signatures of AD graph transformations from
  gm = transformation(gm)
to
  transformation(gm)

Since the AD graph transformations modify the input GraphModule
in-place. Previous signature style was misleading.

Signed-off-by: Gal Hubara Agam <[email protected]>
…ion (#76)

* Fix trtllm-bench test and enable trtllm-bench integration

Signed-off-by: Neta Zmora <[email protected]>

* Remove unneeded __init__.py

Signed-off-by: Neta Zmora <[email protected]>

---------

Signed-off-by: Neta Zmora <[email protected]>
) (#73)

* yaml config loader for dynamic settings

Signed-off-by: Lucas Liebenwein <[email protected]>

* updates for yaml mixin

Signed-off-by: Lucas Liebenwein <[email protected]>

* addressing reviewer feedback

Signed-off-by: Lucas Liebenwein <[email protected]>

---------

Signed-off-by: Lucas Liebenwein <[email protected]>
* [AutoDeploy] Refining AD configurability

Signed-off-by: Lucas Liebenwein <[email protected]>

* addressed reviewer feedback

Signed-off-by: Lucas Liebenwein <[email protected]>

---------

Signed-off-by: Lucas Liebenwein <[email protected]>
* Add the Torch backend and update the test to use the torch backend.

Signed-off-by: nvchenghaoz <[email protected]>

* Add the sinks and fix the test failures

Signed-off-by: nvchenghaoz <[email protected]>

* address reviewer's comments

Signed-off-by: nvchenghaoz <[email protected]>

* use custom op convention

Signed-off-by: nvchenghaoz <[email protected]>

* move the ref to the utils_test

Signed-off-by: nvchenghaoz <[email protected]>

* Add torch backend tests in ad_build_small_single.py

Signed-off-by: nvchenghaoz <[email protected]>

* Address hidden comments...

Signed-off-by: nvchenghaoz <[email protected]>

---------

Signed-off-by: nvchenghaoz <[email protected]>
Signed-off-by: Lucas Liebenwein <[email protected]>
* add torch_fp8_moe and fp8 linear support in pattern matcher, update unit tests

Signed-off-by: Frida Hou <[email protected]>

* add torch-fp4-moe and fp4 support in pattern matcher, unit test has acc issue and e2e mixtral fp4 has kernel error wo moe matching

Signed-off-by: Frida Hou <[email protected]>

* add pre-commit hook

Signed-off-by: Frida Hou <[email protected]>

* hacky fix for e2e run of mixtral FP4 and fp4 op unit test

Signed-off-by: Frida Hou <[email protected]>

* EP support for torch_fp4_moe and torch_fp8_moe

Signed-off-by: Frida Hou <[email protected]>

* fix rebase: op rename, shard_load_hook bug in FP4

Signed-off-by: Frida Hou <[email protected]>

* fix pre-commit

Signed-off-by: Frida Hou <[email protected]>

* fix weight loading-load_hook issue for FP4, update function to handle exclude_modules in hf_quant_config

Signed-off-by: Frida Hou <[email protected]>

* addressing feedback, add moe op template, update op names,other minor refinements

Signed-off-by: Frida Hou <[email protected]>

* move common functionality to utility

Signed-off-by: Frida Hou <[email protected]>

* fix FP4QuantizationImpl register from rebase

Signed-off-by: Frida Hou <[email protected]>

* add quantize_moe pass for patched torch_moe op

Signed-off-by: Frida Hou <[email protected]>

* add transformation unit tests for FP8 and FP4

Signed-off-by: Frida Hou <[email protected]>

* update should_skip_quantization to fix bmm unit test

Signed-off-by: Frida Hou <[email protected]>

* update BMMDynamicModel and utils to extract weight for dynamic BMM case

Signed-off-by: Frida Hou <[email protected]>

* update BMMDynamicModel to drop linear op

Signed-off-by: Frida Hou <[email protected]>

* minor

Signed-off-by: Frida Hou <[email protected]>

---------

Signed-off-by: Frida Hou <[email protected]>
* remove assert, add qwen small to tests

* lint

Signed-off-by: Suyog Gupta <[email protected]>

---------

Signed-off-by: Suyog Gupta <[email protected]>
@tensorrt-cicd
Copy link
Collaborator

PR_Github #12475 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #9280 completed with status: 'FAILURE'

@lucaslie
Copy link
Member Author

/bot run

@lucaslie
Copy link
Member Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12485 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12484 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12485 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12484 [ run ] completed with state FAILURE

@lucaslie
Copy link
Member Author

/bot run --disable-fail-fast

1 similar comment
@lucaslie
Copy link
Member Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12490 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12502 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12490 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12502 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9293 completed with status: 'FAILURE'

@lucaslie
Copy link
Member Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12577 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12577 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9355 completed with status: 'SUCCESS'

@lucaslie lucaslie merged commit 41fb8aa into NVIDIA:main Jul 22, 2025
3 checks passed
@github-project-automation github-project-automation bot moved this from In review to Done in AutoDeploy Board Jul 22, 2025
yali-arch pushed a commit to yali-arch/TensorRT-LLM that referenced this pull request Jul 23, 2025
Signed-off-by: Gal Hubara Agam <[email protected]>
Signed-off-by: Neta Zmora <[email protected]>
Signed-off-by: Lucas Liebenwein <[email protected]>
Signed-off-by: nvchenghaoz <[email protected]>
Signed-off-by: Frida Hou <[email protected]>
Signed-off-by: greg-kwasniewski1 <[email protected]>
Signed-off-by: Suyog Gupta <[email protected]>
Co-authored-by: Gal Hubara-Agam <[email protected]>
Co-authored-by: Neta Zmora <[email protected]>
Co-authored-by: nvchenghaoz <[email protected]>
Co-authored-by: Frida Hou  <[email protected]>
Co-authored-by: Suyog Gupta <[email protected]>
Co-authored-by: Grzegorz Kwasniewski <[email protected]>
Signed-off-by: Yang Li <[email protected]>
NVShreyas pushed a commit to NVShreyas/TensorRT-LLM that referenced this pull request Jul 28, 2025
Signed-off-by: Gal Hubara Agam <[email protected]>
Signed-off-by: Neta Zmora <[email protected]>
Signed-off-by: Lucas Liebenwein <[email protected]>
Signed-off-by: nvchenghaoz <[email protected]>
Signed-off-by: Frida Hou <[email protected]>
Signed-off-by: greg-kwasniewski1 <[email protected]>
Signed-off-by: Suyog Gupta <[email protected]>
Co-authored-by: Gal Hubara-Agam <[email protected]>
Co-authored-by: Neta Zmora <[email protected]>
Co-authored-by: nvchenghaoz <[email protected]>
Co-authored-by: Frida Hou  <[email protected]>
Co-authored-by: Suyog Gupta <[email protected]>
Co-authored-by: Grzegorz Kwasniewski <[email protected]>
Signed-off-by: Shreyas Misra <[email protected]>
Ransiki pushed a commit to Ransiki/TensorRT-LLM that referenced this pull request Jul 29, 2025
Signed-off-by: Gal Hubara Agam <[email protected]>
Signed-off-by: Neta Zmora <[email protected]>
Signed-off-by: Lucas Liebenwein <[email protected]>
Signed-off-by: nvchenghaoz <[email protected]>
Signed-off-by: Frida Hou <[email protected]>
Signed-off-by: greg-kwasniewski1 <[email protected]>
Signed-off-by: Suyog Gupta <[email protected]>
Co-authored-by: Gal Hubara-Agam <[email protected]>
Co-authored-by: Neta Zmora <[email protected]>
Co-authored-by: nvchenghaoz <[email protected]>
Co-authored-by: Frida Hou  <[email protected]>
Co-authored-by: Suyog Gupta <[email protected]>
Co-authored-by: Grzegorz Kwasniewski <[email protected]>
Signed-off-by: Ransiki Zhang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

[AutoDeploy] Create export patch registry
9 participants