-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[AutoDeploy] merge feat/ad-2025-07-07 #6196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThis update introduces a modular, extensible transformation and export pipeline for PyTorch models in the AutoDeploy system. It replaces legacy monolithic transformation code with a registry-based system for export patches and graph transforms, adds deep YAML config merging, and expands quantization, sharding, and MoE support. The update includes new backend-specific custom ops, extensive test coverage, and improved configuration management. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant CLI/Script
participant ConfigLoader
participant ModelFactory
participant InferenceOptimizer
participant TransformRegistry
participant ExportPatchRegistry
participant GraphModule
User->>CLI/Script: Launch with CLI args/YAML configs
CLI/Script->>ConfigLoader: Load and deep-merge YAML configs
ConfigLoader->>ModelFactory: Create model factory with config
CLI/Script->>InferenceOptimizer: Initialize with factory and transform config
InferenceOptimizer->>TransformRegistry: For each transform stage, get transform class
InferenceOptimizer->>ExportPatchRegistry: Apply export patches as needed
InferenceOptimizer->>GraphModule: Apply transforms in order (build, export, cleanup, quantize, fuse, etc.)
GraphModule-->>InferenceOptimizer: Modified graph after all transforms
InferenceOptimizer-->>CLI/Script: Return optimized inference model
CLI/Script-->>User: Ready-to-use inference model
Estimated code review effort4 (~90 minutes) Possibly related PRs
Suggested reviewers
Poem
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
/bot run |
PR_Github #12356 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 24
🔭 Outside diff range comments (2)
tensorrt_llm/_torch/auto_deploy/transformations/library/eliminate_redundant_transposes.py (1)
98-104
: Fix self-reference bug in graph node replacement.Line 103 has a critical bug where
new_contiguous_node.replace_input_with(new_contiguous_node, original_input)
is trying to replace the node's input with itself, which doesn't make sense logically.- new_contiguous_node.replace_input_with(new_contiguous_node, original_input) + # The new_contiguous_node already uses original_input as its argument, + # so no additional input replacement is needed hereAdditionally, there's a potential issue with the logic flow:
original_input.replace_all_uses_with(new_contiguous_node)
on line 102 will replace all uses oforiginal_input
, but then we immediately try to modifynew_contiguous_node
's inputs, which may cause inconsistencies.tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/triton_kernels/test_attention_with_kv_cache.py (1)
286-307
: Add sliding window mask to reference computation.The test passes
sliding_window
parameter to the kernel but doesn't apply the corresponding mask in the reference computation.The reference computation should apply sliding window masking when
sliding_window > 0
to properly validate the kernel behavior. Consider adding sliding window mask similar to what's done intest_context_attention_kv_flattened
.
♻️ Duplicate comments (2)
tensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py (1)
79-83
: Apply same docstring improvement for consistency.tests/unittest/_torch/auto_deploy/unit/multigpu/transformations/library/test_ep_sharding.py (1)
60-68
: Same code duplication issue as in TP sharding test.
🧹 Nitpick comments (46)
tensorrt_llm/_torch/auto_deploy/export/library/__init__.py (1)
1-16
: Functional but duplicates code from models/patches init.This auto-import implementation is correct and well-documented, but it's identical to the pattern in
tensorrt_llm/_torch/auto_deploy/models/patches/__init__.py
. Consider extracting this common pattern into a reusable utility function to eliminate code duplication.Example refactor to create a shared utility:
# Create shared utility in a common location def auto_import_submodules(package_path, package_name): """Auto-import all non-private submodules in a package.""" import importlib import pkgutil all_modules = [] for _, module_name, is_pkg in pkgutil.iter_modules(package_path): if module_name.startswith("_"): continue all_modules.append(module_name) importlib.import_module(f"{package_name}.{module_name}") return all_modules # Then use in both files: __all__ = auto_import_submodules(__path__, __name__)tensorrt_llm/_torch/auto_deploy/export/library/sdpa.py (1)
23-23
: Consider adding error handling for missing custom operator.The patch assumes
torch.ops.auto_deploy.torch_attention_sdpa
exists. Consider adding validation to ensure the custom operator is available before applying the patch.def _apply_patch(self): """Apply the SDPA patch.""" + # Validate custom operator exists + if not hasattr(torch.ops.auto_deploy, 'torch_attention_sdpa'): + raise RuntimeError("Custom operator torch.ops.auto_deploy.torch_attention_sdpa not found") + # Store original function self.original_values["F.scaled_dot_product_attention"] = F.scaled_dot_product_attention # Apply patch F.scaled_dot_product_attention = torch.ops.auto_deploy.torch_attention_sdpatensorrt_llm/_torch/auto_deploy/config/default.yaml (1)
16-21
: Consider adding comments for cleanup stages.The cleanup stages (
cleanup_noop_slice
,cleanup_noop_add
,cleanup_input_constraints
) lack explanatory comments unlike the other stages. Adding brief descriptions would improve maintainability.cleanup_noop_slice: stage: post_export + # Remove unnecessary slice operations cleanup_noop_add: stage: post_export + # Remove unnecessary add operations cleanup_input_constraints: stage: post_export + # Clean up input constraint nodestensorrt_llm/_torch/auto_deploy/transform/library/__init__.py (1)
12-16
: Consider performance implications of eager loading.All transform modules are imported immediately when the package is imported, which could impact startup time. Consider if lazy loading would be more appropriate for this use case.
You could implement lazy loading by only populating
__all__
during discovery and deferring actual imports until the transforms are needed, or document that eager loading is intentional for registration purposes.tensorrt_llm/_torch/auto_deploy/export/library/linear.py (1)
31-31
: Consider thread safety implications.The patch modifies the global
F.linear
function, which could potentially affect concurrent exports if they occur in the same process. Consider documenting this limitation or implementing thread-local patching if needed.tensorrt_llm/_torch/auto_deploy/export/library/tensor_meta_device.py (2)
24-26
: Consider expanding the condition to handle more cases.The current condition only handles the exact case of
data == 0.0
. Consider whether other numeric values or data types might also have issues on the meta device.- if data == 0.0 and device is not None and torch.device(device) == torch.device("meta"): - return torch.zeros((), **kwargs) + if ( + isinstance(data, (int, float)) + and data == 0.0 + and device is not None + and torch.device(device) == torch.device("meta") + ): + return torch.zeros((), **kwargs)
24-24
: Add error handling for invalid device specifications.The
torch.device(device)
call could raise an exception for invalid device strings. Consider adding error handling or validation.- if data == 0.0 and device is not None and torch.device(device) == torch.device("meta"): + if data == 0.0 and device is not None: + try: + device_obj = torch.device(device) + if device_obj == torch.device("meta"): + return torch.zeros((), **kwargs) + except (TypeError, RuntimeError): + pass # Fall back to original function for invalid device specstests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_flashinfer_attention_op.py (1)
444-460
: Address inconsistent reference implementation usage.This test function still uses
torch.nn.functional.scaled_dot_product_attention
directly instead ofTorchAttentionReference
. For consistency and maintainability, consider updating this test to use the centralized reference implementation as well.Do you want me to update this test to use
TorchAttentionReference
for consistency with the other test functions?tensorrt_llm/_torch/auto_deploy/transform/library/cleanup_noop_slice.py (1)
37-37
: Consider using a more portable max value check.Using
torch.iinfo(torch.long).max
assumes that the slice end uses the same maximum value. Consider also checking for other common "unbounded" representations.- if node.args[2] != 0 or node.args[3] != torch.iinfo(torch.long).max: + # Check for no-op slice: start=0 and end is effectively unbounded + max_vals = [torch.iinfo(torch.long).max, torch.iinfo(torch.int64).max, float('inf')] + if node.args[2] != 0 or node.args[3] not in max_vals:tensorrt_llm/_torch/auto_deploy/custom_ops/_triton_attention_internal.py (1)
354-356
: Consistent parameter addition but needs documentation.Same sliding window and sinks parameters as other calls, maintaining consistency.
Add inline comments for the new parameters for clarity:
- False, - None, + False, # sliding_window_enabled + None, # sinkstensorrt_llm/_torch/auto_deploy/export/library/transformers_sdpa_mask.py (1)
49-51
: Consider logging skipped patches for debugging.Silent skipping of patches due to ImportError might make debugging difficult in development environments.
except ImportError: # If transformers is not available or doesn't have required modules, skip patch - pass + # Consider adding debug logging here for development + passtests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_attention_op.py (1)
425-426
: TODO comment indicates incomplete refactoring.The TODO comment correctly identifies the intention to replace the manual reference computation with the torch backend reference, consistent with the pattern applied to other tests. However, the refactoring for this test remains incomplete.
Consider completing the refactoring for
test_paged_gqa_op
to maintain consistency with the other test functions, or create a follow-up task to address this.tests/unittest/_torch/auto_deploy/_utils_test/_model_test_utils.py (1)
245-245
: Significant behavioral change in BMMDynamicModel.The model now uses fixed learnable weights instead of input-dependent dynamic weight generation. While this simplifies the model and may improve test determinism, the name "BMMDynamicModel" might be misleading since weights are no longer dynamically generated from inputs.
Consider updating the class name or docstring to reflect that weights are now fixed learnable parameters rather than input-dependent.
class BMMDynamicModel(nn.Module): - """BMM model with dynamic tensor weights for testing.""" + """BMM model with fixed learnable weights for testing."""Also applies to: 252-252
examples/auto_deploy/README.md (1)
262-262
: Minor markdown style inconsistency.The linter flags inconsistent emphasis style (underscores vs asterisks). For consistency with markdown best practices, consider updating:
-*exclusively* exposed in the [`AutoDeployConfig` class] +**exclusively** exposed in the [`AutoDeployConfig` class] -_ignored_ in AutoDeploy +**ignored** in AutoDeployAlso applies to: 267-267
tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py (1)
50-50
: Consider improving assertion message for better debugging.The assertion correctly validates the precondition, but could provide more context for debugging:
-assert len(gm.graph.nodes) == 0, "Expected empty graph module." +assert len(gm.graph.nodes) == 0, f"Expected empty graph module for export, but found {len(gm.graph.nodes)} nodes. This transform should be applied to a dummy graph module containing only the factory_model submodule."tensorrt_llm/_torch/auto_deploy/transform/optimizer.py (1)
47-56
: Complete the docstring to include all parameters.The docstring is missing documentation for the
gm
parameter.Add the missing parameter:
Args: cm: The cached sequence interface defining the sequence interface. + gm: Optional GraphModule to transform. If None, creates an empty one.
tests/unittest/_torch/auto_deploy/unit/multigpu/transformations/library/test_tp_sharding.py (1)
172-198
: Consider extracting common model initialization logic.There's significant code duplication between
_run_pattern_detection_job
and_run_job
for model and input initialization (lines 178-198 duplicate lines 89-109).Extract common initialization:
def _init_model_and_input(model_cls, bias=False): """Initialize model and input tensor for testing.""" batch_size = 4 sequence_len = 8 num_features = 32 num_heads = 4 num_key_value_heads = 1 if model_cls == GQA_Block: model = model_cls( num_attention_heads=num_heads, hidden_size=num_features, num_key_value_heads=num_key_value_heads, ).to(device="cuda", dtype=torch.float16) else: model = model_cls(num_features, num_features, bias=bias).to( device="cuda", dtype=torch.float16 ) x = torch.randn(batch_size, sequence_len, num_features, device="cuda", dtype=torch.float16) return model, x, num_features, num_heads, num_key_value_headstensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py (1)
42-61
: Consider improving docstring consistency.The implementation is correct, but the fake implementation's docstring is less detailed than the FlashInfer version.
Update the docstring for consistency:
@triton_rmsnorm.register_fake def _(input: torch.Tensor, weight: torch.Tensor, eps: float) -> torch.Tensor: - """Fake implementation for the custom operator during tracing.""" + """Fake implementation for the custom operator during tracing. + + Args: + input: Input tensor to normalize. + weight: Scaling weights for the normalized output. + eps: Small constant for numerical stability. + + Returns: + Empty tensor with same shape as input. + """ return torch.empty_like(input)tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py (1)
107-107
: Consider tracking the TODO for future improvementsThe TODO suggests this test could benefit from a custom InferenceOptimizer config. Consider creating an issue to track this enhancement.
Would you like me to open an issue to track this TODO for implementing a custom InferenceOptimizer config?
tensorrt_llm/_torch/auto_deploy/custom_ops/triton_kernels/attention_with_kv_cache.py (1)
398-402
: Sinks mechanism correctly integratedThe implementation properly adds the sinks contribution to the softmax denominator. Consider adding a comment explaining what "sinks" represent in the attention mechanism for better code documentation.
tensorrt_llm/_torch/auto_deploy/transformations/library/quantize_moe.py (1)
34-60
: Well-structured quantization logicThe nested function
quantize_param_list
cleanly handles weight quantization, scale registration, and hook setup. Consider documenting the memory implications of quantizing large MoE models with many experts.tensorrt_llm/_torch/auto_deploy/utils/quantization_utils.py (1)
505-513
: Consider making the quantization type detection more extensible.The function currently hardcodes the list of quantization implementations to check. This could be made more maintainable.
Consider using a registry pattern or class attribute to make this more extensible:
def get_scales_and_type_from_node(node: Node) -> Tuple[Dict[str, Node], str]: """Returns a dict of scale args and quantization type string ('fp4', 'fp8', etc).""" - for qtype in [FP4QuantizationImpl, FP8QuantizationImpl]: + # Consider adding a class method to QuantizationImpl to get all implementations + for qtype in QuantizationImpl.get_all_implementations(): if is_op(node, qtype.target_op()): return extract_scales_from_node( node, qtype.scale_names() ), qtype.__name__.lower().replace("quantizationimpl", "") return None, "simple"tensorrt_llm/_torch/auto_deploy/utils/_config.py (1)
58-62
: Enforce MRO requirement programmatically.The documentation states this class must come first in the MRO, but this isn't enforced in code.
Consider adding a check in
__init_subclass__
to enforce the MRO requirement:def __init_subclass__(cls, **kwargs): super().__init_subclass__(**kwargs) # Ensure DynamicYamlMixInForSettings is first in MRO (after the class itself) if len(cls.__mro__) > 2 and cls.__mro__[1] != DynamicYamlMixInForSettings: raise TypeError( f"{cls.__name__} must inherit from DynamicYamlMixInForSettings " "as the first base class to ensure proper yaml_configs processing" )tensorrt_llm/_torch/auto_deploy/transformations/library/fused_moe.py (2)
188-189
: Address TODO for quantized model fusion.The TODO indicates that
trtllm_moe_fused
doesn't currently support quantized models. This could be a performance limitation.Would you like me to help implement the quantized version of
trtllm_moe_fused
or create an issue to track this enhancement?
139-143
: Update docstring to reflect in-place modification.The function now performs in-place modifications but the docstring doesn't indicate this change.
Update the docstring to clarify that the function modifies the graph module in-place:
def fuse_moe(gm: torch.fx.GraphModule) -> None: """ Scan the FX graph and replace all calls to torch.ops.auto_deploy.torch_moe with torch.ops.auto_deploy.trtllm_moe_fused. + + This function modifies the graph module in-place. """tensorrt_llm/_torch/auto_deploy/export/export.py (2)
29-36
: Consider checking buffer devices as wellThe function only checks parameter devices but applies the device to all nodes in the graph. Buffers might be on different devices than parameters.
def _clean_up_device_info(gm: fx.GraphModule) -> None: """Correct device information in the graph.""" - devices = {t.device for _, t in gm.named_parameters()} + devices = {t.device for _, t in gm.named_parameters()} + devices.update({t.device for _, t in gm.named_buffers()}) if len(devices) == 0: return elif len(devices) > 1: - raise AssertionError("All parameters should be on the same device.") + raise AssertionError("All parameters and buffers should be on the same device.")
269-272
: Address TODO about overlap between deduplication functionsThe TODO comment suggests there's overlap between
_add_load_hook_for_aliased_params
and_deduplicate_params_and_buffers
. This should be clarified to avoid maintenance issues.Would you like me to analyze the overlap between these functions and suggest a refactoring approach to consolidate them?
tests/unittest/_torch/auto_deploy/unit/singlegpu/utils/test_config.py (2)
843-862
: Test doesn't actually verify relative path handlingThe test is named
test_relative_and_absolute_paths
and has a comment about testing relative paths, but it only uses absolute paths throughout. Consider either renaming the test or implementing actual relative path testing.
866-866
: Add newline at end of file- assert settings.simple.name == "config2" # from config2 + assert settings.simple.name == "config2" # from config2 +tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (1)
171-173
: Consider optimizing GPU->CPU transfer for new_tokensThe comment indicates a performance concern about the GPU->CPU transfer. This could be a bottleneck in the inference pipeline.
Consider tracking this as a performance optimization opportunity. Would you like me to open an issue to investigate avoiding this transfer by keeping the tensor operations on the GPU?
tensorrt_llm/_torch/auto_deploy/export/interface.py (1)
250-250
: Add newline at end of file- yield from _apply_patches(patches) + yield from _apply_patches(patches) +tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_torch_attention_op.py (2)
332-339
: Document known backend issues with sinksThe code handles backend failures when sinks are enabled, suggesting known issues. Consider adding a comment explaining the specific backend limitations or linking to a tracking issue.
# For sinks: test that backend runs without crashing (backend has bugs) # and validate correct sinks behavior with numpy reference + # TODO: Backend currently has issues with sinks implementation - see issue #XXX try:
488-488
: Add newline at end of file- assert result[0].shape[0] == batch_size, "First tensor should have batch_size elements" + assert result[0].shape[0] == batch_size, "First tensor should have batch_size elements" +tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_moe_fusion.py (5)
44-44
: Document the magic numbers for weight scale factors.The magic numbers 448 and 432 lack explanation. Please add a comment explaining why these specific values are chosen for bfloat16 vs other dtypes.
- wt_factor = 448 if dtype == torch.bfloat16 else 432 + # FP8 E4M3 max value is 448 for bfloat16 and 432 for float16/float32 + wt_factor = 448 if dtype == torch.bfloat16 else 432
46-48
: Consider using deterministic weight initialization for tests.Using
torch.randn
without a seed in test code can lead to non-deterministic behavior. Consider setting a seed or accepting initialized weights as parameters.+ torch.manual_seed(42) # Ensure deterministic initialization w1_fp32 = torch.randn(ffn_dim, hidden_dim, device=device) w3_fp32 = torch.randn(ffn_dim, hidden_dim, device=device) w2_fp32 = torch.randn(hidden_dim, ffn_dim, device=device)
99-101
: Document the weight initialization scale factor.The weights are initialized with a 0.01 scale factor. Please add a comment explaining why this specific value is chosen for FP4 quantization.
# Prepare full-precision weights + # Small initialization scale (0.01) helps with FP4 quantization stability w1_fp32 = torch.randn(ffn_dim, hidden_dim, device=device, dtype=dtype) * 0.01
244-245
: Optimize input tensor creation.Creating the input tensor on CPU and then moving it to the device in the forward pass is inefficient. Consider creating it directly on the target device.
- input_ids = self.get_input(device="cpu") # or pass as constructor arg - input_sample = self.embedding(input_ids) + with torch.device(device): + input_ids = self.get_input(device=device) + input_sample = self.embedding(input_ids)
262-262
: Remove duplicate seed setting.The random seed is set both in
get_input
(line 262) and in the test function (line 294). Consider removing one to avoid confusion.Also applies to: 294-294
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_moe.py (1)
274-275
: Consider moving empty input check to the template.The empty input check here is good, but for consistency, consider moving this check to the
_template_moe
function since it could benefit all MoE variants.tensorrt_llm/_torch/auto_deploy/transform/interface.py (1)
232-241
: Consider more specific exception handling.The broad exception catch might hide specific errors. Consider catching more specific exceptions or at least logging the full traceback for debugging.
try: gm, info = self._apply(gm, cm, factory) - except Exception as e: + except (TransformError, ValueError, TypeError) as e: error_msg = f"Transform {t_name} failed" if self.config.skip_on_error: ad_logger.warning(f"{error_msg}: {e}") + ad_logger.debug(f"Full traceback:", exc_info=True) info = TransformInfo(skipped=True, num_matches=0) else: raise TransformError(error_msg) from e + except Exception as e: + # Unexpected errors should always be raised + ad_logger.error(f"Unexpected error in transform {t_name}: {e}") + raisetensorrt_llm/_torch/auto_deploy/llm_args.py (1)
205-206
: Address the TODO for Path support.The TODO comment indicates that Path objects should be supported in the model factory. Consider implementing this support or creating an issue to track it.
Would you like me to help implement Path support in the model factory or create an issue to track this enhancement?
tensorrt_llm/_torch/auto_deploy/transformations/library/sharding.py (1)
498-500
: Consider validating ShardingConfig is not None.Add a validation check to ensure sharding_config is provided when world_size >= 2.
if world_size < 2: ad_logger.info("Skipping sharding for single device") return + + assert sharding_config is not None, "sharding_config must be provided for multi-device setup"tensorrt_llm/_torch/auto_deploy/custom_ops/torch_backend_attention.py (2)
188-213
: Complex but correct sliding window implementation.The sliding window mask implementation correctly handles position differences between query and key positions. However, consider adding more inline documentation to explain the logic, especially around lines 194-207 where position differences are calculated.
+ # Calculate position differences for sliding window mask + # Query positions are offset by input_pos_i, key positions start from 0 query_positions = torch.arange( input_pos_i, input_pos_i + seq_len_i, device=q.device ) # [seq_len_i] key_positions = torch.arange(0, kv_seq_len, device=q.device) # [kv_seq_len] # Create position difference matrix: query_pos - key_pos + # This represents how far back each query position is looking pos_diff = query_positions.unsqueeze(1) - key_positions.unsqueeze( 0 ) # [seq_len_i, kv_seq_len]
475-484
: Consider using approximate equality check for scale validation.When validating the scale parameter, consider allowing for floating-point precision differences if you want to verify it matches the expected 1/sqrt(head_dim) value.
# Validate scale if not isinstance(scale, float): ad_logger.warning("Provided scale is not a float. Using default scale instead.") scale = None + elif scale is not None: + # Optionally validate scale is approximately 1/sqrt(head_dim) + k_fake: FakeTensor = source_attn_node.args[1].meta["val"] + expected_scale = 1.0 / math.sqrt(k_fake.shape[3]) + if abs(scale - expected_scale) > 1e-6: + ad_logger.debug(f"Scale {scale} differs from expected {expected_scale}")tensorrt_llm/_torch/auto_deploy/transformations/library/attention.py (1)
226-231
: Consider using more meaningful scalar valuesThe scalar values for scale and dropout (e.g., 0.1234743, 0.85849734) appear arbitrary. Consider using more intuitive values or adding a comment explaining why these specific values were chosen.
configs = [ - (_sfdp_pattern_1, _sfdp_replacement_1, True, 0.1234743, 0.85849734), - (_sfdp_pattern_2, _sfdp_replacement_2, False, 0.234743, 0.5849734), + (_sfdp_pattern_1, _sfdp_replacement_1, True, 0.125, 0.1), # 1/8 scale, 10% dropout + (_sfdp_pattern_2, _sfdp_replacement_2, False, 0.25, 0.2), # 1/4 scale, 20% dropouttests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher.py (1)
1174-1176
: Document transpose node count expectationsThe test expects "at least 7" transposes for bsnd but exactly 4 for bnsd. Consider adding a comment explaining why this asymmetry exists (e.g., some transposes may be optimized away in the bsnd case).
# - 3 for the new input transposes # - 1 for the new output transpose # Note: Some nodes may be optimized away, so we check for at least 7 + # The graph optimizer may fuse or eliminate some transposes, hence "at least" if len(transpose_nodes) < 7:
Also applies to: 1201-1203
tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_flashinfer_attention_op.py
Show resolved
Hide resolved
tests/unittest/_torch/auto_deploy/_utils_test/torch_attention_reference.py
Show resolved
Hide resolved
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_moe_fusion.py
Show resolved
Hide resolved
tests/unittest/_torch/auto_deploy/unit/multigpu/transformations/library/test_bmm_sharding.py
Show resolved
Hide resolved
PR_Github #12356 [ run ] completed with state |
/bot run |
PR_Github #12444 [ run ] triggered by Bot |
PR_Github #12444 [ run ] completed with state |
…rmations to return None (#71) * Refactor the signatures of AD graph transformations to return None (NVIDIA#5249) Refactor signatures of AD graph transformations from gm = transformation(gm) to transformation(gm) Since the AD graph transformations modify the input GraphModule in-place. Previous signature style was misleading. Signed-off-by: Gal Hubara Agam <[email protected]>
…ion (#76) * Fix trtllm-bench test and enable trtllm-bench integration Signed-off-by: Neta Zmora <[email protected]> * Remove unneeded __init__.py Signed-off-by: Neta Zmora <[email protected]> --------- Signed-off-by: Neta Zmora <[email protected]>
) (#73) * yaml config loader for dynamic settings Signed-off-by: Lucas Liebenwein <[email protected]> * updates for yaml mixin Signed-off-by: Lucas Liebenwein <[email protected]> * addressing reviewer feedback Signed-off-by: Lucas Liebenwein <[email protected]> --------- Signed-off-by: Lucas Liebenwein <[email protected]>
* [AutoDeploy] Refining AD configurability Signed-off-by: Lucas Liebenwein <[email protected]> * addressed reviewer feedback Signed-off-by: Lucas Liebenwein <[email protected]> --------- Signed-off-by: Lucas Liebenwein <[email protected]>
* Add the Torch backend and update the test to use the torch backend. Signed-off-by: nvchenghaoz <[email protected]> * Add the sinks and fix the test failures Signed-off-by: nvchenghaoz <[email protected]> * address reviewer's comments Signed-off-by: nvchenghaoz <[email protected]> * use custom op convention Signed-off-by: nvchenghaoz <[email protected]> * move the ref to the utils_test Signed-off-by: nvchenghaoz <[email protected]> * Add torch backend tests in ad_build_small_single.py Signed-off-by: nvchenghaoz <[email protected]> * Address hidden comments... Signed-off-by: nvchenghaoz <[email protected]> --------- Signed-off-by: nvchenghaoz <[email protected]> Signed-off-by: Lucas Liebenwein <[email protected]>
Signed-off-by: nvchenghaoz <[email protected]>
* add torch_fp8_moe and fp8 linear support in pattern matcher, update unit tests Signed-off-by: Frida Hou <[email protected]> * add torch-fp4-moe and fp4 support in pattern matcher, unit test has acc issue and e2e mixtral fp4 has kernel error wo moe matching Signed-off-by: Frida Hou <[email protected]> * add pre-commit hook Signed-off-by: Frida Hou <[email protected]> * hacky fix for e2e run of mixtral FP4 and fp4 op unit test Signed-off-by: Frida Hou <[email protected]> * EP support for torch_fp4_moe and torch_fp8_moe Signed-off-by: Frida Hou <[email protected]> * fix rebase: op rename, shard_load_hook bug in FP4 Signed-off-by: Frida Hou <[email protected]> * fix pre-commit Signed-off-by: Frida Hou <[email protected]> * fix weight loading-load_hook issue for FP4, update function to handle exclude_modules in hf_quant_config Signed-off-by: Frida Hou <[email protected]> * addressing feedback, add moe op template, update op names,other minor refinements Signed-off-by: Frida Hou <[email protected]> * move common functionality to utility Signed-off-by: Frida Hou <[email protected]> * fix FP4QuantizationImpl register from rebase Signed-off-by: Frida Hou <[email protected]> * add quantize_moe pass for patched torch_moe op Signed-off-by: Frida Hou <[email protected]> * add transformation unit tests for FP8 and FP4 Signed-off-by: Frida Hou <[email protected]> * update should_skip_quantization to fix bmm unit test Signed-off-by: Frida Hou <[email protected]> * update BMMDynamicModel and utils to extract weight for dynamic BMM case Signed-off-by: Frida Hou <[email protected]> * update BMMDynamicModel to drop linear op Signed-off-by: Frida Hou <[email protected]> * minor Signed-off-by: Frida Hou <[email protected]> --------- Signed-off-by: Frida Hou <[email protected]>
* remove assert, add qwen small to tests * lint Signed-off-by: Suyog Gupta <[email protected]> --------- Signed-off-by: Suyog Gupta <[email protected]>
PR_Github #12475 [ run ] completed with state |
/bot run |
/bot run --disable-fail-fast |
PR_Github #12485 [ run ] triggered by Bot |
PR_Github #12484 [ run ] triggered by Bot |
PR_Github #12485 [ run ] completed with state |
PR_Github #12484 [ run ] completed with state |
/bot run --disable-fail-fast |
1 similar comment
/bot run --disable-fail-fast |
PR_Github #12490 [ run ] triggered by Bot |
PR_Github #12502 [ run ] triggered by Bot |
PR_Github #12490 [ run ] completed with state |
PR_Github #12502 [ run ] completed with state |
/bot run |
PR_Github #12577 [ run ] triggered by Bot |
PR_Github #12577 [ run ] completed with state |
Signed-off-by: Gal Hubara Agam <[email protected]> Signed-off-by: Neta Zmora <[email protected]> Signed-off-by: Lucas Liebenwein <[email protected]> Signed-off-by: nvchenghaoz <[email protected]> Signed-off-by: Frida Hou <[email protected]> Signed-off-by: greg-kwasniewski1 <[email protected]> Signed-off-by: Suyog Gupta <[email protected]> Co-authored-by: Gal Hubara-Agam <[email protected]> Co-authored-by: Neta Zmora <[email protected]> Co-authored-by: nvchenghaoz <[email protected]> Co-authored-by: Frida Hou <[email protected]> Co-authored-by: Suyog Gupta <[email protected]> Co-authored-by: Grzegorz Kwasniewski <[email protected]> Signed-off-by: Yang Li <[email protected]>
Signed-off-by: Gal Hubara Agam <[email protected]> Signed-off-by: Neta Zmora <[email protected]> Signed-off-by: Lucas Liebenwein <[email protected]> Signed-off-by: nvchenghaoz <[email protected]> Signed-off-by: Frida Hou <[email protected]> Signed-off-by: greg-kwasniewski1 <[email protected]> Signed-off-by: Suyog Gupta <[email protected]> Co-authored-by: Gal Hubara-Agam <[email protected]> Co-authored-by: Neta Zmora <[email protected]> Co-authored-by: nvchenghaoz <[email protected]> Co-authored-by: Frida Hou <[email protected]> Co-authored-by: Suyog Gupta <[email protected]> Co-authored-by: Grzegorz Kwasniewski <[email protected]> Signed-off-by: Shreyas Misra <[email protected]>
Signed-off-by: Gal Hubara Agam <[email protected]> Signed-off-by: Neta Zmora <[email protected]> Signed-off-by: Lucas Liebenwein <[email protected]> Signed-off-by: nvchenghaoz <[email protected]> Signed-off-by: Frida Hou <[email protected]> Signed-off-by: greg-kwasniewski1 <[email protected]> Signed-off-by: Suyog Gupta <[email protected]> Co-authored-by: Gal Hubara-Agam <[email protected]> Co-authored-by: Neta Zmora <[email protected]> Co-authored-by: nvchenghaoz <[email protected]> Co-authored-by: Frida Hou <[email protected]> Co-authored-by: Suyog Gupta <[email protected]> Co-authored-by: Grzegorz Kwasniewski <[email protected]> Signed-off-by: Ransiki Zhang <[email protected]>
Summary by CodeRabbit
New Features
Bug Fixes
Refactor
Chores
Documentation
Description
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.