Skip to content

Conversation

@nvchenghaoz
Copy link
Collaborator

@nvchenghaoz nvchenghaoz commented Oct 3, 2025

Summary by CodeRabbit

  • New Features

    • None
  • Bug Fixes

    • Corrected decode indexing to properly handle generate-only (single-token) cases.
    • Initialized sequence state caches to zeros for clean, deterministic starts.
    • Improved free-slot selection to prevent collisions and avoid spurious out-of-range assertions.
  • Refactor

    • Minor cleanup in decode path variable handling for cached convolutions (no functional change).
  • Tests

    • Added accuracy coverage for Nemotron-H-8B-Base-8K, including MMLU and GSM8K.
    • Re-enabled a previously skipped unit test for generate-only with slot mapping.
  1. Fix the test errors in test_triton_generate_only_with_slot_mapping, remove the waive.

  2. Add the accuracy testing for Nemotron-H, both MMLU and GSM8K passed.

lucaslie and others added 22 commits October 3, 2025 10:09
* [None][auto_deploy] Bamba

Signed-off-by: William Zhang <[email protected]>

* debugging export accuracy diff for bamba

Signed-off-by: Lucas Liebenwein <[email protected]>

---------

Signed-off-by: William Zhang <[email protected]>
Signed-off-by: Lucas Liebenwein <[email protected]>
Co-authored-by: William Zhang <[email protected]>
Signed-off-by: Chenghao Zhang <[email protected]>
* Fix the bamba unit test

Signed-off-by: Chenghao Zhang <[email protected]>

* none: Add triton backend for ssm_transform and cuda backend for conv

Signed-off-by: Chenghao Zhang <[email protected]>

* Fully Use the TRT LLM kernels

Signed-off-by: Chenghao Zhang <[email protected]>

* Add fake version for ssm transform op

Signed-off-by: Chenghao Zhang <[email protected]>

* Fix the datatype error in fake op

Signed-off-by: Chenghao Zhang <[email protected]>

* Fix the conv test error

Signed-off-by: Chenghao Zhang <[email protected]>

* Fix the triton ssm error

Signed-off-by: Chenghao Zhang <[email protected]>

---------

Signed-off-by: Chenghao Zhang <[email protected]>
…es with better reset/sizing (#140)

Signed-off-by: Lucas Liebenwein <[email protected]>
* Fix the bamba unit test

Signed-off-by: Chenghao Zhang <[email protected]>

* none: Add triton backend for ssm_transform and cuda backend for conv

Signed-off-by: Chenghao Zhang <[email protected]>

* Fully Use the TRT LLM kernels

Signed-off-by: Chenghao Zhang <[email protected]>

* Add fake version for ssm transform op

Signed-off-by: Chenghao Zhang <[email protected]>

* Fix the datatype error in fake op

Signed-off-by: Chenghao Zhang <[email protected]>

* Fix the conv test error

Signed-off-by: Chenghao Zhang <[email protected]>

* Fix the triton ssm error

Signed-off-by: Chenghao Zhang <[email protected]>

* Fix the DemoLLM sampler mismatch

Signed-off-by: Chenghao Zhang <[email protected]>

* Update the implementation for triton/cuda kernels

Signed-off-by: Chenghao Zhang <[email protected]>

* Fix the d2d memcpy for decode

Signed-off-by: Chenghao Zhang <[email protected]>

* Revert the generator and remove the redundant code

Signed-off-by: Chenghao Zhang <[email protected]>

---------

Signed-off-by: Chenghao Zhang <[email protected]>
Signed-off-by: Suyog Gupta <[email protected]>
Co-authored-by: Suyog Gupta <[email protected]>
* [None][feat] Add patches for NemotronH

Signed-off-by: William Zhang <[email protected]>

* [None][test] unittest for nemotron_h

Signed-off-by: William Zhang <[email protected]>

* nemotron-h support finished

Signed-off-by: Lucas Liebenwein <[email protected]>

* added anticapted path for new models on llm_models trt-llm CI

Signed-off-by: Lucas Liebenwein <[email protected]>

---------

Signed-off-by: William Zhang <[email protected]>
Signed-off-by: Lucas Liebenwein <[email protected]>
Co-authored-by: William Zhang <[email protected]>
Signed-off-by: Lucas Liebenwein <[email protected]>
Signed-off-by: Chenghao Zhang <[email protected]>
Signed-off-by: Chenghao Zhang <[email protected]>
Signed-off-by: Chenghao Zhang <[email protected]>
Signed-off-by: Chenghao Zhang <[email protected]>
Signed-off-by: Chenghao Zhang <[email protected]>
Signed-off-by: Chenghao Zhang <[email protected]>
Signed-off-by: Chenghao Zhang <[email protected]>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 3, 2025

📝 Walkthrough

Walkthrough

Refactors internal value selection in attention interface, adjusts decode-time indexing in CUDA causal conv, updates Triton Mamba decode index logic and zero-initialized cache, and modifies tests by adding a Nemotron-H integration accuracy suite and enabling a previously skipped unit test.

Changes

Cohort / File(s) Summary of changes
AutoDeploy custom ops — attention interface
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
Simplified _get_unique_value: removed bounds docstring and check; compute free values via set difference; return smallest free value or 0.
AutoDeploy custom ops — CUDA causal conv (decode path)
tensorrt_llm/_torch/auto_deploy/custom_ops/cuda_backend_causal_conv.py
In DECODE branch, introduced local slot_idx_decode = slot_idx[num_prefill:].to(torch.int32) and used it for conv_state_indices. No control-flow changes.
Triton backend — Mamba cached op
tensorrt_llm/_torch/auto_deploy/custom_ops/triton_backend_mamba.py
In decode path: for s==1, set decode_idx to range over flattened batch; for s>1, keep seq_start-based indexing. Cache init now zeros via torch.zeros for SSM state with explicit device/dtype.
Integration tests — AutoDeploy LLM API
tests/integration/defs/accuracy/test_llm_api_autodeploy.py
Added TestNemotronH harness for Nemotron-H-8B-Base-8K with sampling params (end_id/pad_id=-1, n=1). Imports updated to include GSM8K.
Unit tests — Triton Mamba cached op
tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_triton_mamba_cached_op.py
Removed skip on test_triton_generate_only_with_slot_mapping, enabling the test.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Caller
  participant TritonMamba as Triton Mamba Op
  participant SSMCache as SSM State Cache

  rect rgba(230,240,255,0.5)
    note over TritonMamba: Initialization
    Caller->>TritonMamba: get_cache_initializers()
    TritonMamba->>SSMCache: allocate zeros (device,dtype)
    SSMCache-->>TritonMamba: zero-initialized cache
  end

  alt Prefill or multi-token decode (s > 1)
    Caller->>TritonMamba: _triton_cached_ssm_transform(prefill/decode, seq_start, s>1)
    TritonMamba->>TritonMamba: decode_idx ← seq_start[num_prefill:]
  else Generate-only (s == 1)
    Caller->>TritonMamba: _triton_cached_ssm_transform(decode, s==1)
    TritonMamba->>TritonMamba: decode_idx ← range(flattened batch)
  end

  TritonMamba->>SSMCache: read/update with decode_idx
  SSMCache-->>TritonMamba: updated state
  TritonMamba-->>Caller: outputs
Loading
sequenceDiagram
  autonumber
  participant Caller
  participant CudaConv as CUDA Causal Conv

  Caller->>CudaConv: _cuda_cached_causal_conv1d(DECODE, slot_idx, num_prefill)
  CudaConv->>CudaConv: slot_idx_decode ← slot_idx[num_prefill:].to(int32)
  CudaConv->>CudaConv: conv_state_indices ← slot_idx_decode
  CudaConv-->>Caller: outputs
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The PR description does not follow the provided template because it only contains a brief AI-generated summary and omits the required “## Description”, “## Test Coverage”, and “## PR Checklist” sections, leaving key details about the issue, solution, and tests unstructured. Please update the PR description to include the “## Description”, “## Test Coverage”, and “## PR Checklist” headings as specified in the repository template and provide information on the changes, test coverage, and checklist items.
Docstring Coverage ⚠️ Warning Docstring coverage is 27.27% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The title follows the standard “[None][feat]” convention and succinctly describes the addition of Nemotron-H accuracy testing support, which is the primary change in this pull request.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ca82911 and fe22f1c.

📒 Files selected for processing (5)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/cuda_backend_causal_conv.py (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/triton_backend_mamba.py (2 hunks)
  • tests/integration/defs/accuracy/test_llm_api_autodeploy.py (2 hunks)
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_triton_mamba_cached_op.py (0 hunks)
💤 Files with no reviewable changes (1)
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_triton_mamba_cached_op.py
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

  • tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/cuda_backend_causal_conv.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/triton_backend_mamba.py
  • tests/integration/defs/accuracy/test_llm_api_autodeploy.py
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

  • tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/cuda_backend_causal_conv.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/triton_backend_mamba.py
  • tests/integration/defs/accuracy/test_llm_api_autodeploy.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

  • tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/cuda_backend_causal_conv.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/triton_backend_mamba.py
  • tests/integration/defs/accuracy/test_llm_api_autodeploy.py
🧬 Code graph analysis (3)
tensorrt_llm/_torch/auto_deploy/custom_ops/cuda_backend_causal_conv.py (1)
tensorrt_llm/_torch/modules/mamba/causal_conv1d.py (1)
  • causal_conv1d_update (77-118)
tensorrt_llm/_torch/auto_deploy/custom_ops/triton_backend_mamba.py (2)
tensorrt_llm/functional.py (1)
  • arange (1498-1569)
cpp/kernels/fmha_v2/src/fused_multihead_attention_utils.h (1)
  • zeros (878-881)
tests/integration/defs/accuracy/test_llm_api_autodeploy.py (1)
tests/integration/defs/accuracy/accuracy_core.py (6)
  • GSM8K (332-347)
  • MMLU (315-329)
  • CnnDailymail (248-265)
  • LlmapiAccuracyTestHarness (844-855)
  • evaluate (184-245)
  • evaluate (763-773)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (7)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (1)

642-645: LGTM: Simplified value selection logic.

The refactored implementation using set(range(max_val)) - occupied is cleaner and more pythonic than the previous approach. The logic correctly returns the smallest unoccupied value via pop() (which returns an arbitrary element, but from a small range this is acceptable) or defaults to 0 when no free values exist.

tensorrt_llm/_torch/auto_deploy/custom_ops/cuda_backend_causal_conv.py (1)

189-197: LGTM: Improved readability with extracted variable.

Extracting slot_idx_decode as a local variable improves code clarity and aligns with the similar pattern in triton_backend_mamba.py. The logic remains functionally equivalent.

tests/integration/defs/accuracy/test_llm_api_autodeploy.py (3)

24-24: LGTM: Import correctly added for GSM8K test.

The GSM8K import is appropriately added and used in the new test method at line 122.


113-123: LGTM: Test method structure is correct.

The test appropriately instantiates the AutoDeployLLM with the model configuration and evaluates both MMLU and GSM8K tasks, which aligns with the PR objectives of adding accuracy testing for Nemotron-H.


85-111: Confirm Nemotron-H tokenizer's eos_token_id and pad_token_id
Manually verify that the model’s eos_token_id and pad_token_id are not both –1 and adjust SamplingParams to use the actual token IDs if needed.

tensorrt_llm/_torch/auto_deploy/custom_ops/triton_backend_mamba.py (2)

127-132: LGTM: Correctly handles generate-only decode indexing.

The conditional logic appropriately handles the generate-only case (s==1) where seq_start entries would all be zeros. Using arange(bs) ensures correct token indexing for the flattened batch. This fix aligns with the PR objective of resolving errors in test_triton_generate_only_with_slot_mapping.


244-253: LGTM: Proper zero-initialization for SSM state cache.

Changing from torch.empty to torch.zeros ensures that new sequences start with a clean state, which is the correct initial condition for SSM state caches. This improves correctness and prevents potential issues from uninitialized memory.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Member

@lucaslie lucaslie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nvchenghaoz the accuracy test got merged in #8133

Maybe for this PR, you can focus on just fixing the unit test?

@nvchenghaoz
Copy link
Collaborator Author

@lucaslie Resolved the merge conflict.

@nvchenghaoz
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20701 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20701 [ run ] completed with state ABORTED
LLM/main/L0_MergeRequest_PR #15639 (Blue Ocean) completed with status: ABORTED

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

3 participants