[None][feat] AutoDeploy: Nemotron-H accuracy testing support #8136

nvchenghaoz · 2025-10-03T17:41:41Z

Summary by CodeRabbit

New Features
- None
Bug Fixes
- Corrected decode indexing to properly handle generate-only (single-token) cases.
- Initialized sequence state caches to zeros for clean, deterministic starts.
- Improved free-slot selection to prevent collisions and avoid spurious out-of-range assertions.
Refactor
- Minor cleanup in decode path variable handling for cached convolutions (no functional change).
Tests
- Added accuracy coverage for Nemotron-H-8B-Base-8K, including MMLU and GSM8K.
- Re-enabled a previously skipped unit test for generate-only with slot mapping.

Fix the test errors in test_triton_generate_only_with_slot_mapping, remove the waive.
Add the accuracy testing for Nemotron-H, both MMLU and GSM8K passed.

* [None][auto_deploy] Bamba Signed-off-by: William Zhang <[email protected]> * debugging export accuracy diff for bamba Signed-off-by: Lucas Liebenwein <[email protected]> --------- Signed-off-by: William Zhang <[email protected]> Signed-off-by: Lucas Liebenwein <[email protected]> Co-authored-by: William Zhang <[email protected]>

Signed-off-by: Lucas Liebenwein <[email protected]>

Signed-off-by: Chenghao Zhang <[email protected]>

* Fix the bamba unit test Signed-off-by: Chenghao Zhang <[email protected]> * none: Add triton backend for ssm_transform and cuda backend for conv Signed-off-by: Chenghao Zhang <[email protected]> * Fully Use the TRT LLM kernels Signed-off-by: Chenghao Zhang <[email protected]> * Add fake version for ssm transform op Signed-off-by: Chenghao Zhang <[email protected]> * Fix the datatype error in fake op Signed-off-by: Chenghao Zhang <[email protected]> * Fix the conv test error Signed-off-by: Chenghao Zhang <[email protected]> * Fix the triton ssm error Signed-off-by: Chenghao Zhang <[email protected]> --------- Signed-off-by: Chenghao Zhang <[email protected]>

…es with better reset/sizing (#140) Signed-off-by: Lucas Liebenwein <[email protected]>

Signed-off-by: Lucas Liebenwein <[email protected]>

* Fix the bamba unit test Signed-off-by: Chenghao Zhang <[email protected]> * none: Add triton backend for ssm_transform and cuda backend for conv Signed-off-by: Chenghao Zhang <[email protected]> * Fully Use the TRT LLM kernels Signed-off-by: Chenghao Zhang <[email protected]> * Add fake version for ssm transform op Signed-off-by: Chenghao Zhang <[email protected]> * Fix the datatype error in fake op Signed-off-by: Chenghao Zhang <[email protected]> * Fix the conv test error Signed-off-by: Chenghao Zhang <[email protected]> * Fix the triton ssm error Signed-off-by: Chenghao Zhang <[email protected]> * Fix the DemoLLM sampler mismatch Signed-off-by: Chenghao Zhang <[email protected]> * Update the implementation for triton/cuda kernels Signed-off-by: Chenghao Zhang <[email protected]> * Fix the d2d memcpy for decode Signed-off-by: Chenghao Zhang <[email protected]> * Revert the generator and remove the redundant code Signed-off-by: Chenghao Zhang <[email protected]> --------- Signed-off-by: Chenghao Zhang <[email protected]> Signed-off-by: Suyog Gupta <[email protected]> Co-authored-by: Suyog Gupta <[email protected]>

* [None][feat] Add patches for NemotronH Signed-off-by: William Zhang <[email protected]> * [None][test] unittest for nemotron_h Signed-off-by: William Zhang <[email protected]> * nemotron-h support finished Signed-off-by: Lucas Liebenwein <[email protected]> * added anticapted path for new models on llm_models trt-llm CI Signed-off-by: Lucas Liebenwein <[email protected]> --------- Signed-off-by: William Zhang <[email protected]> Signed-off-by: Lucas Liebenwein <[email protected]> Co-authored-by: William Zhang <[email protected]>

Signed-off-by: Lucas Liebenwein <[email protected]>

Signed-off-by: Chenghao Zhang <[email protected]>

Signed-off-by: Lucas Liebenwein <[email protected]>

This reverts commit 67ee3d8.

Signed-off-by: Lucas Liebenwein <[email protected]>

Signed-off-by: Chenghao Zhang <[email protected]>

coderabbitai · 2025-10-03T17:48:33Z

📝 Walkthrough

Walkthrough

Refactors internal value selection in attention interface, adjusts decode-time indexing in CUDA causal conv, updates Triton Mamba decode index logic and zero-initialized cache, and modifies tests by adding a Nemotron-H integration accuracy suite and enabling a previously skipped unit test.

Changes

Cohort / File(s)	Summary of changes
AutoDeploy custom ops — attention interface `tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py`	Simplified _get_unique_value: removed bounds docstring and check; compute free values via set difference; return smallest free value or 0.
AutoDeploy custom ops — CUDA causal conv (decode path) `tensorrt_llm/_torch/auto_deploy/custom_ops/cuda_backend_causal_conv.py`	In DECODE branch, introduced local `slot_idx_decode = slot_idx[num_prefill:].to(torch.int32)` and used it for `conv_state_indices`. No control-flow changes.
Triton backend — Mamba cached op `tensorrt_llm/_torch/auto_deploy/custom_ops/triton_backend_mamba.py`	In decode path: for s==1, set `decode_idx` to range over flattened batch; for s>1, keep seq_start-based indexing. Cache init now zeros via `torch.zeros` for SSM state with explicit device/dtype.
Integration tests — AutoDeploy LLM API `tests/integration/defs/accuracy/test_llm_api_autodeploy.py`	Added TestNemotronH harness for Nemotron-H-8B-Base-8K with sampling params (end_id/pad_id=-1, n=1). Imports updated to include GSM8K.
Unit tests — Triton Mamba cached op `tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_triton_mamba_cached_op.py`	Removed skip on `test_triton_generate_only_with_slot_mapping`, enabling the test.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Caller
  participant TritonMamba as Triton Mamba Op
  participant SSMCache as SSM State Cache

  rect rgba(230,240,255,0.5)
    note over TritonMamba: Initialization
    Caller->>TritonMamba: get_cache_initializers()
    TritonMamba->>SSMCache: allocate zeros (device,dtype)
    SSMCache-->>TritonMamba: zero-initialized cache
  end

  alt Prefill or multi-token decode (s > 1)
    Caller->>TritonMamba: _triton_cached_ssm_transform(prefill/decode, seq_start, s>1)
    TritonMamba->>TritonMamba: decode_idx ← seq_start[num_prefill:]
  else Generate-only (s == 1)
    Caller->>TritonMamba: _triton_cached_ssm_transform(decode, s==1)
    TritonMamba->>TritonMamba: decode_idx ← range(flattened batch)
  end

  TritonMamba->>SSMCache: read/update with decode_idx
  SSMCache-->>TritonMamba: updated state
  TritonMamba-->>Caller: outputs

sequenceDiagram
  autonumber
  participant Caller
  participant CudaConv as CUDA Causal Conv

  Caller->>CudaConv: _cuda_cached_causal_conv1d(DECODE, slot_idx, num_prefill)
  CudaConv->>CudaConv: slot_idx_decode ← slot_idx[num_prefill:].to(int32)
  CudaConv->>CudaConv: conv_state_indices ← slot_idx_decode
  CudaConv-->>Caller: outputs

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	The PR description does not follow the provided template because it only contains a brief AI-generated summary and omits the required “## Description”, “## Test Coverage”, and “## PR Checklist” sections, leaving key details about the issue, solution, and tests unstructured.	Please update the PR description to include the “## Description”, “## Test Coverage”, and “## PR Checklist” headings as specified in the repository template and provide information on the changes, test coverage, and checklist items.
Docstring Coverage	⚠️ Warning	Docstring coverage is 27.27% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title follows the standard “[None][feat]” convention and succinctly describes the addition of Nemotron-H accuracy testing support, which is the primary change in this pull request.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ca82911 and fe22f1c.

📒 Files selected for processing (5)

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/cuda_backend_causal_conv.py (1 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/triton_backend_mamba.py (2 hunks)
tests/integration/defs/accuracy/test_llm_api_autodeploy.py (2 hunks)
tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_triton_mamba_cached_op.py (0 hunks)

💤 Files with no reviewable changes (1)

tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_triton_mamba_cached_op.py

🧰 Additional context used

📓 Path-based instructions (3)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
tensorrt_llm/_torch/auto_deploy/custom_ops/cuda_backend_causal_conv.py
tensorrt_llm/_torch/auto_deploy/custom_ops/triton_backend_mamba.py
tests/integration/defs/accuracy/test_llm_api_autodeploy.py

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
tensorrt_llm/_torch/auto_deploy/custom_ops/cuda_backend_causal_conv.py
tensorrt_llm/_torch/auto_deploy/custom_ops/triton_backend_mamba.py
tests/integration/defs/accuracy/test_llm_api_autodeploy.py

**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
tensorrt_llm/_torch/auto_deploy/custom_ops/cuda_backend_causal_conv.py
tensorrt_llm/_torch/auto_deploy/custom_ops/triton_backend_mamba.py
tests/integration/defs/accuracy/test_llm_api_autodeploy.py

🧬 Code graph analysis (3)

tensorrt_llm/_torch/auto_deploy/custom_ops/cuda_backend_causal_conv.py (1)

tensorrt_llm/_torch/modules/mamba/causal_conv1d.py (1)

causal_conv1d_update (77-118)

tensorrt_llm/_torch/auto_deploy/custom_ops/triton_backend_mamba.py (2)

tensorrt_llm/functional.py (1)

arange (1498-1569)

cpp/kernels/fmha_v2/src/fused_multihead_attention_utils.h (1)

zeros (878-881)

tests/integration/defs/accuracy/test_llm_api_autodeploy.py (1)

tests/integration/defs/accuracy/accuracy_core.py (6)

GSM8K (332-347)

MMLU (315-329)

CnnDailymail (248-265)

LlmapiAccuracyTestHarness (844-855)

evaluate (184-245)

evaluate (763-773)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (7)

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (1)

642-645: LGTM: Simplified value selection logic.

The refactored implementation using set(range(max_val)) - occupied is cleaner and more pythonic than the previous approach. The logic correctly returns the smallest unoccupied value via pop() (which returns an arbitrary element, but from a small range this is acceptable) or defaults to 0 when no free values exist.

tensorrt_llm/_torch/auto_deploy/custom_ops/cuda_backend_causal_conv.py (1)

189-197: LGTM: Improved readability with extracted variable.

Extracting slot_idx_decode as a local variable improves code clarity and aligns with the similar pattern in triton_backend_mamba.py. The logic remains functionally equivalent.

tests/integration/defs/accuracy/test_llm_api_autodeploy.py (3)

24-24: LGTM: Import correctly added for GSM8K test.

The GSM8K import is appropriately added and used in the new test method at line 122.

113-123: LGTM: Test method structure is correct.

The test appropriately instantiates the AutoDeployLLM with the model configuration and evaluates both MMLU and GSM8K tasks, which aligns with the PR objectives of adding accuracy testing for Nemotron-H.

85-111: Confirm Nemotron-H tokenizer's eos_token_id and pad_token_id
Manually verify that the model’s eos_token_id and pad_token_id are not both –1 and adjust SamplingParams to use the actual token IDs if needed.

tensorrt_llm/_torch/auto_deploy/custom_ops/triton_backend_mamba.py (2)

127-132: LGTM: Correctly handles generate-only decode indexing.

The conditional logic appropriately handles the generate-only case (s==1) where seq_start entries would all be zeros. Using arange(bs) ensures correct token indexing for the flattened batch. This fix aligns with the PR objective of resolving errors in test_triton_generate_only_with_slot_mapping.

244-253: LGTM: Proper zero-initialization for SSM state cache.

Changing from torch.empty to torch.zeros ensures that new sequences start with a clean state, which is the correct initial condition for SSM state caches. This improves correctness and prevents potential issues from uninitialized memory.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

lucaslie

@nvchenghaoz the accuracy test got merged in #8133

Maybe for this PR, you can focus on just fixing the unit test?

Signed-off-by: Chenghao Zhang <[email protected]>

nvchenghaoz · 2025-10-07T03:19:49Z

@lucaslie Resolved the merge conflict.

nvchenghaoz · 2025-10-07T03:20:09Z

/bot run

tensorrt-cicd · 2025-10-07T03:26:39Z

PR_Github #20701 [ run ] triggered by Bot

tensorrt-cicd · 2025-10-07T04:18:38Z

PR_Github #20701 [ run ] completed with state ABORTED
LLM/main/L0_MergeRequest_PR #15639 (Blue Ocean) completed with status: ABORTED

lucaslie and others added 22 commits October 3, 2025 10:09

torch ssm and causal conv support (#134)

928d3a7

Signed-off-by: Lucas Liebenwein <[email protected]>

Fix the bamba unit test (#136)

73a8db9

Signed-off-by: Chenghao Zhang <[email protected]>

[https://nvbugs/5527956][fix] AutoDeploy: fix metadata to device copi…

507c988

…es with better reset/sizing (#140) Signed-off-by: Lucas Liebenwein <[email protected]>

small bamba test for parallel execution (#142)

0c7a7ec

Signed-off-by: Lucas Liebenwein <[email protected]>

fix overflow expression for number of pages in sequence interface (#144)

631d051

Signed-off-by: Lucas Liebenwein <[email protected]>

waive mamba tests (#149)

de2ce2d

Signed-off-by: Lucas Liebenwein <[email protected]>

Fix the causal conv error

e44be92

Signed-off-by: Chenghao Zhang <[email protected]>

Fix the triton kernel test error

e7c44df

Signed-off-by: Chenghao Zhang <[email protected]>

Add Nemotron-h acc test

3e6ed8a

Signed-off-by: Chenghao Zhang <[email protected]>

Add gsm8k testing

e26846f

Signed-off-by: Chenghao Zhang <[email protected]>

resolve rebase issue

dabd3d7

Signed-off-by: Chenghao Zhang <[email protected]>

Pass the torch-compile and torch-simple

fdee78a

Signed-off-by: Chenghao Zhang <[email protected]>

Disable the test as the golden data is wrong

ffc646d

Signed-off-by: Chenghao Zhang <[email protected]>

fill seq info data with valid dummy data

cd6ed8b

Signed-off-by: Lucas Liebenwein <[email protected]>

Revert "fill seq info data with valid dummy data"

ac1670c

This reverts commit 67ee3d8.

fill seq info data with valid dummy data

8ad51fe

Signed-off-by: Lucas Liebenwein <[email protected]>

Update the test setup

af04130

Signed-off-by: Chenghao Zhang <[email protected]>

Fix the redundant type conversion

fe22f1c

Signed-off-by: Chenghao Zhang <[email protected]>

nvchenghaoz requested a review from a team as a code owner October 3, 2025 17:41

nvchenghaoz requested a review from lucaslie October 3, 2025 17:41

github-project-automation bot added this to AutoDeploy Board Oct 3, 2025

github-project-automation bot moved this to Backlog in AutoDeploy Board Oct 3, 2025

Resolve the rebase issue

5644504

Signed-off-by: Chenghao Zhang <[email protected]>

lucaslie reviewed Oct 6, 2025

View reviewed changes

Merge branch 'main' into chenghao/fix-causal-conv

e719b2c

Signed-off-by: Chenghao Zhang <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[None][feat] AutoDeploy: Nemotron-H accuracy testing support #8136

[None][feat] AutoDeploy: Nemotron-H accuracy testing support #8136

nvchenghaoz commented Oct 3, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 3, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

lucaslie left a comment

Uh oh!

nvchenghaoz commented Oct 7, 2025

Uh oh!

nvchenghaoz commented Oct 7, 2025

Uh oh!

tensorrt-cicd commented Oct 7, 2025

Uh oh!

tensorrt-cicd commented Oct 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[None][feat] AutoDeploy: Nemotron-H accuracy testing support #8136

Are you sure you want to change the base?

[None][feat] AutoDeploy: Nemotron-H accuracy testing support #8136

Conversation

nvchenghaoz commented Oct 3, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 3, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

lucaslie left a comment

Choose a reason for hiding this comment

Uh oh!

nvchenghaoz commented Oct 7, 2025

Uh oh!

nvchenghaoz commented Oct 7, 2025

Uh oh!

tensorrt-cicd commented Oct 7, 2025

Uh oh!

tensorrt-cicd commented Oct 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nvchenghaoz commented Oct 3, 2025 •

edited by coderabbitai bot

Loading