-
Notifications
You must be signed in to change notification settings - Fork 1.9k
[None][feat] AutoDeploy: Nemotron-H accuracy testing support #8136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
* [None][auto_deploy] Bamba Signed-off-by: William Zhang <[email protected]> * debugging export accuracy diff for bamba Signed-off-by: Lucas Liebenwein <[email protected]> --------- Signed-off-by: William Zhang <[email protected]> Signed-off-by: Lucas Liebenwein <[email protected]> Co-authored-by: William Zhang <[email protected]>
Signed-off-by: Lucas Liebenwein <[email protected]>
Signed-off-by: Chenghao Zhang <[email protected]>
* Fix the bamba unit test Signed-off-by: Chenghao Zhang <[email protected]> * none: Add triton backend for ssm_transform and cuda backend for conv Signed-off-by: Chenghao Zhang <[email protected]> * Fully Use the TRT LLM kernels Signed-off-by: Chenghao Zhang <[email protected]> * Add fake version for ssm transform op Signed-off-by: Chenghao Zhang <[email protected]> * Fix the datatype error in fake op Signed-off-by: Chenghao Zhang <[email protected]> * Fix the conv test error Signed-off-by: Chenghao Zhang <[email protected]> * Fix the triton ssm error Signed-off-by: Chenghao Zhang <[email protected]> --------- Signed-off-by: Chenghao Zhang <[email protected]>
…es with better reset/sizing (#140) Signed-off-by: Lucas Liebenwein <[email protected]>
Signed-off-by: Lucas Liebenwein <[email protected]>
Signed-off-by: Lucas Liebenwein <[email protected]>
* Fix the bamba unit test Signed-off-by: Chenghao Zhang <[email protected]> * none: Add triton backend for ssm_transform and cuda backend for conv Signed-off-by: Chenghao Zhang <[email protected]> * Fully Use the TRT LLM kernels Signed-off-by: Chenghao Zhang <[email protected]> * Add fake version for ssm transform op Signed-off-by: Chenghao Zhang <[email protected]> * Fix the datatype error in fake op Signed-off-by: Chenghao Zhang <[email protected]> * Fix the conv test error Signed-off-by: Chenghao Zhang <[email protected]> * Fix the triton ssm error Signed-off-by: Chenghao Zhang <[email protected]> * Fix the DemoLLM sampler mismatch Signed-off-by: Chenghao Zhang <[email protected]> * Update the implementation for triton/cuda kernels Signed-off-by: Chenghao Zhang <[email protected]> * Fix the d2d memcpy for decode Signed-off-by: Chenghao Zhang <[email protected]> * Revert the generator and remove the redundant code Signed-off-by: Chenghao Zhang <[email protected]> --------- Signed-off-by: Chenghao Zhang <[email protected]> Signed-off-by: Suyog Gupta <[email protected]> Co-authored-by: Suyog Gupta <[email protected]>
* [None][feat] Add patches for NemotronH Signed-off-by: William Zhang <[email protected]> * [None][test] unittest for nemotron_h Signed-off-by: William Zhang <[email protected]> * nemotron-h support finished Signed-off-by: Lucas Liebenwein <[email protected]> * added anticapted path for new models on llm_models trt-llm CI Signed-off-by: Lucas Liebenwein <[email protected]> --------- Signed-off-by: William Zhang <[email protected]> Signed-off-by: Lucas Liebenwein <[email protected]> Co-authored-by: William Zhang <[email protected]>
Signed-off-by: Lucas Liebenwein <[email protected]>
Signed-off-by: Chenghao Zhang <[email protected]>
Signed-off-by: Chenghao Zhang <[email protected]>
Signed-off-by: Chenghao Zhang <[email protected]>
Signed-off-by: Chenghao Zhang <[email protected]>
Signed-off-by: Chenghao Zhang <[email protected]>
Signed-off-by: Chenghao Zhang <[email protected]>
Signed-off-by: Chenghao Zhang <[email protected]>
Signed-off-by: Lucas Liebenwein <[email protected]>
This reverts commit 67ee3d8.
Signed-off-by: Lucas Liebenwein <[email protected]>
Signed-off-by: Chenghao Zhang <[email protected]>
Signed-off-by: Chenghao Zhang <[email protected]>
Signed-off-by: Chenghao Zhang <[email protected]>
📝 WalkthroughWalkthroughRefactors internal value selection in attention interface, adjusts decode-time indexing in CUDA causal conv, updates Triton Mamba decode index logic and zero-initialized cache, and modifies tests by adding a Nemotron-H integration accuracy suite and enabling a previously skipped unit test. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Caller
participant TritonMamba as Triton Mamba Op
participant SSMCache as SSM State Cache
rect rgba(230,240,255,0.5)
note over TritonMamba: Initialization
Caller->>TritonMamba: get_cache_initializers()
TritonMamba->>SSMCache: allocate zeros (device,dtype)
SSMCache-->>TritonMamba: zero-initialized cache
end
alt Prefill or multi-token decode (s > 1)
Caller->>TritonMamba: _triton_cached_ssm_transform(prefill/decode, seq_start, s>1)
TritonMamba->>TritonMamba: decode_idx ← seq_start[num_prefill:]
else Generate-only (s == 1)
Caller->>TritonMamba: _triton_cached_ssm_transform(decode, s==1)
TritonMamba->>TritonMamba: decode_idx ← range(flattened batch)
end
TritonMamba->>SSMCache: read/update with decode_idx
SSMCache-->>TritonMamba: updated state
TritonMamba-->>Caller: outputs
sequenceDiagram
autonumber
participant Caller
participant CudaConv as CUDA Causal Conv
Caller->>CudaConv: _cuda_cached_causal_conv1d(DECODE, slot_idx, num_prefill)
CudaConv->>CudaConv: slot_idx_decode ← slot_idx[num_prefill:].to(int32)
CudaConv->>CudaConv: conv_state_indices ← slot_idx_decode
CudaConv-->>Caller: outputs
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro 📒 Files selected for processing (5)
💤 Files with no reviewable changes (1)
🧰 Additional context used📓 Path-based instructions (3)**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Files:
**/*.py📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Files:
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Files:
🧬 Code graph analysis (3)tensorrt_llm/_torch/auto_deploy/custom_ops/cuda_backend_causal_conv.py (1)
tensorrt_llm/_torch/auto_deploy/custom_ops/triton_backend_mamba.py (2)
tests/integration/defs/accuracy/test_llm_api_autodeploy.py (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
🔇 Additional comments (7)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
lucaslie
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nvchenghaoz the accuracy test got merged in #8133
Maybe for this PR, you can focus on just fixing the unit test?
Signed-off-by: Chenghao Zhang <[email protected]>
|
@lucaslie Resolved the merge conflict. |
|
/bot run |
|
PR_Github #20701 [ run ] triggered by Bot |
|
PR_Github #20701 [ run ] completed with state |
Summary by CodeRabbit
New Features
Bug Fixes
Refactor
Tests
Fix the test errors in test_triton_generate_only_with_slot_mapping, remove the waive.
Add the accuracy testing for Nemotron-H, both MMLU and GSM8K passed.