-
Notifications
You must be signed in to change notification settings - Fork 1.9k
[https://nvbugs/5412456][fix] Fix an illegal instruction was encountered #6776
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pytest tests/unittest/trt/attention/test_gpt_attention.py occurs 'RuntimeError: CUDA error: an illegal instruction was encountered' Signed-off-by: Zhou Yuxin <[email protected]>
📝 WalkthroughWalkthroughAdds a guard in FusedMHARunnerV2::setupLaunchParams to set total_kv_seqlen to total_q_seqlen when total_kv_seqlen computes to 0 (NVBug workaround). No other launch parameter calculations or public interfaces are changed. Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant FusedMHARunnerV2
Caller->>FusedMHARunnerV2: setupLaunchParams(...)
activate FusedMHARunnerV2
FusedMHARunnerV2->>FusedMHARunnerV2: compute total_q_seqlen, total_kv_seqlen
alt total_kv_seqlen == 0
FusedMHARunnerV2->>FusedMHARunnerV2: total_kv_seqlen = total_q_seqlen
end
FusedMHARunnerV2->>FusedMHARunnerV2: proceed with descriptor/setup using non-zero kv len
deactivate FusedMHARunnerV2
FusedMHARunnerV2-->>Caller: launch params ready
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Suggested reviewers
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
|
/bot run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🔭 Outside diff range comments (1)
cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fmhaRunner.cpp (1)
480-481: Add non-zero sequence length checks when using TMAInsert the following guard at the top of
FusedMHARunnerV2::setTmaDescriptorsin
cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fmhaRunner.cpp(around line 480):const uint32_t total_q_seqlen = mLaunchParams.total_q_seqlen; const uint32_t total_kv_seqlen = mLaunchParams.total_kv_seqlen; + if (mLaunchParams.use_tma) { + TLLM_CHECK_WITH_INFO(total_q_seqlen > 0, "total_q_seqlen must be > 0 when using TMA."); + TLLM_CHECK_WITH_INFO(total_kv_seqlen > 0, "total_kv_seqlen must be > 0 when using TMA."); + }This ensures that neither Q nor KV sequence length is zero before constructing any TMA descriptors.
🧹 Nitpick comments (3)
cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fmhaRunner.cpp (3)
300-304: Log when the NVBug fallback triggers to aid diagnosticsRecommend emitting a warning the first time the fallback is applied. This helps detect unexpected occurrences and simplifies triage without changing behavior.
mLaunchParams.total_kv_seqlen = mFixedParams.isSPadded ? runnerParams.b * runnerParams.kvSeqLen : runnerParams.totalKvSeqLen; - // Workaround for nvbug 5412456: total_kv_seqlen fallbacks to total_q_seqlen if it's zero. - if (mLaunchParams.total_kv_seqlen == 0) + // Workaround for NVBug 5412456: total_kv_seqlen falls back to total_q_seqlen if it's zero. + if (mLaunchParams.total_kv_seqlen == 0) { + TLLM_LOG_WARNING( + "Applying NVBug 5412456 workaround: total_kv_seqlen==0. " + "Using total_q_seqlen (%u) instead.", + static_cast<unsigned>(mLaunchParams.total_q_seqlen)); mLaunchParams.total_kv_seqlen = mLaunchParams.total_q_seqlen; }
25-25: Use C++ headers over C headers per guidelinesReplace <math.h> with . The file already uses std::log2; align sqrt/pow usage with std::.
-#include <math.h> +#include <cmath>Optionally, change sqrtf to std::sqrt and pow to std::pow for consistency.
215-216: Constant naming: align with codebase conventionGuidelines suggest uppercase snake case prefixed with 'k' for constants. Consider kLOG2E instead of kLog2e.
- constexpr float kLog2e = 1.4426950408889634074; // log_2(e) = M_LOG2E - set_alpha(mKernelParams.scale_bmm1, scale_bmm1 * float(kLog2e), DATA_TYPE_FP32); + constexpr float kLOG2E = 1.4426950408889634074; // log2(e) + set_alpha(mKernelParams.scale_bmm1, scale_bmm1 * float(kLOG2E), DATA_TYPE_FP32);
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fmhaRunner.cpp(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,h,hpp,cc,cxx}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,h,hpp,cc,cxx}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo).
Prefer const or constexpr variables over #defines whenever possible.
A variable that is not modified after its initialization should be declared as const.
Except 0 (used for checking signness/existence/emptiness), nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do..while, or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with the first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in a compilation target must have case-insensitive unique filenames.
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces should use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by 's' (e.g., sMutableStaticGlobal).
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter (e.g., static std::once_flag sFlag;).
Class member variables should use camel case prefixed with 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic-number/literal constants should be uppercase snake case with prefix...
Files:
cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fmhaRunner.cpp
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fmhaRunner.cpp
🧠 Learnings (1)
📚 Learning: 2025-08-08T22:03:40.685Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.685Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.
Applied to files:
cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fmhaRunner.cpp
🔇 Additional comments (1)
cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fmhaRunner.cpp (1)
300-304: Guard looks correct and addresses the illegal-instruction root causeSetting total_kv_seqlen to total_q_seqlen when it computes to 0 prevents zero-sized TMA descriptors. This is aligned with the PR goal and is the right place to apply the workaround before descriptor setup.
|
PR_Github #14750 [ run ] triggered by Bot |
|
PR_Github #14750 [ run ] completed with state |
|
/bot run |
|
PR_Github #15050 [ run ] triggered by Bot |
|
PR_Github #15050 [ run ] completed with state |

pytest tests/unittest/trt/attention/test_gpt_attention.py occurs 'RuntimeError: CUDA error: an illegal instruction was encountered'
Summary by CodeRabbit