[None][fix] Fix a numerical stability issue for XQA with spec dec #7114

lowsfer · 2025-08-21T06:56:24Z

Summary by CodeRabbit

Bug Fixes
- Improved numerical stability in row-wise computations by adjusting initialization threshold to reduce overflow/underflow risk.
Refactor
- Tuned GPU kernel launch constraints in latency-optimized builds to adjust occupancy without changing external behavior.
Documentation
- Added detailed comments explaining exponentiation optimization and stability considerations.
Debugging
- Added conditional runtime logging of per-head row statistics in debug builds to aid troubleshooting.

coderabbitai · 2025-08-21T06:56:29Z

📝 Walkthrough

Walkthrough

Refines kernel launch bounds under an NDEBUG/OPTIMIZE_FOR_LATENCY path, adds gated debug prints for XV GEMM row statistics, and reduces a numerical sentinel constant (safeInitRowMax) with explanatory comments. No public APIs or signatures changed.

Changes

Cohort / File(s)	Summary
Kernel launch bounds and debug instrumentation `cpp/kernels/xqa/mha_sm90.cu`	Under NDEBUG with OPTIMIZE_FOR_LATENCY, changes kernel_mha's launch_bounds to a single-parameter (`__launch_bounds__(128 * 3)`), removing the maxBlocksPerSM constraint in that branch. Adds conditional debug prints of `rowMax` and `rowSum` (gated by block/thread checks) in the XV GEMM path followed by a barrier. No API/signature changes.
Utility constant and documentation `cpp/kernels/xqa/utils.cuh`	Changes `safeInitRowMax` from `-1e+30F` to `-1e+5F` and adds commentary explaining exp(x-rowMax) optimization and numerical stability rationale for the smaller sentinel. No API/signature changes.

Sequence Diagram(s)

(omitted)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

[None][fix] Use safeInitRowMax instead of fp32_lowest to avoid NaN #7087 — Adjusts safeInitRowMax documentation/value and replaces fp32_lowest usages with safeInitRowMax, directly connected to the sentinel change here.

Suggested reviewers

jhaotingc

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

📜 Recent review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 859eef3 and 28dc2d6.

📒 Files selected for processing (2)

cpp/kernels/xqa/mha_sm90.cu (2 hunks)
cpp/kernels/xqa/utils.cuh (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

cpp/kernels/xqa/utils.cuh
cpp/kernels/xqa/mha_sm90.cu

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (6)

cpp/kernels/xqa/utils.cuh (3)
32-36: Tighten literal type and fix comment typo (‘log2d’ → ‘log2e’).

log2e is declared as float but initialized with a double literal; add the F suffix to avoid implicit narrowing.

The comment uses log2d, which is a typo and can confuse readers.

Apply this diff:
-inline constexpr float log2e = 1.4426950408889634; // std::log2(M_E)
+inline constexpr float log2e = 1.4426950408889634F; // std::log2(M_E)
-/*  bias = rowMax * log2d  // shared for the whole row
+/*  bias = rowMax * log2e  // shared for the whole row
     exp(x-rowMax) = exp(x * log2e - bias)
 */
37-39: Value change to safeInitRowMax: good intent; suggest naming and documentation tweak.

Lowering the magnitude to -1e5F aligns with the stability note and should reduce FMA-induced issues. To match the guidelines (“Constants use kPREFIXED_UPPER_SNAKE”), keep the existing name for ABI/back-compat but introduce a named constant.

Apply this diff:
-// this reason, don't set a huge safeInitRowMax.
-inline constexpr float safeInitRowMax = -1e+5F;
+// This optimization can be numerically unstable when (x * log2e - bias) is fused into an FMA for large |x|.
+// To mitigate, avoid an excessively large sentinel for initialization.
+inline constexpr float kSAFE_INIT_ROW_MAX = -1e5F;
+inline constexpr float safeInitRowMax = kSAFE_INIT_ROW_MAX; // Backward-compatible alias.
Optionally, add a one-liner explaining the selection rationale, e.g., “Chosen to be far below plausible QK logits while avoiding extreme products with log2e.”

2116-2139: Avoid FMA fusion in softmax exponent to further improve stability.

Current form elem = exp2f(elem * log2e - bias) invites FMA. Pre-subtract then multiply to prevent fusion and lower catastrophic cancellation risk, while maintaining perf. Same change applies to both SWAP_AB and non-SWAP_AB variants.

Apply this diff for both warpGrpOnlineSoftmax() overloads:
-            float const maxVal = colMax[n][j];
-            float const bias = maxVal * log2e;
+            float const maxVal = colMax[n][j];
             ...
-                    float& elem = acc(m, n)(i, j);
-                    assert(maxVal >= elem);
-                    elem = exp2f(elem * log2e - bias);
+                    float& elem = acc(m, n)(i, j);
+                    assert(maxVal >= elem);
+                    float const delta = elem - maxVal;
+                    elem = exp2f(delta * log2e);
And similarly in the row-wise variant:
-            float const maxVal = rowMax[m][i];
-            float const bias = maxVal * log2e;
+            float const maxVal = rowMax[m][i];
             ...
-                    float& elem = acc(m, n)(i, j);
-                    assert(maxVal >= elem);
-                    elem = exp2f(elem * log2e - bias);
+                    float& elem = acc(m, n)(i, j);
+                    assert(maxVal >= elem);
+                    float const delta = elem - maxVal;
+                    elem = exp2f(delta * log2e);
Also applies to: 2440-2463
cpp/kernels/xqa/mha_sm90.cu (3)
1094-1110: Debug prints for non-SWAP_AB row stats: looks good; minor gate suggestion.

Gating on blockIdx.y == 1 and threadIdx.x == 0 keeps noise down and the barrier maintains ordering. Consider switching to blockIdx.y == 0 to ensure a print even when nbSubSeq == 1; or add a short comment explaining why y == 1 is preferred.

2950-2954: Optional: clamp normalization denominator to avoid division by tiny sums in corner cases.

Extremely adversarial inputs could yield near-zero sums before adding attention sinks. A tiny clamp protects against denorm/Inf without impacting normal runs.

Example:
-    auto const regOutScale = __frcp_rn(regColSum) * xvoScale;
+    auto const regOutScale = __frcp_rn(max(regColSum, Vec<float, decltype(regColSum)::size>::filled(1e-20f))) * xvoScale;
Same consideration applies to the row-wise variant.

840-846: Request targeted validation for NaN/Inf eradication on SPEC_DEC.

Given the PR goal, please run a stress A/B on SPEC_DEC with adversarial logits to confirm NaN/Inf are gone and that the safeInitRowMax tweak plus current exponent path holds.

If helpful, I can craft a focused harness that:

Forces all-masked rows and very negative tiles to check softmax behavior.

Sweeps large-magnitude pre-softmax values to check for NaN/Inf and overflow in exp2f paths.

Also applies to: 1094-1110

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 2d40e87 and 859eef3.

📒 Files selected for processing (2)

cpp/kernels/xqa/mha_sm90.cu (2 hunks)
cpp/kernels/xqa/utils.cuh (1 hunks)

🧰 Additional context used

📓 Path-based instructions (5)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}