Skip to content

Conversation

@nv-guomingz
Copy link
Collaborator

@nv-guomingz nv-guomingz commented Sep 29, 2025

It's a PR which address the comments from @yuxianq on #7892

Summary by CodeRabbit

  • Bug Fixes
    • Corrected attention behavior with output gating to ensure proper dimensions and RoPE application.
    • Improved model variant detection in config parsing for more reliable behavior.
  • Performance
    • Tightened conditions for applying fused RoPE, using it only when compatible with gating and norm settings for more predictable performance and stability.
  • Chores
    • Reduced log noise by downgrading a warning to info when output gating is enabled.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 29, 2025

📝 Walkthrough

Walkthrough

Updates adjust attention module logging and QKV projection sizing when attn_output_gate is enabled, simplify RoPE application flow, refine RoPE fusion enablement in QK-Norm attention, and change a config utility to explicitly check attribute presence before comparison.

Changes

Cohort / File(s) Summary
Attention module updates
tensorrt_llm/_torch/modules/attention.py
Change log level to info when attn_output_gate is enabled; adjust qkv_proj input dimension when attn_output_gate is true; simplify apply_rope path by removing None-guards and unconditionally splitting QKV when RoPE isn’t fused.
QK-Norm RoPE fusion gating
tensorrt_llm/_torch/modules/qk_norm_attention.py
Tighten rope_fusion condition: disable when attn_output_gate or use_gemma_rms_norm is true, or when skip_rope, or fuse_qk_norm_rope is false; update inline comments; no API changes.
Config attribute check
tensorrt_llm/_torch/pyexecutor/config_utils.py
is_qwen3_next now uses hasattr to ensure linear_key_head_dim exists before verifying it is > 0, avoiding getattr defaulting.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant C as Config
  participant A as Attention.__init__
  participant Q as QKNormRoPEAttention.__init__

  Note over A: Initialization decisions
  A->>A: if attn_output_gate\n set info log
  A->>A: compute qkv_proj size\n(adjust when gate=true)
  A->>A: apply_rope path simplified\n(unconditional split when not fused)

  Note over Q: RoPE fusion gating
  Q->>Q: rope_fusion = fuse_qk_norm_rope && !skip_rope && !attn_output_gate && !use_gemma_rms_norm
Loading
sequenceDiagram
  autonumber
  participant C as Config
  participant U as is_qwen3_next

  C->>U: query linear_key_head_dim
  alt attribute exists
    U->>U: return linear_key_head_dim > 0
  else attribute missing
    U->>U: return False
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Pre-merge checks and finishing touches

❌ Failed checks (3 warnings)
Check name Status Explanation Resolution
Title Check ⚠️ Warning The title “[None][chore] Refine qwen3-next implementation” focuses narrowly on the qwen3-next detection change but omits the broader updates to tensor-based attention gating, RoPE application, and fusion logic in multiple modules, so it does not accurately summarize the main changes. Please revise the title to clearly reflect the key changes, for example: “[None][chore] Refine attention gating, RoPE fusion logic, and qwen3-next detection.”
Description Check ⚠️ Warning The pull request description is a single informal sentence and does not follow the repository’s template; it lacks the required “## Description” section explaining what and why, the “## Test Coverage” section listing relevant tests, and the checklist items to confirm coding guidelines and documentation updates. Please expand the description to include a “## Description” that details the problem and solution, a “## Test Coverage” section listing new or updated tests, and confirm the PR checklist items as outlined in the template.
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/modules/attention.py (1)

226-239: Critical: LoRA tensor size mismatch when attn_output_gate=True.

With gating on, qkv_proj emits 2*q_size + 2*kv_size features, but both splitted_qkv_lora and fused_qkv_lora remain initialized to q_size + 2*kv_size, leading to a runtime shape-mismatch. Apply:

-        self.splitted_qkv_lora = LoraLayer([
-            LoraModuleType.ATTENTION_Q, LoraModuleType.ATTENTION_K,
-            LoraModuleType.ATTENTION_V
-        ], [self.q_size, self.kv_size, self.kv_size])
-        self.fused_qkv_lora = LoraLayer([LoraModuleType.ATTENTION_QKV],
-                                        [self.q_size + 2 * self.kv_size])
+        q_out_size_for_lora = (2 * self.q_size) if self.attn_output_gate else self.q_size
+        self.splitted_qkv_lora = LoraLayer(
+            [LoraModuleType.ATTENTION_Q, LoraModuleType.ATTENTION_K, LoraModuleType.ATTENTION_V],
+            [q_out_size_for_lora, self.kv_size, self.kv_size],
+        )
+        self.fused_qkv_lora = LoraLayer(
+            [LoraModuleType.ATTENTION_QKV],
+            [q_out_size_for_lora + 2 * self.kv_size],
+        )

Also verify that LoraLayer supports these fused sizes (or pad the extra gated-Q slice) to avoid runtime errors.

🧹 Nitpick comments (1)
tensorrt_llm/_torch/pyexecutor/config_utils.py (1)

17-18: Make the check robust to non-numeric/None values to avoid TypeError.

If linear_key_head_dim exists but is None or non‑numeric, > 0 will raise. Recommend guarding the type before comparing.

Apply:

-def is_qwen3_next(config):
-    return hasattr(config,
-                   'linear_key_head_dim') and config.linear_key_head_dim > 0
+def is_qwen3_next(config):
+    val = getattr(config, 'linear_key_head_dim', None)
+    return isinstance(val, int) and val > 0
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 38d6e4e and 550d815.

📒 Files selected for processing (3)
  • tensorrt_llm/_torch/modules/attention.py (3 hunks)
  • tensorrt_llm/_torch/modules/qk_norm_attention.py (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/config_utils.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

  • tensorrt_llm/_torch/modules/attention.py
  • tensorrt_llm/_torch/pyexecutor/config_utils.py
  • tensorrt_llm/_torch/modules/qk_norm_attention.py
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

  • tensorrt_llm/_torch/modules/attention.py
  • tensorrt_llm/_torch/pyexecutor/config_utils.py
  • tensorrt_llm/_torch/modules/qk_norm_attention.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

  • tensorrt_llm/_torch/modules/attention.py
  • tensorrt_llm/_torch/pyexecutor/config_utils.py
  • tensorrt_llm/_torch/modules/qk_norm_attention.py
🧬 Code graph analysis (2)
tensorrt_llm/_torch/modules/attention.py (2)
tensorrt_llm/logger.py (1)
  • info_once (141-142)
tensorrt_llm/_torch/distributed/communicator.py (1)
  • tp_size (53-54)
tensorrt_llm/_torch/pyexecutor/config_utils.py (1)
tensorrt_llm/_torch/models/modeling_utils.py (1)
  • config (519-520)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (3)
tensorrt_llm/_torch/modules/attention.py (2)

180-181: Log level downgrade to info_once is fine.

Reduces unnecessary warns while still surfacing one‑time signal.


584-588: Unconditional split_qkv in unfused RoPE path looks good; confirm no extra work in already-split cases.

split_qkv is no‑op when k/v are provided, so this simplifies flow without changing behavior.

Please sanity‑check a gated configuration with unfused RoPE to ensure sequences pass shapes through: enable attn_output_gate=True, set rope_fusion=False, and run a short forward to confirm no extra allocations/regressions.

tensorrt_llm/_torch/modules/qk_norm_attention.py (1)

169-172: Correctly disable fused RoPE when gating or Gemma RMSNorm are active.

The consolidated condition aligns fused‑path constraints with upstream kernels.

Double‑check that when attn_output_gate=True, Attention.apply_rope handles the unfused path with split‑then‑norm‑then‑RoPE as intended in this subclass.

@yuxianq
Copy link
Collaborator

yuxianq commented Sep 30, 2025

@nv-guomingz Please also address the following comments: #7892 (comment), #7892 (comment), #7892 (comment), #7892 (comment), thanks~

@nv-guomingz nv-guomingz force-pushed the user/guomingz/refine_qwen3_next_attn branch from 550d815 to a62e5bf Compare September 30, 2025 07:01
@nv-guomingz nv-guomingz requested a review from a team as a code owner September 30, 2025 07:01
@nv-guomingz nv-guomingz force-pushed the user/guomingz/refine_qwen3_next_attn branch from a62e5bf to 939ecdd Compare September 30, 2025 07:31
@nv-guomingz
Copy link
Collaborator Author

@nv-guomingz Please also address the following comments: #7892 (comment), #7892 (comment), #7892 (comment), #7892 (comment), thanks~

All resolved.

@nv-guomingz nv-guomingz force-pushed the user/guomingz/refine_qwen3_next_attn branch from 939ecdd to 8b06ade Compare September 30, 2025 07:35
@nv-guomingz
Copy link
Collaborator Author

/bot run

@nv-guomingz nv-guomingz requested a review from syuoni September 30, 2025 07:42
@tensorrt-cicd
Copy link
Collaborator

PR_Github #20346 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20346 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #15350 completed with status: 'FAILURE'

@nv-guomingz
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20370 [ run ] triggered by Bot

@nv-guomingz nv-guomingz enabled auto-merge (squash) September 30, 2025 12:59
@tensorrt-cicd
Copy link
Collaborator

PR_Github #20370 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #15366 completed with status: 'FAILURE'

@nv-guomingz nv-guomingz force-pushed the user/guomingz/refine_qwen3_next_attn branch from 8b06ade to 429bda9 Compare September 30, 2025 15:43
@nv-guomingz
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20395 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #20395 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #15389 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

@nv-guomingz nv-guomingz merged commit b4be0d2 into NVIDIA:main Sep 30, 2025
5 checks passed
faradawn pushed a commit to faradawn/TensorRT-LLM that referenced this pull request Oct 2, 2025
evezhier pushed a commit to evezhier/TensorRT-LLM that referenced this pull request Oct 3, 2025
faradawn pushed a commit to faradawn/TensorRT-LLM that referenced this pull request Oct 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants