perf: optimize rejection sampling triton kernel #25791

happierpig · 2025-09-26T23:12:21Z

Purpose

Optimize sample_recovered_tokens_kernel implementation by unrolling CTA over the vocab_size dimension.
Adding more parallelism by using more num_warps, could be helpful for small batch sizes.

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request optimizes the sample_recovered_tokens_kernel Triton kernel by unrolling the computation over the vocabulary size. This is a good performance optimization for large vocabularies. The implementation of the online maximum calculation is correct. However, I've identified a performance issue where a value is redundantly loaded within a loop. My review includes a suggestion to fix this.

gemini-code-assist · 2025-09-26T23:14:04Z

vllm/v1/sample/rejection_sampler.py

+    for off in range(0, vocab_size, BLOCK_SIZE):
+        vocab_offset = off + tl.arange(0, BLOCK_SIZE)
+        if NO_DRAFT_PROBS:
+            draft_token_id = tl.load(draft_token_ids_ptr + start_idx + pos)
+            prob = tl.load(
+                target_probs_ptr + (start_idx + pos) * vocab_size +
+                vocab_offset,
+                mask=((vocab_offset < vocab_size) &
+                      (vocab_offset != draft_token_id)),
+                other=0,
+            )
+        else:
+            draft_prob = tl.load(
+                draft_probs_ptr + (start_idx + pos) * vocab_size +
+                vocab_offset,
+                mask=vocab_offset < vocab_size,
+                other=0,
+            )
+            target_prob = tl.load(
+                target_probs_ptr + (start_idx + pos) * vocab_size +
+                vocab_offset,
                mask=vocab_offset < vocab_size,
-                other=float("-inf"))
-    recovered_id = tl.argmax(prob / q, axis=-1)
-    tl.store(output_token_ids_ptr + start_idx + pos, recovered_id)
+                other=0,
+            )
+            prob = tl.maximum(target_prob - draft_prob, 0)
+
+        q = tl.load(
+            q_ptr + req_idx * vocab_size + vocab_offset,
+            mask=vocab_offset < vocab_size,
+            other=float("-inf"),
+        )
+        scores = prob / q
+        local_val = tl.max(scores, axis=-1)
+        local_idx = tl.argmax(scores, axis=-1) + off
+
+        # update global max
+        better = local_val > max_val
+        max_val = tl.where(better, local_val, max_val)
+        max_idx = tl.where(better, local_idx, max_idx)


For performance, draft_token_id should be loaded only once before the loop since its value doesn't change across iterations. Loading it inside the loop results in redundant reads from global memory, which can negatively impact kernel performance.

if NO_DRAFT_PROBS: draft_token_id = tl.load(draft_token_ids_ptr + start_idx + pos) for off in range(0, vocab_size, BLOCK_SIZE): vocab_offset = off + tl.arange(0, BLOCK_SIZE) if NO_DRAFT_PROBS: prob = tl.load( target_probs_ptr + (start_idx + pos) * vocab_size + vocab_offset, mask=((vocab_offset < vocab_size) & (vocab_offset != draft_token_id)), other=0, ) else: draft_prob = tl.load( draft_probs_ptr + (start_idx + pos) * vocab_size + vocab_offset, mask=vocab_offset < vocab_size, other=0, ) target_prob = tl.load( target_probs_ptr + (start_idx + pos) * vocab_size + vocab_offset, mask=vocab_offset < vocab_size, other=0, ) prob = tl.maximum(target_prob - draft_prob, 0) q = tl.load( q_ptr + req_idx * vocab_size + vocab_offset, mask=vocab_offset < vocab_size, other=float("-inf"), ) scores = prob / q local_val = tl.max(scores, axis=-1) local_idx = tl.argmax(scores, axis=-1) + off # update global max better = local_val > max_val max_val = tl.where(better, local_val, max_val) max_idx = tl.where(better, local_idx, max_idx)

perf: optimize rejection sampling kernel

c80926a

happierpig requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners September 26, 2025 23:12

mergify bot added the v1 label Sep 26, 2025

gemini-code-assist bot reviewed Sep 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

perf: optimize rejection sampling triton kernel #25791

perf: optimize rejection sampling triton kernel #25791

Uh oh!

happierpig commented Sep 26, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 26, 2025

Uh oh!

Uh oh!

Uh oh!

perf: optimize rejection sampling triton kernel #25791

Are you sure you want to change the base?

perf: optimize rejection sampling triton kernel #25791

Uh oh!

Conversation

happierpig commented Sep 26, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

happierpig commented Sep 26, 2025 •

edited by github-actions bot

Loading