[Perf] Enable full CUDA graphs for spec decoding with FlashInfer #26937

benchislett · 2025-10-15T19:12:04Z

Purpose

TRTLLM-gen kernels support full cuda graphs, but are only used with FlashInfer on Blackwell under certain conditions.
It might not be safe to change FlashInfer's cudagraph_support to UNIFORM_BATCH always, but we can still set it when we know TRTLLM-gen backend will be used.

Also update the docs to reflect the FlashInfer and FlashInferMLA cuda graph compatibility

FIX #26856

Test Plan

Ran Llama 3.1 8B-Instruct with EAGLE3 and confirmed that lm_eval-gsm8k is unchanged compared to main, and when TRTLLM attention is force disabled. Confirmed via torch profile that full graphs are now issued for verification when TRTLLM attention is enabled

Test Result

TRTLLM on:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7726|±  |0.0115|
|     |       |strict-match    |     5|exact_match|↑  |0.7013|±  |0.0126|

TRTLLM off:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7726|±  |0.0115|
|     |       |strict-match    |     5|exact_match|↑  |0.7013|±  |0.0126|

Benchmarks

MT-Bench at concurrency 1 sees a minimal speedup (~2%)

vllm serve meta-llama/Llama-3.1-8B-Instruct --speculative-config '{"method": "eagle3", "model": "yuhuili/EAGLE3-LLaMA3.1-Instruct-8B", "num_speculative_tokens": 3}' &

vllm bench serve --dataset-name hf --dataset-path philschmid/mt-bench --num-prompts 80 --max-concurrency 1 --model meta-llama/Llama-3.1-8B-Instruct --base-url http://0.0.0.0:8049

Before:

============ Serving Benchmark Result ============
Successful requests:                     80        
Maximum request concurrency:             1         
Benchmark duration (s):                  42.58     
Total input tokens:                      8133      
Total generated tokens:                  16955     
Request throughput (req/s):              1.88      
Output token throughput (tok/s):         398.23    
Peak output token throughput (tok/s):    186.00    
Peak concurrent requests:                4.00      
Total Token throughput (tok/s):          589.25    
---------------Time to First Token----------------
Mean TTFT (ms):                          12.11     
Median TTFT (ms):                        11.88     
P99 TTFT (ms):                           14.29     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          2.45      
Median TPOT (ms):                        2.45      
P99 TPOT (ms):                           3.33      
---------------Inter-token Latency----------------
Mean ITL (ms):                           5.36      
Median ITL (ms):                         5.36      
P99 ITL (ms):                            5.61      
==================================================

After:

============ Serving Benchmark Result ============
Successful requests:                     80        
Maximum request concurrency:             1         
Benchmark duration (s):                  41.73     
Total input tokens:                      8133      
Total generated tokens:                  16795     
Request throughput (req/s):              1.92      
Output token throughput (tok/s):         402.47    
Peak output token throughput (tok/s):    190.00    
Peak concurrent requests:                4.00      
Total Token throughput (tok/s):          597.37    
---------------Time to First Token----------------
Mean TTFT (ms):                          11.86     
Median TTFT (ms):                        11.75     
P99 TTFT (ms):                           14.93     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          2.43      
Median TPOT (ms):                        2.37      
P99 TPOT (ms):                           3.39      
---------------Inter-token Latency----------------
Mean ITL (ms):                           5.26      
Median ITL (ms):                         5.25      
P99 ITL (ms):                            5.48      
==================================================

Signed-off-by: Benjamin Chislett <[email protected]>

mergify · 2025-10-15T19:12:45Z

Documentation preview: https://vllm--26937.org.readthedocs.build/en/26937/

gemini-code-assist

Code Review

This pull request enables full CUDA graphs for speculative decoding with FlashInfer when TRT-LLM attention kernels are available, which is a valuable performance enhancement. The implementation correctly updates the cudagraph_support attribute in FlashInferMetadataBuilder at runtime based on whether TRT-LLM attention can be used. The change from a class variable to an instance variable for cudagraph_support is appropriate for this dynamic behavior. The documentation has also been updated to reflect these changes. The logic appears sound and the provided test results indicate that correctness is maintained while enabling this optimization.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

vllm/v1/attention/backends/flashinfer.py

vadiklyutiy · 2025-10-15T19:26:08Z

Regarding performance improvement.
I did try on Qwen3-next with 2 prediction tokens.
With batch=1 it improves from 92 toks/s -> 222 toks/s

mgoin · 2025-10-15T21:35:24Z

cc @LucasWilkinson @ProExpertProg regarding updating AttentionCGSupport dynamically

Signed-off-by: Benjamin Chislett <[email protected]>

LucasWilkinson · 2025-10-16T19:45:27Z

cc @LucasWilkinson @ProExpertProg regarding updating AttentionCGSupport dynamically

Dynamically updating it should be fine since we only call it here on instances here

vllm/vllm/v1/worker/gpu_model_runner.py

Line 3968 in a5464dc

if builder.cudagraph_support.value < min_cg_support.value:

. But if we are going to dynamically update it I think we should make it an instance property instead of a class variable just to avoid confusion and future bugs.

Signed-off-by: Benjamin Chislett <[email protected]>

…ables Signed-off-by: Benjamin Chislett <[email protected]>

LucasWilkinson

LGTM; there are some nits that should be addressed (specifically for the CPU backend I think we should still keep the reorder_batch_threshold = 1)

it is a bit harder to see where cudagraph_support is set now :/ I guess the alternative would be use a function; i.e. add a get_cudagraph_support() function in the base class (I think the current implementation is better but im also flip-flopping haha)

LucasWilkinson · 2025-10-22T19:20:44Z

vllm/v1/attention/backends/cpu_attn.py


 class TorchSDPAMetadataBuilderV1(AttentionMetadataBuilder[TorchSDPAMetadata]):
-    reorder_batch_threshold: int = 1
+    reorder_batch_threshold: int


nit: I still think this needs to be set?

It is set in the constructor: _init_reorder_batch_threshold(1, False)

The type annotation is left to indicate that it will never be "None" on this class and its subclasses. This is a common pattern in the changes in this PR

LucasWilkinson · 2025-10-22T19:23:34Z

vllm/v1/attention/backends/xformers.py

    AttentionMetadataBuilder[XFormersAttentionMetadata]
 ):
-    reorder_batch_threshold: int = 1
+    reorder_batch_threshold: int


nit: is this still needed?

yes, as type annotation, see previous comment

LucasWilkinson · 2025-10-22T19:23:49Z

vllm/v1/attention/backends/mla/indexer.py

-    )
-
-    reorder_batch_threshold: int = 1
+    reorder_batch_threshold: int


nit: is this still needed?

yes, as type annotation, see previous comment

fhl2000 · 2025-11-02T02:01:17Z

Hi @benchislett, it is no longer safe to dynamically update cudagraph_support inside __init__() since #27427 merged, because now we resolve cudagraph mode (which requires cudagraph_support) before actually initializing the builder instance. So making it an instance property is not a good idea. Instead, I would make a class method get_cudagraph_support() for this.

benchislett · 2025-11-10T21:06:58Z

@fhl2000 that breaks this PR pretty firmly. The main idea to enable full-cuda-graphs for FlashInfer is to opt-in dynamically based on whether TRTLLM kernels can be used, which depends on a number of parameters, some which are specific to the actual model architecture. Do you see an easy way around this?

fhl2000 · 2025-11-11T03:03:19Z

The right logic should be "determine cudagraph_support of each backend(builder class)"-> "resolve cudagraph mode" -> "initial cudagraph-relative stuff of each backend" anyway. I think since the cudagraph_support of the specific backend is fixed after its initialization, can we extract how we determine the cudagraph_support out as a static method (I think passing what passed to the builder_class. __init__() is enough). Alternative may be delaying the "initial cudagraph-relative stuff of each backend" after the backend initialization but before the first build() call (may be triggered at build_for cudagraph_capturing?). so the flow becomes `initial each backend (also determine cudagraph support here)" -> resolve cudagraph mode -> "trigger cudagraph initialization at build_for_cudagraph_capturing". I think the first option is easier, but let's see if you and @LucasWilkinson have other concerns on it.

mergify · 2025-11-11T16:54:30Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @benchislett.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

benchislett · 2025-11-11T16:55:01Z

Closing the PR for now while I work on a refactor to fix up cudagraph_support.

benchislett · 2025-11-11T18:41:27Z

@fhl2000 @LucasWilkinson I took another stab at this in #28479, following @fhl2000's suggestion. I think this will work well.

I omitted the refactor changes to reorder_batch_threshold to simplify the diff; can add in a follow-up PR if still desired

use full cuda graphs for spec when FlashInfer allows

56158cb

Signed-off-by: Benjamin Chislett <[email protected]>

benchislett requested a review from mgoin as a code owner October 15, 2025 19:12

mergify bot added documentation Improvements or additions to documentation v1 labels Oct 15, 2025

gemini-code-assist bot reviewed Oct 15, 2025

View reviewed changes

Merge branch 'main' into flashinfer-spec-fullgraph

983d50d

chatgpt-codex-connector bot reviewed Oct 15, 2025

View reviewed changes

vllm/v1/attention/backends/flashinfer.py Show resolved Hide resolved

ggg-s mentioned this pull request Oct 16, 2025

feat: spec decode with draft models #24322

Open

force flashinfer to use trtllm attention with spec when available

ba172c8

Signed-off-by: Benjamin Chislett <[email protected]>

vadiklyutiy mentioned this pull request Oct 20, 2025

[Tracking Issue]: Qwen3-next performance optimisations #27225

Open

8 tasks

benchislett added 2 commits October 21, 2025 17:22

Merge branch 'main' into flashinfer-spec-fullgraph

3ced8b1

Signed-off-by: Benjamin Chislett <[email protected]>

make reorder_batch_threshold and cudagraph_support into instance vari…

350b888

…ables Signed-off-by: Benjamin Chislett <[email protected]>

benchislett requested review from LucasWilkinson, gshtras, pavanimajety and tdoublep as code owners October 21, 2025 18:12

mergify bot added the rocm Related to AMD ROCm label Oct 21, 2025

Merge branch 'main' into flashinfer-spec-fullgraph

41a0624

LucasWilkinson reviewed Oct 22, 2025

View reviewed changes

mergify bot added the nvidia label Nov 11, 2025

github-project-automation bot added this to NVIDIA Nov 11, 2025

mergify bot added the needs-rebase label Nov 11, 2025

benchislett closed this Nov 11, 2025

github-project-automation bot moved this to Done in NVIDIA Nov 11, 2025

benchislett mentioned this pull request Nov 11, 2025

[Perf] Refactor cudagraph_support to enable full CUDA graphs for spec decoding with FlashInfer #28479

Merged

Uh oh!

[Perf] Enable full CUDA graphs for spec decoding with FlashInfer #26937

[Perf] Enable full CUDA graphs for spec decoding with FlashInfer #26937

Uh oh!

Conversation

benchislett commented Oct 15, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

TRTLLM on:

TRTLLM off:

Benchmarks

Before:

After:

Uh oh!

mergify bot commented Oct 15, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

vadiklyutiy commented Oct 15, 2025

Uh oh!

mgoin commented Oct 15, 2025

Uh oh!

LucasWilkinson commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

benchislett Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

benchislett Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

benchislett Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

fhl2000 commented Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benchislett commented Nov 10, 2025

Uh oh!

fhl2000 commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Nov 11, 2025

Uh oh!

benchislett commented Nov 11, 2025

Uh oh!

benchislett commented Nov 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

benchislett commented Oct 15, 2025 •

edited by github-actions bot

Loading

LucasWilkinson commented Oct 16, 2025 •

edited

Loading

fhl2000 commented Nov 2, 2025 •

edited

Loading

fhl2000 commented Nov 11, 2025 •

edited

Loading