[Spec Decode] Integrate Suffix Decoding from Arctic Inference #25784

aurickq · 2025-09-26T19:53:56Z

Purpose

This PR adds Suffix Decoding (https://arxiv.org/abs/2411.04975) as a new speculative decoding method in vLLM. Suffix Decoding is a dynamic n-gram matching method that:

Uses suffix trees to generate speculative tokens quickly using branch frequency counts.
Can keep a history of prior model responses, which tends to work very well with repetitive agentic use cases.
Can be dynamically updated with newly generated tokens, and FIFO eviction of older requests.

Test Plan

Benchmark Suffix Decoding against the current ngram speculator.
Write and run unit tests
Documentation

Test Result

Benchmarks on SpecBench and Aider-AI/refactor-benchmark are below. Suffix Decoding beats ngram in the majority of cases. In practice, we have seen larger speedups for real user interactions and agentic requests, since they tend to exhibit more output repetition than these benchmark datasets.

refactor-bench (out=1024)

Results are mean TPOT (ms)

method	spec_len	concurrency 1	concurrency 4	concurrency 16	concurrency 64
suffix (w/ cache)	5	2.15	3.68	9.02	26.64
suffix (w/ cache)	12	1.91	3.36	8.56	26.32
suffix (w/ cache)	32	1.81	3.22	8.58	26.78
suffix (w/o cache)	5	2.35	3.92	9.2	26.78
suffix (w/o cache)	12	2.13	3.65	8.92	26.68
suffix (w/o cache)	32	2.04	3.56	8.98	27.77
ngram	5	2.99	4.7	10.41	28.62
ngram	12	2.68	4.41	9.85	28.66
ngram	32	2.58	4.32	10.57	32.63

spec-bench (out=256)

Results are mean TPOT (ms)

method	spec_len	concurrency 1	concurrency 4	concurrency 16	concurrency 64
suffix (w/ cache)	5	4.27	4.67	6.17	12.03
suffix (w/ cache)	12	4.26	4.71	6.2	12.11
suffix (w/ cache)	32	4.28	4.73	6.17	12.27
suffix (w/o cache)	5	4.63	5.09	6.38	11.68
suffix (w/o cache)	12	4.63	5.1	6.37	11.62
suffix (w/o cache)	32	4.62	5.06	6.35	11.66
ngram	5	5.38	5.7	6.77	10.98
ngram	12	5.37	5.67	6.76	10.99
ngram	32	5.37	5.73	6.87	11.76

mergify · 2025-09-26T19:54:37Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @aurickq.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This pull request integrates Suffix Decoding from Arctic Inference as a new speculative decoding method. The changes are well-structured, adding new configuration options, validation, and the core logic for proposing draft tokens and managing the suffix cache. My review identifies a potential type inconsistency in the token sequences passed to the arctic-inference library, which could lead to runtime errors. I've suggested a fix to ensure consistency.

vllm/v1/worker/gpu_model_runner.py

simon-mo · 2025-09-26T19:57:08Z

@codex review

simon-mo · 2025-09-26T19:57:59Z

note to reviewers:

We discussed with the Snowflake team that importing from arctic-inference is acceptable path forward and the team is committed in maintaining it as a separate library.
Please focus on code quality, interfaces, UX, etc.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting

@codex fix this CI failure
@codex address that feedback

vllm/v1/worker/gpu_model_runner.py

keyboardAnt · 2025-09-26T20:42:23Z

@aurickq, thanks for your awesome contribution, the results look good!

Suffix decoding outperforms n-gram at out=1024, but falls behind at out=256 with concurrency=64 (+5.8% in the best case). Any idea why?

aurickq · 2025-09-28T18:47:16Z

@aurickq, thanks for your awesome contribution, the results look good!

Suffix decoding outperforms n-gram at out=1024, but falls behind at out=256 with concurrency=64 (+5.8% in the best case). Any idea why?

The out=1024 and out=256 are also two different datasets, so might not be very comparable. Other than that, when the concurrency is high and the number of output tokens is low (e.g. 256), the request completion time becomes dominated by mixed-prefill batches that drag up the mean TPOT metric. So it makes sense for these cases the performance of suffix and ngram will approach each other.

As for why suffix becomes a little worse than ngram for spec_bench out=256 and concurrency=64, here is my guess: the SpecBench dataset is more open-ended (higher entropy, less repetition) than refactor-benchmark, so we should already would expect suffix/ngram to perform worse on it. The benchmark is also small (400-500 examples), so suffix decoding might not have built a sufficiently large cache to accurately predict the next tokens. From the benchmarks, the performance of suffix decoding actually is better when this cache is disabled in this setting.

I have some ideas for solving this latter issue when the cached data is sparse, which I might later implement and contribute as a "suffix v2" method, if it works.

Neo9061 · 2025-09-29T19:04:53Z

Thanks a lot for the contribution @aurickq ! A few questions.

In your benchmarking, when there is cache enabled, is it referring to the global tree? what training data are you using to construct the global tree?
Can we enable an option to make the global tree static which uses some offline training data? as explained in other thread, this will be very useful for multi-tenets requests. Plan to merge Suffix decoding into vLLM mainline? snowflakedb/ArcticInference#171 (comment)
Can your PR work with the hybrid PR [Spec Decode][Hybrid] Add ngram-eagle SD method #24344 where they enable n-gram and EAGLE? such that we can hybrid suffix decoding and eagle?
For the comparison between suffix decoding w/o cache and n-gram, what do you think of the reason to make the suffix decoding w/o cache working better than n-gram? In my understanding, they are almost equivalent when suffix decoding does not use global cache. One of reason I could think of is the dynamic drafting length suffix decoding has over the n-gram.

aurickq · 2025-09-29T19:49:42Z

@Neo9061

"w/ cache" means using the global suffix tree, and "w/o cache" means not using the global suffix tree (setting suffix_decoding_max_cached_requests = 0. The per-prompt suffix trees are used in both cases. In these benchmarks, the only requests being cached are the earlier requests in the same benchmark. The performance would probably be much better in a more realistic setting when more requests can be cached over a longer period of time.
I think this is a good idea, but I would like to address this in a follow-up PR once the core suffix speculation is enabled. It could use more input from the community on interface design, like what's the best format to read the "static" cache.
The current PR doesn't consider hybrid speculation yet, would also be good to add in the future.
Yeah they are "almost" equivalent except for suffix decoding's frequency stats and scoring mechanism. For each speculation length, suffix decoding can speculate up to that many tokens but can also speculate less if there is no probable continuation to save on verification costs. It also means that out of several possible continuations, suffix decoding can choose the most "frequent" one to maximize the probability of acceptance.

aurickq · 2025-09-30T04:33:01Z

Finished up all the TODOs, ready for reviews.

mergify · 2025-09-30T17:05:04Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @aurickq.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

ekagra-ranjan · 2025-10-01T02:59:23Z

Thanks for the PR @aurickq !

Can you also report the AL of the approaches in the PR description? It gives a sense of how much better an approach is assuming 0 SD overhead. Plus the AL as a metric can be used across different hardware for comparison.
Can you also include a benchmark on MTBench on BS 1/4 since its the most widely used in vLLM. Makes it easier to compare bench across different SD methods being tracked here.
Regarding w/ cache case, you mentioned that in these benchmarks, the only requests being cached are the earlier requests in the same benchmark. Would it mean that the generations are already present in the global cache and the results are super optimistic? In real cases, this won't be the case always, right?
Can you share the cmds to run the benchmark for reproducibility purposes?

aurickq · 2025-10-01T05:00:08Z

@ekagra-ranjan

I am unsure how to get the AL stats from vLLM, but we have published detailed breakdowns in Appendix A.1.2 of the paper https://arxiv.org/pdf/2411.04975.
Actually, a large fraction of spec-bench comes from mt-bench, their performance should be similar. The paper also has a sub-task breakdown of spec-bench so you can see exactly the ones from mt-bench. If there are specific configs you are interested in, I can try to run those.
Do you mean if the cache is warmed up before benchmarking? If so then the answer is no. During the actual benchmark is the first time those requests are ever seen. For the "w/ cache" experiments, for each column, the vLLM server is always started from scratch before running the benchmark. I guess this is more "pessimistic" since in our experience many practical use cases actually have high repetition across requests (agent loops, code editing, RL rollout, etc.).
Sure, the commands are pretty straight-forward, e.g.

vllm serve meta-llama/Llama-3.1-8B-Instruct --disable-log-requests --no-enable-prefix-caching --speculative-config '{"method": "suffix", "num_speculative_tokens": 12, "suffix_decoding_max_cached_requests": 1000}'

Set "suffix_decoding_max_cached_requests": 0 to disable the global cache.

On the benchmark side:

vllm bench serve --model meta-llama/Llama-3.1-8B-Instruct --dataset-name spec_bench --max-concurrency ... --no-oversample

It's important to include --no-oversample to avoid running already-cached requests. Also, to restart the server between every w/ cache experiment to always start with an empty cache.

Neo9061 · 2025-10-01T20:14:39Z

@Neo9061

"w/ cache" means using the global suffix tree, and "w/o cache" means not using the global suffix tree (setting suffix_decoding_max_cached_requests = 0. The per-prompt suffix trees are used in both cases. In these benchmarks, the only requests being cached are the earlier requests in the same benchmark. The performance would probably be much better in a more realistic setting when more requests can be cached over a longer period of time.

I think this is a good idea, but I would like to address this in a follow-up PR once the core suffix speculation is enabled. It could use more input from the community on interface design, like what's the best format to read the "static" cache.

The current PR doesn't consider hybrid speculation yet, would also be good to add in the future.

Yeah they are "almost" equivalent except for suffix decoding's frequency stats and scoring mechanism. For each speculation length, suffix decoding can speculate up to that many tokens but can also speculate less if there is no probable continuation to save on verification costs. It also means that out of several possible continuations, suffix decoding can choose the most "frequent" one to maximize the probability of acceptance.

Thanks @aurickq ! Would you plan to open up a subsequent PR to address the static global tree soon after this one is merged? asking as this is major bottleneck for multi-tenets serving to use.

vllm/config/speculative.py

sfc-gh-aqiao added 2 commits September 24, 2025 00:45

Integrate Suffix Decoding

2d4180c

update

db621b4

aurickq requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac, alexm-redhat, simon-mo, youkaichao, mgoin, tlrmchlsmth, houseroad, hmellor, yewentao256 and ProExpertProg as code owners September 26, 2025 19:53

mergify bot added the v1 label Sep 26, 2025

mergify bot added the needs-rebase label Sep 26, 2025

gemini-code-assist bot reviewed Sep 26, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

aurickq mentioned this pull request Sep 26, 2025

Plan to merge Suffix decoding into vLLM mainline? snowflakedb/ArcticInference#171

Open

chatgpt-codex-connector bot reviewed Sep 26, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

add tests

a15727d

mergify bot added the ci/build label Sep 29, 2025

docs

92fe31e

mergify bot added the documentation Improvements or additions to documentation label Sep 30, 2025

sfc-gh-aqiao added 2 commits September 30, 2025 02:51

move import check

6755a31

Merge branch 'main' into HEAD

6acc529

mergify bot removed the needs-rebase label Sep 30, 2025

sfc-gh-aqiao added 2 commits September 30, 2025 03:08

fix

72c85e6

SuffixDecodingProposer

da82b85

aurickq requested review from benchislett and luccafong as code owners September 30, 2025 04:14

mergify bot added the speculative-decoding label Sep 30, 2025

sfc-gh-aqiao added 2 commits September 30, 2025 04:23

fix

0751844

docstring

9692d9d

precommit

4536270

mergify bot added the needs-rebase label Sep 30, 2025

Merge branch 'main' into suffix-decoding

6a24831

mergify bot removed the needs-rebase label Sep 30, 2025

minor

6342db2

aurickq changed the title ~~[Misc] Integrate Suffix Decoding from Arctic Inference~~ [Spec Decode] Integrate Suffix Decoding from Arctic Inference Sep 30, 2025

CptTZ mentioned this pull request Sep 30, 2025

[RFC]: Enabling Suffix Decoding, LSTM Speculator, Sequence Parallelism from Arctic Inference #18037

Open

1 task

zixi-qi mentioned this pull request Sep 30, 2025

[Experimental][Spec Decode] Port suffix decoding from ArcticInference to vLLM main #24852

Closed

5 tasks

simon-mo assigned benchislett and WoosukKwon Sep 30, 2025

CptTZ reviewed Oct 2, 2025

View reviewed changes

vllm/config/speculative.py Outdated Show resolved Hide resolved

Update speculative.py

ecf3efd

Neo9061 mentioned this pull request Oct 3, 2025

[Spec Decode][Hybrid] Add ngram-eagle SD method #24344

Open

3 tasks

Uh oh!

[Spec Decode] Integrate Suffix Decoding from Arctic Inference #25784

Are you sure you want to change the base?

[Spec Decode] Integrate Suffix Decoding from Arctic Inference #25784

Conversation

aurickq commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

mergify bot commented Sep 26, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

simon-mo commented Sep 26, 2025

Uh oh!

simon-mo commented Sep 26, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

keyboardAnt commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aurickq commented Sep 28, 2025

Uh oh!

Neo9061 commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aurickq commented Sep 29, 2025

Uh oh!

aurickq commented Sep 30, 2025

Uh oh!

mergify bot commented Sep 30, 2025

Uh oh!

ekagra-ranjan commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aurickq commented Oct 1, 2025

Uh oh!

Neo9061 commented Oct 1, 2025

Uh oh!

Uh oh!

Uh oh!

aurickq commented Sep 26, 2025 •

edited

Loading

keyboardAnt commented Sep 26, 2025 •

edited

Loading

Neo9061 commented Sep 29, 2025 •

edited

Loading

ekagra-ranjan commented Oct 1, 2025 •

edited

Loading