[Experimental][Spec Decode] Port suffix decoding from ArcticInference to vLLM main #24852

zixi-qi · 2025-09-15T05:58:05Z

Purpose

Port suffix decoding implementation from ArcticInference(https://github.com/snowflakedb/ArcticInference) to vLLM main to test suffix decoding without depending on arctic inference

Test Plan

Run e2e and unit tests for suffix decoding based spec decode

Test Result

E2E test

suffix decode

VLLM_USE_V1=1 python examples/offline_inference/spec_decode.py --num_spec_tokens 1 --num_prompts 80 --dataset-name hf --dataset-path philschmid/mt-bench --method suffix

Adding requests: 100%|██████████████████████████████████████████████████████████████████| 80/80 [00:00<00:00, 9848.96it/s]
Processed prompts: 100%|████████| 80/80 [00:04<00:00, 17.43it/s, est. speed input: 1755.10 toks/s, output: 3703.51 toks/s]
--------------------------------------------------
total_num_output_tokens: 16993
num_drafts: 1986
num_draft_tokens: 1986
num_accepted_tokens: 1226
mean acceptance length: 1.62
--------------------------------------------------
acceptance at token 0: 0.62

ngram as comparison

VLLM_USE_V1=1 python examples/offline_inference/spec_decode.py --num_spec_tokens 1 --num_prompts 80 --dataset-name hf --dataset-path philschmid/mt-bench --method ngram

Adding requests: 100%|██████████████████████████████████████████████████████████████████| 80/80 [00:00<00:00, 9968.34it/s]
Processed prompts: 100%|████████| 80/80 [00:04<00:00, 18.30it/s, est. speed input: 1842.36 toks/s, output: 3915.31 toks/s]
--------------------------------------------------
total_num_output_tokens: 17114
num_drafts: 3426
num_draft_tokens: 3426
num_accepted_tokens: 1696
mean acceptance length: 1.50
--------------------------------------------------
acceptance at token 0: 0.50

Unit test

pytest tests/v1/e2e/test_spec_decode.py -k suffix -v

tests/v1/e2e/test_spec_decode.py::test_suffix_correctness PASSED                                                                                                                                                                                                                [ 25%]
tests/v1/e2e/test_spec_decode.py::test_suffix_with_configs[suffix_config0] PASSED                                                                                                                                                                                               [ 50%]
tests/v1/e2e/test_spec_decode.py::test_suffix_with_configs[suffix_config1] PASSED                                                                                                                                                                                               [ 75%]
tests/v1/e2e/test_spec_decode.py::test_suffix_with_configs[suffix_config2] PASSED

pytest tests/v1/spec_decode/test_suffix_tree_cpp.py -v

tests/v1/spec_decode/test_suffix_tree_cpp.py::TestSuffixTreeCpp::test_basic_operations PASSED                                                                                                                                                                                   [ 11%]
tests/v1/spec_decode/test_suffix_tree_cpp.py::TestSuffixTreeCpp::test_append_operations PASSED                                                                                                                                                                                  [ 22%]
tests/v1/spec_decode/test_suffix_tree_cpp.py::TestSuffixTreeCpp::test_multiple_sequences PASSED                                                                                                                                                                                 [ 33%]
tests/v1/spec_decode/test_suffix_tree_cpp.py::TestSuffixTreeCpp::test_speculation_parameters PASSED                                                                                                                                                                             [ 44%]
tests/v1/spec_decode/test_suffix_tree_cpp.py::TestSuffixTreeCpp::test_integrity_check PASSED                                                                                                                                                                                    [ 55%]
tests/v1/spec_decode/test_suffix_tree_cpp.py::TestSuffixTreeCpp::test_memory_estimation PASSED                                                                                                                                                                                  [ 66%]
tests/v1/spec_decode/test_suffix_tree_cpp.py::TestSuffixTreeCpp::test_empty_sequences PASSED                                                                                                                                                                                    [ 77%]
tests/v1/spec_decode/test_suffix_tree_cpp.py::TestSuffixTreeCpp::test_large_sequences PASSED                                                                                                                                                                                    [ 88%]
tests/v1/spec_decode/test_suffix_tree_cpp.py::TestSuffixTreeCpp::test_tree_vs_path_speculation PASSED

pytest tests/v1/spec_decode/test_suffix_cache.py -v

tests/v1/spec_decode/test_suffix_cache.py::TestSuffixCache::test_basic_operations PASSED                                                                                                                                                                                        [ 12%]
tests/v1/spec_decode/test_suffix_cache.py::TestSuffixCache::test_multiple_requests PASSED                                                                                                                                                                                       [ 25%]
tests/v1/spec_decode/test_suffix_cache.py::TestSuffixCache::test_cache_eviction PASSED                                                                                                                                                                                          [ 37%]
tests/v1/spec_decode/test_suffix_cache.py::TestSuffixCache::test_pattern_matching PASSED                                                                                                                                                                                        [ 50%]
tests/v1/spec_decode/test_suffix_cache.py::TestSuffixCache::test_empty_patterns PASSED                                                                                                                                                                                          [ 62%]
tests/v1/spec_decode/test_suffix_cache.py::TestSuffixCache::test_invalid_operations PASSED                                                                                                                                                                                      [ 75%]
tests/v1/spec_decode/test_suffix_cache.py::TestSuffixCache::test_max_depth_handling PASSED                                                                                                                                                                                      [ 87%]
tests/v1/spec_decode/test_suffix_cache.py::TestSuffixCache::test_speculation_parameters PASSED                                                                                                                                                                                  [100%]

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

pytorch-bot · 2025-09-15T05:58:44Z

No ciflow labels are configured for this repo.
For information on how to enable CIFlow bot see this wiki

mergify · 2025-09-17T04:10:37Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @zixi-qi.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: zixi-qi <[email protected]> Signed-off-by: qizixi <[email protected]> Signed-off-by: zixi-qi <[email protected]>

Signed-off-by: qizixi <[email protected]> Signed-off-by: zixi-qi <[email protected]>

zixi-qi · 2025-09-30T19:25:20Z

Official implementation added in #25784

mergify bot added documentation Improvements or additions to documentation ci/build speculative-decoding v1 labels Sep 15, 2025

zixi-qi closed this Sep 15, 2025

zixi-qi reopened this Sep 16, 2025

zixi-qi force-pushed the suffix-decoding branch from 22ada5d to 854b0bc Compare September 16, 2025 05:33

mergify bot added the needs-rebase label Sep 17, 2025

zixi-qi force-pushed the suffix-decoding branch from 854b0bc to 1dd5038 Compare September 29, 2025 22:40

mergify bot removed the needs-rebase label Sep 29, 2025

zixi-qi force-pushed the suffix-decoding branch from 1dd5038 to 911a5a3 Compare September 29, 2025 22:41

zixi-qi added 3 commits September 29, 2025 15:41

port suffix decoding from ArcticInference to vLLM main

6610917

Signed-off-by: zixi-qi <[email protected]> Signed-off-by: qizixi <[email protected]> Signed-off-by: zixi-qi <[email protected]>

use torch custom op instead of pybind

db0a01b

Signed-off-by: qizixi <[email protected]> Signed-off-by: zixi-qi <[email protected]>

Use torch custom op instead of pybind

a8586c2

Signed-off-by: qizixi <[email protected]> Signed-off-by: zixi-qi <[email protected]>

zixi-qi force-pushed the suffix-decoding branch from 911a5a3 to a8586c2 Compare September 29, 2025 22:42

zixi-qi marked this pull request as ready for review September 29, 2025 22:50

zixi-qi requested review from WoosukKwon, alexm-redhat, benchislett, comaniac, luccafong, mgoin, njhill, robertgshaw2-redhat, simon-mo, tlrmchlsmth, youkaichao and ywang96 as code owners September 29, 2025 22:50

zixi-qi requested review from LucasWilkinson, ProExpertProg, hmellor, houseroad and yewentao256 as code owners September 29, 2025 22:50

zixi-qi mentioned this pull request Sep 29, 2025

[RFC]: Enabling Suffix Decoding, LSTM Speculator, Sequence Parallelism from Arctic Inference #18037

Open

1 task

zixi-qi closed this Sep 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Experimental][Spec Decode] Port suffix decoding from ArcticInference to vLLM main #24852

[Experimental][Spec Decode] Port suffix decoding from ArcticInference to vLLM main #24852

Uh oh!

zixi-qi commented Sep 15, 2025 •

edited by github-actions bot

Loading

Uh oh!

pytorch-bot bot commented Sep 15, 2025

Uh oh!

mergify bot commented Sep 17, 2025

Uh oh!

zixi-qi commented Sep 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

[Experimental][Spec Decode] Port suffix decoding from ArcticInference to vLLM main #24852

[Experimental][Spec Decode] Port suffix decoding from ArcticInference to vLLM main #24852

Uh oh!

Conversation

zixi-qi commented Sep 15, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

pytorch-bot bot commented Sep 15, 2025

Uh oh!

mergify bot commented Sep 17, 2025

Uh oh!

zixi-qi commented Sep 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

zixi-qi commented Sep 15, 2025 •

edited by github-actions bot

Loading