[SpecDecode][Kernel] Use Flashinfer for Rejection Sampling in Speculative Decoding #7244

LiuXiaoxuanPKU · 2024-08-07T04:31:51Z

End to end Speculative Decoding Performance (request latency):
Draft: LLama-160M, Target: Vicuna-7B, batch size=8, input_len=256, output_len=512

Before this PR:

Avg latency: 5.9652480507269505 seconds
10% percentile latency: 5.729408229794354 seconds
25% percentile latency: 5.794497653492726 seconds
50% percentile latency: 5.954964595614001 seconds
75% percentile latency: 6.124162045423873 seconds
90% percentile latency: 6.1757167306263 seconds
99% percentile latency: 6.3235187567165125 seconds

After this PR:

Avg latency: 5.717350374627858 seconds
10% percentile latency: 5.423373702447861 seconds
25% percentile latency: 5.50113928236533 seconds
50% percentile latency: 5.768905333476141 seconds
75% percentile latency: 5.950261808000505 seconds
90% percentile latency: 5.9829305820167065 seconds
99% percentile latency: 5.992807335173711 seconds

github-actions · 2024-08-07T04:32:03Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

cadedaniel

LGTM. two qs:

can we run the correctness test for both paths? specifically the convergence test (all of the e2e tests depend on this for temp>0)
can we make sure there is no perf regression for the non-FlashInfer path?

vllm/model_executor/layers/spec_decode_base_sampler.py

vllm/spec_decode/spec_decode_worker.py

vllm/model_executor/layers/spec_decode_base_sampler.py

vllm/model_executor/layers/rejection_sampler.py

comaniac · 2024-08-07T16:07:57Z

Note: there's a bugfix for correctness issue in this sampling kernel (flashinfer-ai/flashinfer#425), so we may want to bump FlashInfer to the next release.

yzh119 · 2024-08-07T16:25:40Z

Note: there's a bugfix for correctness issue in this sampling kernel (flashinfer-ai/flashinfer#425), so we may want to bump FlashInfer to the next release.

Yes @LiuXiaoxuanPKU 's number are measured with flashinfer main branch where #425 was already merged. This PR depends on flashinfer v0.1.4.

LiuXiaoxuanPKU · 2024-08-13T06:25:46Z

@cadedaniel
The PR is ready for a second review. It should be pass speculative decoding tests, but I do have some questions & concerns.

Currently, we update the metrics in the _create_output, however, with the rejection sampling kernel, we will not go through that code pass. Therefore, we will update the metrics outside (

vllm/vllm/model_executor/layers/rejection_sampler.py

Line 121 in 7439daf

self.num_accepted_tokens += batch_size * k

), not sure if it's good practice.
The rejection sampler does not return vllm speculative decoding's definition of accepted tokens, therefore the metric is not meaningful for now. I update in that way to just pass the CI.
In the current CI, we install flashinfer by default. Therefore, all speculative decoding tests (including rejection sampling) are using the kernel. Do we want to keep the test for the old code? Will that be a bit expensive for CI?

…trics

LiuXiaoxuanPKU · 2024-08-18T22:33:39Z

@cadedaniel Updates:

modify the flashinfer kernel to return the metric required by vllm spec dec metric. PR: add accept num, emit num metric for ChainSpeculativeSampling flashinfer-ai/flashinfer#450
change rejection sampler to mainly test the flashinfer backend.
add tests to compare the flashinfer and nonflashinfer backend results.

This PR is ready, CI tests might fail because we will need flashinfer to release and add the latest flashinfer to CI. Tests passed locally.

comaniac

Overall LGTM. Just nits.

comaniac · 2024-08-20T00:14:28Z

tests/samplers/test_rejection_sampler.py


-    rejection_sampler = RejectionSampler(
-        disable_bonus_tokens=disable_bonus_tokens)
+    rejection_sampler = RejectionSampler(disable_bonus_tokens=False,


I feel you can leave this test untouched, and just rename the follow test to "test_flashinfer_backed".

tests/samplers/test_rejection_sampler.py

vllm/model_executor/layers/rejection_sampler.py

comaniac · 2024-08-20T00:20:51Z

vllm/model_executor/layers/rejection_sampler.py

                         strict_mode=strict_mode)
+        self.use_flashinfer = use_flashinfer
+        if self.use_flashinfer:
+            assert not disable_bonus_tokens, \


Could be just warning?

Ideally, when disable_bonus_tokens, the bonus token should be -1.
However, if we use flashinfer and set disable_bonus_tokens, the bonus token will still have values (!= -1), which makes the results incorrect. I guess it might be better to just fail here?

we can remove the disable_bonus_token path completely now that #4212 is fixed.

but if it's too much work let's just leave it as assert, that way "no failure" means user gets the experience we planned for them instead of missing a warning and getting subpar perf

vllm/model_executor/layers/spec_decode_base_sampler.py

cadedaniel · 2024-08-20T18:27:21Z

Will review today.

vllm/model_executor/layers/rejection_sampler.py

cadedaniel · 2024-08-21T07:34:59Z

vllm/model_executor/layers/rejection_sampler.py

                         strict_mode=strict_mode)
+        self.use_flashinfer = use_flashinfer
+        if self.use_flashinfer:
+            assert not disable_bonus_tokens, \


we can remove the disable_bonus_token path completely now that #4212 is fixed.

but if it's too much work let's just leave it as assert, that way "no failure" means user gets the experience we planned for them instead of missing a warning and getting subpar perf

vllm/model_executor/layers/rejection_sampler.py

vllm/model_executor/layers/spec_decode_base_sampler.py

tests/samplers/test_rejection_sampler.py

yzh119 · 2024-08-28T07:06:15Z

FYI: flashinfer v0.1.6 wheels are ready: https://github.com/flashinfer-ai/flashinfer/releases/tag/v0.1.6

LiuXiaoxuanPKU · 2024-08-28T23:39:58Z

/ready

cadedaniel · 2024-08-29T22:34:09Z

tests/samplers/test_rejection_sampler.py

+                                       batch_size: int, device: str):
+
+    def get_seeded_seqs():
+        seeded_mask = torch.rand(batch_size, dtype=torch.float32) <= 1.0


I think this needs to go out of the helper function, else the rand will be different

I realize there's an error here -- it should be torch.rand(...) <= 0.5

I think I will just remove the rand, we should just fix the generator for each request in the batch instead of fixing it with 50% probability.

tests/samplers/test_rejection_sampler.py

yzh119 · 2024-08-29T22:59:36Z

vllm/model_executor/layers/rejection_sampler.py


            # num_emitted_tokens returned by flashinfer
            # does not include the bonus token
+            # Flashinfer stops at the first token that violates


Why not just align flashinfer's behavior and this API's?

Signed-off-by: Alvant <[email protected]>

Signed-off-by: LeiWang1999 <[email protected]>

LiuXiaoxuanPKU added 3 commits August 5, 2024 20:56

fix spec decode sampler tests

d7b254b

pass basic tests

f1e899b

remove log

3979f69

cadedaniel reviewed Aug 7, 2024

View reviewed changes

LiuXiaoxuanPKU added 5 commits August 11, 2024 23:23

update flashinfer version for correctness

aed8e2b

Merge branch 'main' into flashinfer-rejection-sampler

74645a5

fix tests and comments

c9f88d9

Merge branch 'main' into flashinfer-rejection-sampler

2e2dc7e

hack for test

7439daf

add tests for flashinfer and nonflashinfer backend, fix flashinfer me…

3ea89a5

…trics

comaniac reviewed Aug 20, 2024

View reviewed changes

LiuXiaoxuanPKU added 2 commits August 20, 2024 13:31

merge

5f08cb7

Merge branch 'main' into flashinfer-rejection-sampler

ec2e1da

cadedaniel reviewed Aug 21, 2024

View reviewed changes

fix comments

0782f6d

LiuXiaoxuanPKU added 2 commits August 28, 2024 11:33

Merge branch 'main' into flashinfer-rejection-sampler

e6f1e97

update flashinfer version

38c7513

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 28, 2024

LiuXiaoxuanPKU added 4 commits August 28, 2024 17:12

typo

aca0bb7

Merge branch 'main' into flashinfer-rejection-sampler

745d35e

fix comments

eaada90

Merge branch 'main' into flashinfer-rejection-sampler

4e16ce3

cadedaniel reviewed Aug 29, 2024

View reviewed changes

yzh119 reviewed Aug 29, 2024

View reviewed changes

fix

5e57252

cadedaniel approved these changes Aug 30, 2024

View reviewed changes

Merge branch 'main' into flashinfer-rejection-sampler

b0d9ad2

youkaichao merged commit e6a26ed into vllm-project:main Sep 2, 2024

LiuXiaoxuanPKU deleted the flashinfer-rejection-sampler branch September 17, 2024 04:29

llsj14 mentioned this pull request Oct 25, 2024

[Speculative decoding] [Performance]: Re-enable bonus tokens #4212

Closed

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024

[SpecDecode][Kernel] Flashinfer Rejection Sampling (vllm-project#7244)

2312235

Signed-off-by: Alvant <[email protected]>

LeiWang1999 pushed a commit to LeiWang1999/vllm-bitblas that referenced this pull request Mar 26, 2025

[SpecDecode][Kernel] Flashinfer Rejection Sampling (vllm-project#7244)

916390c

Signed-off-by: LeiWang1999 <[email protected]>

Uh oh!

[SpecDecode][Kernel] Use Flashinfer for Rejection Sampling in Speculative Decoding #7244

[SpecDecode][Kernel] Use Flashinfer for Rejection Sampling in Speculative Decoding #7244

Uh oh!

Conversation

LiuXiaoxuanPKU commented Aug 7, 2024

Uh oh!

github-actions bot commented Aug 7, 2024

Uh oh!

cadedaniel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

comaniac commented Aug 7, 2024

Uh oh!

yzh119 commented Aug 7, 2024

Uh oh!

LiuXiaoxuanPKU commented Aug 13, 2024

Uh oh!

LiuXiaoxuanPKU commented Aug 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

comaniac left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cadedaniel commented Aug 20, 2024

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yzh119 commented Aug 28, 2024

Uh oh!

LiuXiaoxuanPKU commented Aug 28, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

LiuXiaoxuanPKU commented Aug 18, 2024 •

edited

Loading