Skip to content

Conversation

@LiuXiaoxuanPKU
Copy link
Collaborator

End to end Speculative Decoding Performance (request latency):
Draft: LLama-160M, Target: Vicuna-7B, batch size=8, input_len=256, output_len=512

Before this PR:

Avg latency: 5.9652480507269505 seconds
10% percentile latency: 5.729408229794354 seconds
25% percentile latency: 5.794497653492726 seconds
50% percentile latency: 5.954964595614001 seconds
75% percentile latency: 6.124162045423873 seconds
90% percentile latency: 6.1757167306263 seconds
99% percentile latency: 6.3235187567165125 seconds

After this PR:

Avg latency: 5.717350374627858 seconds
10% percentile latency: 5.423373702447861 seconds
25% percentile latency: 5.50113928236533 seconds
50% percentile latency: 5.768905333476141 seconds
75% percentile latency: 5.950261808000505 seconds
90% percentile latency: 5.9829305820167065 seconds
99% percentile latency: 5.992807335173711 seconds

@github-actions
Copy link

github-actions bot commented Aug 7, 2024

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

  • Comment /ready on the PR
  • Add ready label to the PR
  • Enable auto-merge.

🚀

Copy link
Collaborator

@cadedaniel cadedaniel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. two qs:

  • can we run the correctness test for both paths? specifically the convergence test (all of the e2e tests depend on this for temp>0)
  • can we make sure there is no perf regression for the non-FlashInfer path?

@comaniac
Copy link
Collaborator

comaniac commented Aug 7, 2024

Note: there's a bugfix for correctness issue in this sampling kernel (flashinfer-ai/flashinfer#425), so we may want to bump FlashInfer to the next release.

@yzh119
Copy link

yzh119 commented Aug 7, 2024

Note: there's a bugfix for correctness issue in this sampling kernel (flashinfer-ai/flashinfer#425), so we may want to bump FlashInfer to the next release.

Yes @LiuXiaoxuanPKU 's number are measured with flashinfer main branch where #425 was already merged. This PR depends on flashinfer v0.1.4.

@LiuXiaoxuanPKU
Copy link
Collaborator Author

@cadedaniel
The PR is ready for a second review. It should be pass speculative decoding tests, but I do have some questions & concerns.

  1. Currently, we update the metrics in the _create_output, however, with the rejection sampling kernel, we will not go through that code pass. Therefore, we will update the metrics outside (
    self.num_accepted_tokens += batch_size * k
    ), not sure if it's good practice.
  2. The rejection sampler does not return vllm speculative decoding's definition of accepted tokens, therefore the metric is not meaningful for now. I update in that way to just pass the CI.
  3. In the current CI, we install flashinfer by default. Therefore, all speculative decoding tests (including rejection sampling) are using the kernel. Do we want to keep the test for the old code? Will that be a bit expensive for CI?

@LiuXiaoxuanPKU
Copy link
Collaborator Author

LiuXiaoxuanPKU commented Aug 18, 2024

@cadedaniel Updates:

  1. modify the flashinfer kernel to return the metric required by vllm spec dec metric. PR: add accept num, emit num metric for ChainSpeculativeSampling flashinfer-ai/flashinfer#450
  2. change rejection sampler to mainly test the flashinfer backend.
  3. add tests to compare the flashinfer and nonflashinfer backend results.

This PR is ready, CI tests might fail because we will need flashinfer to release and add the latest flashinfer to CI. Tests passed locally.

Copy link
Collaborator

@comaniac comaniac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. Just nits.


rejection_sampler = RejectionSampler(
disable_bonus_tokens=disable_bonus_tokens)
rejection_sampler = RejectionSampler(disable_bonus_tokens=False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel you can leave this test untouched, and just rename the follow test to "test_flashinfer_backed".

strict_mode=strict_mode)
self.use_flashinfer = use_flashinfer
if self.use_flashinfer:
assert not disable_bonus_tokens, \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be just warning?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, when disable_bonus_tokens, the bonus token should be -1.
However, if we use flashinfer and set disable_bonus_tokens, the bonus token will still have values (!= -1), which makes the results incorrect. I guess it might be better to just fail here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can remove the disable_bonus_token path completely now that #4212 is fixed.

but if it's too much work let's just leave it as assert, that way "no failure" means user gets the experience we planned for them instead of missing a warning and getting subpar perf

@cadedaniel
Copy link
Collaborator

Will review today.

strict_mode=strict_mode)
self.use_flashinfer = use_flashinfer
if self.use_flashinfer:
assert not disable_bonus_tokens, \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can remove the disable_bonus_token path completely now that #4212 is fixed.

but if it's too much work let's just leave it as assert, that way "no failure" means user gets the experience we planned for them instead of missing a warning and getting subpar perf

@yzh119
Copy link

yzh119 commented Aug 28, 2024

FYI: flashinfer v0.1.6 wheels are ready: https://github.com/flashinfer-ai/flashinfer/releases/tag/v0.1.6

@LiuXiaoxuanPKU
Copy link
Collaborator Author

/ready

@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 28, 2024
batch_size: int, device: str):

def get_seeded_seqs():
seeded_mask = torch.rand(batch_size, dtype=torch.float32) <= 1.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to go out of the helper function, else the rand will be different

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize there's an error here -- it should be torch.rand(...) <= 0.5

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I will just remove the rand, we should just fix the generator for each request in the batch instead of fixing it with 50% probability.


# num_emitted_tokens returned by flashinfer
# does not include the bonus token
# Flashinfer stops at the first token that violates
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just align flashinfer's behavior and this API's?

@youkaichao youkaichao merged commit e6a26ed into vllm-project:main Sep 2, 2024
@LiuXiaoxuanPKU LiuXiaoxuanPKU deleted the flashinfer-rejection-sampler branch September 17, 2024 04:29
Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024
LeiWang1999 pushed a commit to LeiWang1999/vllm-bitblas that referenced this pull request Mar 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants