[RFC] Upstream Chunked Prefill

# Progress
- 2D -> 1D query refactoring https://github.com/vllm-project/vllm/issues/3130
- Sequence group metadata API update https://github.com/vllm-project/vllm/pull/3538/files
- Scheduler refactoring https://github.com/vllm-project/vllm/pull/3550
- Chunked prefill scheduler https://github.com/vllm-project/vllm/pull/3853
- Chunked prefill attention update  https://github.com/vllm-project/vllm/pull/3884
- logprob fix https://github.com/vllm-project/vllm/pull/4309
- Enable cuda-graph on prefill path -> Won't do

**Future Extension (not a scope of this RFC)**
- Make it work with sliding window attention
- Use same mechanism for prefix caching
- Use better kernel than context attention forward (ideally flash infer)

Chunked prefill chunks prefill requests into multiple chunks and batch them with decoding requests. Since prefill requests are compute-bound and decoding requests are memory-bound, it can greatly improve system efficiency by overlapping these two. See the linked papers for more details.
- https://arxiv.org/pdf/2401.08671.pdf
- https://arxiv.org/pdf/2308.16369.pdf 

We are planning to upstream chunked prefill internally used at Anyscale to OSS Vllm repo. We turned on this feature in the Anyscale production endpoint. 

# Benchmark Result
See https://github.com/vllm-project/vllm/issues/3130#issuecomment-2011281519

The following diagram is the benchmark result of Llama 13B x 2 A100 for different QPS (it is the result from Anyscale forked vLLM). Chunked prefill greatly improves latency when QPS is high, but has competitive performance at low QPS. The detailed benchmark results for different parameters with OSS vLLM is in progress (ETA end of this week). 

<img width="1496" alt="Screenshot 2024-03-01 at 11 30 35 AM" src="https://github.com/vllm-project/vllm/assets/18510752/39a5a500-f6df-4b29-a570-82d72aaeaf87">


# Design

## Kernel

For chunked prefill, we are internally using flash attn with paged attention enabled. ~~so we will also upstream flash attn integration first.~~ We decided to go with existing context attention first, and migrate to flash infer. ~~Note that the kernel choice is not finalized, and we are actively investigating other kernels like FlashInfer.~~

## Scheduler & Batch Layout

- Since chunked prefill mixes the prefill and decoding, having 2D query with batch dimensions is inefficient. To get around this, we will move 2D query back to 1D query. 
- We change Sequence so that it keeps track of the number prompting tokens have been prefilled.
- In Scheduler, if chunked prefill is enabled, it will first prioritize decoding requests, and then pick prompting request. If prompting request is too long, the prompting request will be chunked.
- The worker now will create metadata for both decoding requests and prompting requests, including their block tables, context lengths, and number of decoding and prefiling tokens and requests. 

```
    Chunked Prefill support:
    If chunked prefill is enabled, the input will include both prompt tokens
    and generation tokens. The layout is as follows:
    |<---------------------- num_valid_tokens -------------------------->|
    |<--------- num_prompt_tokens ----->|<--- num_generation_tokens >|
    |<-prompt_0->|<-prompt_1->|...............|<-gen_0->|<-gen_1->|..............|
    Notice that both num_valid_tokens, num_prompt_tokens and num_generation_tokens
    includes padding.
    The actual prompt length and offeset are stored in cum_prompt_context_lens.
    The actual num generation tokens are stored in num_generation_tokens_tensor.
    To support chunked prefill, where the prompt and context might have different
    length, we stored the context's length in cum_prompt_context_lens.
```
- Once request is processed and sampled, the result belongs to chunked prefilling requests will be ignored.

## Cuda Graph
Cuda graph won't be supported for chunked prefill because the performance benefit of cuda graph only happens with smaller # of tokens. 

~~Following is how we make everything cuda-graphable. We pad both prompt_tokens and generation tokens, and the cuda-graph is cached based number of prompt-tokens, number of generation tokens, and prompt batch size. Following are the details.~~

```
        # The key for caching cuda-graph.
        # We ensure that with the same total number of tokens,
        # same number of prompts, and same number of prompt tokens,
        # the intermediate tensors have the same shapes thus can be
        # replayed.
        #
        # Refer to following layout for the shapes of intermediate tensors.
        # during attention calculation.
        #
        #|<---------------------- padded_num_tokens ------------------------->|
        #|<--------- num_prompt_tokens ----->|<--- num_generation_tokens----->|
        #|<-prompt_0->|<-prompt_1->|...|<pad>|<-gen_0->|<-gen_1->|......|<pad>|
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC] Upstream Chunked Prefill #3130

Progress

Benchmark Result

Design

Kernel

Scheduler & Batch Layout

Cuda Graph

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[RFC] Upstream Chunked Prefill #3130

Description

Progress

Benchmark Result

Design

Kernel

Scheduler & Batch Layout

Cuda Graph

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions