Skip to content

[RFC] Upstream Chunked Prefill #3130

@rkooo567

Description

@rkooo567

Progress

Future Extension (not a scope of this RFC)

  • Make it work with sliding window attention
  • Use same mechanism for prefix caching
  • Use better kernel than context attention forward (ideally flash infer)

Chunked prefill chunks prefill requests into multiple chunks and batch them with decoding requests. Since prefill requests are compute-bound and decoding requests are memory-bound, it can greatly improve system efficiency by overlapping these two. See the linked papers for more details.

We are planning to upstream chunked prefill internally used at Anyscale to OSS Vllm repo. We turned on this feature in the Anyscale production endpoint.

Benchmark Result

See #3130 (comment)

The following diagram is the benchmark result of Llama 13B x 2 A100 for different QPS (it is the result from Anyscale forked vLLM). Chunked prefill greatly improves latency when QPS is high, but has competitive performance at low QPS. The detailed benchmark results for different parameters with OSS vLLM is in progress (ETA end of this week).

Screenshot 2024-03-01 at 11 30 35 AM

Design

Kernel

For chunked prefill, we are internally using flash attn with paged attention enabled. so we will also upstream flash attn integration first. We decided to go with existing context attention first, and migrate to flash infer. Note that the kernel choice is not finalized, and we are actively investigating other kernels like FlashInfer.

Scheduler & Batch Layout

  • Since chunked prefill mixes the prefill and decoding, having 2D query with batch dimensions is inefficient. To get around this, we will move 2D query back to 1D query.
  • We change Sequence so that it keeps track of the number prompting tokens have been prefilled.
  • In Scheduler, if chunked prefill is enabled, it will first prioritize decoding requests, and then pick prompting request. If prompting request is too long, the prompting request will be chunked.
  • The worker now will create metadata for both decoding requests and prompting requests, including their block tables, context lengths, and number of decoding and prefiling tokens and requests.
    Chunked Prefill support:
    If chunked prefill is enabled, the input will include both prompt tokens
    and generation tokens. The layout is as follows:
    |<---------------------- num_valid_tokens -------------------------->|
    |<--------- num_prompt_tokens ----->|<--- num_generation_tokens >|
    |<-prompt_0->|<-prompt_1->|...............|<-gen_0->|<-gen_1->|..............|
    Notice that both num_valid_tokens, num_prompt_tokens and num_generation_tokens
    includes padding.
    The actual prompt length and offeset are stored in cum_prompt_context_lens.
    The actual num generation tokens are stored in num_generation_tokens_tensor.
    To support chunked prefill, where the prompt and context might have different
    length, we stored the context's length in cum_prompt_context_lens.
  • Once request is processed and sampled, the result belongs to chunked prefilling requests will be ignored.

Cuda Graph

Cuda graph won't be supported for chunked prefill because the performance benefit of cuda graph only happens with smaller # of tokens.

Following is how we make everything cuda-graphable. We pad both prompt_tokens and generation tokens, and the cuda-graph is cached based number of prompt-tokens, number of generation tokens, and prompt batch size. Following are the details.

        # The key for caching cuda-graph.
        # We ensure that with the same total number of tokens,
        # same number of prompts, and same number of prompt tokens,
        # the intermediate tensors have the same shapes thus can be
        # replayed.
        #
        # Refer to following layout for the shapes of intermediate tensors.
        # during attention calculation.
        #
        #|<---------------------- padded_num_tokens ------------------------->|
        #|<--------- num_prompt_tokens ----->|<--- num_generation_tokens----->|
        #|<-prompt_0->|<-prompt_1->|...|<pad>|<-gen_0->|<-gen_1->|......|<pad>|

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions