[RFC]: Support Context Parallelism with Fully Sharded KV Cache and Ring Attention

### Motivation.

Context parallelism introduces an additional degree of parallelism to LLM inference. While tensor parallelism and pipeline parallelism focus on distributing model weights and layers across devices, context parallelism specifically targets the parallel processing of multiple input contexts or sequences. By combining context parallelism with parallelisms, systems can achieve more scalable and efficient inference, leveraging all forms of parallelism to maximize hardware utilization and reduce latency. 
<img width="1166" height="550" alt="Image" src="https://github.com/user-attachments/assets/9c9a5f66-6822-4681-834e-38a46ea943c3" />
Context parallelism improves performance as the context length grows by distributing both computation and the KV cache across multiple GPUs. This approach effectively lowers processing latency and can also decrease the memory required per GPU potentially, especially when dealing with extremely large KV caches (such as sequence lengths on the order of 1 million tokens).

### Proposed Change.

Within the model, attention is the only component that has dependency on the sequence dimension, since each token must attend to all previous tokens in the same sequence. In contrast, FFN and element-wise operations are performed independently for each token. For a more in-depth understanding of context parallelism in LLM inference, including partial attention, read the MLSys paper available at[ https://arxiv.org/pdf/2411.01783](https://arxiv.org/pdf/2411.01783).
To implement context parallelism in vLLM, the design needs to:

- Be aware of these dependencies to minimize synchronization overhead,
- Remain flexible to support various backends, and
- Avoid major changes to core components to ensure system stability.

## Partition Sequence Dimension

Causal attention imposes a varying computational load for each token, as shown in the following figure. To ensure an even workload distribution, tokens should be partitioned across different context parallelism (CP) ranks. Specifically, the sequence is divided into 2 × cp_world_size chunks. Each CP rank i is assigned both the i-th chunk and the (2 × cp_world_size - i - 1)-th chunk. This approach helps balance the compute load among all CP ranks.

<img width="521" height="463" alt="Image" src="https://github.com/user-attachments/assets/b188c8d9-5a93-43c2-a5d2-a46bcd21f3c7" />

## Prefill

During the prefill phase, both the query (Q) and key-value (KV) tensors are sharded across GPUs. To ensure that each Q token can attend to all preceding KV tokens, it is necessary to exchange the relevant Q or KV shards among GPUs. To reduce synchronization overhead, data transfers are overlapped with partial attention computations, with the goal of fully hiding data transfer latency. The following figure shows an example of prefill with cp2. 

<img width="1007" height="907" alt="Image" src="https://github.com/user-attachments/assets/b371f04a-d58f-4e79-944f-bdf414f36016" />

The choice between passing KV or Q shards depends on the relative sizes of the Q and KV tensors. For full prefill, passing KV shards is generally preferred, as the number of queries per KV head typically exceeds two in most models. Conversely, for chunked prefill, passing Q shards may be more efficient if the KV cache length is significantly greater than the number of Q tokens. 

## Decode

During decoding, new Key-Value (KV) pairs are distributed across CP ranks in a round-robin fashion. This prevents overlap and enables correct computation of partial attention, which is then merged. Once each CP rank completes its local partial attention, these partial results are all-gathered and merged to achieve the same attention across all CP ranks. The provided figure demonstrates the decoding process with cp=2.

<img width="1054" height="532" alt="Image" src="https://github.com/user-attachments/assets/ab05fca9-e25e-491c-b2f3-17ae369247f7" />

## Block Table

When tokens are distributed across context parallel (CP) ranks, gaps may appear in the block table. After compaction, tokens that are stored physically next to each other may not be logically consecutive. This is correct for CP because we only need to maintain the correct relative order of tokens for mapping purposes, rather than tracking their absolute positions in the block table. The figure below shows how KVs are stored in the CP case.

<img width="719" height="961" alt="Image" src="https://github.com/user-attachments/assets/fdda3a4a-fabf-484b-8c3b-3b9e7e339aeb" />

## PRs
https://github.com/vllm-project/vllm/pull/26057
https://github.com/vllm-project/vllm/pull/26058
https://github.com/vllm-project/vllm/pull/26059


### Feedback Period.

_No response_

### CC List.

@luccafong @houseroad @minosfuture 

### Any Other Things.

Related RFCs
https://github.com/vllm-project/vllm/issues/25749
https://github.com/vllm-project/vllm/issues/22693



### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC]: Support Context Parallelism with Fully Sharded KV Cache and Ring Attention #26133

Motivation.

Proposed Change.

Partition Sequence Dimension

Prefill

Decode

Block Table

PRs

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[RFC]: Support Context Parallelism with Fully Sharded KV Cache and Ring Attention #26133

Description

Motivation.

Proposed Change.

Partition Sequence Dimension

Prefill

Decode

Block Table

PRs

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions