[RFC][llm] Prefill/Decode disaggregation with Ray Serve

### Description

Prefill/Decode disaggregation is a critical feature required for large scale LLM deployments to achieve good performance and SLAs. Currently, this is not well supported.

This can be broken down into a couple core requirements:
- Ability to specify prefill and decode and handle KV transfer
- Ability to scale prefill and decode replicas independently
- Integrate with NIXL for flexible and high performance communication

### Design Sketch
This design is meant to offer a ray serve solution built on top of https://github.com/vllm-project/vllm/pull/17751 ([design doc](https://docs.google.com/document/d/1lDQt6hUXBoMnMabsuAMZAWdqit7TzuflsFGHLt2gZus/edit?tab=t.0))

To support this in Ray Serve, we are planning to create a proxy deployment that manages request routing between prefill and decode (passing request with the correct meta data to prefill and decode). We will also create separate deployments for prefill and decoding using the existing LLMServer deployment. Everything else will be done internally via Nixl.


Architecture diagram

<img width="771" alt="Image" src="https://github.com/user-attachments/assets/c36238d8-a12e-4182-a72a-46abd3496cef" />


We have a prototype on https://github.com/ray-project/ray/pull/53092 ~that we are hoping to merge soon.~


### API 
We want the end to end experience to look like something like this:

```python

prefill_config = LLMConfig(
	model_loading_config={"model_id": "deepseek-r1"},
        engine_kwargs=dict(
	     data_parallel_size=4,
	     tensor_parallel_size=1,
             enable_expert_parallel=True,
       ),
)

decode_config = LLMConfig(
	model_loading_config={"model_id": "deepseek-r1"},
        engine_kwargs=dict(
	      data_parallel_size=16,
	      tensor_parallel_size=1,
              enable_expert_parallel=True,
        ),
)

prefill_server = LLMServer.bind(prefill_config)
decode_server = LLMServer.bind(decode_config)
pd_proxy = LLMPDProxyServer.bind(prefill_server, decode_server)

app = LLMRouter.bind([pd_proxy])
serve.run(app)
```

cc @lk-chen @richardliaw 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC][llm] Prefill/Decode disaggregation with Ray Serve #53257

Description

Design Sketch

API

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[RFC][llm] Prefill/Decode disaggregation with Ray Serve #53257

Description

Description

Design Sketch

API

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions