-
Couldn't load subscription status.
- Fork 6.8k
Description
Description
Prefill/Decode disaggregation is a critical feature required for large scale LLM deployments to achieve good performance and SLAs. Currently, this is not well supported.
This can be broken down into a couple core requirements:
- Ability to specify prefill and decode and handle KV transfer
- Ability to scale prefill and decode replicas independently
- Integrate with NIXL for flexible and high performance communication
Design Sketch
This design is meant to offer a ray serve solution built on top of vllm-project/vllm#17751 (design doc)
To support this in Ray Serve, we are planning to create a proxy deployment that manages request routing between prefill and decode (passing request with the correct meta data to prefill and decode). We will also create separate deployments for prefill and decoding using the existing LLMServer deployment. Everything else will be done internally via Nixl.
Architecture diagram
We have a prototype on #53092 that we are hoping to merge soon.
API
We want the end to end experience to look like something like this:
prefill_config = LLMConfig(
model_loading_config={"model_id": "deepseek-r1"},
engine_kwargs=dict(
data_parallel_size=4,
tensor_parallel_size=1,
enable_expert_parallel=True,
),
)
decode_config = LLMConfig(
model_loading_config={"model_id": "deepseek-r1"},
engine_kwargs=dict(
data_parallel_size=16,
tensor_parallel_size=1,
enable_expert_parallel=True,
),
)
prefill_server = LLMServer.bind(prefill_config)
decode_server = LLMServer.bind(decode_config)
pd_proxy = LLMPDProxyServer.bind(prefill_server, decode_server)
app = LLMRouter.bind([pd_proxy])
serve.run(app)