Skip to content

[RFC][llm] Prefill/Decode disaggregation with Ray Serve #53257

@kouroshHakha

Description

@kouroshHakha

Description

Prefill/Decode disaggregation is a critical feature required for large scale LLM deployments to achieve good performance and SLAs. Currently, this is not well supported.

This can be broken down into a couple core requirements:

  • Ability to specify prefill and decode and handle KV transfer
  • Ability to scale prefill and decode replicas independently
  • Integrate with NIXL for flexible and high performance communication

Design Sketch

This design is meant to offer a ray serve solution built on top of vllm-project/vllm#17751 (design doc)

To support this in Ray Serve, we are planning to create a proxy deployment that manages request routing between prefill and decode (passing request with the correct meta data to prefill and decode). We will also create separate deployments for prefill and decoding using the existing LLMServer deployment. Everything else will be done internally via Nixl.

Architecture diagram

Image

We have a prototype on #53092 that we are hoping to merge soon.

API

We want the end to end experience to look like something like this:

prefill_config = LLMConfig(
	model_loading_config={"model_id": "deepseek-r1"},
        engine_kwargs=dict(
	     data_parallel_size=4,
	     tensor_parallel_size=1,
             enable_expert_parallel=True,
       ),
)

decode_config = LLMConfig(
	model_loading_config={"model_id": "deepseek-r1"},
        engine_kwargs=dict(
	      data_parallel_size=16,
	      tensor_parallel_size=1,
              enable_expert_parallel=True,
        ),
)

prefill_server = LLMServer.bind(prefill_config)
decode_server = LLMServer.bind(decode_config)
pd_proxy = LLMPDProxyServer.bind(prefill_server, decode_server)

app = LLMRouter.bind([pd_proxy])
serve.run(app)

cc @lk-chen @richardliaw

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions