feat: add KV Event Publishing to vLLM v1 #1181

alec-flowers · 2025-05-23T06:15:21Z

Overview:

This PR enables KV Routing in vLLM v1 by introducing a new KVEventPublisherFromZmq that listens for KV events over ZeroMQ and forwards them to the Indexer.

Details:

New Publisher: Added KVEventPublisherFromZmq (and config) to dynamo/lib/llm/src/kv_router/publisher.rs and exposed it to Python bindings.
Integration: Write a new vLLM v1 subprocess integration based on the v0 one from @grahamking (vllm_v1_inc.py) to use the new ZMQ-based publisher for KV events.
Protocol Support: Added Rust-side support for deserializing and converting ZMQ-published events into internal router events.
Python Bindings: Updated Python bindings and type hints to expose the new publisher and config.
Dependency Updates: Added zeromq and rmp-serde as dependencies for ZMQ and msgpack support.

Where should the reviewer start?

Start with dynamo/lib/llm/src/kv_router/publisher.rs to review the new ZMQ publisher implementation and its integration.

Builds off this PR: vllm-project/vllm#16750 to read the published events from Zmq and forward them along to the Indexer.

Notes

Builds on: vllm-project/vllm#16750

Known Issue To-Fix:

Follow-ups (future PR's):

This same Publisher should work out of the box with no changes with SGLang @ishandhanani.
Add CLI flags for all vLLM args.
Revisit metrics publishing to align with new vLLM metrics.

Summary by CodeRabbit

New Features
- Introduced a fully asynchronous, distributed language model server with vLLM integration, supporting advanced configuration and streaming generation.
- Added ZeroMQ-based KV event publishing and ingestion for multi-process engine support.
- Exposed new Python classes for ZeroMQ KV event publishing and configuration.
- Expanded metrics reporting to include GPU cache usage, prefix cache hit rate, and data parallel rank.
Improvements
- Enhanced model registration with optional context length and KV cache block size parameters.
- Improved thread safety and flexibility in KV event handling and metrics publishing.
- Refined GPU cache usage calculation for worker selection logic.
Bug Fixes
- Ensured backwards compatibility by defaulting new fields where required.
Documentation
- Updated Python type hints and module exports to reflect new features and interfaces.

launch/dynamo-run/src/subprocess/vllm_v1_inc.py

lib/bindings/python/rust/llm/kv.rs

launch/dynamo-run/src/subprocess/vllm_v1_inc.py

lib/llm/src/kv_router/publisher.rs

launch/dynamo-run/src/subprocess/vllm_v1_inc.py

lib/llm/src/kv_router/publisher.rs

lib/llm/src/kv_router/protocols.rs

launch/dynamo-run/src/subprocess/vllm_v1_inc.py

david-pureal · 2025-06-04T13:53:39Z

@alec-flowers Hi, could you clarify if Dynamo uses the block_hash from VLLM? If it does, how does Dynamo ensure consistency between VLLM’s calculated block_hash (from BlockStore events) and its own block_hash caculated from request tokens?

pull-request-size bot added the size/XXL label May 23, 2025

copy-pr-bot bot temporarily deployed to GITLAB May 23, 2025 06:15 Inactive

github-actions bot added the feat label May 23, 2025

copy-pr-bot bot temporarily deployed to GITLAB May 23, 2025 06:16 Inactive

alec-flowers force-pushed the aflowers/vllm_v1_publish branch from cff6e3a to b2d0847 Compare May 23, 2025 06:29

copy-pr-bot bot temporarily deployed to GITLAB May 23, 2025 06:29 Inactive

copy-pr-bot bot temporarily deployed to GITLAB May 23, 2025 06:30 Inactive

alec-flowers force-pushed the aflowers/vllm_v1_publish branch from b2d0847 to 00a0687 Compare May 23, 2025 06:43

copy-pr-bot bot temporarily deployed to GITLAB May 23, 2025 06:43 Inactive

copy-pr-bot bot temporarily deployed to GITLAB May 23, 2025 06:44 Inactive

alec-flowers force-pushed the aflowers/vllm_v1_publish branch from 00a0687 to c9a4330 Compare May 23, 2025 06:57

copy-pr-bot bot temporarily deployed to GITLAB May 23, 2025 06:57 Inactive

copy-pr-bot bot temporarily deployed to GITLAB May 23, 2025 06:58 Inactive