Skip to content

Conversation

@alec-flowers
Copy link
Contributor

@alec-flowers alec-flowers commented May 23, 2025

Overview:

This PR enables KV Routing in vLLM v1 by introducing a new KVEventPublisherFromZmq that listens for KV events over ZeroMQ and forwards them to the Indexer.

Details:

  • New Publisher: Added KVEventPublisherFromZmq (and config) to dynamo/lib/llm/src/kv_router/publisher.rs and exposed it to Python bindings.
  • Integration: Write a new vLLM v1 subprocess integration based on the v0 one from @grahamking (vllm_v1_inc.py) to use the new ZMQ-based publisher for KV events.
  • Protocol Support: Added Rust-side support for deserializing and converting ZMQ-published events into internal router events.
  • Python Bindings: Updated Python bindings and type hints to expose the new publisher and config.
  • Dependency Updates: Added zeromq and rmp-serde as dependencies for ZMQ and msgpack support.

Where should the reviewer start?

Start with dynamo/lib/llm/src/kv_router/publisher.rs to review the new ZMQ publisher implementation and its integration.

Builds off this PR: vllm-project/vllm#16750 to read the published events from Zmq and forward them along to the Indexer.

Notes

Builds on: vllm-project/vllm#16750

Known Issue To-Fix:

Follow-ups (future PR's):

  • This same Publisher should work out of the box with no changes with SGLang @ishandhanani.
  • Add CLI flags for all vLLM args.
  • Revisit metrics publishing to align with new vLLM metrics.

Summary by CodeRabbit

  • New Features

    • Introduced a fully asynchronous, distributed language model server with vLLM integration, supporting advanced configuration and streaming generation.
    • Added ZeroMQ-based KV event publishing and ingestion for multi-process engine support.
    • Exposed new Python classes for ZeroMQ KV event publishing and configuration.
    • Expanded metrics reporting to include GPU cache usage, prefix cache hit rate, and data parallel rank.
  • Improvements

    • Enhanced model registration with optional context length and KV cache block size parameters.
    • Improved thread safety and flexibility in KV event handling and metrics publishing.
    • Refined GPU cache usage calculation for worker selection logic.
  • Bug Fixes

    • Ensured backwards compatibility by defaulting new fields where required.
  • Documentation

    • Updated Python type hints and module exports to reflect new features and interfaces.

@alec-flowers alec-flowers merged commit 0df6d46 into main May 29, 2025
13 checks passed
@alec-flowers alec-flowers deleted the aflowers/vllm_v1_publish branch May 29, 2025 14:56
@david-pureal
Copy link

@alec-flowers Hi, could you clarify if Dynamo uses the block_hash from VLLM? If it does, how does Dynamo ensure consistency between VLLM’s calculated block_hash (from BlockStore events) and its own block_hash caculated from request tokens?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants