feat: sglang integration of dynamo event manager #1323

faradawn · 2025-06-02T14:30:26Z

Overview:

When SGLang is launched without an explicit page_size, it silently falls back to 1, producing variable-sized “blocks” that Dynamo discards.

This patch guarantees the two systems stay aligned.

Details:

first, In sglang_inc.py. When building ServerArgs we now set page_size = config.kv_block_size if it is positive, otherwise we inject a safe default of 16 tokens. This prevents SGLang from defaulting to page_size=1.

second, In SGLang-side, in a companion change, _record_store_event() now slices prompts into page_size chunks before emitting BlockStored events so every published block is exactly kv_block_size tokens. Link to PR: sgl-project/sglang#6824

This eliminates dropped KV-cache events caused by size mismatches.
Ensures cache-aware routing works out-of-the-box even when the user forgets --kv-block-size.
Keeps behaviour consistent across vLLM, TRT-LLM and SGLang back-ends.

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Relates to GitHub issue: [FEATURE]: sglang kvbm integration #1199

Summary by CodeRabbit

New Features
- Added a KV-based router for efficient request routing among SGLang workers using cache overlap and runtime metrics.
- Introduced a processing pipeline linking frontend, router, and worker components for streamlined request handling.
- Added chat completion endpoint to frontend service with improved logging and resource configuration.
- Defined new structured message formats to support request and response communication for language model interactions.
- Added configuration and example for aggregated KV routing setup.
Bug Fixes
- Ensured page_size is properly set with a default value when configuration is missing or invalid, preventing unintended behavior.
Chores
- Enhanced logging during worker initialization and argument parsing for better observability.
- Updated Dockerfile to use latest commit with performance improvements and fixes.
- Improved documentation for Python wheel build process.

copy-pr-bot · 2025-06-02T14:30:30Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2025-06-02T14:30:49Z

👋 Hi faradawn! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

coderabbitai · 2025-06-02T14:35:12Z

Walkthrough

The condition verifying the kv_block_size parameter in the initialization was refined to ensure it is greater than zero before assigning it to page_size; otherwise, a default of 16 is used. A new KV-based router service was added to route requests to SGLang workers using cache overlap and runtime metrics. A new script sets up a pipeline linking Frontend, Router, and SGLangWorker components. The Frontend component was updated to depend on the Router alongside the worker, with added chat completion endpoint and lifecycle hooks. The worker component was enhanced to initialize a ZeroMQ key-value event publisher bridging events to NATS. New Pydantic data models for request/response protocols were introduced. The argument parser was updated to set a default page_size of 64 with logging if not provided. The Dockerfile was updated to use a newer sglang commit with performance fixes and the KV event publisher fix. Documentation and configuration files were added or updated to support the new routing architecture.

Changes

File(s)	Change Summary
launch/dynamo-run/src/subprocess/sglang_inc.py	Modified `init` to check `kv_block_size > 0` before assigning `page_size`; defaults to 16 if invalid or unset.
examples/sglang/components/kv_router.py	Added new `Router` service class implementing KV-based routing logic using cache overlap scores and runtime metrics.
examples/sglang/graphs/agg_router.py	Added new script linking `Frontend` → `Router` → `SGLangWorker` to form a processing pipeline.
examples/sglang/components/frontend.py	Changed dependency injection in `Frontend` to include `Router`; added chat completion async endpoint, resource specs, logging, and shutdown cleanup.
examples/sglang/components/worker.py	Added initialization of `ZmqKvEventPublisher` in `SGLangWorker` to bridge ZMQ events to NATS with error handling and logging; enhanced logging in generation method.
examples/sglang/utils/protocol.py	Added new Pydantic models: `Request`, `Tokens`, `PrefillRequest`, `Response`, and `PrefillResponse` for structured messaging.
examples/sglang/utils/sglang.py	Added logging and default `page_size=64` assignment with info log if missing in `parse_sglang_args`.
container/Dockerfile.sglang	Updated `SGLANG_COMMIT` ARG to newer commit with performance fixes and KV event publisher fix; updated related comments.
examples/sglang/README.md	Added "Aggregated with KV Routing" example with command to run Dynamo server using `graphs.agg_router:Frontend`.
examples/sglang/configs/agg_router.yaml	Added new YAML config defining model, router, frontend, and worker parameters for KV routing architecture.
lib/bindings/python/README.md	Updated build instructions to include changing directory before running `maturin develop`.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Frontend
    participant Router
    participant SGLangWorker

    Client->>Frontend: Send request
    Frontend->>Router: Forward request (generate)
    Router->>SGLangWorker: Route request based on cache overlap and metrics
    SGLangWorker-->>Router: Response with worker ID and prefix hit rate
    Router-->>Frontend: Yield best worker info
    Frontend-->>Client: Return response

Possibly related PRs

refactor: Refactor kv event publishers #1287: Refactors internal Rust implementation of ZmqKvEventPublisher and KvEventPublisher, related to the worker’s new initialization of ZmqKvEventPublisher.
fix: Renamed event publisher classes and configuration #1273: Renames event publisher classes and updates imports, directly related to the usage of ZmqKvEventPublisher in the worker component.

Poem

A check for block size, clear and bright,
Now guards the page size day and night.
No more sneaky ones will slip through,
Sixteen's the fallback, sturdy and true.
With logic neat and errors slight,
The code now hops along just right! 🐇

New routes to workers, clever and keen,
Cache hits and metrics guide the scene.
Frontend, Router, Worker in line,
Together they work, efficient and fine! 🚀🐰

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (3)

container/deps/sglang/sglang_v0.4.6-dynamo-kv-disagg-patch.patch (3)

43-49: Consider using SGLang-specific environment variable names.

The code uses VLLM_* environment variables in an SGLang context, which could cause confusion for users. Consider using SGLANG_* prefixed variables or making this configurable.

-        namespace = namespace or os.getenv("VLLM_KV_NAMESPACE") or "default"
-        component = component or os.getenv("VLLM_KV_COMPONENT") or "sglang"
-        worker_id = worker_id if worker_id is not None else int(
-            os.getenv("VLLM_WORKER_ID", "0"))
-        lib_path = lib_path or os.getenv("VLLM_KV_CAPI_PATH")
+        namespace = namespace or os.getenv("SGLANG_KV_NAMESPACE") or "default"
+        component = component or os.getenv("SGLANG_KV_COMPONENT") or "sglang"
+        worker_id = worker_id if worker_id is not None else int(
+            os.getenv("SGLANG_WORKER_ID", "0"))
+        lib_path = lib_path or os.getenv("SGLANG_KV_CAPI_PATH")

156-159: Consider logging ignored events for debugging.

Silently ignoring non-KVEventBatch events could mask configuration errors or unexpected usage patterns. Consider adding debug logging.

     def publish(self, events: EventBatch) -> None:  # type: ignore[override]
         # Only handle KVEventBatch; ignore others quietly
         if not isinstance(events, KVEventBatch):
+            logger.debug("Ignoring non-KVEventBatch event: %s", type(events).__name__)
             return

171-175: Document the design decision for AllBlocksCleared handling.

The comment explains why AllBlocksCleared events are ignored, but this could be a source of confusion. Consider making this behavior more explicit.

             elif isinstance(ev, AllBlocksCleared):
-                # We could broadcast each removed block, but Radix clears by
-                # resetting cache; upstream workers will clear too. For now
-                # no-op.
-                continue
+                # AllBlocksCleared events are intentionally ignored because:
+                # 1. RadixCache clears by resetting the entire cache
+                # 2. Upstream workers will clear their caches independently
+                # 3. Broadcasting individual block removals would be inefficient
+                logger.debug("Ignoring AllBlocksCleared event (intentional no-op)")
+                continue

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 859944f and 5a21312.

📒 Files selected for processing (1)

container/deps/sglang/sglang_v0.4.6-dynamo-kv-disagg-patch.patch (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (1)

GitHub Check: Build and Test - vllm

🔇 Additional comments (1)

container/deps/sglang/sglang_v0.4.6-dynamo-kv-disagg-patch.patch (1)

182-182:
❓ Verification inconclusive

Verify registration key consistency across the codebase.

Ensure that the registration key "dynamo" is consistently used and documented throughout the codebase for proper event publisher discovery.

🏁 Script executed:
#!/bin/bash
# Description: Search for references to "dynamo" event publisher key to ensure consistency

# Search for any configuration or documentation mentioning "dynamo" publisher
rg -i "dynamo.*publisher|publisher.*dynamo" --type py

# Search for event publisher factory usage to understand expected keys
rg "EventPublisherFactory" --type py -A 3 -B 3
Length of output: 1322

🏁 Script executed:
#!/usr/bin/env bash
# Locate the patch file
patch_file=$(fd sglang_v0.4.6-dynamo-kv-disagg-patch.patch -t f)
echo "Found patch file: $patch_file"
# Show context around the inserted publisher registration (lines 160–220)
sed -n '160,220p' "$patch_file"
Length of output: 1348

🏁 Script executed:
#!/usr/bin/env bash
# Find definition of EventPublisherFactory
rg -n "class EventPublisherFactory" --type py

# Find any usage of EventPublisherFactory.get or lookup in registry
rg -n "EventPublisherFactory" --type py -A3 -B3
Length of output: 95
Confirm new publisher registration in sglang’s EventPublisherFactory

The patch adds the following at the end of container/deps/sglang/sglang_v0.4.6-dynamo-kv-disagg-patch.patch:
+# Register new publisher
+EventPublisherFactory._registry["dynamo"] = DynamoEventPublisher
Please manually verify that:

The EventPublisherFactory class in the sglang codebase defines an internal _registry and uses it for publisher lookup.

The key "dynamo" exactly matches the lookup name used in methods like get or create.

Any related documentation (README, code comments, configs) includes and explains the "dynamo" registration key.

coderabbitai · 2025-06-02T14:35:16Z

container/deps/sglang/sglang_v0.4.6-dynamo-kv-disagg-patch.patch

+        if lib_path is None:
+            raise RuntimeError("VLLM_KV_CAPI_PATH environment variable not set")
+
+        self._lib = CDLL(lib_path)


⚠️ Potential issue

Add security validation for library path and improve error handling.

Loading arbitrary C libraries from environment variables poses security risks. Additionally, the CDLL loading should have proper error handling.

if lib_path is None: raise RuntimeError("VLLM_KV_CAPI_PATH environment variable not set") + # Validate library path exists and is readable + if not os.path.isfile(lib_path): + raise RuntimeError(f"C-API library not found: {lib_path}") + if not os.access(lib_path, os.R_OK): + raise RuntimeError(f"C-API library not readable: {lib_path}") + - self._lib = CDLL(lib_path) + try: + self._lib = CDLL(lib_path) + except OSError as e: + raise RuntimeError(f"Failed to load C-API library {lib_path}: {e}")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

+ if lib_path is None:

+ raise RuntimeError("VLLM_KV_CAPI_PATH environment variable not set")

+

+ self._lib = CDLL(lib_path)

if lib_path is None:

raise RuntimeError("VLLM_KV_CAPI_PATH environment variable not set")

# Validate library path exists and is readable

if not os.path.isfile(lib_path):

raise RuntimeError(f"C-API library not found: {lib_path}")

if not os.access(lib_path, os.R_OK):

raise RuntimeError(f"C-API library not readable: {lib_path}")

try:

self._lib = CDLL(lib_path)

except OSError as e:

raise RuntimeError(f"Failed to load C-API library {lib_path}: {e}")

🤖 Prompt for AI Agents

In container/deps/sglang/sglang_v0.4.6-dynamo-kv-disagg-patch.patch around lines 48 to 51, add validation to ensure the library path from the environment variable is safe and expected before loading it with CDLL. Also, wrap the CDLL loading in a try-except block to catch and handle any exceptions, raising a clear error if loading fails. This prevents security risks from arbitrary paths and improves error handling.

coderabbitai · 2025-06-02T14:35:16Z

container/deps/sglang/sglang_v0.4.6-dynamo-kv-disagg-patch.patch

+        num_blocks = 1  # SGLang RadixCache emits one block at a time
+
+        token_arr = (c_uint32 * len(token_ids))(*token_ids)
+        num_block_tokens_arr = (c_size_t * 1)(len(token_ids))
+        block_hash_arr = (c_uint64 * 1)(block_hashes[0])


⚠️ Potential issue

Fix potential index error and validate assumptions.

The code assumes block_hashes has at least one element without validation, and hardcodes num_blocks = 1 based on a comment about SGLang RadixCache behavior.

def publish_stored( self, token_ids: List[int], block_hashes: List[int], parent_hash: Optional[int], block_size: int, lora_id: Optional[int] = None, ) -> None: - num_blocks = 1 # SGLang RadixCache emits one block at a time + if not block_hashes: + raise ValueError("block_hashes cannot be empty") + + num_blocks = 1 # SGLang RadixCache emits one block at a time + if len(block_hashes) != 1: + logger.warning("Expected exactly 1 block hash, got %d. Using first hash only.", len(block_hashes)) token_arr = (c_uint32 * len(token_ids))(*token_ids) num_block_tokens_arr = (c_size_t * 1)(len(token_ids)) block_hash_arr = (c_uint64 * 1)(block_hashes[0])

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

+ num_blocks = 1 # SGLang RadixCache emits one block at a time

+

+ token_arr = (c_uint32 * len(token_ids))(*token_ids)

+ num_block_tokens_arr = (c_size_t * 1)(len(token_ids))

+ block_hash_arr = (c_uint64 * 1)(block_hashes[0])

def publish_stored(

self,

token_ids: List[int],

block_hashes: List[int],

parent_hash: Optional[int],

block_size: int,

lora_id: Optional[int] = None,

) -> None:

if not block_hashes:

raise ValueError("block_hashes cannot be empty")

num_blocks = 1 # SGLang RadixCache emits one block at a time

if len(block_hashes) != 1:

logger.warning(

"Expected exactly 1 block hash, got %d. Using first hash only.",

len(block_hashes),

)

token_arr = (c_uint32 * len(token_ids))(*token_ids)

num_block_tokens_arr = (c_size_t * 1)(len(token_ids))

block_hash_arr = (c_uint64 * 1)(block_hashes[0])

🤖 Prompt for AI Agents

In container/deps/sglang/sglang_v0.4.6-dynamo-kv-disagg-patch.patch around lines 107 to 111, the code assumes block_hashes has at least one element and hardcodes num_blocks to 1 without validation. Add a check to ensure block_hashes is not empty before accessing block_hashes[0], and validate or dynamically determine num_blocks instead of hardcoding it to 1 to prevent potential index errors and incorrect assumptions.

coderabbitai · 2025-06-02T14:35:16Z

container/deps/sglang/sglang_v0.4.6-dynamo-kv-disagg-patch.patch

+        # block_size (tokens per block) must match allocator; default 16.
+        from .dynamo_event_manager import KVCacheEventManager
+


🛠️ Refactor suggestion

Move import to module level to avoid potential issues.

Importing modules inside __init__ methods is generally not recommended as it can cause circular import issues and makes dependencies less clear.

+from .dynamo_event_manager import KVCacheEventManager + # Add DynamoEventPublisher class DynamoEventPublisher(EventPublisher): """Publisher that forwards events to Dynamo via the C-API bindings.""" def __init__(self, kv_block_size: int = 16): # block_size (tokens per block) must match allocator; default 16. - from .dynamo_event_manager import KVCacheEventManager self._mgr = KVCacheEventManager(kv_block_size=kv_block_size)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

+ # block_size (tokens per block) must match allocator; default 16.

+ from .dynamo_event_manager import KVCacheEventManager

+

from .dynamo_event_manager import KVCacheEventManager

# Add DynamoEventPublisher

class DynamoEventPublisher(EventPublisher):

"""Publisher that forwards events to Dynamo via the C-API bindings."""

def __init__(self, kv_block_size: int = 16):

# block_size (tokens per block) must match allocator; default 16.

self._mgr = KVCacheEventManager(kv_block_size=kv_block_size)

🤖 Prompt for AI Agents

In container/deps/sglang/sglang_v0.4.6-dynamo-kv-disagg-patch.patch around lines 152 to 154, the import of KVCacheEventManager is done inside a method or function. Move this import statement to the top of the module file, outside of any functions or classes, to avoid potential circular import issues and improve clarity of dependencies.

coderabbitai · 2025-06-02T14:35:16Z

container/deps/sglang/sglang_v0.4.6-dynamo-kv-disagg-patch.patch

+        for ev in events.events:
+            if isinstance(ev, BlockStored):
+                self._mgr.publish_stored(
+                    token_ids=ev.token_ids,
+                    block_hashes=ev.block_hashes,
+                    parent_hash=ev.parent_block_hash,
+                    block_size=ev.block_size,
+                    lora_id=ev.lora_id,
+                )
+            elif isinstance(ev, BlockRemoved):
+                self._mgr.publish_removed(ev.block_hashes)
+            elif isinstance(ev, AllBlocksCleared):
+                # We could broadcast each removed block, but Radix clears by
+                # resetting cache; upstream workers will clear too. For now
+                # no-op.


🛠️ Refactor suggestion

Add validation for event data integrity.

The event handling forwards data to the C-API without validation. Consider adding basic checks to prevent potential crashes or invalid data.

for ev in events.events: if isinstance(ev, BlockStored): + if not ev.token_ids: + logger.warning("Skipping BlockStored event with empty token_ids") + continue + if not ev.block_hashes: + logger.warning("Skipping BlockStored event with empty block_hashes") + continue self._mgr.publish_stored( token_ids=ev.token_ids, block_hashes=ev.block_hashes, parent_hash=ev.parent_block_hash, block_size=ev.block_size, lora_id=ev.lora_id, ) elif isinstance(ev, BlockRemoved): + if not ev.block_hashes: + logger.warning("Skipping BlockRemoved event with empty block_hashes") + continue self._mgr.publish_removed(ev.block_hashes)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

+ for ev in events.events:

+ if isinstance(ev, BlockStored):

+ self._mgr.publish_stored(

+ token_ids=ev.token_ids,

+ block_hashes=ev.block_hashes,

+ parent_hash=ev.parent_block_hash,

+ block_size=ev.block_size,

+ lora_id=ev.lora_id,

+ )

+ elif isinstance(ev, BlockRemoved):

+ self._mgr.publish_removed(ev.block_hashes)

+ elif isinstance(ev, AllBlocksCleared):

+ # We could broadcast each removed block, but Radix clears by

+ # resetting cache; upstream workers will clear too. For now

+ # no-op.

for ev in events.events:

if isinstance(ev, BlockStored):

if not ev.token_ids:

logger.warning("Skipping BlockStored event with empty token_ids")

continue

if not ev.block_hashes:

logger.warning("Skipping BlockStored event with empty block_hashes")

continue

self._mgr.publish_stored(

token_ids=ev.token_ids,

block_hashes=ev.block_hashes,

parent_hash=ev.parent_block_hash,

block_size=ev.block_size,

lora_id=ev.lora_id,

)

elif isinstance(ev, BlockRemoved):

if not ev.block_hashes:

logger.warning("Skipping BlockRemoved event with empty block_hashes")

continue

self._mgr.publish_removed(ev.block_hashes)

elif isinstance(ev, AllBlocksCleared):

# We could broadcast each removed block, but Radix clears by

# resetting cache; upstream workers will clear too. For now

# no-op.

🤖 Prompt for AI Agents

In container/deps/sglang/sglang_v0.4.6-dynamo-kv-disagg-patch.patch around lines 161 to 175, the event handling code forwards event data directly to the C-API without validating the integrity of the data. To fix this, add basic validation checks for required fields in each event type before calling the publish methods. For example, verify that token_ids and block_hashes are not empty or None, and that block_size is a positive integer. If validation fails, handle the error gracefully, such as logging a warning and skipping the publish call to prevent crashes or invalid data propagation.

ishandhanani · 2025-06-02T18:19:56Z

Hi @faradawn - thanks for this PR.

Have you taken a look at the following?

[Metrics] Add KV events publishing sgl-project/sglang#6098 - this allows you to publish KV events from SGL via ZMQ.
https://github.com/ai-dynamo/dynamo/pull/1181/files - this added the ability for Dynamo to listen to events via ZMQ. It should work for SGL as well

I see what you're trying to do but I don't think we need to maintain a patch given those 2 PRs above.

Given this info however - the current kv event emitting logic in SGL has a bug

In the radix tree impl, SGL compresses multiple blocks (of size page_size) into a single node for space optimization. So the current event emitting logic in the radix_cache is not quite right because we emit events with blocks sizes that are variable (something dynamo does not currently support)

Let me know if you have any thoughts here. It's on my TODO to figure out a solution for this. Happy to discuss and see if we can parallelize work here!

faradawn · 2025-06-02T19:40:09Z

Hi @ishandhanani , here are two ideas:

Idea 1. Fix SGlang event publisher. Just before an event is pushed onto kv_event_queue, slice the node's value into fixed-size chunks of page_size and emit 1 BlockStored event per chunk.

Idea 2: Fix Dynamo listener. Break a large block into chucks of page_size and simulate arrival of each page.

You have more experience and I'd love to discuss and execute any plan.

alec-flowers · 2025-06-02T19:45:38Z

We don't want to continue using the C path, we should add a note that its deprecated for now. We have a python KVEventPublisher that is easier to use, we should use that. Also we want to avoid patches and instead commit mainline. That is what @ishandhanani is pointing out that there is work already committed so that we can pull events without needing patches.

ishandhanani · 2025-06-03T00:42:07Z

Hi @ishandhanani , here are two ideas:

Idea 1. Fix SGlang event publisher. Just before an event is pushed onto kv_event_queue, slice the node's value into fixed-size chunks of page_size and emit 1 BlockStored event per chunk.

Idea 2: Fix Dynamo listener. Break a large block into chucks of page_size and simulate arrival of each page.

You have more experience and I'd love to discuss and execute any plan.

I think option 1 is better. I think we want events emitted from SGLang to have a consistent block size. This allow us to maintain approximately the same tree just with our prefix tree having fixed block sizes. Do you want to take a stab at the PR to SGLang for this @faradawn?

faradawn · 2025-06-03T00:48:22Z

@ishandhanani sounds good! Is the ZMQ for kv event only used by Dynamo? I can take a look and create a PR to SGLang.

ishandhanani · 2025-06-03T02:21:21Z

@ishandhanani sounds good! Is the ZMQ for kv event only used by Dynamo? I can take a look and create a PR to SGLang.

Great! I pinged you in the discord in case we wanted to discuss a bit more synchronously.

At the moment yes - it is only used by Dynamo. However - any other framework that wants to build their own router based on ZMQ events could in theory use it.

faradawn · 2025-06-03T03:20:26Z

PR created for SGLang: sgl-project/sglang#6824. WIll keep you updated!

alec-flowers · 2025-06-03T04:05:50Z

If SGLang would accept it we could embed our own Publisher directly. Then we can skip the whole zmq stuff and publish directly from each process to NATS.

For data parallel we have made a few small changes to vLLM that we should try to mirror in SGLang
vllm-project/vllm#18925

ishandhanani · 2025-06-03T18:20:04Z

If SGLang would accept it we could embed our own Publisher directly. Then we can skip the whole zmq stuff and publish directly from each process to NATS.

For data parallel we have made a few small changes to vLLM that we should try to mirror in SGLang vllm-project/vllm#18925

Nice - let me take a look

…n/dynamo into sglang-integration-event-manager

coderabbitai

Actionable comments posted: 2

🔭 Outside diff range comments (2)

examples/sglang/components/frontend.py (1)
20-29: ⚠️ Potential issue

Remove now-unused SGLangWorker import to unblock CI

SGLangWorker is no longer referenced after switching the dependency to Router, and the Ruff step already failed once for the same issue. Deleting the import keeps the import section clean and prevents the next lint/format cycle from rewriting this file again.
-from components.worker import SGLangWorker
-from fastapi import FastAPI
+from fastapi import FastAPI
🧰 Tools

🪛 Ruff (0.11.9)

20-20: components.worker.SGLangWorker imported but unused

Remove unused import: components.worker.SGLangWorker

(F401)

🪛 Pylint (3.3.7)

[refactor] 24-24: Use 'from dynamo import sdk' instead

(R0402)
examples/sglang/components/worker.py (1)
71-99: ⚠️ Potential issue

KV-publisher is initialised against the decode worker, not this worker

comp_ns, comp_name is reassigned to SGLangDecodeWorker.dynamo_address() (lines 81-83).
Lines 94-97 then create component_ref = runtime.namespace(comp_ns).component(comp_name), which therefore points to the decode worker, while the KV-events you want originate from the prefill/aggregate worker (this component).

Create a separate pair of variables or call self.__class__.dynamo_address() again when wiring up the publisher:
-            comp_ns, comp_name = SGLangDecodeWorker.dynamo_address()
+            comp_ns_dec, comp_name_dec = SGLangDecodeWorker.dynamo_address()
             self.decode_client = (
-                await runtime.namespace(comp_ns)
-                .component(comp_name)
+                await runtime.namespace(comp_ns_dec)
+                .component(comp_name_dec)
                 .endpoint("generate")
                 .client()
             )
...
-            component_ref = runtime.namespace(comp_ns).component(comp_name)
+            comp_ns_self, comp_name_self = self.__class__.dynamo_address()
+            component_ref = runtime.namespace(comp_ns_self).component(comp_name_self)
🧰 Tools

🪛 Pylint (3.3.7)

[error] 71-71: Class 'SGLangWorker' has no 'dynamo_address' member

(E1101)

🧹 Nitpick comments (4)

examples/sglang/utils/protocol.py (2)
22-24: Narrow the type of sampling_params

sampling_params is currently typed as a plain dict, which loses all static-type information and IDE assistance. If the structure is not fixed yet, at least annotate it as dict[str, Any] (or introduce a dedicated SamplingParams model) so downstream code can rely on proper typing.
-from pydantic import BaseModel, Field
+from typing import Any
+from pydantic import BaseModel, Field
...
-    sampling_params: dict
+    sampling_params: dict[str, Any]
🧰 Tools

🪛 Pylint (3.3.7)

[refactor] 22-22: Too few public methods (0/2)

(R0903)

27-28: Prefer TokenIdType for consistency

tokens: list[int] diverges from the alias declared above (TokenIdType = int). Using the alias keeps future refactors (e.g., moving to numpy.uint32) mechanical:
-class Tokens(BaseModel):
-    tokens: list[int]
+class Tokens(BaseModel):
+    tokens: list[TokenIdType]
🧰 Tools

🪛 Pylint (3.3.7)

[refactor] 27-27: Too few public methods (0/2)

(R0903)
examples/sglang/components/kv_router.py (2)
168-226: _cost_function is doing too much – extract helpers

18 local variables and mixed concerns (score conversion, normalisation, logging, final selection) make this block hard to reason about and unit-test. Consider:

_convert_block_scores() – blocks ➜ token hit-rate

_merge_metrics() – build worker_metrics & max_waiting

_compute_logits() – linear combination & tie-breaking

This keeps each piece ≤ 30 LOC and removes the Pylint R0914 hit.

🧰 Tools

🪛 Pylint (3.3.7)

[refactor] 168-168: Too many local variables (18/15)

(R0914)

228-235: Empty logits fall back to "" – propagate a valid worker instead

Returning an empty string forces the caller to add fallback logic. A safer default is a random available worker (keeps throughput) or raising so the caller can retry.
-            return "", 0.0
+            return random.choice(worker_ids), 0.0

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 53ea353 and 077ead1.

📒 Files selected for processing (5)

examples/sglang/components/frontend.py (2 hunks)
examples/sglang/components/kv_router.py (1 hunks)
examples/sglang/components/worker.py (3 hunks)
examples/sglang/utils/protocol.py (1 hunks)
examples/sglang/utils/sglang.py (2 hunks)

🧰 Additional context used

🧬 Code Graph Analysis (1)

examples/sglang/components/kv_router.py (6)

examples/sglang/components/worker.py (3)

SGLangWorker (56-216)

async_init (68-119)

generate (154-195)

examples/sglang/utils/protocol.py (1)

Tokens (27-28)

deploy/sdk/src/dynamo/sdk/lib/decorators.py (1)

async_on_start (95-99)

deploy/sdk/src/dynamo/sdk/core/lib.py (2)

depends (121-146)

service (88-118)

deploy/sdk/src/dynamo/sdk/lib/config.py (1)

as_args (81-132)

lib/bindings/python/rust/llm/kv.rs (1)

get_metrics (405-428)

🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/1323/merge) by faradawn.

examples/sglang/components/frontend.py

[error] 1-1: isort formatting failed and modified this file to fix import order.

[error] 17-25: ruff linting failed and modified this file to fix an unused import error.

examples/sglang/components/worker.py

[error] 1-1: isort formatting failed and modified this file to fix import order.

[error] 1-1: black formatting failed and reformatted this file to fix code style issues.

examples/sglang/utils/protocol.py

[error] 1-1: black formatting failed and reformatted this file to fix code style issues.

examples/sglang/utils/sglang.py

[error] 1-1: black formatting failed and reformatted this file to fix code style issues.

examples/sglang/components/kv_router.py

[error] 1-1: isort formatting failed and modified this file to fix import order.

[error] 1-1: black formatting failed and reformatted this file to fix code style issues.

🪛 Pylint (3.3.7)

examples/sglang/utils/protocol.py

[refactor] 22-22: Too few public methods (0/2)

(R0903)

[refactor] 27-27: Too few public methods (0/2)

(R0903)

[refactor] 31-31: Too few public methods (0/2)

(R0903)

[refactor] 35-35: Too few public methods (0/2)

(R0903)

[refactor] 39-39: Too few public methods (0/2)

(R0903)

examples/sglang/utils/sglang.py

[error] 20-20: No name 'srt' in module 'sglang'

(E0611)

examples/sglang/components/kv_router.py

[refactor] 55-55: Too few public methods (0/2)

(R0903)

[refactor] 168-168: Too many local variables (18/15)

(R0914)

⏰ Context from checks skipped due to timeout of 90000ms (4)

GitHub Check: pre-merge-rust (.)
GitHub Check: pre-merge-rust (lib/runtime/examples)
GitHub Check: pre-merge-rust (lib/bindings/python)
GitHub Check: Build and Test - vllm

🔇 Additional comments (1)

examples/sglang/utils/sglang.py (1)

20-21: Verify import path sglang.srt.server_args

Static analysis flags No name 'srt' in module 'sglang'. If srt is an internal sub-package introduced downstream this is fine, but it is easy to mis-spell (srt vs srv / rst). Please re-check the actual package path in the SGLang wheel used in production.

🧰 Tools

🪛 Pylint (3.3.7)

[error] 20-20: No name 'srt' in module 'sglang'

(E0611)

coderabbitai · 2025-06-13T20:36:33Z

examples/sglang/components/worker.py

+            # Derive KV block size from engine args, falling back to 16
+            kv_block_size = getattr(
+                self.engine_args, "page_size", getattr(self.engine_args, "block_size", 16)
+            )
+
+            zmq_config = ZmqKvEventPublisherConfig(


🛠️ Refactor suggestion

page_size/block_size may arrive as string

argparse leaves CLI values as str when they come from the service config.
A manual int() cast guards against accidental string propagation to Rust bindings:

- kv_block_size = getattr( - self.engine_args, "page_size", getattr(self.engine_args, "block_size", 16) - ) + kv_block_size = int( + getattr( + self.engine_args, + "page_size", + getattr(self.engine_args, "block_size", 16), + ) + )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Derive KV block size from engine args, falling back to 16

kv_block_size = getattr(

self.engine_args, "page_size", getattr(self.engine_args, "block_size", 16)

)

zmq_config = ZmqKvEventPublisherConfig(

# Derive KV block size from engine args, falling back to 16

kv_block_size = int(

getattr(

self.engine_args,

"page_size",

getattr(self.engine_args, "block_size", 16),

)

)

zmq_config = ZmqKvEventPublisherConfig(

🤖 Prompt for AI Agents

In examples/sglang/components/worker.py around lines 98 to 103, the kv_block_size is derived from engine_args.page_size or engine_args.block_size, but these values may be strings due to argparse behavior. To fix this, explicitly cast the retrieved value to int to ensure kv_block_size is always an integer before passing it to Rust bindings.

coderabbitai · 2025-06-13T20:36:34Z

examples/sglang/components/kv_router.py

+        "--custom-router",
+        type=bool,
+        default=False,
+        help="Whether to use custom router or not",
+    )
+    parser.add_argument(
+        "--router",
+        type=str,


⚠️ Potential issue

Boolean CLI flag should use action="store_true"

type=bool expects the user to pass an explicit value (--custom-router true), which is error-prone and breaks argparse’s “flag” semantics. Replace with action="store_true" (and drop the default):

- parser.add_argument( - "--custom-router", - type=bool, - default=False, - help="Whether to use custom router or not", - ) + parser.add_argument( + "--custom-router", + action="store_true", + help="Enable custom router", + )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"--custom-router",

type=bool,

default=False,

help="Whether to use custom router or not",

)

parser.add_argument(

"--router",

type=str,

parser.add_argument(

"--custom-router",

action="store_true",

help="Enable custom router",

)

parser.add_argument(

"--router",

type=str,

🤖 Prompt for AI Agents

In examples/sglang/components/kv_router.py around lines 90 to 97, the boolean CLI flag for --custom-router incorrectly uses type=bool, requiring an explicit value and breaking flag semantics. Change the argument to use action="store_true" instead and remove the default value to make it a proper flag that sets the value to True when present.

…-integration-event-manager

coderabbitai

Actionable comments posted: 4

♻️ Duplicate comments (1)

examples/sglang/components/kv_router.py (1)
92-96: Boolean flag should use action="store_true"
Same issue flagged previously; using type=bool forces the caller to type --custom-router true. Switch to the canonical argparse style:
-    parser.add_argument(
-        "--custom-router",
-        type=bool,
-        default=False,
-        help="Whether to use custom router or not",
-    )
+    parser.add_argument(
+        "--custom-router",
+        action="store_true",
+        help="Enable custom router",
+    )

🧹 Nitpick comments (4)

examples/sglang/utils/sglang.py (1)
16-25: Minor: initialise logging once

logging.getLogger(__name__) is fine, but without basicConfig nothing is emitted when this module is imported directly (common in unit tests). Consider adding a one-liner:
logging.basicConfig(level=logging.INFO, format="%(levelname)s %(name)s: %(message)s")
or require the embedding application to configure logging explicitly.

🧰 Tools

🪛 Pylint (3.3.7)

[error] 20-20: No name 'srt' in module 'sglang'

(E0611)
examples/sglang/utils/protocol.py (1)
22-42: Tokens model name is plural but field is singular

For consistency with other request/response models consider renaming either the class or the field:
-class Tokens(BaseModel):
-    tokens: list[int]
+class Tokens(BaseModel):
+    values: list[int]
(or rename class to TokenList). This avoids .tokens.tokens-style attribute chains in downstream code.

🧰 Tools

🪛 Pylint (3.3.7)

[refactor] 22-22: Too few public methods (0/2)

(R0903)

[refactor] 27-27: Too few public methods (0/2)

(R0903)

[refactor] 31-31: Too few public methods (0/2)

(R0903)

[refactor] 35-35: Too few public methods (0/2)

(R0903)

[refactor] 39-39: Too few public methods (0/2)

(R0903)
examples/sglang/components/kv_router.py (2)

35-35: Left-over debug comment

# faradawn hello appears to be an accidental artefact – remove to keep the codebase clean.

171-232: Method exceeds reasonable complexity (18 local vars)

Splitting _cost_function into helpers (_translate_scores, _normalise_metrics, _combine_logits) will improve readability and ease unit-testing.

🧰 Tools

🪛 Pylint (3.3.7)

[refactor] 171-171: Too many local variables (18/15)

(R0914)

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 077ead1 and e5b2cdc.

📒 Files selected for processing (10)

container/Dockerfile.sglang (1 hunks)
examples/sglang/README.md (1 hunks)
examples/sglang/components/frontend.py (3 hunks)
examples/sglang/components/kv_router.py (1 hunks)
examples/sglang/components/worker.py (3 hunks)
examples/sglang/configs/agg_router.yaml (1 hunks)
examples/sglang/graphs/agg_router.py (1 hunks)
examples/sglang/utils/protocol.py (1 hunks)
examples/sglang/utils/sglang.py (2 hunks)
lib/bindings/python/README.md (1 hunks)

✅ Files skipped from review due to trivial changes (5)

examples/sglang/README.md
lib/bindings/python/README.md
examples/sglang/graphs/agg_router.py
container/Dockerfile.sglang
examples/sglang/configs/agg_router.yaml

🚧 Files skipped from review as they are similar to previous changes (2)

examples/sglang/components/frontend.py
examples/sglang/components/worker.py

🧰 Additional context used

🧬 Code Graph Analysis (1)

examples/sglang/utils/sglang.py (1)

deploy/sdk/src/dynamo/sdk/cli/utils.py (1)

reserve_free_port (63-102)

🪛 Pylint (3.3.7)

examples/sglang/components/kv_router.py

[refactor] 57-57: Too few public methods (0/2)

(R0903)

[refactor] 171-171: Too many local variables (18/15)

(R0914)

examples/sglang/utils/protocol.py

[refactor] 22-22: Too few public methods (0/2)

(R0903)

[refactor] 27-27: Too few public methods (0/2)

(R0903)

[refactor] 31-31: Too few public methods (0/2)

(R0903)

[refactor] 35-35: Too few public methods (0/2)

(R0903)

[refactor] 39-39: Too few public methods (0/2)

(R0903)

examples/sglang/utils/sglang.py

[error] 20-20: No name 'srt' in module 'sglang'

(E0611)

🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/1323/merge) by faradawn.

examples/sglang/components/kv_router.py

[error] 1-1: isort failed and modified this file to fix import sorting issues.

[error] 1-1: black formatting check failed and reformatted this file.

examples/sglang/utils/protocol.py

[error] 1-1: black formatting check failed and reformatted this file.

examples/sglang/utils/sglang.py

[error] 1-1: black formatting check failed and reformatted this file.

⏰ Context from checks skipped due to timeout of 90000ms (4)

GitHub Check: Build and Test - vllm
GitHub Check: pre-merge-rust (lib/runtime/examples)
GitHub Check: pre-merge-rust (lib/bindings/python)
GitHub Check: pre-merge-rust (.)

🔇 Additional comments (2)

examples/sglang/utils/protocol.py (1)

1-84: Black re-formatted this file – please commit the result

Pre-merge pipeline indicates Black auto-formatted the file. Commit those changes (or run make fmt) to unblock CI.

🧰 Tools

🪛 Pylint (3.3.7)

[refactor] 22-22: Too few public methods (0/2)

(R0903)

[refactor] 27-27: Too few public methods (0/2)

(R0903)

[refactor] 31-31: Too few public methods (0/2)

(R0903)

[refactor] 35-35: Too few public methods (0/2)

(R0903)

[refactor] 39-39: Too few public methods (0/2)

(R0903)

[refactor] 45-45: Too few public methods (0/2)

(R0903)

[refactor] 53-53: Too few public methods (0/2)

(R0903)

[refactor] 68-68: Too few public methods (0/2)

(R0903)

[refactor] 77-77: Too few public methods (0/2)

(R0903)

🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/1323/merge) by faradawn.

[error] 1-1: black formatting check failed and reformatted this file.
examples/sglang/components/kv_router.py (1)

1-314: Formatting & import-order failures block CI

CI ran isort and black and rewrote this file. Please commit the formatted version (or run:
isort examples/sglang/components/kv_router.py
black examples/sglang/components/kv_router.py
).

🧰 Tools

🪛 Pylint (3.3.7)

[refactor] 57-57: Too few public methods (0/2)

(R0903)

[refactor] 171-171: Too many local variables (18/15)

(R0914)

🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/1323/merge) by faradawn.

[error] 1-1: isort failed and modified this file to fix import sorting issues.

[error] 1-1: black formatting check failed and reformatted this file.

coderabbitai · 2025-06-15T19:02:37Z

examples/sglang/utils/sglang.py

+import logging
 from argparse import Namespace

 from sglang.srt.server_args import ServerArgs


⚠️ Potential issue

Broken import – sglang.srt.server_args does not exist

Static-analysis already flags this; in the published sglang wheel the object lives at sglang.server_args. Loading this file will therefore crash on worker start-up.

-from sglang.srt.server_args import ServerArgs +from sglang.server_args import ServerArgs

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

from sglang.srt.server_args import ServerArgs

-from sglang.srt.server_args import ServerArgs

+from sglang.server_args import ServerArgs

🧰 Tools

🪛 Pylint (3.3.7)

[error] 20-20: No name 'srt' in module 'sglang'

(E0611)

🤖 Prompt for AI Agents

In examples/sglang/utils/sglang.py at line 20, the import statement incorrectly references sglang.srt.server_args, which does not exist. Update the import to use sglang.server_args instead to match the actual module location and prevent import errors on worker start-up.

examples/sglang/utils/sglang.py

coderabbitai · 2025-06-15T19:02:37Z

examples/sglang/components/kv_router.py

+from components.worker import SGLangWorker
+from utils.protocol import Tokens
+


⚠️ Potential issue

Import path may break outside examples package

from components.worker import SGLangWorker relies on the current working directory being examples/sglang. Use an absolute import rooted at the project package (e.g. examples.sglang.components.worker) or add .__init__.py files to turn folders into packages.

🤖 Prompt for AI Agents

In examples/sglang/components/kv_router.py around lines 45 to 47, the import statement uses a relative path that depends on the current working directory, which can break outside the examples package. Change the import to use an absolute import rooted at the project package, such as importing SGLangWorker from examples.sglang.components.worker. Alternatively, ensure all relevant directories have __init__.py files to make them proper packages and support relative imports.

coderabbitai · 2025-06-15T19:02:37Z

examples/sglang/components/kv_router.py

+    def _cost_function(
+        self,
+        scores: OverlapScores | None,
+        metrics: AggregatedMetrics | None,
+        token_length: int,
+    ) -> Tuple[WorkerId, float]:
+        """Compute the best worker given scores/metrics.
+
+        Returns a tuple of ``(worker_id, estimated_prefix_hit_rate)``.
+        """
+        logger.info("=== Cost function called")
+
+        worker_scores: dict[WorkerId, float] = {}
+        if scores:
+            for worker_id, score in scores.scores.items():
+                # score is in *blocks*; convert to *tokens* hit-rate
+                worker_scores[worker_id] = (
+                    score * self.indexer.block_size() / token_length
+                )
+        else:
+            logger.warning("Cannot get KV scores")
+
+        worker_metrics: dict[WorkerId, dict[str, float]] = {}
+        max_waiting = 0.0
+        if metrics:
+            for endpoint in metrics.endpoints:
+                worker_id = endpoint.worker_id
+                worker_metrics[worker_id] = {
+                    key: getattr(endpoint, key, self.default_metrics[key])
+                    for key in self.default_metrics.keys()
+                }
+                max_waiting = max(
+                    max_waiting, worker_metrics[worker_id]["num_requests_waiting"]
+                )
+        else:
+            logger.warning("Cannot get metrics")
+
+        # Consider *all* workers, even if metrics/scores missing
+        worker_ids = self.workers_client.instance_ids()
+
+        worker_logits: dict[WorkerId, float] = {}
+        for worker_id in worker_ids:
+            score = worker_scores.get(worker_id, 0.0)
+            metrics_dict = worker_metrics.get(worker_id, self.default_metrics)
+            gpu_cache_usage = metrics_dict["gpu_cache_usage_perc"]
+            normalized_waiting = (
+                metrics_dict["num_requests_waiting"] / max_waiting
+                if max_waiting > 0 else 0.0
+            )
+
+            # Simple linear combination (higher is better)
+            worker_logits[worker_id] = 2 * score - gpu_cache_usage - normalized_waiting
+            logger.info(
+                "Formula for %s: %.3f = 2.0 * %.3f - %.3f - %.3f",
+                worker_id,
+                worker_logits[worker_id],
+                score,
+                gpu_cache_usage,
+                normalized_waiting,
+            )
+
+        if not worker_logits or not any(worker_logits.values()):


🛠️ Refactor suggestion

Cost function mixes incomparable scales – results may be biased

gpu_cache_usage_perc appears to be a value in [0,100] while score is a fraction 0-1 (after block → token normalisation). Subtracting the raw percentage heavily outweighs the 2×cache-hit bonus, effectively turning the formula into “pick whoever uses 0 % GPU cache”.

Either:

Convert gpu_cache_usage_perc to [0,1] by dividing by 100, or

Rescale all terms/weights based on empirical ranges (e.g. weight hit-rate by 100 as well).

Example fix:

- gpu_cache_usage = metrics_dict["gpu_cache_usage_perc"] + gpu_cache_usage = metrics_dict["gpu_cache_usage_perc"] / 100.0

and revisit the coefficients.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def _cost_function(

self,

scores: OverlapScores | None,

metrics: AggregatedMetrics | None,

token_length: int,

) -> Tuple[WorkerId, float]:

"""Compute the best worker given scores/metrics.

Returns a tuple of ``(worker_id, estimated_prefix_hit_rate)``.

"""

logger.info("=== Cost function called")

worker_scores: dict[WorkerId, float] = {}

if scores:

for worker_id, score in scores.scores.items():

# score is in *blocks*; convert to *tokens* hit-rate

worker_scores[worker_id] = (

score * self.indexer.block_size() / token_length

)

else:

logger.warning("Cannot get KV scores")

worker_metrics: dict[WorkerId, dict[str, float]] = {}

max_waiting = 0.0

if metrics:

for endpoint in metrics.endpoints:

worker_id = endpoint.worker_id

worker_metrics[worker_id] = {

key: getattr(endpoint, key, self.default_metrics[key])

for key in self.default_metrics.keys()

}

max_waiting = max(

max_waiting, worker_metrics[worker_id]["num_requests_waiting"]

)

else:

logger.warning("Cannot get metrics")

# Consider *all* workers, even if metrics/scores missing

worker_ids = self.workers_client.instance_ids()

worker_logits: dict[WorkerId, float] = {}

for worker_id in worker_ids:

score = worker_scores.get(worker_id, 0.0)

metrics_dict = worker_metrics.get(worker_id, self.default_metrics)

gpu_cache_usage = metrics_dict["gpu_cache_usage_perc"]

normalized_waiting = (

metrics_dict["num_requests_waiting"] / max_waiting

if max_waiting > 0 else 0.0

)

# Simple linear combination (higher is better)

worker_logits[worker_id] = 2 * score - gpu_cache_usage - normalized_waiting

logger.info(

"Formula for %s: %.3f = 2.0 * %.3f - %.3f - %.3f",

worker_id,

worker_logits[worker_id],

score,

gpu_cache_usage,

normalized_waiting,

)

if not worker_logits or not any(worker_logits.values()):

for worker_id in worker_ids:

score = worker_scores.get(worker_id, 0.0)

metrics_dict = worker_metrics.get(worker_id, self.default_metrics)

- gpu_cache_usage = metrics_dict["gpu_cache_usage_perc"]

+ gpu_cache_usage = metrics_dict["gpu_cache_usage_perc"] / 100.0

normalized_waiting = (

metrics_dict["num_requests_waiting"] / max_waiting

if max_waiting > 0 else 0.0

)

# Simple linear combination (higher is better)

worker_logits[worker_id] = 2 * score - gpu_cache_usage - normalized_waiting

logger.info(

"Formula for %s: %.3f = 2.0 * %.3f - %.3f - %.3f",

worker_id,

worker_logits[worker_id],

score,

gpu_cache_usage,

normalized_waiting,

)

🧰 Tools

🪛 Pylint (3.3.7)

[refactor] 171-171: Too many local variables (18/15)

(R0914)

🤖 Prompt for AI Agents

In examples/sglang/components/kv_router.py between lines 171 and 232, the cost function subtracts gpu_cache_usage_perc directly, which is on a 0-100 scale, from score values on a 0-1 scale, causing bias. Fix this by normalizing gpu_cache_usage_perc to a 0-1 scale by dividing it by 100 before using it in the formula. After this, revisit and adjust the coefficients in the linear combination to ensure balanced weighting of all terms.

…g-integration-event-manager

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (5)

examples/sglang/components/worker.py (1)
99-101: page_size/block_size may arrive as string

argparse leaves CLI values as str when they come from the service config.
A manual int() cast guards against accidental string propagation to Rust bindings:
-            kv_block_size = getattr(
-                self.engine_args, "page_size", getattr(self.engine_args, "block_size", 64)
-            )
+            kv_block_size = int(
+                getattr(
+                    self.engine_args,
+                    "page_size", 
+                    getattr(self.engine_args, "block_size", 64),
+                )
+            )
examples/sglang/components/kv_router.py (4)
37-56: Fix import formatting to resolve pipeline failure.

The pipeline indicates that import sorting failed. Please run isort to fix the import order:
isort examples/sglang/components/kv_router.py
black examples/sglang/components/kv_router.py
45-47: Import path may break outside examples package

from components.worker import SGLangWorker relies on the current working directory being examples/sglang. Use an absolute import rooted at the project package (e.g. examples.sglang.components.worker) or add __init__.py files to turn folders into packages.

92-96: Boolean CLI flag should use action="store_true"

type=bool expects the user to pass an explicit value (--custom-router true), which is error-prone and breaks argparse's "flag" semantics. Replace with action="store_true":
-    parser.add_argument(
-        "--custom-router",
-        type=bool,
-        default=False,
-        help="Whether to use custom router or not",
-    )
+    parser.add_argument(
+        "--custom-router",
+        action="store_true",
+        help="Enable custom router",
+    )
169-230: Fix code formatting and consider refactoring complex method.

The pipeline indicates that Black formatting failed. Please run black to fix code style issues throughout the file.

Additionally, the _cost_function method has too many local variables (18/15 per Pylint), making it difficult to read and maintain.
black examples/sglang/components/kv_router.py
Consider refactoring _cost_function by extracting helper methods:

_process_overlap_scores() to handle score conversion

_process_worker_metrics() to handle metrics aggregation

_compute_worker_logits() to handle the scoring logic

Cost function mixes incomparable scales – results may be biased

gpu_cache_usage_perc appears to be a value in [0,100] while score is a fraction 0-1 (after block → token normalisation). Subtracting the raw percentage heavily outweighs the 2×cache-hit bonus, effectively turning the formula into "pick whoever uses 0 % GPU cache".

Either:

Convert gpu_cache_usage_perc to [0,1] by dividing by 100, or

Rescale all terms/weights based on empirical ranges

Example fix:
-            gpu_cache_usage = metrics_dict["gpu_cache_usage_perc"]
+            gpu_cache_usage = metrics_dict["gpu_cache_usage_perc"] / 100.0
🧰 Tools

🪛 Pylint (3.3.7)

[refactor] 169-169: Too many local variables (18/15)

(R0914)

🧹 Nitpick comments (2)

examples/sglang/components/frontend.py (1)
49-57: Simplify by removing unnecessary else after return.
 def get_http_binary_path():
     """Find the HTTP binary path in SDK or fallback to 'http' command."""
     sdk_path = Path(sdk.__file__)
     binary_path = sdk_path.parent / "cli/bin/http"
     if not binary_path.exists():
         return "http"
-    else:
-        return str(binary_path)
+    return str(binary_path)
🧰 Tools

🪛 Pylint (3.3.7)

[refactor] 53-56: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)
examples/sglang/components/kv_router.py (1)
35-35: Remove unnecessary comment.
-# faradawn hello

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e5b2cdc and aac3b8f.

📒 Files selected for processing (6)

examples/sglang/components/frontend.py (5 hunks)
examples/sglang/components/kv_router.py (1 hunks)
examples/sglang/components/worker.py (5 hunks)
examples/sglang/configs/agg_router.yaml (1 hunks)
launch/dynamo-run/src/subprocess/sglang_inc.py (1 hunks)
lib/bindings/python/README.md (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

lib/bindings/python/README.md
launch/dynamo-run/src/subprocess/sglang_inc.py
examples/sglang/configs/agg_router.yaml

🧰 Additional context used

🧬 Code Graph Analysis (1)

examples/sglang/components/frontend.py (4)

examples/sglang/components/worker.py (2)

SGLangWorker (56-218)

generate (154-197)

examples/sglang/components/kv_router.py (2)

Router (118-313)

generate (279-313)

examples/sglang/utils/protocol.py (4)

Tokens (27-28)

PreprocessedRequest (68-74)

StopConditions (45-50)

SamplingOptions (53-65)

deploy/sdk/src/dynamo/sdk/lib/decorators.py (3)

async_on_start (95-99)

endpoint (68-92)

on_shutdown (102-106)

🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/1323/merge) by faradawn.

examples/sglang/components/worker.py

[error] 1-1: isort formatting failed and modified the file. Please run isort to fix import sorting.

[error] 1-1: Black formatting failed and reformatted the file. Please run 'black' to fix code style issues.

examples/sglang/components/frontend.py

[error] 1-1: isort formatting failed and modified the file. Please run isort to fix import sorting.

[error] 1-1: Black formatting failed and reformatted the file. Please run 'black' to fix code style issues.

[error] 19-37: Ruff linting errors found and fixed in this file. Please run 'ruff --fix' to apply fixes.

examples/sglang/components/kv_router.py

[error] 1-1: isort formatting failed and modified the file. Please run isort to fix import sorting.

[error] 1-1: Black formatting failed and reformatted the file. Please run 'black' to fix code style issues.

🪛 Ruff (0.11.9)

examples/sglang/components/frontend.py

25-25: utils.protocol.Tokens imported but unused

Remove unused import

(F401)

25-25: utils.protocol.StopConditions imported but unused

Remove unused import

(F401)

25-25: utils.protocol.SamplingOptions imported but unused

Remove unused import

(F401)

26-26: transformers.AutoTokenizer imported but unused

Remove unused import: transformers.AutoTokenizer

(F401)

27-27: json imported but unused

Remove unused import: json

(F401)

30-30: dynamo.sdk.async_on_start imported but unused

Remove unused import

(F401)

30-30: dynamo.sdk.dynamo_context imported but unused

Remove unused import

(F401)

🪛 Pylint (3.3.7)

examples/sglang/components/frontend.py

[refactor] 29-29: Use 'from dynamo import sdk' instead

(R0402)

[refactor] 53-56: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

examples/sglang/components/kv_router.py

[refactor] 57-57: Too few public methods (0/2)

(R0903)

[refactor] 169-169: Too many local variables (18/15)

(R0914)

⏰ Context from checks skipped due to timeout of 90000ms (4)

GitHub Check: Build and Test - vllm
GitHub Check: pre-merge-rust (lib/bindings/python)
GitHub Check: pre-merge-rust (.)
GitHub Check: pre-merge-rust (lib/runtime/examples)

coderabbitai · 2025-06-16T01:42:32Z

examples/sglang/components/worker.py

+from dynamo.llm import (
+    ModelType,
+    register_llm,
+    ZmqKvEventPublisher,
+    ZmqKvEventPublisherConfig,
+)


⚠️ Potential issue

Fix import formatting issues.

The pipeline indicates that isort and Black formatting failed. Please run these commands to fix the formatting:

isort examples/sglang/components/worker.py black examples/sglang/components/worker.py

🤖 Prompt for AI Agents

In examples/sglang/components/worker.py around lines 38 to 43, the import statements have formatting issues flagged by isort and Black. To fix this, run the commands "isort examples/sglang/components/worker.py" followed by "black examples/sglang/components/worker.py" to automatically sort and format the imports according to the project's style guidelines.

examples/sglang/components/frontend.py

…g-integration-event-manager

faradawn · 2025-06-16T20:31:54Z

Progress update:

Router Workflow

dynamo-run in=http out=worker.generate router=kv -> worker.

Problem: need a "Preprocessed request" (token id, etc) to pass into SGLang. But dynamo-run couldn't seem to do the translation.

Solution: dynamo-run is able to preprocess the request. Just need to reset etcd to route to the correct endpoint.

…/faradawn/dynamo into sglang-integration-event-manager

faradawn · 2025-06-21T03:27:35Z

Dynamo router is working

cd /workspace/examples/sglang              
dynamo serve graphs.agg:Frontend -f ./configs/agg.yaml
2025-06-21T03:08:46.103Z  INFO utils.resolve_service_config: Running dynamo serve with config: {'Frontend': {'served_model_name': 'deepseek-ai/DeepSeek-R1-Distill-Llama-8B', 'endpoint': 'dynamo.SGLangWorker.generate', 'port': 8000}, 'SGLangWorker': {'model-path': 'deepseek-ai/DeepSeek-R1-Distill-Llama-8B', 'served-model-name': 'deepseek-ai/DeepSeek-R1-Distill-Llama-8B', 'tp': 1, 'page-size': 16, 'trust-remote-code': True, 'skip-tokenizer-init': True, 'ServiceArgs': {'workers': 1, 'resources': {'gpu': 1}}}}   
╭──────────────── Dynamo Serve ────────────────╮
│ Starting Dynamo service: graphs.agg:Frontend │
╰──────────────────────────────────────────────╯
2025-06-21T03:08:57.406Z  INFO resource._discover_gpus: Discovered 1 GPUs   
2025-06-21T03:08:57.406Z  INFO resource._discover_gpus: Discovered 1 GPUs   
2025-06-21T03:08:57.406Z  INFO allocator.get_resource_envs: Getting resource envs for service Frontend   
2025-06-21T03:08:57.407Z  INFO allocator.get_resource_envs: Using configured worker count: 1   
2025-06-21T03:08:57.407Z  INFO allocator.get_resource_envs: Final resource allocation - workers: 1, envs: []   
2025-06-21T03:08:57.407Z  INFO serving.serve_dynamo_graph: Serving dynamo graph with namespace dynamo   
2025-06-21T03:08:57.407Z  INFO serving.serve_dynamo_graph: Clearing namespace dynamo before serving   
2025-06-21T03:08:57.407Z  INFO allocator.get_resource_envs: Getting resource envs for service SGLangWorker   
2025-06-21T03:08:57.408Z  INFO allocator.get_resource_envs: GPU requirement found: 1   
2025-06-21T03:08:57.408Z  INFO allocator.get_resource_envs: Using configured worker count: 1   
2025-06-21T03:08:57.408Z  INFO allocator.get_resource_envs: GPU allocation enabled   
2025-06-21T03:08:57.408Z  INFO allocator.get_resource_envs: Local deployment detected. Allocating GPUs for 1 workers of 'SGLangWorker'   
2025-06-21T03:08:57.408Z  INFO allocator.get_resource_envs: GPU 0 (NVIDIA L4): Memory: 21.9GB free / 22.5GB total, Utilization: 0%    
2025-06-21T03:08:57.409Z  INFO allocator.get_resource_envs: Final resource allocation - workers: 1, envs: [{'CUDA_VISIBLE_DEVICES': '0'}]   
2025-06-21T03:08:57.409Z  INFO serving.create_dynamo_watcher: Created watcher for SGLangWorker's in the dynamo namespace   
2025-06-21T03:08:57.415Z  INFO serving.serve_dynamo_graph: Created watcher for Frontend with 1 workers in the dynamo namespace   
2025-06-21T03:08:57.424Z  INFO arbiter._ensure_ioloop: Installing handle_callback_exception to loop   
2025-06-21T03:08:57.424Z  INFO sighandler.__init__: Registering signals...   
2025-06-21T03:08:57.425Z  INFO arbiter.start: Starting master on pid 17661   
2025-06-21T03:08:57.426Z  INFO arbiter.initialize: sockets started   
2025-06-21T03:08:57.472Z  INFO arbiter.start: Arbiter now waiting for commands   
2025-06-21T03:08:57.472Z  INFO watcher._start: dynamo_SGLangWorker started   
2025-06-21T03:08:57.509Z  INFO watcher._start: dynamo_Frontend started   
2025-06-21T03:08:57.510Z  INFO serving.<lambda>: Starting graphs.agg:Frontend (Press CTRL+C to quit)   
2025-06-21T03:09:05.506Z  INFO serve_dynamo.dyn_worker: [SGLangWorker:1] Registering component dynamo/SGLangWorker   
2025-06-21T03:09:05.513Z  INFO serve_dynamo.dyn_worker: [SGLangWorker:1] Created SGLangWorker component   
2025-06-21T03:09:05.513Z  INFO config.as_args: [SGLangWorker:1] Running SGLangWorker with args=['--model-path', 'deepseek-ai/DeepSeek-R1-Distill-Llama-8B', '--served-model-name', 'deepseek-ai/DeepSeek-R1-Distill-Llama-8B', '--tp', '1', '--page-size', '16', '--trust-remote-code', '--skip-tokenizer-init']   
2025-06-21T03:09:05.544Z  INFO server_args.__post_init__: [SGLangWorker:1] EPLB is enabled or init_expert_location is provided. ep_dispatch_algorithm is configured.   
2025-06-21T03:09:05.635Z  INFO serve_dynamo.dyn_worker: [Frontend:1] Registering component dynamo/Frontend   
2025-06-21T03:09:05.641Z  INFO serve_dynamo.dyn_worker: [Frontend:1] Created Frontend component   
2025-06-21T03:09:05.642Z  INFO frontend.start_ingress_and_processor: [Frontend:1] Starting HTTP server and processor on port 8000   
2025-06-21T03:09:05.643Z  INFO serve_dynamo.dyn_worker: [Frontend:1] Starting Frontend instance with all registered endpoints   
2025-06-21T03:09:05.643Z  INFO serve_dynamo.dyn_worker: [Frontend:1] No Dynamo endpoints found in service Frontend but keeping service alive   
2025-06-21T03:09:05.666Z  WARN dynamo_run::opt: out=dyn://<path> is deprecated, the path is not used. Please use 'out=dyn'
2025-06-21T03:09:05.678Z  INFO dynamo_run::input::http: Watching for remote model at models
2025-06-21T03:09:05.680Z  INFO dynamo_llm::http::service::service_v2: Starting HTTP service on: 0.0.0.0:8000 address="0.0.0.0:8000"
2025-06-21T03:09:07.700Z  INFO engine.__init__: [SGLangWorker:1] server_args=ServerArgs(model_path='deepseek-ai/DeepSeek-R1-Distill-Llama-8B', tokenizer_path='deepseek-ai/DeepSeek-R1-Distill-Llama-8B', tokenizer_mode='auto', skip_tokenizer_init=True, load_format='auto', trust_remote_code=True, dtype='auto', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=None, device='cuda', served_model_name='deepseek-ai/DeepSeek-R1-Distill-Llama-8B', chat_template=None, completion_template=None, is_embedding=False, enable_multimodal=None, revision=None, impl='auto', host='127.0.0.1', port=30000, mem_fraction_static=0.871, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=2048, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=16, tp_size=1, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=42943220, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, show_time_cost=False, enable_metrics=False, bucket_time_to_first_token=None, bucket_e2e_request_latency=None, bucket_inter_token_latency=None, collect_tokens_histogram=False, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, api_key=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, ep_size=1, enable_ep_moe=False, enable_deepep_moe=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, cuda_graph_max_bs=8, cuda_graph_bs=None, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_nccl_nvls=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, disable_overlap_schedule=False, disable_overlap_cg_plan=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_torch_compile=False, torch_compile_max_bs=32, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through_selective', flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, enable_return_hidden_states=False, warmups=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, debug_tensor_dump_prefill_only=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=43657, disaggregation_ib_device=None, num_reserved_decode_tokens=512, pdlb_url=None)   
2025-06-21T03:09:18.381Z  INFO model_runner.model_specific_adjustment: Attention backend not set. Use flashinfer backend by default.   
2025-06-21T03:09:18.384Z  INFO model_runner.init_torch_distributed: Init torch distributed begin.   
2025-06-21T03:09:18.633Z  INFO model_runner.init_torch_distributed: Init torch distributed ends. mem usage=0.00 GB   
2025-06-21T03:09:19.682Z  INFO model_runner.load_model: Load weight begin. avail mem=21.73 GB   
2025-06-21T03:09:20.120Z  INFO weight_utils.download_weights_from_hf: Using model weights format ['*.safetensors']   
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:42<00:42, 42.10s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [01:18<00:00, 38.62s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [01:18<00:00, 39.14s/it]

2025-06-21T03:10:38.846Z  INFO model_runner.load_model: Load weight end. type=LlamaForCausalLM, dtype=torch.bfloat16, avail mem=6.71 GB, mem usage=15.02 GB.   
2025-06-21T03:10:38.983Z  INFO memory_pool.__init__: KV Cache is allocated. #tokens: 31984, K size: 1.95 GB, V size: 1.95 GB   
2025-06-21T03:10:38.992Z  INFO model_runner.init_memory_pool: Memory pool end. avail mem=1.69 GB   
2025-06-21T03:10:39.481Z  INFO model_runner.init_cuda_graphs: Capture cuda graph begin. This can take up to several minutes. avail mem=1.18 GB   
2025-06-21T03:10:39.573Z  INFO utils.rank0_log: Capture cuda graph bs [1, 2, 4, 8]   
Capturing batches (avail_mem=1.10 GB): 100%|██████████████████████████████████████████████████████████| 4/4 [00:04<00:00,  1.03s/it]
2025-06-21T03:10:43.700Z  INFO model_runner.init_cuda_graphs: Capture cuda graph end. Time elapsed: 4.22 s. mem usage=0.11 GB. avail mem=1.07 GB.   
2025-06-21T03:10:43.720Z  INFO scheduler.__init__: max_total_num_tokens=31984, chunked_prefill_size=2048, max_prefill_tokens=16384, max_running_requests=2048, context_len=131072, available_gpu_mem=1.07 GB   
2025-06-21T03:10:43.741Z  INFO worker.__init__: [SGLangWorker:1] === new file SGLangWorker initialized with kv publisher   
2025-06-21T03:10:43.741Z  INFO worker.async_init: [SGLangWorker:1] === worker async_initRegistering LLM for discovery   
2025-06-21T03:10:43.756Z  INFO worker.async_init: [SGLangWorker:1] === new file register llm with kv block size 16, endpoint=<builtins.Endpoint object at 0x7e9d0ac20350>, model_path=deepseek-ai/DeepSeek-R1-Distill-Llama-8B, served_model_name=deepseek-ai/DeepSeek-R1-Distill-Llama-8B   
2025-06-21T03:10:44.583Z  INFO worker.async_init: [SGLangWorker:1] === worker publishing initial metrics   
2025-06-21T03:10:44.585Z  INFO worker.async_init: [SGLangWorker:1] === worker donepublishing initial metrics   
2025-06-21T03:10:44.586Z  INFO worker.create_metrics_publisher_endpoint: [SGLangWorker:1] === me Creating metrics publisher endpoint with primary lease   
2025-06-21T03:10:44.621Z  INFO _core::llm::kv: Starting ZmqKvEventPublisher worker_id=7587887592773332273 kv_block_size=16 zmq_endpoint=tcp://127.0.0.1:5557 zmq_topic=
2025-06-21T03:10:44.622Z  INFO worker.async_init: [SGLangWorker:1] === worker ZMQ publisher created   
2025-06-21T03:10:44.622Z  INFO serve_dynamo.dyn_worker: [SGLangWorker:1] Starting SGLangWorker instance with all registered endpoints   
2025-06-21T03:10:45.753Z  INFO dynamo_llm::kv_router: KV Routing initialized
2025-06-21T03:10:46.601Z  INFO dynamo_llm::discovery::watcher: added model model_name="deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
2025-06-21T03:16:34.719Z  INFO worker.generate: [SGLangWorker:1] === worker generate request=token_ids=[128000, 128011, 12840, 374, 832, 5636, 832, 128012, 128013, 198] stop_conditions=StopConditions(max_tokens=30, stop=None, stop_token_ids_hidden=[128001], min_tokens=None, ignore_eos=None) sampling_options=SamplingOptions(n=None, best_of=None, presence_penalty=None, frequency_penalty=None, repetition_penalty=None, temperature=None, top_p=None, top_k=None, min_p=None, use_beam_search=None, length_penalty=None, seed=None) eos_token_ids=[128001] mdc_sum='70269cc374de5521ce82473ed8ed88e768a1e66d9215e7f295905db4e97d5c23' annotations=[]   
2025-06-21T03:16:34.719Z  INFO engine.async_generate: data_parallel_rank: None   
2025-06-21T03:16:34.721Z  INFO scheduler.log_prefill_stats: Prefill batch. #new-seq: 1, #new-token: 16, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0   
2025-06-21T03:20:14.346Z  INFO worker.generate: [SGLangWorker:1] === worker generate request=token_ids=[128000, 128011, 12840, 374, 832, 5636, 1403, 128012, 128013, 198] stop_conditions=StopConditions(max_tokens=30, stop=None, stop_token_ids_hidden=[128001], min_tokens=None, ignore_eos=None) sampling_options=SamplingOptions(n=None, best_of=None, presence_penalty=None, frequency_penalty=None, repetition_penalty=None, temperature=None, top_p=None, top_k=None, min_p=None, use_beam_search=None, length_penalty=None, seed=None) eos_token_ids=[128001] mdc_sum='70269cc374de5521ce82473ed8ed88e768a1e66d9215e7f295905db4e97d5c23' annotations=[]   
2025-06-21T03:20:14.347Z  INFO engine.async_generate: data_parallel_rank: None   
2025-06-21T03:20:14.348Z  INFO scheduler.log_prefill_stats: Prefill batch. #new-seq: 1, #new-token: 16, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0   
2025-06-21T03:20:15.012Z  INFO scheduler.log_decode_stats: Decode batch. #running-req: 1, #token: 32, token usage: 0.00, cuda graph: True, gen throughput (token/s): 0.07, #queue-req: 0

Client side:

(venv) root@l4:/workspace/examples/sglang# curl localhost:8000/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
    "messages": [
    {
        "role": "user",
        "content": "what is one plus one"
    }
    ],
    "stream":false,
    "max_tokens": 30
  }'
{"id":"chatcmpl-bfc6afa5-769e-4ea4-b43f-9d1cb81cf0fe","choices":[{"index":0,"message":{"content":"I need to determine the answer to the question, which is \"what is one plus one.\"\n\nFirst, I identify the numbers involved in the problem","refusal":null,"tool_calls":null,"role":"assistant","function_call":null,"audio":null},"finish_reason":null,"logprobs":null}],"created":1750475794,"model":"deepseek-ai/DeepSeek-R1-Distill-Llama-8B","service_tier":null,"system_fingerprint":null,"object":"chat.completion","usage":{"prompt_tokens":10,"completion_tokens":29,"total_tokens":0,"prompt_tokens_details":null,"completion_tokens_details":null}}(venv) root@l4:/workspace/examples/sglang# 
(venv) root@l4:/workspace/examples/sglang# 
(venv) root@l4:/workspace/examples/sglang# 
(venv) root@l4:/workspace/examples/sglang# 
(venv) root@l4:/workspace/examples/sglang# curl localhost:8000/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
    "messages": [
    {
        "role": "user",
        "content": "what is one plus two"
    }
    ],
    "stream":false,
    "max_tokens": 30
  }'
{"id":"chatcmpl-3edbaddb-baa5-4d06-bc85-6449af93020b","choices":[{"index":0,"message":{"content":"First, I need to understand the user's question, which is \"What is one plus two?\"\n\nTo solve this, I'll start by identifying","refusal":null,"tool_calls":null,"role":"assistant","function_call":null,"audio":null},"finish_reason":null,"logprobs":null}],"created":1750476014,"model":"deepseek-ai/DeepSeek-R1-Distill-Llama-8B","service_tier":null,"system_fingerprint":null,"object":"chat.completion","usage":{"prompt_tokens":10,"completion_tokens":29,"total_tokens":0,"prompt_tokens_details":null,"completion_tokens_details":null}}(venv) root@l4:/workspace/examples/sglang#

ishandhanani · 2025-06-24T18:29:02Z

Closing as this is now a duplicate of #1605

add patch for sglang to call dynamo event manager

5a21312

faradawn requested review from a team, alec-flowers, ishandhanani, nnshah1, ptarasiewiczNV, rmccorm4 and tanmayv25 as code owners June 2, 2025 14:30

pull-request-size bot added the size/L label Jun 2, 2025

github-actions bot added the feat label Jun 2, 2025

github-actions bot added the external-contribution Pull request is from an external contributor label Jun 2, 2025

coderabbitai bot reviewed Jun 2, 2025

View reviewed changes

faradawn mentioned this pull request Jun 2, 2025

[FEATURE]: sglang kvbm integration #1199

Open

Merge branch 'main' into sglang-integration-event-manager

e699408

faradawn requested a review from richardhuo-nv as a code owner June 5, 2025 20:12

faradawn added 2 commits June 5, 2025 13:18

add default page size to avoid sglang having page size of 1

5dbf090

Merge branch 'sglang-integration-event-manager' of github.com:faradaw…

ace979b

…n/dynamo into sglang-integration-event-manager

faradawn requested a review from piotrm-nvidia as a code owner June 5, 2025 20:18

coderabbitai bot reviewed Jun 13, 2025

View reviewed changes

faradawn added 5 commits June 15, 2025 15:52

update sglang router example in readme

7f6f2f1

fix bindings readme

8507146

update sglang docker commit

39fd160

Merge branch 'main' of https://github.com/faradawn/dynamo into sglang…

ab379c6

…-integration-event-manager

made sglang router example working

e5b2cdc

coderabbitai bot reviewed Jun 15, 2025

View reviewed changes

faradawn added 3 commits June 15, 2025 20:27

Merge branch 'main' of https://github.com/ai-dynamo/dynamo into sglan…

127eac6

…g-integration-event-manager

fixing input pasing to sglang

69bb71e

redo frontend small change

aac3b8f

pull-request-size bot added size/XL and removed size/L labels Jun 16, 2025

raw sglang frontend

45e7294

coderabbitai bot reviewed Jun 16, 2025

View reviewed changes

faradawn added 2 commits June 16, 2025 02:00

processing parsing preprocessed errro

29846bf

Merge branch 'main' of https://github.com/ai-dynamo/dynamo into sglan…

8f821e1

…g-integration-event-manager

faradawn and others added 3 commits June 20, 2025 14:07

Merge branch 'ai-dynamo:main' into sglang-integration-event-manager

e55aa27

Merge branch 'sglang-integration-event-manager' of https://github.com…

bde922b

…/faradawn/dynamo into sglang-integration-event-manager

revert to main branch, frontend and worker working

a818bfb

pull-request-size bot added size/M and removed size/XL labels Jun 20, 2025

faradawn added 2 commits June 20, 2025 19:15

revert readme

e2817e9

add kv zmq publisher to worker, it is working

9e669c7

pull-request-size bot added size/L and removed size/M labels Jun 21, 2025

faradawn mentioned this pull request Jun 22, 2025

feat: add kv router to sglang #1605

Merged

ishandhanani closed this Jun 24, 2025

-+        if lib_path is None:
-+            raise RuntimeError("VLLM_KV_CAPI_PATH environment variable not set")
-+
-+        self._lib = CDLL(lib_path)
+        if lib_path is None:
+            raise RuntimeError("VLLM_KV_CAPI_PATH environment variable not set")
+        # Validate library path exists and is readable
+        if not os.path.isfile(lib_path):
+            raise RuntimeError(f"C-API library not found: {lib_path}")
+        if not os.access(lib_path, os.R_OK):
+            raise RuntimeError(f"C-API library not readable: {lib_path}")
+        try:
+            self._lib = CDLL(lib_path)
+        except OSError as e:
+            raise RuntimeError(f"Failed to load C-API library {lib_path}: {e}")

-+        num_blocks = 1  # SGLang RadixCache emits one block at a time
-+
-+        token_arr = (c_uint32 * len(token_ids))(*token_ids)
-+        num_block_tokens_arr = (c_size_t * 1)(len(token_ids))
-+        block_hash_arr = (c_uint64 * 1)(block_hashes[0])
+    def publish_stored(
+        self,
+        token_ids: List[int],
+        block_hashes: List[int],
+        parent_hash: Optional[int],
+        block_size: int,
+        lora_id: Optional[int] = None,
+    ) -> None:
+        if not block_hashes:
+            raise ValueError("block_hashes cannot be empty")
+        num_blocks = 1  # SGLang RadixCache emits one block at a time
+        if len(block_hashes) != 1:
+            logger.warning(
+                "Expected exactly 1 block hash, got %d. Using first hash only.",
+                len(block_hashes),
+            )
+        token_arr = (c_uint32 * len(token_ids))(*token_ids)
+        num_block_tokens_arr = (c_size_t * 1)(len(token_ids))
+        block_hash_arr = (c_uint64 * 1)(block_hashes[0])

-+        # block_size (tokens per block) must match allocator; default 16.
-+        from .dynamo_event_manager import KVCacheEventManager
-+
+from .dynamo_event_manager import KVCacheEventManager
+# Add DynamoEventPublisher
+class DynamoEventPublisher(EventPublisher):
+    """Publisher that forwards events to Dynamo via the C-API bindings."""
+    def __init__(self, kv_block_size: int = 16):
+        # block_size (tokens per block) must match allocator; default 16.
+        self._mgr = KVCacheEventManager(kv_block_size=kv_block_size)

	from sglang.srt.server_args import ServerArgs
	-from sglang.srt.server_args import ServerArgs
	+from sglang.server_args import ServerArgs

		from components.worker import SGLangWorker
		from utils.protocol import Tokens

feat: sglang integration of dynamo event manager #1323

feat: sglang integration of dynamo event manager #1323

Uh oh!

Conversation

faradawn commented Jun 2, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Jun 2, 2025

Uh oh!

github-actions bot commented Jun 2, 2025

Uh oh!

coderabbitai bot commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Possibly related PRs

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jun 2, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jun 2, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jun 2, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jun 2, 2025

Choose a reason for hiding this comment

Uh oh!

ishandhanani commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

faradawn commented Jun 2, 2025

Uh oh!

alec-flowers commented Jun 2, 2025

Uh oh!

ishandhanani commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

faradawn commented Jun 3, 2025

Uh oh!

ishandhanani commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

faradawn commented Jun 3, 2025

Uh oh!

alec-flowers commented Jun 3, 2025

Uh oh!

ishandhanani commented Jun 3, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jun 15, 2025

Choose a reason for hiding this comment

faradawn commented Jun 2, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jun 2, 2025 •

edited

Loading

ishandhanani commented Jun 2, 2025 •

edited

Loading

ishandhanani commented Jun 3, 2025 •

edited

Loading

ishandhanani commented Jun 3, 2025 •

edited

Loading

faradawn commented Jun 16, 2025 •

edited

Loading