Skip to content

Conversation

@faradawn
Copy link
Contributor

@faradawn faradawn commented Jun 2, 2025

Overview:

When SGLang is launched without an explicit page_size, it silently falls back to 1, producing variable-sized “blocks” that Dynamo discards.

This patch guarantees the two systems stay aligned.

Nvidia Dynamo

Details:

first, In sglang_inc.py. When building ServerArgs we now set page_size = config.kv_block_size if it is positive, otherwise we inject a safe default of 16 tokens. This prevents SGLang from defaulting to page_size=1.

second, In SGLang-side, in a companion change, _record_store_event() now slices prompts into page_size chunks before emitting BlockStored events so every published block is exactly kv_block_size tokens. Link to PR: sgl-project/sglang#6824

This eliminates dropped KV-cache events caused by size mismatches.
Ensures cache-aware routing works out-of-the-box even when the user forgets --kv-block-size.
Keeps behaviour consistent across vLLM, TRT-LLM and SGLang back-ends.

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Summary by CodeRabbit

  • New Features

    • Added a KV-based router for efficient request routing among SGLang workers using cache overlap and runtime metrics.
    • Introduced a processing pipeline linking frontend, router, and worker components for streamlined request handling.
    • Added chat completion endpoint to frontend service with improved logging and resource configuration.
    • Defined new structured message formats to support request and response communication for language model interactions.
    • Added configuration and example for aggregated KV routing setup.
  • Bug Fixes

    • Ensured page_size is properly set with a default value when configuration is missing or invalid, preventing unintended behavior.
  • Chores

    • Enhanced logging during worker initialization and argument parsing for better observability.
    • Updated Dockerfile to use latest commit with performance improvements and fixes.
    • Improved documentation for Python wheel build process.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jun 2, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link

github-actions bot commented Jun 2, 2025

👋 Hi faradawn! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

@github-actions github-actions bot added the external-contribution Pull request is from an external contributor label Jun 2, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jun 2, 2025

Walkthrough

The condition verifying the kv_block_size parameter in the initialization was refined to ensure it is greater than zero before assigning it to page_size; otherwise, a default of 16 is used. A new KV-based router service was added to route requests to SGLang workers using cache overlap and runtime metrics. A new script sets up a pipeline linking Frontend, Router, and SGLangWorker components. The Frontend component was updated to depend on the Router alongside the worker, with added chat completion endpoint and lifecycle hooks. The worker component was enhanced to initialize a ZeroMQ key-value event publisher bridging events to NATS. New Pydantic data models for request/response protocols were introduced. The argument parser was updated to set a default page_size of 64 with logging if not provided. The Dockerfile was updated to use a newer sglang commit with performance fixes and the KV event publisher fix. Documentation and configuration files were added or updated to support the new routing architecture.

Changes

File(s) Change Summary
launch/dynamo-run/src/subprocess/sglang_inc.py Modified init to check kv_block_size > 0 before assigning page_size; defaults to 16 if invalid or unset.
examples/sglang/components/kv_router.py Added new Router service class implementing KV-based routing logic using cache overlap scores and runtime metrics.
examples/sglang/graphs/agg_router.py Added new script linking FrontendRouterSGLangWorker to form a processing pipeline.
examples/sglang/components/frontend.py Changed dependency injection in Frontend to include Router; added chat completion async endpoint, resource specs, logging, and shutdown cleanup.
examples/sglang/components/worker.py Added initialization of ZmqKvEventPublisher in SGLangWorker to bridge ZMQ events to NATS with error handling and logging; enhanced logging in generation method.
examples/sglang/utils/protocol.py Added new Pydantic models: Request, Tokens, PrefillRequest, Response, and PrefillResponse for structured messaging.
examples/sglang/utils/sglang.py Added logging and default page_size=64 assignment with info log if missing in parse_sglang_args.
container/Dockerfile.sglang Updated SGLANG_COMMIT ARG to newer commit with performance fixes and KV event publisher fix; updated related comments.
examples/sglang/README.md Added "Aggregated with KV Routing" example with command to run Dynamo server using graphs.agg_router:Frontend.
examples/sglang/configs/agg_router.yaml Added new YAML config defining model, router, frontend, and worker parameters for KV routing architecture.
lib/bindings/python/README.md Updated build instructions to include changing directory before running maturin develop.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Frontend
    participant Router
    participant SGLangWorker

    Client->>Frontend: Send request
    Frontend->>Router: Forward request (generate)
    Router->>SGLangWorker: Route request based on cache overlap and metrics
    SGLangWorker-->>Router: Response with worker ID and prefix hit rate
    Router-->>Frontend: Yield best worker info
    Frontend-->>Client: Return response
Loading

Possibly related PRs

Poem

A check for block size, clear and bright,
Now guards the page size day and night.
No more sneaky ones will slip through,
Sixteen's the fallback, sturdy and true.
With logic neat and errors slight,
The code now hops along just right! 🐇

New routes to workers, clever and keen,
Cache hits and metrics guide the scene.
Frontend, Router, Worker in line,
Together they work, efficient and fine! 🚀🐰


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (3)
container/deps/sglang/sglang_v0.4.6-dynamo-kv-disagg-patch.patch (3)

43-49: Consider using SGLang-specific environment variable names.

The code uses VLLM_* environment variables in an SGLang context, which could cause confusion for users. Consider using SGLANG_* prefixed variables or making this configurable.

-        namespace = namespace or os.getenv("VLLM_KV_NAMESPACE") or "default"
-        component = component or os.getenv("VLLM_KV_COMPONENT") or "sglang"
-        worker_id = worker_id if worker_id is not None else int(
-            os.getenv("VLLM_WORKER_ID", "0"))
-        lib_path = lib_path or os.getenv("VLLM_KV_CAPI_PATH")
+        namespace = namespace or os.getenv("SGLANG_KV_NAMESPACE") or "default"
+        component = component or os.getenv("SGLANG_KV_COMPONENT") or "sglang"
+        worker_id = worker_id if worker_id is not None else int(
+            os.getenv("SGLANG_WORKER_ID", "0"))
+        lib_path = lib_path or os.getenv("SGLANG_KV_CAPI_PATH")

156-159: Consider logging ignored events for debugging.

Silently ignoring non-KVEventBatch events could mask configuration errors or unexpected usage patterns. Consider adding debug logging.

     def publish(self, events: EventBatch) -> None:  # type: ignore[override]
         # Only handle KVEventBatch; ignore others quietly
         if not isinstance(events, KVEventBatch):
+            logger.debug("Ignoring non-KVEventBatch event: %s", type(events).__name__)
             return

171-175: Document the design decision for AllBlocksCleared handling.

The comment explains why AllBlocksCleared events are ignored, but this could be a source of confusion. Consider making this behavior more explicit.

             elif isinstance(ev, AllBlocksCleared):
-                # We could broadcast each removed block, but Radix clears by
-                # resetting cache; upstream workers will clear too. For now
-                # no-op.
-                continue
+                # AllBlocksCleared events are intentionally ignored because:
+                # 1. RadixCache clears by resetting the entire cache
+                # 2. Upstream workers will clear their caches independently
+                # 3. Broadcasting individual block removals would be inefficient
+                logger.debug("Ignoring AllBlocksCleared event (intentional no-op)")
+                continue
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 859944f and 5a21312.

📒 Files selected for processing (1)
  • container/deps/sglang/sglang_v0.4.6-dynamo-kv-disagg-patch.patch (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Build and Test - vllm
🔇 Additional comments (1)
container/deps/sglang/sglang_v0.4.6-dynamo-kv-disagg-patch.patch (1)

182-182:

❓ Verification inconclusive

Verify registration key consistency across the codebase.

Ensure that the registration key "dynamo" is consistently used and documented throughout the codebase for proper event publisher discovery.


🏁 Script executed:

#!/bin/bash
# Description: Search for references to "dynamo" event publisher key to ensure consistency

# Search for any configuration or documentation mentioning "dynamo" publisher
rg -i "dynamo.*publisher|publisher.*dynamo" --type py

# Search for event publisher factory usage to understand expected keys
rg "EventPublisherFactory" --type py -A 3 -B 3

Length of output: 1322


🏁 Script executed:

#!/usr/bin/env bash
# Locate the patch file
patch_file=$(fd sglang_v0.4.6-dynamo-kv-disagg-patch.patch -t f)
echo "Found patch file: $patch_file"
# Show context around the inserted publisher registration (lines 160–220)
sed -n '160,220p' "$patch_file"

Length of output: 1348


🏁 Script executed:

#!/usr/bin/env bash
# Find definition of EventPublisherFactory
rg -n "class EventPublisherFactory" --type py

# Find any usage of EventPublisherFactory.get or lookup in registry
rg -n "EventPublisherFactory" --type py -A3 -B3

Length of output: 95


Confirm new publisher registration in sglang’s EventPublisherFactory

The patch adds the following at the end of container/deps/sglang/sglang_v0.4.6-dynamo-kv-disagg-patch.patch:

+# Register new publisher
+EventPublisherFactory._registry["dynamo"] = DynamoEventPublisher

Please manually verify that:

  • The EventPublisherFactory class in the sglang codebase defines an internal _registry and uses it for publisher lookup.
  • The key "dynamo" exactly matches the lookup name used in methods like get or create.
  • Any related documentation (README, code comments, configs) includes and explains the "dynamo" registration key.

Comment on lines 48 to 51
+ if lib_path is None:
+ raise RuntimeError("VLLM_KV_CAPI_PATH environment variable not set")
+
+ self._lib = CDLL(lib_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add security validation for library path and improve error handling.

Loading arbitrary C libraries from environment variables poses security risks. Additionally, the CDLL loading should have proper error handling.

         if lib_path is None:
             raise RuntimeError("VLLM_KV_CAPI_PATH environment variable not set")

+        # Validate library path exists and is readable
+        if not os.path.isfile(lib_path):
+            raise RuntimeError(f"C-API library not found: {lib_path}")
+        if not os.access(lib_path, os.R_OK):
+            raise RuntimeError(f"C-API library not readable: {lib_path}")
+
-        self._lib = CDLL(lib_path)
+        try:
+            self._lib = CDLL(lib_path)
+        except OSError as e:
+            raise RuntimeError(f"Failed to load C-API library {lib_path}: {e}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
+ if lib_path is None:
+ raise RuntimeError("VLLM_KV_CAPI_PATH environment variable not set")
+
+ self._lib = CDLL(lib_path)
if lib_path is None:
raise RuntimeError("VLLM_KV_CAPI_PATH environment variable not set")
# Validate library path exists and is readable
if not os.path.isfile(lib_path):
raise RuntimeError(f"C-API library not found: {lib_path}")
if not os.access(lib_path, os.R_OK):
raise RuntimeError(f"C-API library not readable: {lib_path}")
try:
self._lib = CDLL(lib_path)
except OSError as e:
raise RuntimeError(f"Failed to load C-API library {lib_path}: {e}")
🤖 Prompt for AI Agents
In container/deps/sglang/sglang_v0.4.6-dynamo-kv-disagg-patch.patch around lines
48 to 51, add validation to ensure the library path from the environment
variable is safe and expected before loading it with CDLL. Also, wrap the CDLL
loading in a try-except block to catch and handle any exceptions, raising a
clear error if loading fails. This prevents security risks from arbitrary paths
and improves error handling.

Comment on lines 107 to 111
+ num_blocks = 1 # SGLang RadixCache emits one block at a time
+
+ token_arr = (c_uint32 * len(token_ids))(*token_ids)
+ num_block_tokens_arr = (c_size_t * 1)(len(token_ids))
+ block_hash_arr = (c_uint64 * 1)(block_hashes[0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix potential index error and validate assumptions.

The code assumes block_hashes has at least one element without validation, and hardcodes num_blocks = 1 based on a comment about SGLang RadixCache behavior.

     def publish_stored(
         self,
         token_ids: List[int],
         block_hashes: List[int],
         parent_hash: Optional[int],
         block_size: int,
         lora_id: Optional[int] = None,
     ) -> None:
-        num_blocks = 1  # SGLang RadixCache emits one block at a time
+        if not block_hashes:
+            raise ValueError("block_hashes cannot be empty")
+        
+        num_blocks = 1  # SGLang RadixCache emits one block at a time
+        if len(block_hashes) != 1:
+            logger.warning("Expected exactly 1 block hash, got %d. Using first hash only.", len(block_hashes))

         token_arr = (c_uint32 * len(token_ids))(*token_ids)
         num_block_tokens_arr = (c_size_t * 1)(len(token_ids))
         block_hash_arr = (c_uint64 * 1)(block_hashes[0])
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
+ num_blocks = 1 # SGLang RadixCache emits one block at a time
+
+ token_arr = (c_uint32 * len(token_ids))(*token_ids)
+ num_block_tokens_arr = (c_size_t * 1)(len(token_ids))
+ block_hash_arr = (c_uint64 * 1)(block_hashes[0])
def publish_stored(
self,
token_ids: List[int],
block_hashes: List[int],
parent_hash: Optional[int],
block_size: int,
lora_id: Optional[int] = None,
) -> None:
if not block_hashes:
raise ValueError("block_hashes cannot be empty")
num_blocks = 1 # SGLang RadixCache emits one block at a time
if len(block_hashes) != 1:
logger.warning(
"Expected exactly 1 block hash, got %d. Using first hash only.",
len(block_hashes),
)
token_arr = (c_uint32 * len(token_ids))(*token_ids)
num_block_tokens_arr = (c_size_t * 1)(len(token_ids))
block_hash_arr = (c_uint64 * 1)(block_hashes[0])
🤖 Prompt for AI Agents
In container/deps/sglang/sglang_v0.4.6-dynamo-kv-disagg-patch.patch around lines
107 to 111, the code assumes block_hashes has at least one element and hardcodes
num_blocks to 1 without validation. Add a check to ensure block_hashes is not
empty before accessing block_hashes[0], and validate or dynamically determine
num_blocks instead of hardcoding it to 1 to prevent potential index errors and
incorrect assumptions.

Comment on lines 152 to 154
+ # block_size (tokens per block) must match allocator; default 16.
+ from .dynamo_event_manager import KVCacheEventManager
+
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Move import to module level to avoid potential issues.

Importing modules inside __init__ methods is generally not recommended as it can cause circular import issues and makes dependencies less clear.

+from .dynamo_event_manager import KVCacheEventManager
+
 # Add DynamoEventPublisher
 class DynamoEventPublisher(EventPublisher):
     """Publisher that forwards events to Dynamo via the C-API bindings."""

     def __init__(self, kv_block_size: int = 16):
         # block_size (tokens per block) must match allocator; default 16.
-        from .dynamo_event_manager import KVCacheEventManager

         self._mgr = KVCacheEventManager(kv_block_size=kv_block_size)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
+ # block_size (tokens per block) must match allocator; default 16.
+ from .dynamo_event_manager import KVCacheEventManager
+
from .dynamo_event_manager import KVCacheEventManager
# Add DynamoEventPublisher
class DynamoEventPublisher(EventPublisher):
"""Publisher that forwards events to Dynamo via the C-API bindings."""
def __init__(self, kv_block_size: int = 16):
# block_size (tokens per block) must match allocator; default 16.
self._mgr = KVCacheEventManager(kv_block_size=kv_block_size)
🤖 Prompt for AI Agents
In container/deps/sglang/sglang_v0.4.6-dynamo-kv-disagg-patch.patch around lines
152 to 154, the import of KVCacheEventManager is done inside a method or
function. Move this import statement to the top of the module file, outside of
any functions or classes, to avoid potential circular import issues and improve
clarity of dependencies.

Comment on lines 161 to 175
+ for ev in events.events:
+ if isinstance(ev, BlockStored):
+ self._mgr.publish_stored(
+ token_ids=ev.token_ids,
+ block_hashes=ev.block_hashes,
+ parent_hash=ev.parent_block_hash,
+ block_size=ev.block_size,
+ lora_id=ev.lora_id,
+ )
+ elif isinstance(ev, BlockRemoved):
+ self._mgr.publish_removed(ev.block_hashes)
+ elif isinstance(ev, AllBlocksCleared):
+ # We could broadcast each removed block, but Radix clears by
+ # resetting cache; upstream workers will clear too. For now
+ # no-op.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add validation for event data integrity.

The event handling forwards data to the C-API without validation. Consider adding basic checks to prevent potential crashes or invalid data.

         for ev in events.events:
             if isinstance(ev, BlockStored):
+                if not ev.token_ids:
+                    logger.warning("Skipping BlockStored event with empty token_ids")
+                    continue
+                if not ev.block_hashes:
+                    logger.warning("Skipping BlockStored event with empty block_hashes")
+                    continue
                 self._mgr.publish_stored(
                     token_ids=ev.token_ids,
                     block_hashes=ev.block_hashes,
                     parent_hash=ev.parent_block_hash,
                     block_size=ev.block_size,
                     lora_id=ev.lora_id,
                 )
             elif isinstance(ev, BlockRemoved):
+                if not ev.block_hashes:
+                    logger.warning("Skipping BlockRemoved event with empty block_hashes")
+                    continue
                 self._mgr.publish_removed(ev.block_hashes)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
+ for ev in events.events:
+ if isinstance(ev, BlockStored):
+ self._mgr.publish_stored(
+ token_ids=ev.token_ids,
+ block_hashes=ev.block_hashes,
+ parent_hash=ev.parent_block_hash,
+ block_size=ev.block_size,
+ lora_id=ev.lora_id,
+ )
+ elif isinstance(ev, BlockRemoved):
+ self._mgr.publish_removed(ev.block_hashes)
+ elif isinstance(ev, AllBlocksCleared):
+ # We could broadcast each removed block, but Radix clears by
+ # resetting cache; upstream workers will clear too. For now
+ # no-op.
for ev in events.events:
if isinstance(ev, BlockStored):
if not ev.token_ids:
logger.warning("Skipping BlockStored event with empty token_ids")
continue
if not ev.block_hashes:
logger.warning("Skipping BlockStored event with empty block_hashes")
continue
self._mgr.publish_stored(
token_ids=ev.token_ids,
block_hashes=ev.block_hashes,
parent_hash=ev.parent_block_hash,
block_size=ev.block_size,
lora_id=ev.lora_id,
)
elif isinstance(ev, BlockRemoved):
if not ev.block_hashes:
logger.warning("Skipping BlockRemoved event with empty block_hashes")
continue
self._mgr.publish_removed(ev.block_hashes)
elif isinstance(ev, AllBlocksCleared):
# We could broadcast each removed block, but Radix clears by
# resetting cache; upstream workers will clear too. For now
# no-op.
🤖 Prompt for AI Agents
In container/deps/sglang/sglang_v0.4.6-dynamo-kv-disagg-patch.patch around lines
161 to 175, the event handling code forwards event data directly to the C-API
without validating the integrity of the data. To fix this, add basic validation
checks for required fields in each event type before calling the publish
methods. For example, verify that token_ids and block_hashes are not empty or
None, and that block_size is a positive integer. If validation fails, handle the
error gracefully, such as logging a warning and skipping the publish call to
prevent crashes or invalid data propagation.

@ishandhanani
Copy link
Contributor

ishandhanani commented Jun 2, 2025

Hi @faradawn - thanks for this PR.

Have you taken a look at the following?

  1. [Metrics] Add KV events publishing sgl-project/sglang#6098 - this allows you to publish KV events from SGL via ZMQ.
  2. https://github.com/ai-dynamo/dynamo/pull/1181/files - this added the ability for Dynamo to listen to events via ZMQ. It should work for SGL as well

I see what you're trying to do but I don't think we need to maintain a patch given those 2 PRs above.


Given this info however - the current kv event emitting logic in SGL has a bug

  • In the radix tree impl, SGL compresses multiple blocks (of size page_size) into a single node for space optimization. So the current event emitting logic in the radix_cache is not quite right because we emit events with blocks sizes that are variable (something dynamo does not currently support)

Let me know if you have any thoughts here. It's on my TODO to figure out a solution for this. Happy to discuss and see if we can parallelize work here!

@faradawn
Copy link
Contributor Author

faradawn commented Jun 2, 2025

Hi @ishandhanani , here are two ideas:

Idea 1. Fix SGlang event publisher. Just before an event is pushed onto kv_event_queue, slice the node's value into fixed-size chunks of page_size and emit 1 BlockStored event per chunk.

Idea 2: Fix Dynamo listener. Break a large block into chucks of page_size and simulate arrival of each page.

You have more experience and I'd love to discuss and execute any plan.

@alec-flowers
Copy link
Contributor

We don't want to continue using the C path, we should add a note that its deprecated for now. We have a python KVEventPublisher that is easier to use, we should use that. Also we want to avoid patches and instead commit mainline. That is what @ishandhanani is pointing out that there is work already committed so that we can pull events without needing patches.

@ishandhanani
Copy link
Contributor

ishandhanani commented Jun 3, 2025

Hi @ishandhanani , here are two ideas:

Idea 1. Fix SGlang event publisher. Just before an event is pushed onto kv_event_queue, slice the node's value into fixed-size chunks of page_size and emit 1 BlockStored event per chunk.

Idea 2: Fix Dynamo listener. Break a large block into chucks of page_size and simulate arrival of each page.

You have more experience and I'd love to discuss and execute any plan.

I think option 1 is better. I think we want events emitted from SGLang to have a consistent block size. This allow us to maintain approximately the same tree just with our prefix tree having fixed block sizes. Do you want to take a stab at the PR to SGLang for this @faradawn?

@faradawn
Copy link
Contributor Author

faradawn commented Jun 3, 2025

@ishandhanani sounds good! Is the ZMQ for kv event only used by Dynamo? I can take a look and create a PR to SGLang.

@ishandhanani
Copy link
Contributor

ishandhanani commented Jun 3, 2025

@ishandhanani sounds good! Is the ZMQ for kv event only used by Dynamo? I can take a look and create a PR to SGLang.

Great! I pinged you in the discord in case we wanted to discuss a bit more synchronously.

At the moment yes - it is only used by Dynamo. However - any other framework that wants to build their own router based on ZMQ events could in theory use it.

@faradawn
Copy link
Contributor Author

faradawn commented Jun 3, 2025

PR created for SGLang: sgl-project/sglang#6824. WIll keep you updated!

@alec-flowers
Copy link
Contributor

If SGLang would accept it we could embed our own Publisher directly. Then we can skip the whole zmq stuff and publish directly from each process to NATS.

For data parallel we have made a few small changes to vLLM that we should try to mirror in SGLang
vllm-project/vllm#18925

@ishandhanani
Copy link
Contributor

If SGLang would accept it we could embed our own Publisher directly. Then we can skip the whole zmq stuff and publish directly from each process to NATS.

For data parallel we have made a few small changes to vLLM that we should try to mirror in SGLang vllm-project/vllm#18925

Nice - let me take a look

@faradawn faradawn requested a review from richardhuo-nv as a code owner June 5, 2025 20:12
@faradawn faradawn requested a review from piotrm-nvidia as a code owner June 5, 2025 20:18
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🔭 Outside diff range comments (2)
examples/sglang/components/frontend.py (1)

20-29: ⚠️ Potential issue

Remove now-unused SGLangWorker import to unblock CI

SGLangWorker is no longer referenced after switching the dependency to Router, and the Ruff step already failed once for the same issue. Deleting the import keeps the import section clean and prevents the next lint/format cycle from rewriting this file again.

-from components.worker import SGLangWorker
-from fastapi import FastAPI
+from fastapi import FastAPI
🧰 Tools
🪛 Ruff (0.11.9)

20-20: components.worker.SGLangWorker imported but unused

Remove unused import: components.worker.SGLangWorker

(F401)

🪛 Pylint (3.3.7)

[refactor] 24-24: Use 'from dynamo import sdk' instead

(R0402)

examples/sglang/components/worker.py (1)

71-99: ⚠️ Potential issue

KV-publisher is initialised against the decode worker, not this worker

comp_ns, comp_name is reassigned to SGLangDecodeWorker.dynamo_address() (lines 81-83).
Lines 94-97 then create component_ref = runtime.namespace(comp_ns).component(comp_name), which therefore points to the decode worker, while the KV-events you want originate from the prefill/aggregate worker (this component).

Create a separate pair of variables or call self.__class__.dynamo_address() again when wiring up the publisher:

-            comp_ns, comp_name = SGLangDecodeWorker.dynamo_address()
+            comp_ns_dec, comp_name_dec = SGLangDecodeWorker.dynamo_address()
             self.decode_client = (
-                await runtime.namespace(comp_ns)
-                .component(comp_name)
+                await runtime.namespace(comp_ns_dec)
+                .component(comp_name_dec)
                 .endpoint("generate")
                 .client()
             )
...
-            component_ref = runtime.namespace(comp_ns).component(comp_name)
+            comp_ns_self, comp_name_self = self.__class__.dynamo_address()
+            component_ref = runtime.namespace(comp_ns_self).component(comp_name_self)
🧰 Tools
🪛 Pylint (3.3.7)

[error] 71-71: Class 'SGLangWorker' has no 'dynamo_address' member

(E1101)

🧹 Nitpick comments (4)
examples/sglang/utils/protocol.py (2)

22-24: Narrow the type of sampling_params

sampling_params is currently typed as a plain dict, which loses all static-type information and IDE assistance. If the structure is not fixed yet, at least annotate it as dict[str, Any] (or introduce a dedicated SamplingParams model) so downstream code can rely on proper typing.

-from pydantic import BaseModel, Field
+from typing import Any
+from pydantic import BaseModel, Field
...
-    sampling_params: dict
+    sampling_params: dict[str, Any]
🧰 Tools
🪛 Pylint (3.3.7)

[refactor] 22-22: Too few public methods (0/2)

(R0903)


27-28: Prefer TokenIdType for consistency

tokens: list[int] diverges from the alias declared above (TokenIdType = int). Using the alias keeps future refactors (e.g., moving to numpy.uint32) mechanical:

-class Tokens(BaseModel):
-    tokens: list[int]
+class Tokens(BaseModel):
+    tokens: list[TokenIdType]
🧰 Tools
🪛 Pylint (3.3.7)

[refactor] 27-27: Too few public methods (0/2)

(R0903)

examples/sglang/components/kv_router.py (2)

168-226: _cost_function is doing too much – extract helpers

18 local variables and mixed concerns (score conversion, normalisation, logging, final selection) make this block hard to reason about and unit-test. Consider:

  1. _convert_block_scores() – blocks ➜ token hit-rate
  2. _merge_metrics() – build worker_metrics & max_waiting
  3. _compute_logits() – linear combination & tie-breaking

This keeps each piece ≤ 30 LOC and removes the Pylint R0914 hit.

🧰 Tools
🪛 Pylint (3.3.7)

[refactor] 168-168: Too many local variables (18/15)

(R0914)


228-235: Empty logits fall back to "" – propagate a valid worker instead

Returning an empty string forces the caller to add fallback logic. A safer default is a random available worker (keeps throughput) or raising so the caller can retry.

-            return "", 0.0
+            return random.choice(worker_ids), 0.0
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 53ea353 and 077ead1.

📒 Files selected for processing (5)
  • examples/sglang/components/frontend.py (2 hunks)
  • examples/sglang/components/kv_router.py (1 hunks)
  • examples/sglang/components/worker.py (3 hunks)
  • examples/sglang/utils/protocol.py (1 hunks)
  • examples/sglang/utils/sglang.py (2 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
examples/sglang/components/kv_router.py (6)
examples/sglang/components/worker.py (3)
  • SGLangWorker (56-216)
  • async_init (68-119)
  • generate (154-195)
examples/sglang/utils/protocol.py (1)
  • Tokens (27-28)
deploy/sdk/src/dynamo/sdk/lib/decorators.py (1)
  • async_on_start (95-99)
deploy/sdk/src/dynamo/sdk/core/lib.py (2)
  • depends (121-146)
  • service (88-118)
deploy/sdk/src/dynamo/sdk/lib/config.py (1)
  • as_args (81-132)
lib/bindings/python/rust/llm/kv.rs (1)
  • get_metrics (405-428)
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/1323/merge) by faradawn.
examples/sglang/components/frontend.py

[error] 1-1: isort formatting failed and modified this file to fix import order.


[error] 17-25: ruff linting failed and modified this file to fix an unused import error.

examples/sglang/components/worker.py

[error] 1-1: isort formatting failed and modified this file to fix import order.


[error] 1-1: black formatting failed and reformatted this file to fix code style issues.

examples/sglang/utils/protocol.py

[error] 1-1: black formatting failed and reformatted this file to fix code style issues.

examples/sglang/utils/sglang.py

[error] 1-1: black formatting failed and reformatted this file to fix code style issues.

examples/sglang/components/kv_router.py

[error] 1-1: isort formatting failed and modified this file to fix import order.


[error] 1-1: black formatting failed and reformatted this file to fix code style issues.

🪛 Pylint (3.3.7)
examples/sglang/utils/protocol.py

[refactor] 22-22: Too few public methods (0/2)

(R0903)


[refactor] 27-27: Too few public methods (0/2)

(R0903)


[refactor] 31-31: Too few public methods (0/2)

(R0903)


[refactor] 35-35: Too few public methods (0/2)

(R0903)


[refactor] 39-39: Too few public methods (0/2)

(R0903)

examples/sglang/utils/sglang.py

[error] 20-20: No name 'srt' in module 'sglang'

(E0611)

examples/sglang/components/kv_router.py

[refactor] 55-55: Too few public methods (0/2)

(R0903)


[refactor] 168-168: Too many local variables (18/15)

(R0914)

⏰ Context from checks skipped due to timeout of 90000ms (4)
  • GitHub Check: pre-merge-rust (.)
  • GitHub Check: pre-merge-rust (lib/runtime/examples)
  • GitHub Check: pre-merge-rust (lib/bindings/python)
  • GitHub Check: Build and Test - vllm
🔇 Additional comments (1)
examples/sglang/utils/sglang.py (1)

20-21: Verify import path sglang.srt.server_args

Static analysis flags No name 'srt' in module 'sglang'. If srt is an internal sub-package introduced downstream this is fine, but it is easy to mis-spell (srt vs srv / rst). Please re-check the actual package path in the SGLang wheel used in production.

🧰 Tools
🪛 Pylint (3.3.7)

[error] 20-20: No name 'srt' in module 'sglang'

(E0611)

Comment on lines 98 to 103
# Derive KV block size from engine args, falling back to 16
kv_block_size = getattr(
self.engine_args, "page_size", getattr(self.engine_args, "block_size", 16)
)

zmq_config = ZmqKvEventPublisherConfig(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

page_size/block_size may arrive as string

argparse leaves CLI values as str when they come from the service config.
A manual int() cast guards against accidental string propagation to Rust bindings:

-            kv_block_size = getattr(
-                self.engine_args, "page_size", getattr(self.engine_args, "block_size", 16)
-            )
+            kv_block_size = int(
+                getattr(
+                    self.engine_args,
+                    "page_size",
+                    getattr(self.engine_args, "block_size", 16),
+                )
+            )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Derive KV block size from engine args, falling back to 16
kv_block_size = getattr(
self.engine_args, "page_size", getattr(self.engine_args, "block_size", 16)
)
zmq_config = ZmqKvEventPublisherConfig(
# Derive KV block size from engine args, falling back to 16
kv_block_size = int(
getattr(
self.engine_args,
"page_size",
getattr(self.engine_args, "block_size", 16),
)
)
zmq_config = ZmqKvEventPublisherConfig(
🤖 Prompt for AI Agents
In examples/sglang/components/worker.py around lines 98 to 103, the
kv_block_size is derived from engine_args.page_size or engine_args.block_size,
but these values may be strings due to argparse behavior. To fix this,
explicitly cast the retrieved value to int to ensure kv_block_size is always an
integer before passing it to Rust bindings.

Comment on lines 90 to 97
"--custom-router",
type=bool,
default=False,
help="Whether to use custom router or not",
)
parser.add_argument(
"--router",
type=str,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Boolean CLI flag should use action="store_true"

type=bool expects the user to pass an explicit value (--custom-router true), which is error-prone and breaks argparse’s “flag” semantics. Replace with action="store_true" (and drop the default):

-    parser.add_argument(
-        "--custom-router",
-        type=bool,
-        default=False,
-        help="Whether to use custom router or not",
-    )
+    parser.add_argument(
+        "--custom-router",
+        action="store_true",
+        help="Enable custom router",
+    )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"--custom-router",
type=bool,
default=False,
help="Whether to use custom router or not",
)
parser.add_argument(
"--router",
type=str,
parser.add_argument(
"--custom-router",
action="store_true",
help="Enable custom router",
)
parser.add_argument(
"--router",
type=str,
🤖 Prompt for AI Agents
In examples/sglang/components/kv_router.py around lines 90 to 97, the boolean
CLI flag for --custom-router incorrectly uses type=bool, requiring an explicit
value and breaking flag semantics. Change the argument to use
action="store_true" instead and remove the default value to make it a proper
flag that sets the value to True when present.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (1)
examples/sglang/components/kv_router.py (1)

92-96: Boolean flag should use action="store_true"
Same issue flagged previously; using type=bool forces the caller to type --custom-router true. Switch to the canonical argparse style:

-    parser.add_argument(
-        "--custom-router",
-        type=bool,
-        default=False,
-        help="Whether to use custom router or not",
-    )
+    parser.add_argument(
+        "--custom-router",
+        action="store_true",
+        help="Enable custom router",
+    )
🧹 Nitpick comments (4)
examples/sglang/utils/sglang.py (1)

16-25: Minor: initialise logging once

logging.getLogger(__name__) is fine, but without basicConfig nothing is emitted when this module is imported directly (common in unit tests). Consider adding a one-liner:

logging.basicConfig(level=logging.INFO, format="%(levelname)s %(name)s: %(message)s")

or require the embedding application to configure logging explicitly.

🧰 Tools
🪛 Pylint (3.3.7)

[error] 20-20: No name 'srt' in module 'sglang'

(E0611)

examples/sglang/utils/protocol.py (1)

22-42: Tokens model name is plural but field is singular

For consistency with other request/response models consider renaming either the class or the field:

-class Tokens(BaseModel):
-    tokens: list[int]
+class Tokens(BaseModel):
+    values: list[int]

(or rename class to TokenList). This avoids .tokens.tokens-style attribute chains in downstream code.

🧰 Tools
🪛 Pylint (3.3.7)

[refactor] 22-22: Too few public methods (0/2)

(R0903)


[refactor] 27-27: Too few public methods (0/2)

(R0903)


[refactor] 31-31: Too few public methods (0/2)

(R0903)


[refactor] 35-35: Too few public methods (0/2)

(R0903)


[refactor] 39-39: Too few public methods (0/2)

(R0903)

examples/sglang/components/kv_router.py (2)

35-35: Left-over debug comment

# faradawn hello appears to be an accidental artefact – remove to keep the codebase clean.


171-232: Method exceeds reasonable complexity (18 local vars)

Splitting _cost_function into helpers (_translate_scores, _normalise_metrics, _combine_logits) will improve readability and ease unit-testing.

🧰 Tools
🪛 Pylint (3.3.7)

[refactor] 171-171: Too many local variables (18/15)

(R0914)

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 077ead1 and e5b2cdc.

📒 Files selected for processing (10)
  • container/Dockerfile.sglang (1 hunks)
  • examples/sglang/README.md (1 hunks)
  • examples/sglang/components/frontend.py (3 hunks)
  • examples/sglang/components/kv_router.py (1 hunks)
  • examples/sglang/components/worker.py (3 hunks)
  • examples/sglang/configs/agg_router.yaml (1 hunks)
  • examples/sglang/graphs/agg_router.py (1 hunks)
  • examples/sglang/utils/protocol.py (1 hunks)
  • examples/sglang/utils/sglang.py (2 hunks)
  • lib/bindings/python/README.md (1 hunks)
✅ Files skipped from review due to trivial changes (5)
  • examples/sglang/README.md
  • lib/bindings/python/README.md
  • examples/sglang/graphs/agg_router.py
  • container/Dockerfile.sglang
  • examples/sglang/configs/agg_router.yaml
🚧 Files skipped from review as they are similar to previous changes (2)
  • examples/sglang/components/frontend.py
  • examples/sglang/components/worker.py
🧰 Additional context used
🧬 Code Graph Analysis (1)
examples/sglang/utils/sglang.py (1)
deploy/sdk/src/dynamo/sdk/cli/utils.py (1)
  • reserve_free_port (63-102)
🪛 Pylint (3.3.7)
examples/sglang/components/kv_router.py

[refactor] 57-57: Too few public methods (0/2)

(R0903)


[refactor] 171-171: Too many local variables (18/15)

(R0914)

examples/sglang/utils/protocol.py

[refactor] 22-22: Too few public methods (0/2)

(R0903)


[refactor] 27-27: Too few public methods (0/2)

(R0903)


[refactor] 31-31: Too few public methods (0/2)

(R0903)


[refactor] 35-35: Too few public methods (0/2)

(R0903)


[refactor] 39-39: Too few public methods (0/2)

(R0903)

examples/sglang/utils/sglang.py

[error] 20-20: No name 'srt' in module 'sglang'

(E0611)

🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/1323/merge) by faradawn.
examples/sglang/components/kv_router.py

[error] 1-1: isort failed and modified this file to fix import sorting issues.


[error] 1-1: black formatting check failed and reformatted this file.

examples/sglang/utils/protocol.py

[error] 1-1: black formatting check failed and reformatted this file.

examples/sglang/utils/sglang.py

[error] 1-1: black formatting check failed and reformatted this file.

⏰ Context from checks skipped due to timeout of 90000ms (4)
  • GitHub Check: Build and Test - vllm
  • GitHub Check: pre-merge-rust (lib/runtime/examples)
  • GitHub Check: pre-merge-rust (lib/bindings/python)
  • GitHub Check: pre-merge-rust (.)
🔇 Additional comments (2)
examples/sglang/utils/protocol.py (1)

1-84: Black re-formatted this file – please commit the result

Pre-merge pipeline indicates Black auto-formatted the file. Commit those changes (or run make fmt) to unblock CI.

🧰 Tools
🪛 Pylint (3.3.7)

[refactor] 22-22: Too few public methods (0/2)

(R0903)


[refactor] 27-27: Too few public methods (0/2)

(R0903)


[refactor] 31-31: Too few public methods (0/2)

(R0903)


[refactor] 35-35: Too few public methods (0/2)

(R0903)


[refactor] 39-39: Too few public methods (0/2)

(R0903)


[refactor] 45-45: Too few public methods (0/2)

(R0903)


[refactor] 53-53: Too few public methods (0/2)

(R0903)


[refactor] 68-68: Too few public methods (0/2)

(R0903)


[refactor] 77-77: Too few public methods (0/2)

(R0903)

🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/1323/merge) by faradawn.

[error] 1-1: black formatting check failed and reformatted this file.

examples/sglang/components/kv_router.py (1)

1-314: Formatting & import-order failures block CI

CI ran isort and black and rewrote this file. Please commit the formatted version (or run:

isort examples/sglang/components/kv_router.py
black examples/sglang/components/kv_router.py

).

🧰 Tools
🪛 Pylint (3.3.7)

[refactor] 57-57: Too few public methods (0/2)

(R0903)


[refactor] 171-171: Too many local variables (18/15)

(R0914)

🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/1323/merge) by faradawn.

[error] 1-1: isort failed and modified this file to fix import sorting issues.


[error] 1-1: black formatting check failed and reformatted this file.

import logging
from argparse import Namespace

from sglang.srt.server_args import ServerArgs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Broken import – sglang.srt.server_args does not exist

Static-analysis already flags this; in the published sglang wheel the object lives at sglang.server_args. Loading this file will therefore crash on worker start-up.

-from sglang.srt.server_args import ServerArgs
+from sglang.server_args import ServerArgs
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from sglang.srt.server_args import ServerArgs
-from sglang.srt.server_args import ServerArgs
+from sglang.server_args import ServerArgs
🧰 Tools
🪛 Pylint (3.3.7)

[error] 20-20: No name 'srt' in module 'sglang'

(E0611)

🤖 Prompt for AI Agents
In examples/sglang/utils/sglang.py at line 20, the import statement incorrectly
references sglang.srt.server_args, which does not exist. Update the import to
use sglang.server_args instead to match the actual module location and prevent
import errors on worker start-up.

Comment on lines 45 to 47
from components.worker import SGLangWorker
from utils.protocol import Tokens

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Import path may break outside examples package

from components.worker import SGLangWorker relies on the current working directory being examples/sglang. Use an absolute import rooted at the project package (e.g. examples.sglang.components.worker) or add .__init__.py files to turn folders into packages.

🤖 Prompt for AI Agents
In examples/sglang/components/kv_router.py around lines 45 to 47, the import
statement uses a relative path that depends on the current working directory,
which can break outside the examples package. Change the import to use an
absolute import rooted at the project package, such as importing SGLangWorker
from examples.sglang.components.worker. Alternatively, ensure all relevant
directories have __init__.py files to make them proper packages and support
relative imports.

Comment on lines 171 to 232
def _cost_function(
self,
scores: OverlapScores | None,
metrics: AggregatedMetrics | None,
token_length: int,
) -> Tuple[WorkerId, float]:
"""Compute the best worker given scores/metrics.
Returns a tuple of ``(worker_id, estimated_prefix_hit_rate)``.
"""
logger.info("=== Cost function called")

worker_scores: dict[WorkerId, float] = {}
if scores:
for worker_id, score in scores.scores.items():
# score is in *blocks*; convert to *tokens* hit-rate
worker_scores[worker_id] = (
score * self.indexer.block_size() / token_length
)
else:
logger.warning("Cannot get KV scores")

worker_metrics: dict[WorkerId, dict[str, float]] = {}
max_waiting = 0.0
if metrics:
for endpoint in metrics.endpoints:
worker_id = endpoint.worker_id
worker_metrics[worker_id] = {
key: getattr(endpoint, key, self.default_metrics[key])
for key in self.default_metrics.keys()
}
max_waiting = max(
max_waiting, worker_metrics[worker_id]["num_requests_waiting"]
)
else:
logger.warning("Cannot get metrics")

# Consider *all* workers, even if metrics/scores missing
worker_ids = self.workers_client.instance_ids()

worker_logits: dict[WorkerId, float] = {}
for worker_id in worker_ids:
score = worker_scores.get(worker_id, 0.0)
metrics_dict = worker_metrics.get(worker_id, self.default_metrics)
gpu_cache_usage = metrics_dict["gpu_cache_usage_perc"]
normalized_waiting = (
metrics_dict["num_requests_waiting"] / max_waiting
if max_waiting > 0 else 0.0
)

# Simple linear combination (higher is better)
worker_logits[worker_id] = 2 * score - gpu_cache_usage - normalized_waiting
logger.info(
"Formula for %s: %.3f = 2.0 * %.3f - %.3f - %.3f",
worker_id,
worker_logits[worker_id],
score,
gpu_cache_usage,
normalized_waiting,
)

if not worker_logits or not any(worker_logits.values()):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Cost function mixes incomparable scales – results may be biased

gpu_cache_usage_perc appears to be a value in [0,100] while score is a fraction 0-1 (after block → token normalisation). Subtracting the raw percentage heavily outweighs the 2×cache-hit bonus, effectively turning the formula into “pick whoever uses 0 % GPU cache”.

Either:

  1. Convert gpu_cache_usage_perc to [0,1] by dividing by 100, or
  2. Rescale all terms/weights based on empirical ranges (e.g. weight hit-rate by 100 as well).

Example fix:

-            gpu_cache_usage = metrics_dict["gpu_cache_usage_perc"]
+            gpu_cache_usage = metrics_dict["gpu_cache_usage_perc"] / 100.0

and revisit the coefficients.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _cost_function(
self,
scores: OverlapScores | None,
metrics: AggregatedMetrics | None,
token_length: int,
) -> Tuple[WorkerId, float]:
"""Compute the best worker given scores/metrics.
Returns a tuple of ``(worker_id, estimated_prefix_hit_rate)``.
"""
logger.info("=== Cost function called")
worker_scores: dict[WorkerId, float] = {}
if scores:
for worker_id, score in scores.scores.items():
# score is in *blocks*; convert to *tokens* hit-rate
worker_scores[worker_id] = (
score * self.indexer.block_size() / token_length
)
else:
logger.warning("Cannot get KV scores")
worker_metrics: dict[WorkerId, dict[str, float]] = {}
max_waiting = 0.0
if metrics:
for endpoint in metrics.endpoints:
worker_id = endpoint.worker_id
worker_metrics[worker_id] = {
key: getattr(endpoint, key, self.default_metrics[key])
for key in self.default_metrics.keys()
}
max_waiting = max(
max_waiting, worker_metrics[worker_id]["num_requests_waiting"]
)
else:
logger.warning("Cannot get metrics")
# Consider *all* workers, even if metrics/scores missing
worker_ids = self.workers_client.instance_ids()
worker_logits: dict[WorkerId, float] = {}
for worker_id in worker_ids:
score = worker_scores.get(worker_id, 0.0)
metrics_dict = worker_metrics.get(worker_id, self.default_metrics)
gpu_cache_usage = metrics_dict["gpu_cache_usage_perc"]
normalized_waiting = (
metrics_dict["num_requests_waiting"] / max_waiting
if max_waiting > 0 else 0.0
)
# Simple linear combination (higher is better)
worker_logits[worker_id] = 2 * score - gpu_cache_usage - normalized_waiting
logger.info(
"Formula for %s: %.3f = 2.0 * %.3f - %.3f - %.3f",
worker_id,
worker_logits[worker_id],
score,
gpu_cache_usage,
normalized_waiting,
)
if not worker_logits or not any(worker_logits.values()):
for worker_id in worker_ids:
score = worker_scores.get(worker_id, 0.0)
metrics_dict = worker_metrics.get(worker_id, self.default_metrics)
- gpu_cache_usage = metrics_dict["gpu_cache_usage_perc"]
+ gpu_cache_usage = metrics_dict["gpu_cache_usage_perc"] / 100.0
normalized_waiting = (
metrics_dict["num_requests_waiting"] / max_waiting
if max_waiting > 0 else 0.0
)
# Simple linear combination (higher is better)
worker_logits[worker_id] = 2 * score - gpu_cache_usage - normalized_waiting
logger.info(
"Formula for %s: %.3f = 2.0 * %.3f - %.3f - %.3f",
worker_id,
worker_logits[worker_id],
score,
gpu_cache_usage,
normalized_waiting,
)
🧰 Tools
🪛 Pylint (3.3.7)

[refactor] 171-171: Too many local variables (18/15)

(R0914)

🤖 Prompt for AI Agents
In examples/sglang/components/kv_router.py between lines 171 and 232, the cost
function subtracts gpu_cache_usage_perc directly, which is on a 0-100 scale,
from score values on a 0-1 scale, causing bias. Fix this by normalizing
gpu_cache_usage_perc to a 0-1 scale by dividing it by 100 before using it in the
formula. After this, revisit and adjust the coefficients in the linear
combination to ensure balanced weighting of all terms.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (5)
examples/sglang/components/worker.py (1)

99-101: page_size/block_size may arrive as string

argparse leaves CLI values as str when they come from the service config.
A manual int() cast guards against accidental string propagation to Rust bindings:

-            kv_block_size = getattr(
-                self.engine_args, "page_size", getattr(self.engine_args, "block_size", 64)
-            )
+            kv_block_size = int(
+                getattr(
+                    self.engine_args,
+                    "page_size", 
+                    getattr(self.engine_args, "block_size", 64),
+                )
+            )
examples/sglang/components/kv_router.py (4)

37-56: Fix import formatting to resolve pipeline failure.

The pipeline indicates that import sorting failed. Please run isort to fix the import order:

isort examples/sglang/components/kv_router.py
black examples/sglang/components/kv_router.py

45-47: Import path may break outside examples package

from components.worker import SGLangWorker relies on the current working directory being examples/sglang. Use an absolute import rooted at the project package (e.g. examples.sglang.components.worker) or add __init__.py files to turn folders into packages.


92-96: Boolean CLI flag should use action="store_true"

type=bool expects the user to pass an explicit value (--custom-router true), which is error-prone and breaks argparse's "flag" semantics. Replace with action="store_true":

-    parser.add_argument(
-        "--custom-router",
-        type=bool,
-        default=False,
-        help="Whether to use custom router or not",
-    )
+    parser.add_argument(
+        "--custom-router",
+        action="store_true",
+        help="Enable custom router",
+    )

169-230: Fix code formatting and consider refactoring complex method.

The pipeline indicates that Black formatting failed. Please run black to fix code style issues throughout the file.

Additionally, the _cost_function method has too many local variables (18/15 per Pylint), making it difficult to read and maintain.

black examples/sglang/components/kv_router.py

Consider refactoring _cost_function by extracting helper methods:

  • _process_overlap_scores() to handle score conversion
  • _process_worker_metrics() to handle metrics aggregation
  • _compute_worker_logits() to handle the scoring logic

Cost function mixes incomparable scales – results may be biased

gpu_cache_usage_perc appears to be a value in [0,100] while score is a fraction 0-1 (after block → token normalisation). Subtracting the raw percentage heavily outweighs the 2×cache-hit bonus, effectively turning the formula into "pick whoever uses 0 % GPU cache".

Either:

  1. Convert gpu_cache_usage_perc to [0,1] by dividing by 100, or
  2. Rescale all terms/weights based on empirical ranges

Example fix:

-            gpu_cache_usage = metrics_dict["gpu_cache_usage_perc"]
+            gpu_cache_usage = metrics_dict["gpu_cache_usage_perc"] / 100.0
🧰 Tools
🪛 Pylint (3.3.7)

[refactor] 169-169: Too many local variables (18/15)

(R0914)

🧹 Nitpick comments (2)
examples/sglang/components/frontend.py (1)

49-57: Simplify by removing unnecessary else after return.

 def get_http_binary_path():
     """Find the HTTP binary path in SDK or fallback to 'http' command."""
     sdk_path = Path(sdk.__file__)
     binary_path = sdk_path.parent / "cli/bin/http"
     if not binary_path.exists():
         return "http"
-    else:
-        return str(binary_path)
+    return str(binary_path)
🧰 Tools
🪛 Pylint (3.3.7)

[refactor] 53-56: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

examples/sglang/components/kv_router.py (1)

35-35: Remove unnecessary comment.

-# faradawn hello
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e5b2cdc and aac3b8f.

📒 Files selected for processing (6)
  • examples/sglang/components/frontend.py (5 hunks)
  • examples/sglang/components/kv_router.py (1 hunks)
  • examples/sglang/components/worker.py (5 hunks)
  • examples/sglang/configs/agg_router.yaml (1 hunks)
  • launch/dynamo-run/src/subprocess/sglang_inc.py (1 hunks)
  • lib/bindings/python/README.md (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • lib/bindings/python/README.md
  • launch/dynamo-run/src/subprocess/sglang_inc.py
  • examples/sglang/configs/agg_router.yaml
🧰 Additional context used
🧬 Code Graph Analysis (1)
examples/sglang/components/frontend.py (4)
examples/sglang/components/worker.py (2)
  • SGLangWorker (56-218)
  • generate (154-197)
examples/sglang/components/kv_router.py (2)
  • Router (118-313)
  • generate (279-313)
examples/sglang/utils/protocol.py (4)
  • Tokens (27-28)
  • PreprocessedRequest (68-74)
  • StopConditions (45-50)
  • SamplingOptions (53-65)
deploy/sdk/src/dynamo/sdk/lib/decorators.py (3)
  • async_on_start (95-99)
  • endpoint (68-92)
  • on_shutdown (102-106)
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/1323/merge) by faradawn.
examples/sglang/components/worker.py

[error] 1-1: isort formatting failed and modified the file. Please run isort to fix import sorting.


[error] 1-1: Black formatting failed and reformatted the file. Please run 'black' to fix code style issues.

examples/sglang/components/frontend.py

[error] 1-1: isort formatting failed and modified the file. Please run isort to fix import sorting.


[error] 1-1: Black formatting failed and reformatted the file. Please run 'black' to fix code style issues.


[error] 19-37: Ruff linting errors found and fixed in this file. Please run 'ruff --fix' to apply fixes.

examples/sglang/components/kv_router.py

[error] 1-1: isort formatting failed and modified the file. Please run isort to fix import sorting.


[error] 1-1: Black formatting failed and reformatted the file. Please run 'black' to fix code style issues.

🪛 Ruff (0.11.9)
examples/sglang/components/frontend.py

25-25: utils.protocol.Tokens imported but unused

Remove unused import

(F401)


25-25: utils.protocol.StopConditions imported but unused

Remove unused import

(F401)


25-25: utils.protocol.SamplingOptions imported but unused

Remove unused import

(F401)


26-26: transformers.AutoTokenizer imported but unused

Remove unused import: transformers.AutoTokenizer

(F401)


27-27: json imported but unused

Remove unused import: json

(F401)


30-30: dynamo.sdk.async_on_start imported but unused

Remove unused import

(F401)


30-30: dynamo.sdk.dynamo_context imported but unused

Remove unused import

(F401)

🪛 Pylint (3.3.7)
examples/sglang/components/frontend.py

[refactor] 29-29: Use 'from dynamo import sdk' instead

(R0402)


[refactor] 53-56: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

examples/sglang/components/kv_router.py

[refactor] 57-57: Too few public methods (0/2)

(R0903)


[refactor] 169-169: Too many local variables (18/15)

(R0914)

⏰ Context from checks skipped due to timeout of 90000ms (4)
  • GitHub Check: Build and Test - vllm
  • GitHub Check: pre-merge-rust (lib/bindings/python)
  • GitHub Check: pre-merge-rust (.)
  • GitHub Check: pre-merge-rust (lib/runtime/examples)

Comment on lines 38 to 43
from dynamo.llm import (
ModelType,
register_llm,
ZmqKvEventPublisher,
ZmqKvEventPublisherConfig,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix import formatting issues.

The pipeline indicates that isort and Black formatting failed. Please run these commands to fix the formatting:

isort examples/sglang/components/worker.py
black examples/sglang/components/worker.py
🤖 Prompt for AI Agents
In examples/sglang/components/worker.py around lines 38 to 43, the import
statements have formatting issues flagged by isort and Black. To fix this, run
the commands "isort examples/sglang/components/worker.py" followed by "black
examples/sglang/components/worker.py" to automatically sort and format the
imports according to the project's style guidelines.

@faradawn
Copy link
Contributor Author

faradawn commented Jun 16, 2025

Progress update:

Router Workflow

dynamo-run in=http out=worker.generate router=kv -> worker.

Problem: need a "Preprocessed request" (token id, etc) to pass into SGLang. But dynamo-run couldn't seem to do the translation.

Solution: dynamo-run is able to preprocess the request. Just need to reset etcd to route to the correct endpoint.

@pull-request-size pull-request-size bot added size/L and removed size/M labels Jun 21, 2025
@faradawn
Copy link
Contributor Author

Dynamo router is working

cd /workspace/examples/sglang              
dynamo serve graphs.agg:Frontend -f ./configs/agg.yaml
2025-06-21T03:08:46.103Z  INFO utils.resolve_service_config: Running dynamo serve with config: {'Frontend': {'served_model_name': 'deepseek-ai/DeepSeek-R1-Distill-Llama-8B', 'endpoint': 'dynamo.SGLangWorker.generate', 'port': 8000}, 'SGLangWorker': {'model-path': 'deepseek-ai/DeepSeek-R1-Distill-Llama-8B', 'served-model-name': 'deepseek-ai/DeepSeek-R1-Distill-Llama-8B', 'tp': 1, 'page-size': 16, 'trust-remote-code': True, 'skip-tokenizer-init': True, 'ServiceArgs': {'workers': 1, 'resources': {'gpu': 1}}}}   
╭──────────────── Dynamo Serve ────────────────╮
│ Starting Dynamo service: graphs.agg:Frontend │
╰──────────────────────────────────────────────╯
2025-06-21T03:08:57.406Z  INFO resource._discover_gpus: Discovered 1 GPUs   
2025-06-21T03:08:57.406Z  INFO resource._discover_gpus: Discovered 1 GPUs   
2025-06-21T03:08:57.406Z  INFO allocator.get_resource_envs: Getting resource envs for service Frontend   
2025-06-21T03:08:57.407Z  INFO allocator.get_resource_envs: Using configured worker count: 1   
2025-06-21T03:08:57.407Z  INFO allocator.get_resource_envs: Final resource allocation - workers: 1, envs: []   
2025-06-21T03:08:57.407Z  INFO serving.serve_dynamo_graph: Serving dynamo graph with namespace dynamo   
2025-06-21T03:08:57.407Z  INFO serving.serve_dynamo_graph: Clearing namespace dynamo before serving   
2025-06-21T03:08:57.407Z  INFO allocator.get_resource_envs: Getting resource envs for service SGLangWorker   
2025-06-21T03:08:57.408Z  INFO allocator.get_resource_envs: GPU requirement found: 1   
2025-06-21T03:08:57.408Z  INFO allocator.get_resource_envs: Using configured worker count: 1   
2025-06-21T03:08:57.408Z  INFO allocator.get_resource_envs: GPU allocation enabled   
2025-06-21T03:08:57.408Z  INFO allocator.get_resource_envs: Local deployment detected. Allocating GPUs for 1 workers of 'SGLangWorker'   
2025-06-21T03:08:57.408Z  INFO allocator.get_resource_envs: GPU 0 (NVIDIA L4): Memory: 21.9GB free / 22.5GB total, Utilization: 0%    
2025-06-21T03:08:57.409Z  INFO allocator.get_resource_envs: Final resource allocation - workers: 1, envs: [{'CUDA_VISIBLE_DEVICES': '0'}]   
2025-06-21T03:08:57.409Z  INFO serving.create_dynamo_watcher: Created watcher for SGLangWorker's in the dynamo namespace   
2025-06-21T03:08:57.415Z  INFO serving.serve_dynamo_graph: Created watcher for Frontend with 1 workers in the dynamo namespace   
2025-06-21T03:08:57.424Z  INFO arbiter._ensure_ioloop: Installing handle_callback_exception to loop   
2025-06-21T03:08:57.424Z  INFO sighandler.__init__: Registering signals...   
2025-06-21T03:08:57.425Z  INFO arbiter.start: Starting master on pid 17661   
2025-06-21T03:08:57.426Z  INFO arbiter.initialize: sockets started   
2025-06-21T03:08:57.472Z  INFO arbiter.start: Arbiter now waiting for commands   
2025-06-21T03:08:57.472Z  INFO watcher._start: dynamo_SGLangWorker started   
2025-06-21T03:08:57.509Z  INFO watcher._start: dynamo_Frontend started   
2025-06-21T03:08:57.510Z  INFO serving.<lambda>: Starting graphs.agg:Frontend (Press CTRL+C to quit)   
2025-06-21T03:09:05.506Z  INFO serve_dynamo.dyn_worker: [SGLangWorker:1] Registering component dynamo/SGLangWorker   
2025-06-21T03:09:05.513Z  INFO serve_dynamo.dyn_worker: [SGLangWorker:1] Created SGLangWorker component   
2025-06-21T03:09:05.513Z  INFO config.as_args: [SGLangWorker:1] Running SGLangWorker with args=['--model-path', 'deepseek-ai/DeepSeek-R1-Distill-Llama-8B', '--served-model-name', 'deepseek-ai/DeepSeek-R1-Distill-Llama-8B', '--tp', '1', '--page-size', '16', '--trust-remote-code', '--skip-tokenizer-init']   
2025-06-21T03:09:05.544Z  INFO server_args.__post_init__: [SGLangWorker:1] EPLB is enabled or init_expert_location is provided. ep_dispatch_algorithm is configured.   
2025-06-21T03:09:05.635Z  INFO serve_dynamo.dyn_worker: [Frontend:1] Registering component dynamo/Frontend   
2025-06-21T03:09:05.641Z  INFO serve_dynamo.dyn_worker: [Frontend:1] Created Frontend component   
2025-06-21T03:09:05.642Z  INFO frontend.start_ingress_and_processor: [Frontend:1] Starting HTTP server and processor on port 8000   
2025-06-21T03:09:05.643Z  INFO serve_dynamo.dyn_worker: [Frontend:1] Starting Frontend instance with all registered endpoints   
2025-06-21T03:09:05.643Z  INFO serve_dynamo.dyn_worker: [Frontend:1] No Dynamo endpoints found in service Frontend but keeping service alive   
2025-06-21T03:09:05.666Z  WARN dynamo_run::opt: out=dyn://<path> is deprecated, the path is not used. Please use 'out=dyn'
2025-06-21T03:09:05.678Z  INFO dynamo_run::input::http: Watching for remote model at models
2025-06-21T03:09:05.680Z  INFO dynamo_llm::http::service::service_v2: Starting HTTP service on: 0.0.0.0:8000 address="0.0.0.0:8000"
2025-06-21T03:09:07.700Z  INFO engine.__init__: [SGLangWorker:1] server_args=ServerArgs(model_path='deepseek-ai/DeepSeek-R1-Distill-Llama-8B', tokenizer_path='deepseek-ai/DeepSeek-R1-Distill-Llama-8B', tokenizer_mode='auto', skip_tokenizer_init=True, load_format='auto', trust_remote_code=True, dtype='auto', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=None, device='cuda', served_model_name='deepseek-ai/DeepSeek-R1-Distill-Llama-8B', chat_template=None, completion_template=None, is_embedding=False, enable_multimodal=None, revision=None, impl='auto', host='127.0.0.1', port=30000, mem_fraction_static=0.871, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=2048, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=16, tp_size=1, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=42943220, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, show_time_cost=False, enable_metrics=False, bucket_time_to_first_token=None, bucket_e2e_request_latency=None, bucket_inter_token_latency=None, collect_tokens_histogram=False, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, api_key=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, ep_size=1, enable_ep_moe=False, enable_deepep_moe=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, cuda_graph_max_bs=8, cuda_graph_bs=None, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_nccl_nvls=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, disable_overlap_schedule=False, disable_overlap_cg_plan=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_torch_compile=False, torch_compile_max_bs=32, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through_selective', flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, enable_return_hidden_states=False, warmups=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, debug_tensor_dump_prefill_only=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=43657, disaggregation_ib_device=None, num_reserved_decode_tokens=512, pdlb_url=None)   
2025-06-21T03:09:18.381Z  INFO model_runner.model_specific_adjustment: Attention backend not set. Use flashinfer backend by default.   
2025-06-21T03:09:18.384Z  INFO model_runner.init_torch_distributed: Init torch distributed begin.   
2025-06-21T03:09:18.633Z  INFO model_runner.init_torch_distributed: Init torch distributed ends. mem usage=0.00 GB   
2025-06-21T03:09:19.682Z  INFO model_runner.load_model: Load weight begin. avail mem=21.73 GB   
2025-06-21T03:09:20.120Z  INFO weight_utils.download_weights_from_hf: Using model weights format ['*.safetensors']   
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:42<00:42, 42.10s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [01:18<00:00, 38.62s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [01:18<00:00, 39.14s/it]

2025-06-21T03:10:38.846Z  INFO model_runner.load_model: Load weight end. type=LlamaForCausalLM, dtype=torch.bfloat16, avail mem=6.71 GB, mem usage=15.02 GB.   
2025-06-21T03:10:38.983Z  INFO memory_pool.__init__: KV Cache is allocated. #tokens: 31984, K size: 1.95 GB, V size: 1.95 GB   
2025-06-21T03:10:38.992Z  INFO model_runner.init_memory_pool: Memory pool end. avail mem=1.69 GB   
2025-06-21T03:10:39.481Z  INFO model_runner.init_cuda_graphs: Capture cuda graph begin. This can take up to several minutes. avail mem=1.18 GB   
2025-06-21T03:10:39.573Z  INFO utils.rank0_log: Capture cuda graph bs [1, 2, 4, 8]   
Capturing batches (avail_mem=1.10 GB): 100%|██████████████████████████████████████████████████████████| 4/4 [00:04<00:00,  1.03s/it]
2025-06-21T03:10:43.700Z  INFO model_runner.init_cuda_graphs: Capture cuda graph end. Time elapsed: 4.22 s. mem usage=0.11 GB. avail mem=1.07 GB.   
2025-06-21T03:10:43.720Z  INFO scheduler.__init__: max_total_num_tokens=31984, chunked_prefill_size=2048, max_prefill_tokens=16384, max_running_requests=2048, context_len=131072, available_gpu_mem=1.07 GB   
2025-06-21T03:10:43.741Z  INFO worker.__init__: [SGLangWorker:1] === new file SGLangWorker initialized with kv publisher   
2025-06-21T03:10:43.741Z  INFO worker.async_init: [SGLangWorker:1] === worker async_initRegistering LLM for discovery   
2025-06-21T03:10:43.756Z  INFO worker.async_init: [SGLangWorker:1] === new file register llm with kv block size 16, endpoint=<builtins.Endpoint object at 0x7e9d0ac20350>, model_path=deepseek-ai/DeepSeek-R1-Distill-Llama-8B, served_model_name=deepseek-ai/DeepSeek-R1-Distill-Llama-8B   
2025-06-21T03:10:44.583Z  INFO worker.async_init: [SGLangWorker:1] === worker publishing initial metrics   
2025-06-21T03:10:44.585Z  INFO worker.async_init: [SGLangWorker:1] === worker donepublishing initial metrics   
2025-06-21T03:10:44.586Z  INFO worker.create_metrics_publisher_endpoint: [SGLangWorker:1] === me Creating metrics publisher endpoint with primary lease   
2025-06-21T03:10:44.621Z  INFO _core::llm::kv: Starting ZmqKvEventPublisher worker_id=7587887592773332273 kv_block_size=16 zmq_endpoint=tcp://127.0.0.1:5557 zmq_topic=
2025-06-21T03:10:44.622Z  INFO worker.async_init: [SGLangWorker:1] === worker ZMQ publisher created   
2025-06-21T03:10:44.622Z  INFO serve_dynamo.dyn_worker: [SGLangWorker:1] Starting SGLangWorker instance with all registered endpoints   
2025-06-21T03:10:45.753Z  INFO dynamo_llm::kv_router: KV Routing initialized
2025-06-21T03:10:46.601Z  INFO dynamo_llm::discovery::watcher: added model model_name="deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
2025-06-21T03:16:34.719Z  INFO worker.generate: [SGLangWorker:1] === worker generate request=token_ids=[128000, 128011, 12840, 374, 832, 5636, 832, 128012, 128013, 198] stop_conditions=StopConditions(max_tokens=30, stop=None, stop_token_ids_hidden=[128001], min_tokens=None, ignore_eos=None) sampling_options=SamplingOptions(n=None, best_of=None, presence_penalty=None, frequency_penalty=None, repetition_penalty=None, temperature=None, top_p=None, top_k=None, min_p=None, use_beam_search=None, length_penalty=None, seed=None) eos_token_ids=[128001] mdc_sum='70269cc374de5521ce82473ed8ed88e768a1e66d9215e7f295905db4e97d5c23' annotations=[]   
2025-06-21T03:16:34.719Z  INFO engine.async_generate: data_parallel_rank: None   
2025-06-21T03:16:34.721Z  INFO scheduler.log_prefill_stats: Prefill batch. #new-seq: 1, #new-token: 16, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0   
2025-06-21T03:20:14.346Z  INFO worker.generate: [SGLangWorker:1] === worker generate request=token_ids=[128000, 128011, 12840, 374, 832, 5636, 1403, 128012, 128013, 198] stop_conditions=StopConditions(max_tokens=30, stop=None, stop_token_ids_hidden=[128001], min_tokens=None, ignore_eos=None) sampling_options=SamplingOptions(n=None, best_of=None, presence_penalty=None, frequency_penalty=None, repetition_penalty=None, temperature=None, top_p=None, top_k=None, min_p=None, use_beam_search=None, length_penalty=None, seed=None) eos_token_ids=[128001] mdc_sum='70269cc374de5521ce82473ed8ed88e768a1e66d9215e7f295905db4e97d5c23' annotations=[]   
2025-06-21T03:20:14.347Z  INFO engine.async_generate: data_parallel_rank: None   
2025-06-21T03:20:14.348Z  INFO scheduler.log_prefill_stats: Prefill batch. #new-seq: 1, #new-token: 16, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0   
2025-06-21T03:20:15.012Z  INFO scheduler.log_decode_stats: Decode batch. #running-req: 1, #token: 32, token usage: 0.00, cuda graph: True, gen throughput (token/s): 0.07, #queue-req: 0   

Client side:

(venv) root@l4:/workspace/examples/sglang# curl localhost:8000/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
    "messages": [
    {
        "role": "user",
        "content": "what is one plus one"
    }
    ],
    "stream":false,
    "max_tokens": 30
  }'
{"id":"chatcmpl-bfc6afa5-769e-4ea4-b43f-9d1cb81cf0fe","choices":[{"index":0,"message":{"content":"I need to determine the answer to the question, which is \"what is one plus one.\"\n\nFirst, I identify the numbers involved in the problem","refusal":null,"tool_calls":null,"role":"assistant","function_call":null,"audio":null},"finish_reason":null,"logprobs":null}],"created":1750475794,"model":"deepseek-ai/DeepSeek-R1-Distill-Llama-8B","service_tier":null,"system_fingerprint":null,"object":"chat.completion","usage":{"prompt_tokens":10,"completion_tokens":29,"total_tokens":0,"prompt_tokens_details":null,"completion_tokens_details":null}}(venv) root@l4:/workspace/examples/sglang# 
(venv) root@l4:/workspace/examples/sglang# 
(venv) root@l4:/workspace/examples/sglang# 
(venv) root@l4:/workspace/examples/sglang# 
(venv) root@l4:/workspace/examples/sglang# curl localhost:8000/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
    "messages": [
    {
        "role": "user",
        "content": "what is one plus two"
    }
    ],
    "stream":false,
    "max_tokens": 30
  }'
{"id":"chatcmpl-3edbaddb-baa5-4d06-bc85-6449af93020b","choices":[{"index":0,"message":{"content":"First, I need to understand the user's question, which is \"What is one plus two?\"\n\nTo solve this, I'll start by identifying","refusal":null,"tool_calls":null,"role":"assistant","function_call":null,"audio":null},"finish_reason":null,"logprobs":null}],"created":1750476014,"model":"deepseek-ai/DeepSeek-R1-Distill-Llama-8B","service_tier":null,"system_fingerprint":null,"object":"chat.completion","usage":{"prompt_tokens":10,"completion_tokens":29,"total_tokens":0,"prompt_tokens_details":null,"completion_tokens_details":null}}(venv) root@l4:/workspace/examples/sglang# 

@ishandhanani
Copy link
Contributor

Closing as this is now a duplicate of #1605

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

external-contribution Pull request is from an external contributor feat size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants