Skip to content

Conversation

@kthui
Copy link
Contributor

@kthui kthui commented Jun 7, 2025

Overview:

Track unreachable instances at the router level, and skip the downed instances for future request before ETCD can discover the failure.

Details:

The goal of this change is to drastically reduce the number of request failures, due to downed instance(s), before the availability can be updated from ETCD lease.

Test setup:

The Client is sending a request to the Processor every 0.1 seconds, for a total of 100 requests over 10 seconds. The Processor round-robins the received requests to the 2 Workers. The Worker returns a response back to the Processor immediately after it receives it, and the same goes for the Processor back to the Client.

To demonstrate reduced number of request failure, 1 of the 2 Worker is stopped while the Client is sending request.

Before the change, the number of failed request from the Processor back to the Client is 32 / 100, because the router does not know one of the Worker has stopped before ETCD is able to update with its lease.

2025-06-13T21:01:20.515Z  INFO processor.generate: [Processor:1] Processor received: 1 2 3 4 5 6 7 8 9   
2025-06-13T21:01:20.621Z  INFO processor.generate: [Processor:1] Processor received: 1 2 3 4 5 6 7 8 9   
...   
2025-06-13T21:01:23.630Z  INFO processor.generate: [Processor:1] Processor received: 1 2 3 4 5 6 7 8 9   
2025-06-13T21:01:23.734Z  INFO processor.generate: [Processor:1] Processor received: 1 2 3 4 5 6 7 8 9   
2025-06-13T21:01:23.735Z ERROR processor.generate: [Processor:1] Processor error: no responders: no responders   
2025-06-13T21:01:23.836Z  INFO processor.generate: [Processor:1] Processor received: 1 2 3 4 5 6 7 8 9   
2025-06-13T21:01:23.940Z  INFO processor.generate: [Processor:1] Processor received: 1 2 3 4 5 6 7 8 9   
2025-06-13T21:01:23.940Z ERROR processor.generate: [Processor:1] Processor error: no responders: no responders   
2025-06-13T21:01:24.042Z  INFO processor.generate: [Processor:1] Processor received: 1 2 3 4 5 6 7 8 9   
2025-06-13T21:01:24.146Z  INFO processor.generate: [Processor:1] Processor received: 1 2 3 4 5 6 7 8 9   
2025-06-13T21:01:24.146Z ERROR processor.generate: [Processor:1] Processor error: no responders: no responders   
...   
2025-06-13T21:01:30.006Z  INFO processor.generate: [Processor:1] Processor received: 1 2 3 4 5 6 7 8 9   
2025-06-13T21:01:30.110Z  INFO processor.generate: [Processor:1] Processor received: 1 2 3 4 5 6 7 8 9   
2025-06-13T21:01:30.110Z ERROR processor.generate: [Processor:1] Processor error: no responders: no responders   
2025-06-13T21:01:30.211Z  INFO processor.generate: [Processor:1] Processor received: 1 2 3 4 5 6 7 8 9   
2025-06-13T21:01:30.315Z  INFO processor.generate: [Processor:1] Processor received: 1 2 3 4 5 6 7 8 9   
2025-06-13T21:01:30.419Z  INFO processor.generate: [Processor:1] Processor received: 1 2 3 4 5 6 7 8 9   
2025-06-13T21:01:30.523Z  INFO processor.generate: [Processor:1] Processor received: 1 2 3 4 5 6 7 8 9   
2025-06-13T21:01:30.627Z  INFO processor.generate: [Processor:1] Processor received: 1 2 3 4 5 6 7 8 9   
2025-06-13T21:01:30.730Z  INFO processor.generate: [Processor:1] Processor received: 1 2 3 4 5 6 7 8 9

After the change, the number of failed request is reduced to 1 / 100, because the failed worker is temporary removed from the list of workers that joins the round-robin after its first failure on a request.

2025-06-13T21:09:06.804Z  INFO processor.generate: [Processor:1] Processor received: 1 2 3 4 5 6 7 8 9   
...
2025-06-13T21:09:08.875Z  INFO processor.generate: [Processor:1] Processor received: 1 2 3 4 5 6 7 8 9
2025-06-13T21:09:08.875Z  INFO processor.generate: [Processor:1] Processor received: 1 2 3 4 5 6 7 8 9   
2025-06-13T21:09:08.875Z ERROR processor.generate: [Processor:1] Processor error: no responders: no responders   
2025-06-13T21:09:08.976Z  INFO processor.generate: [Processor:1] Processor received: 1 2 3 4 5 6 7 8 9   
...   
2025-06-13T21:09:17.054Z  INFO processor.generate: [Processor:1] Processor received: 1 2 3 4 5 6 7 8 9

The Processor Worker Client example is from this commit.

Note: The actual number of failure before the change may vary depending on when the worker is stopped during the 10 seconds test and how quickly ETCD can figure out the worker is stopped thought its lease. The number of failure after the change should always be 1 because it only needs 1 failure to stop new requests from routing to the stopped worker until ETCD can update.

Where should the reviewer start?

Start with client.rs on how the downed instances are tracked in a hash map, and then to the updates on the push_router.rs on down detection and skipping to the routing algorithms

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

N/A

Summary by CodeRabbit

  • New Features

    • Added automatic instance inhibition: instances reported as down are temporarily excluded from routing for a short period.
    • Introduced methods to view currently available instances and to manually report an instance as down.
  • Refactor

    • Improved routing logic with unified fault-tolerant handling, automatically excluding down instances and retrying as needed for round robin, random, and direct routing modes.
  • Bug Fixes

    • Enhanced error handling to better manage unavailable or invalid instances during routing.

@kthui kthui self-assigned this Jun 7, 2025
@copy-pr-bot
Copy link

copy-pr-bot bot commented Jun 7, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the feat label Jun 7, 2025
@kthui kthui force-pushed the jacky-ft-instance-inhibit branch from c76b633 to e939769 Compare June 9, 2025 19:31
@kthui kthui changed the title feat: FT downed worker instance tracking and request migration feat: FT downed worker instance tracking and skipping Jun 13, 2025
kthui added 5 commits June 12, 2025 18:16
* Add instance inhibition to Client

* [PoC] Retry a failed request if the worker is unreachable

* Remove request data copy

* Refactor instance inhibit and restart algorithm
@kthui kthui force-pushed the jacky-ft-instance-inhibit branch from 1ef7325 to e4d5b0a Compare June 13, 2025 01:28
Copy link
Contributor

@ryanolson ryanolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added comments. let's discuss and we can work toward approval. we don't have to address everything in this PR.

@kthui kthui force-pushed the jacky-ft-instance-inhibit branch from 12bab91 to 54f3b07 Compare June 13, 2025 21:25
@kthui kthui marked this pull request as ready for review June 13, 2025 21:28
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jun 13, 2025

Walkthrough

The changes introduce fault-tolerant instance selection to the client and push router components. The Client now tracks and inhibits instances reported as down, exposing new async methods to report and filter available instances. The PushRouter routing logic is refactored to use these filtered instances and to automatically report failed instances.

Changes

File(s) Change Summary
lib/runtime/src/component/client.rs Added async instance inhibition tracking and filtering, with new methods for reporting and retrieving available instances.
lib/runtime/src/pipeline/network/egress/push_router.rs Refactored routing methods to use async instance filtering and a unified, fault-tolerant instance selection method.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant PushRouter
    participant Client
    participant Etcd

    Caller->>PushRouter: send_request(request)
    PushRouter->>Client: instances_avail()
    Client->>Etcd: fetch_instances()
    Etcd-->>Client: [instances]
    Client-->>PushRouter: [filtered_instances]
    PushRouter->>PushRouter: select_instance()
    PushRouter->>Instance: send(request)
    alt No responders error
        PushRouter->>Client: report_instance_down(instance_id)
    end
    Instance-->>PushRouter: response
    PushRouter-->>Caller: response
Loading

Poem

In the warren, instances hop with care,
Some go missing—now we’re aware!
With mutex and map, we keep track of the down,
And route requests smartly, all over town.
Faults don’t scare us, we bounce and we bound—
The system’s more robust, with less run-around!
🐇✨


📜 Recent review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 54f3b07 and 4be44f8.

📒 Files selected for processing (1)
  • lib/runtime/src/pipeline/network/egress/push_router.rs (5 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • lib/runtime/src/pipeline/network/egress/push_router.rs
⏰ Context from checks skipped due to timeout of 90000ms (4)
  • GitHub Check: Build and Test - vllm
  • GitHub Check: pre-merge-rust (lib/runtime/examples)
  • GitHub Check: pre-merge-rust (lib/bindings/python)
  • GitHub Check: pre-merge-rust (.)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
lib/runtime/src/component/client.rs (2)

137-169: Consider Instant + configurable TTL & separate cleanup pass

instances_avail:

  1. Uses SystemTime (wall-clock). Clock jumps can incorrectly expire or extend inhibitions. std::time::Instant is monotonic and better suited.
  2. Hard-codes ETCD_LEASE_TTL = 10 s. Making this configurable (field or const generic) eases tuning per deployment.
  3. Mutates inhibited while iterating with filter_map, which is fine but slightly obscures intent. Doing a short upfront retain of fresh entries, then filtering, is clearer and keeps mutation separate from read path.

Optional but recommended:

const DEFAULT_TTL: Duration = Duration::from_secs(10);

pub async fn instances_avail(&self) -> Vec<Instance> {
    let now = Instant::now();
    let mut inhibited = self.instance_inhibited.lock().await;

    // purge stale entries first
    inhibited.retain(|_, ts| now.duration_since(*ts) <= DEFAULT_TTL);

    self.instances()
        .into_iter()
        .filter(|inst| !inhibited.contains_key(&inst.id()))
        .collect()
}

You’d store Instant timestamps in the map:

inhibited.insert(instance_id, Instant::now());

This tightens correctness and slightly improves readability.


178-182: Minor: Avoid silent overwrite when reporting the same instance rapidly

inhibited.insert(instance_id, now) overwrites any existing timestamp, potentially resetting the TTL indefinitely if the same failing instance keeps returning quickly.

If the goal is to keep the earliest failure time, prefer entry().or_insert(now); if the goal is to extend the TTL, keep as-is. Clarify intent with a comment or adjust accordingly.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 99e67e6 and 54f3b07.

📒 Files selected for processing (2)
  • lib/runtime/src/component/client.rs (5 hunks)
  • lib/runtime/src/pipeline/network/egress/push_router.rs (5 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (4)
  • GitHub Check: pre-merge-rust (.)
  • GitHub Check: pre-merge-rust (lib/bindings/python)
  • GitHub Check: pre-merge-rust (lib/runtime/examples)
  • GitHub Check: Build and Test - vllm

@kthui kthui requested a review from ryanolson June 13, 2025 23:05
@kthui kthui enabled auto-merge (squash) June 13, 2025 23:41
@kthui kthui requested a review from oandreeva-nv June 13, 2025 23:41
@oandreeva-nv
Copy link
Contributor

Generally, looks good. Couple questions:

  • What if 2 requests sent in parallel to the same instance, one mark it as down and one was lucky to progress. Is this situation feasible?
  • Do we clean up stale entries in instance_inhibited?

Copy link
Contributor

@oandreeva-nv oandreeva-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

synced offline. I think we can merge now and continue working on suggested optimizations

@kthui kthui merged commit a09ca3e into main Jun 13, 2025
11 checks passed
@kthui kthui deleted the jacky-ft-instance-inhibit branch June 13, 2025 23:58
grahamking added a commit that referenced this pull request Jul 16, 2025
Update the list of active instances in a background thread, and use
[arc-swap](https://crates.io/crates/arc-swap) to make the new list
active. This is broadly similar to updating an atomic pointer.

This makes the router's `client.random(request)` and
`client.round_robin(request)` lock-free and usually wait-free.

Lock was introduced here: #1424
grahamking added a commit that referenced this pull request Jul 16, 2025
Update the list of active instances in a background thread, and use
[arc-swap](https://crates.io/crates/arc-swap) to make the new list
active. This is broadly similar to updating an atomic pointer.

This makes the router's `client.random(request)` and
`client.round_robin(request)` lock-free and usually wait-free.

Lock was introduced here: #1424
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants