feat: FT downed worker instance tracking and skipping #1424

kthui · 2025-06-07T00:52:31Z

Overview:

Track unreachable instances at the router level, and skip the downed instances for future request before ETCD can discover the failure.

Details:

The goal of this change is to drastically reduce the number of request failures, due to downed instance(s), before the availability can be updated from ETCD lease.

Test setup:

The Client is sending a request to the Processor every 0.1 seconds, for a total of 100 requests over 10 seconds. The Processor round-robins the received requests to the 2 Workers. The Worker returns a response back to the Processor immediately after it receives it, and the same goes for the Processor back to the Client.

To demonstrate reduced number of request failure, 1 of the 2 Worker is stopped while the Client is sending request.

Before the change, the number of failed request from the Processor back to the Client is 32 / 100, because the router does not know one of the Worker has stopped before ETCD is able to update with its lease.

2025-06-13T21:01:20.515Z  INFO processor.generate: [Processor:1] Processor received: 1 2 3 4 5 6 7 8 9   
2025-06-13T21:01:20.621Z  INFO processor.generate: [Processor:1] Processor received: 1 2 3 4 5 6 7 8 9   
...   
2025-06-13T21:01:23.630Z  INFO processor.generate: [Processor:1] Processor received: 1 2 3 4 5 6 7 8 9   
2025-06-13T21:01:23.734Z  INFO processor.generate: [Processor:1] Processor received: 1 2 3 4 5 6 7 8 9   
2025-06-13T21:01:23.735Z ERROR processor.generate: [Processor:1] Processor error: no responders: no responders   
2025-06-13T21:01:23.836Z  INFO processor.generate: [Processor:1] Processor received: 1 2 3 4 5 6 7 8 9   
2025-06-13T21:01:23.940Z  INFO processor.generate: [Processor:1] Processor received: 1 2 3 4 5 6 7 8 9   
2025-06-13T21:01:23.940Z ERROR processor.generate: [Processor:1] Processor error: no responders: no responders   
2025-06-13T21:01:24.042Z  INFO processor.generate: [Processor:1] Processor received: 1 2 3 4 5 6 7 8 9   
2025-06-13T21:01:24.146Z  INFO processor.generate: [Processor:1] Processor received: 1 2 3 4 5 6 7 8 9   
2025-06-13T21:01:24.146Z ERROR processor.generate: [Processor:1] Processor error: no responders: no responders   
...   
2025-06-13T21:01:30.006Z  INFO processor.generate: [Processor:1] Processor received: 1 2 3 4 5 6 7 8 9   
2025-06-13T21:01:30.110Z  INFO processor.generate: [Processor:1] Processor received: 1 2 3 4 5 6 7 8 9   
2025-06-13T21:01:30.110Z ERROR processor.generate: [Processor:1] Processor error: no responders: no responders   
2025-06-13T21:01:30.211Z  INFO processor.generate: [Processor:1] Processor received: 1 2 3 4 5 6 7 8 9   
2025-06-13T21:01:30.315Z  INFO processor.generate: [Processor:1] Processor received: 1 2 3 4 5 6 7 8 9   
2025-06-13T21:01:30.419Z  INFO processor.generate: [Processor:1] Processor received: 1 2 3 4 5 6 7 8 9   
2025-06-13T21:01:30.523Z  INFO processor.generate: [Processor:1] Processor received: 1 2 3 4 5 6 7 8 9   
2025-06-13T21:01:30.627Z  INFO processor.generate: [Processor:1] Processor received: 1 2 3 4 5 6 7 8 9   
2025-06-13T21:01:30.730Z  INFO processor.generate: [Processor:1] Processor received: 1 2 3 4 5 6 7 8 9

After the change, the number of failed request is reduced to 1 / 100, because the failed worker is temporary removed from the list of workers that joins the round-robin after its first failure on a request.

2025-06-13T21:09:06.804Z  INFO processor.generate: [Processor:1] Processor received: 1 2 3 4 5 6 7 8 9   
...
2025-06-13T21:09:08.875Z  INFO processor.generate: [Processor:1] Processor received: 1 2 3 4 5 6 7 8 9
2025-06-13T21:09:08.875Z  INFO processor.generate: [Processor:1] Processor received: 1 2 3 4 5 6 7 8 9   
2025-06-13T21:09:08.875Z ERROR processor.generate: [Processor:1] Processor error: no responders: no responders   
2025-06-13T21:09:08.976Z  INFO processor.generate: [Processor:1] Processor received: 1 2 3 4 5 6 7 8 9   
...   
2025-06-13T21:09:17.054Z  INFO processor.generate: [Processor:1] Processor received: 1 2 3 4 5 6 7 8 9

The Processor Worker Client example is from this commit.

Note: The actual number of failure before the change may vary depending on when the worker is stopped during the 10 seconds test and how quickly ETCD can figure out the worker is stopped thought its lease. The number of failure after the change should always be 1 because it only needs 1 failure to stop new requests from routing to the stopped worker until ETCD can update.

Where should the reviewer start?

Start with client.rs on how the downed instances are tracked in a hash map, and then to the updates on the push_router.rs on down detection and skipping to the routing algorithms

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

N/A

Summary by CodeRabbit

New Features
- Added automatic instance inhibition: instances reported as down are temporarily excluded from routing for a short period.
- Introduced methods to view currently available instances and to manually report an instance as down.
Refactor
- Improved routing logic with unified fault-tolerant handling, automatically excluding down instances and retrying as needed for round robin, random, and direct routing modes.
Bug Fixes
- Enhanced error handling to better manage unavailable or invalid instances during routing.

copy-pr-bot · 2025-06-07T00:52:35Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

* Add instance inhibition to Client * [PoC] Retry a failed request if the worker is unreachable * Remove request data copy * Refactor instance inhibit and restart algorithm

ryanolson

added comments. let's discuss and we can work toward approval. we don't have to address everything in this PR.

lib/runtime/src/component/client.rs

lib/runtime/src/pipeline/network/egress/push_router.rs

…ec<Instance>

coderabbitai · 2025-06-13T21:32:07Z

Walkthrough

The changes introduce fault-tolerant instance selection to the client and push router components. The Client now tracks and inhibits instances reported as down, exposing new async methods to report and filter available instances. The PushRouter routing logic is refactored to use these filtered instances and to automatically report failed instances.

Changes

File(s)	Change Summary
lib/runtime/src/component/client.rs	Added async instance inhibition tracking and filtering, with new methods for reporting and retrieving available instances.
lib/runtime/src/pipeline/network/egress/push_router.rs	Refactored routing methods to use async instance filtering and a unified, fault-tolerant instance selection method.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant PushRouter
    participant Client
    participant Etcd

    Caller->>PushRouter: send_request(request)
    PushRouter->>Client: instances_avail()
    Client->>Etcd: fetch_instances()
    Etcd-->>Client: [instances]
    Client-->>PushRouter: [filtered_instances]
    PushRouter->>PushRouter: select_instance()
    PushRouter->>Instance: send(request)
    alt No responders error
        PushRouter->>Client: report_instance_down(instance_id)
    end
    Instance-->>PushRouter: response
    PushRouter-->>Caller: response

Poem

In the warren, instances hop with care,
Some go missing—now we’re aware!
With mutex and map, we keep track of the down,
And route requests smartly, all over town.
Faults don’t scare us, we bounce and we bound—
The system’s more robust, with less run-around!
🐇✨

📜 Recent review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 54f3b07 and 4be44f8.

📒 Files selected for processing (1)

lib/runtime/src/pipeline/network/egress/push_router.rs (5 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

lib/runtime/src/pipeline/network/egress/push_router.rs

⏰ Context from checks skipped due to timeout of 90000ms (4)

GitHub Check: Build and Test - vllm
GitHub Check: pre-merge-rust (lib/runtime/examples)
GitHub Check: pre-merge-rust (lib/bindings/python)
GitHub Check: pre-merge-rust (.)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

lib/runtime/src/component/client.rs (2)
137-169: Consider Instant + configurable TTL & separate cleanup pass

instances_avail:

Uses SystemTime (wall-clock). Clock jumps can incorrectly expire or extend inhibitions. std::time::Instant is monotonic and better suited.

Hard-codes ETCD_LEASE_TTL = 10 s. Making this configurable (field or const generic) eases tuning per deployment.

Mutates inhibited while iterating with filter_map, which is fine but slightly obscures intent. Doing a short upfront retain of fresh entries, then filtering, is clearer and keeps mutation separate from read path.

Optional but recommended:
const DEFAULT_TTL: Duration = Duration::from_secs(10);

pub async fn instances_avail(&self) -> Vec<Instance> {
    let now = Instant::now();
    let mut inhibited = self.instance_inhibited.lock().await;

    // purge stale entries first
    inhibited.retain(|_, ts| now.duration_since(*ts) <= DEFAULT_TTL);

    self.instances()
        .into_iter()
        .filter(|inst| !inhibited.contains_key(&inst.id()))
        .collect()
}
You’d store Instant timestamps in the map:
inhibited.insert(instance_id, Instant::now());
This tightens correctness and slightly improves readability.

178-182: Minor: Avoid silent overwrite when reporting the same instance rapidly

inhibited.insert(instance_id, now) overwrites any existing timestamp, potentially resetting the TTL indefinitely if the same failing instance keeps returning quickly.

If the goal is to keep the earliest failure time, prefer entry().or_insert(now); if the goal is to extend the TTL, keep as-is. Clarify intent with a comment or adjust accordingly.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 99e67e6 and 54f3b07.

📒 Files selected for processing (2)

lib/runtime/src/component/client.rs (5 hunks)
lib/runtime/src/pipeline/network/egress/push_router.rs (5 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (4)

GitHub Check: pre-merge-rust (.)
GitHub Check: pre-merge-rust (lib/bindings/python)
GitHub Check: pre-merge-rust (lib/runtime/examples)
GitHub Check: Build and Test - vllm

lib/runtime/src/pipeline/network/egress/push_router.rs

examples/fault_tolerance/processor.py

examples/fault_tolerance/__init__.py

oandreeva-nv · 2025-06-13T23:52:49Z

Generally, looks good. Couple questions:

What if 2 requests sent in parallel to the same instance, one mark it as down and one was lucky to progress. Is this situation feasible?
Do we clean up stale entries in instance_inhibited?

oandreeva-nv

synced offline. I think we can merge now and continue working on suggested optimizations

Update the list of active instances in a background thread, and use [arc-swap](https://crates.io/crates/arc-swap) to make the new list active. This is broadly similar to updating an atomic pointer. This makes the router's `client.random(request)` and `client.round_robin(request)` lock-free and usually wait-free. Lock was introduced here: #1424

kthui self-assigned this Jun 7, 2025

pull-request-size bot added the size/L label Jun 7, 2025

github-actions bot added the feat label Jun 7, 2025

kthui force-pushed the jacky-ft-instance-inhibit branch from c76b633 to e939769 Compare June 9, 2025 19:31

kthui changed the title ~~feat: FT downed worker instance tracking and request migration~~ feat: FT downed worker instance tracking and skipping Jun 13, 2025

kthui added 5 commits June 12, 2025 18:16

Add fault tolerance test

355dc3c

Add instance inhibit and request restart to PushRouter

d7b0a11

* Add instance inhibition to Client * [PoC] Retry a failed request if the worker is unreachable * Remove request data copy * Refactor instance inhibit and restart algorithm

Fix cargo fmt and Arc impl

7ab63c8

Disable request restart at router layer

477387a

Inhibit instance only due to responders not found

e4d5b0a

kthui force-pushed the jacky-ft-instance-inhibit branch from 1ef7325 to e4d5b0a Compare June 13, 2025 01:28

tmp: Remove examples from this PR and have it in separately

307528d

ryanolson reviewed Jun 13, 2025

View reviewed changes

lib/runtime/src/component/client.rs Show resolved Hide resolved

lib/runtime/src/pipeline/network/egress/push_router.rs Show resolved Hide resolved

lib/runtime/src/pipeline/network/egress/push_router.rs Show resolved Hide resolved

docs: Comment the client should return ref instead of full clone of V…

54f3b07

…ec<Instance>

kthui force-pushed the jacky-ft-instance-inhibit branch from 12bab91 to 54f3b07 Compare June 13, 2025 21:25

kthui marked this pull request as ready for review June 13, 2025 21:28

kthui requested review from a team, GuanLuo, PeaBrane, alec-flowers, biswapanda, grahamking, jthomson04, kkranen, oandreeva-nv, paulhendricks, rmccorm4 and tmonty12 as code owners June 13, 2025 21:28

coderabbitai bot reviewed Jun 13, 2025

View reviewed changes

lib/runtime/src/pipeline/network/egress/push_router.rs Show resolved Hide resolved

lib/runtime/src/pipeline/network/egress/push_router.rs Show resolved Hide resolved

Use NATS error when checking

4be44f8

kthui requested a review from ryanolson June 13, 2025 23:05

oandreeva-nv reviewed Jun 13, 2025

View reviewed changes

examples/fault_tolerance/processor.py Outdated Show resolved Hide resolved

oandreeva-nv reviewed Jun 13, 2025

View reviewed changes

examples/fault_tolerance/__init__.py Outdated Show resolved Hide resolved

kthui enabled auto-merge (squash) June 13, 2025 23:41

kthui requested a review from oandreeva-nv June 13, 2025 23:41

oandreeva-nv approved these changes Jun 13, 2025

View reviewed changes

kthui merged commit a09ca3e into main Jun 13, 2025
11 checks passed

kthui deleted the jacky-ft-instance-inhibit branch June 13, 2025 23:58

grahamking mentioned this pull request Jul 16, 2025

perf(router): Remove lock from router hot path #1963

Merged

feat: FT downed worker instance tracking and skipping #1424

feat: FT downed worker instance tracking and skipping #1424

Uh oh!

Conversation

kthui commented Jun 7, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Jun 7, 2025

Uh oh!

ryanolson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

oandreeva-nv commented Jun 13, 2025

Uh oh!

oandreeva-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kthui commented Jun 7, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jun 13, 2025 •

edited

Loading