Skip to content

Conversation

@anuragagarwal561994
Copy link

No description provided.

@ejona86
Copy link
Member

ejona86 commented Jun 2, 2025

CC @dfawley

@anuragagarwal561994
Copy link
Author

@dfawley did we get a chance to check this

Copy link
Contributor

@atollena atollena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be a nice addition (it's something we've had requests for internally at Datadog). Generally the approach sounds good to me. Ideally we would be able to support this mechanism for other load balancers supported by gRPC and Envoy (lr & rr). But given that lr & rr do not implement endpoint weights, it seems difficult to fit the feature in those balancers.

@anuragagarwal561994
Copy link
Author

@atollena I have addressed your comments regarding the proposal, let me know if I need to make any changes in it

@anuragagarwal561994
Copy link
Author

@ejona86 @atollena I have addressed the new comments, requesting review for any futher modifications needed

@anuragagarwal561994
Copy link
Author

@ejona86 can I start with the implementation of the feature?

@anuragagarwal561994
Copy link
Author

@ejona86 @atollena will it be possible to review the proposal this week so that I can begin with the implementation of the feature

Copy link
Contributor

@atollena atollena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be useful to see an implementation of this in one of the 3 languages. I wouldn't wait for the gRFC to be approved before starting, especially since the result can be used as a custom, private LB policy before it is merged to upstream-supported gRPC implementations.


## Implementation

This will be implemented in all languages C++, Java, and Go.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good place to add the actual implementation of your choice when you have it ready (it can be a draft or closed PR that you can re-open once the gRFC has been approved).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added the implementation for Java, would love if you could review the changes :)

grpc/grpc-java#12200

@anuragagarwal561994 anuragagarwal561994 changed the title Client-side WRR slow start configuration A100: Client-side WRR slow start configuration Jul 1, 2025
@anuragagarwal561994 anuragagarwal561994 force-pushed the wrrslowstart branch 2 times, most recently from 10be215 to 43b016a Compare July 1, 2025 12:39
@anuragagarwal561994
Copy link
Author

@atollena I have opened the respective MR at the envoy side as well, they will merge it once the proposal is approved at our end. For this proposal I just need to handle the case for the timing part and make the respective changes in the proposal once done I will assign for re-review.

@anuragagarwal561994
Copy link
Author

@atollena @ejona86 so I have created a rough implementation, not updated the test cases as of now.

I have also updated the proto on the envoy side, but to merge it we require to approve this proposal, then sync the proto may be in a separate MR.

How should we go about this?

Copy link
Contributor

@atollena atollena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

@dfawley dfawley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mostly seems fine to me, I just have some nits. Also, since I don't see a slow start config in the envoy WRR policy proto, I'm curious if you've (or someone else has?) sent a proposal to add it there already.


## Metrics

The following metric will be exposed to help monitor the slow start behavior:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How important is this metric to include? Should we consider not adding it until it's needed?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally in most of the cases the slow_start_window will be applicable only for a short duration like 1-2 mins and if everything works well, it will not be called again.

I just thought that we can implement it in disabled mode, if someone wants to keeps a track of it.

We can decide this and I will make the respective changes in the proposal

@anuragagarwal561994
Copy link
Author

@dfawley envoyproxy/envoy#40090 this is the MR I have created at envoy's end to include the slow_start_config in proto.

However it requires this grpc proposal to be approved first before it is merged to the master.

Then we will need to sync across repo and then we are good to implement this feature. I have added sample MR with the java implementation as well, please do check it as well.

Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing this up!

@anuragagarwal561994
Copy link
Author

@markdroth I have made the respective changes, let me know if there are more changes required

Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking really good! Only a few comments remaining.

For future reference, please don't force-push to a PR after reviews have started, because that makes it very difficult for a reviewer to tell what's changed since their last review pass.

Please let me know if you have any questions. Thanks!

…rmatting, and changes `aggression` type to `double`.

Signed-off-by: anurag.ag <[email protected]>
Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@anuragagarwal561994
Copy link
Author

@ejona86 @atollena @dfawley can you help me review and approve this proposal?

@anuragagarwal561994
Copy link
Author

@ejona86 @dfawley can you help review and merge the proposal?

Copy link
Member

@dfawley dfawley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One concern (and a minor change will be needed if we can't keep the same envoy PR number):

@anuragagarwal561994
Copy link
Author

@ejona86 all the reviewers has accepted the review, how do we go from here?

@anuragagarwal561994
Copy link
Author

@ejona86 @atollena @markdroth the envoy MR is merged can we now close this?

@markdroth
Copy link
Member

Doug and I have approved, but Eric hasn't yet. We need to get his approval before merging.

@anuragagarwal561994
Copy link
Author

@ejona86 can you have a look into this?

@anuragagarwal561994
Copy link
Author

@ejona86 a quick follow up on the proposal, as the envoy side MR is also merged. I will be starting the implementations in the upcoming month now

davinci26 added a commit to davinci26/grpc-go that referenced this pull request Nov 1, 2025
Implements slow start functionality as described in
[gRFC A100](grpc/proposal#498)

I still need to do some work on tests and do an e2e experiment with xDS to
ensure that things are working as expected, but the core implementation is
in a state that can be reviewed.

At this phase I'm mainly looking for feedback on the overall approach and design and
in the test code because I am worried that I might be missing some test utility that would
make testing with `time` easier.

Pending Items before this is ready to be merged:
- [ ] Add more unit tests for slow start config cases
- [ ] Manually verify with an xDS experiment that slow start is working as expected

Release Notes
* weightedroundrobin: Implements slow start functionality as described in
  [gRFC A100](grpc/proposal#498)

Signed-off-by: sotiris <[email protected]>
davinci26 added a commit to davinci26/grpc-go that referenced this pull request Nov 1, 2025
Implements slow start functionality as described in
[gRFC A100](grpc/proposal#498)

I still need to do some work on tests and do an e2e experiment with xDS to
ensure that things are working as expected, but the core implementation is
in a state that can be reviewed.

At this phase I'm mainly looking for feedback on the overall approach and design and
in the test code because I am worried that I might be missing some test utility that would
make testing with `time` easier.

Pending Items before this is ready to be merged:
- [ ] Add more unit tests for slow start config cases
- [ ] Manually verify with an xDS experiment that slow start is working as expected

Release Notes
* weightedroundrobin: Implements slow start functionality as described in
  [gRFC A100](grpc/proposal#498)

Signed-off-by: sotiris <[email protected]>
Copy link
Member

@ejona86 ejona86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicely written. My READY comment will probably cause some discussion. I see this as very close to ready.

introduced to track when an endpoint transitioned to ready state. This timestamp is separate from the existing
`non_empty_since` timestamp used for the blackout period.

The `ready_since` timestamp is set when an endpoint transitions from a non-ready state (e.g., CONNECTING,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-starting slow start each time an endpoint becomes READY seems much more aggressive than Envoy, and doesn't seem to serve much purpose. I'd expect options more like:

  1. When an endpoint is added (since UNHEALTHY endpoints are filtered, this will re-trigger slow start when it becomes HEALTHY; dunno if there's interactions with stateful session affinity plumbing). This seems very close to Envoy ("If no active healthcheck is configured per cluster, immediately upon joining the cluster.")
  2. The first time an endpoint is READY, and the first time it goes READY after having been TRANSIENT_FAILURE. This seems pretty normal for gRPC

It would seem bad for MAX_CONNECTION_AGE to keep triggering slow-start, and MAX_CONNECTION_AGE will trigger CONNECTING.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that makes sense, so should I add the second one? That slow start should trigger when an endpoint becomes ready for the first time or transitioning from TRANSIENT_FAILURE.

Will also have to make changes with respect to the variable name, something like added_since?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need input from @markdroth and @dfawley. We discussed it (on the 7th?) and realized it was in weak conflict with the blackout period behavior. I'm fine with only "fixing" this one, or changing both this and blackout period behavior. I also have a sneaky suspicion that Envoy implemented blackout period differently than we did. But overall there's several different views on the subject and we need to align.


Please note that this metric will start as experimental and off by default, and we will consider stabilizing it in the
future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no "Temporary environment variable protection" section. How will the implementation process go here? Seems the changes should be behind a variable until at least someone reports it is working.

Copy link
Author

@anuragagarwal561994 anuragagarwal561994 Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was referring to the current metrics implementations, they are turned off by default and if someone enables them via lets say opentelemetry they get enabled. There is no environment vairable, although this is not very well documented yet for how a user can enable these metrics. I also struggled and got to know about them when I was going through the code base in java

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not concerned about the metrics, but the feature in general. The missing section would normally have been here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants