-
Notifications
You must be signed in to change notification settings - Fork 248
A100: Client-side WRR slow start configuration #498
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
A100: Client-side WRR slow start configuration #498
Conversation
|
CC @dfawley |
|
@dfawley did we get a chance to check this |
atollena
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be a nice addition (it's something we've had requests for internally at Datadog). Generally the approach sounds good to me. Ideally we would be able to support this mechanism for other load balancers supported by gRPC and Envoy (lr & rr). But given that lr & rr do not implement endpoint weights, it seems difficult to fit the feature in those balancers.
|
@atollena I have addressed your comments regarding the proposal, let me know if I need to make any changes in it |
|
@ejona86 can I start with the implementation of the feature? |
atollena
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be useful to see an implementation of this in one of the 3 languages. I wouldn't wait for the gRFC to be approved before starting, especially since the result can be used as a custom, private LB policy before it is merged to upstream-supported gRPC implementations.
|
|
||
| ## Implementation | ||
|
|
||
| This will be implemented in all languages C++, Java, and Go. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good place to add the actual implementation of your choice when you have it ready (it can be a draft or closed PR that you can re-open once the gRFC has been approved).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added the implementation for Java, would love if you could review the changes :)
10be215 to
43b016a
Compare
|
@atollena I have opened the respective MR at the envoy side as well, they will merge it once the proposal is approved at our end. For this proposal I just need to handle the case for the timing part and make the respective changes in the proposal once done I will assign for re-review. |
0a3ea16 to
e490283
Compare
atollena
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
dfawley
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mostly seems fine to me, I just have some nits. Also, since I don't see a slow start config in the envoy WRR policy proto, I'm curious if you've (or someone else has?) sent a proposal to add it there already.
|
|
||
| ## Metrics | ||
|
|
||
| The following metric will be exposed to help monitor the slow start behavior: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How important is this metric to include? Should we consider not adding it until it's needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally in most of the cases the slow_start_window will be applicable only for a short duration like 1-2 mins and if everything works well, it will not be called again.
I just thought that we can implement it in disabled mode, if someone wants to keeps a track of it.
We can decide this and I will make the respective changes in the proposal
|
@dfawley envoyproxy/envoy#40090 this is the MR I have created at envoy's end to include the slow_start_config in proto. However it requires this grpc proposal to be approved first before it is merged to the master. Then we will need to sync across repo and then we are good to implement this feature. I have added sample MR with the java implementation as well, please do check it as well. |
markdroth
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for writing this up!
Signed-off-by: anurag.ag <[email protected]>
Signed-off-by: anurag.ag <[email protected]>
Co-authored-by: Antoine Tollenaere <[email protected]>
Co-authored-by: Doug Fawley <[email protected]>
Signed-off-by: anurag.ag <[email protected]>
Signed-off-by: anurag.ag <[email protected]>
…osal. Adds links, clarifies slow start implementation details, and aligns with linked A24 proposal. Signed-off-by: anurag.ag <[email protected]>
e495afb to
c346e9a
Compare
|
@markdroth I have made the respective changes, let me know if there are more changes required |
markdroth
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking really good! Only a few comments remaining.
For future reference, please don't force-push to a PR after reviews have started, because that makes it very difficult for a reviewer to tell what's changed since their last review pass.
Please let me know if you have any questions. Thanks!
…rmatting, and changes `aggression` type to `double`. Signed-off-by: anurag.ag <[email protected]>
… formatting fixes. Signed-off-by: anurag.ag <[email protected]>
markdroth
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One concern (and a minor change will be needed if we can't keep the same envoy PR number):
|
@ejona86 all the reviewers has accepted the review, how do we go from here? |
|
@ejona86 @atollena @markdroth the envoy MR is merged can we now close this? |
|
Doug and I have approved, but Eric hasn't yet. We need to get his approval before merging. |
|
@ejona86 can you have a look into this? |
|
@ejona86 a quick follow up on the proposal, as the envoy side MR is also merged. I will be starting the implementations in the upcoming month now |
Implements slow start functionality as described in [gRFC A100](grpc/proposal#498) I still need to do some work on tests and do an e2e experiment with xDS to ensure that things are working as expected, but the core implementation is in a state that can be reviewed. At this phase I'm mainly looking for feedback on the overall approach and design and in the test code because I am worried that I might be missing some test utility that would make testing with `time` easier. Pending Items before this is ready to be merged: - [ ] Add more unit tests for slow start config cases - [ ] Manually verify with an xDS experiment that slow start is working as expected Release Notes * weightedroundrobin: Implements slow start functionality as described in [gRFC A100](grpc/proposal#498) Signed-off-by: sotiris <[email protected]>
Implements slow start functionality as described in [gRFC A100](grpc/proposal#498) I still need to do some work on tests and do an e2e experiment with xDS to ensure that things are working as expected, but the core implementation is in a state that can be reviewed. At this phase I'm mainly looking for feedback on the overall approach and design and in the test code because I am worried that I might be missing some test utility that would make testing with `time` easier. Pending Items before this is ready to be merged: - [ ] Add more unit tests for slow start config cases - [ ] Manually verify with an xDS experiment that slow start is working as expected Release Notes * weightedroundrobin: Implements slow start functionality as described in [gRFC A100](grpc/proposal#498) Signed-off-by: sotiris <[email protected]>
ejona86
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nicely written. My READY comment will probably cause some discussion. I see this as very close to ready.
| introduced to track when an endpoint transitioned to ready state. This timestamp is separate from the existing | ||
| `non_empty_since` timestamp used for the blackout period. | ||
|
|
||
| The `ready_since` timestamp is set when an endpoint transitions from a non-ready state (e.g., CONNECTING, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-starting slow start each time an endpoint becomes READY seems much more aggressive than Envoy, and doesn't seem to serve much purpose. I'd expect options more like:
- When an endpoint is added (since UNHEALTHY endpoints are filtered, this will re-trigger slow start when it becomes HEALTHY; dunno if there's interactions with stateful session affinity plumbing). This seems very close to Envoy ("If no active healthcheck is configured per cluster, immediately upon joining the cluster.")
- The first time an endpoint is READY, and the first time it goes READY after having been TRANSIENT_FAILURE. This seems pretty normal for gRPC
It would seem bad for MAX_CONNECTION_AGE to keep triggering slow-start, and MAX_CONNECTION_AGE will trigger CONNECTING.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that makes sense, so should I add the second one? That slow start should trigger when an endpoint becomes ready for the first time or transitioning from TRANSIENT_FAILURE.
Will also have to make changes with respect to the variable name, something like added_since?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will need input from @markdroth and @dfawley. We discussed it (on the 7th?) and realized it was in weak conflict with the blackout period behavior. I'm fine with only "fixing" this one, or changing both this and blackout period behavior. I also have a sneaky suspicion that Envoy implemented blackout period differently than we did. But overall there's several different views on the subject and we need to align.
|
|
||
| Please note that this metric will start as experimental and off by default, and we will consider stabilizing it in the | ||
| future. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no "Temporary environment variable protection" section. How will the implementation process go here? Seems the changes should be behind a variable until at least someone reports it is working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was referring to the current metrics implementations, they are turned off by default and if someone enables them via lets say opentelemetry they get enabled. There is no environment vairable, although this is not very well documented yet for how a user can enable these metrics. I also struggled and got to know about them when I was going through the code base in java
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not concerned about the metrics, but the feature in general. The missing section would normally have been here.
No description provided.