-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Updating Topology Aware Hints KEP with GA Graduation Criteria #3572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
27b5fd9
to
77a9ffb
Compare
77a9ffb
to
f858900
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@robscott - two comments, other than that it LGTM
- Autoscaling and Scheduling SIGs have a plan to provide zone aware autoscaling | ||
(and scheduling) that allows users to proportionally distribute endpoints | ||
across zones. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commenting here, since can't comment on unchanged line:
L563: Two things that are missing to me in tests are:
(a) the strategy-level test (like this: https://github.com/kubernetes/kubernetes/pull/97058/files#diff-7826f7adbc1996a05ab52e3f5f02429e94b68ce6bce0dc534d1be636154fded3R246-R282 ) that is ensuring that dropping fields is working as expected
(b) Disablement test in endpoint-slice controller, sth like:
- EPS with hints exist, EPS-controller is started with FG disabled, ensure that all EPS are rewritten without hints now
L581: Please update (I hope that some manual upgrade test were run, so please briefly describe it in the text - what setup and what you did)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @wojtek-t! Addressed both of these with latest push (and added to graduation criteria).
|
||
To make this feature easier to understand and use, the EndpointSlice controller | ||
will publish events for a Service to describe if the feature has been enabled, | ||
and if not, why not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not a ServiceStatus
field? (The samples below really sound like a Condition.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Today the endpoint-slice controller doesn't have RBAC access to Service, and ServiceStatus doesn't have a Conditions
field. I am not against either, per se, but it's worth considering. It DOES feel like a condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree this feels like a condition. I wanted to add that, but I also didn't want this to be dependent on a change to Service API.
this feature. This would happen if a user created custom EndpointSlices _and_ | ||
enabled Topology Aware Hints _and_ failed to set Hints on their custom | ||
EndpointSlices. This seems very unlikely, but is mentioned here for the sake of | ||
completeness. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The EndpointSlice controller sees events even for EndpointSlices that it doesn't manage. It could check for this situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right that we could, but I'm worried about loosening our filtering here: https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/endpointslice/endpointslice_controller.go#L357-L361. It seems like a place that we could easily break things if we were not careful. I'm not sure that this particular edge case justifies that risk.
- KEP Merged: February 2021 | ||
- Alpha release: Kubernetes 1.21 | ||
- Beta Release: Kubernetes 1.23 | ||
- Feature Gate on-by default, feature available by default: 1.24 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally beta implies "feature gate on by default"... did we just declare beta in 1.23 but leave it disabled? Was that an accident or by design? Might be worth explaining just for clarity...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely an accident, and realized a few days too late :(. Will add a note.
- Autoscaling and Scheduling SIGs have a plan to provide zone aware autoscaling | ||
(and scheduling) that allows users to proportionally distribute endpoints | ||
across zones. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that you're claiming we're going to make GA in 1.26, that would imply that these plans already exist, in which case can you link to them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how this process works. I'd thought that graduation criteria had to be met prior to making the change in upstream k8s OSS, not before the KEP updates including the graduation criteria merged. I will say that it will be quite a stretch to get this to GA in 1.26, I just want to make sure we're making real progress towards these goals in the 1.26 release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, true, the criteria don't have to be met yet, but it seems a little sketchy to claim we're planning to make GA for 1.26 when that depends on other teams doing work which isn't started yet...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danwinship I'm not tied to this getting to GA in 1.26, that does seem fairly unlikely at our current pace. Would you feel better if we set the GA target date as 1.27 in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if that needs to block this - Dan, you want to make the call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess if we don't claim 1.26 now then we're blocked from doing it later? So if there's any chance it might happen then I guess we want to say that...
For GA, what is the behavior when the |
I'm not confident enough in this feature to default it to on quite yet. I think I'd rather hide that behind a separate feature gate and see it go through a separate graduation process. |
/lgtm |
The SIG says: /approve |
f858900
to
937a742
Compare
937a742
to
a30d99d
Compare
/lgtm |
a30d99d
to
723a7d9
Compare
723a7d9
to
1cd2e5e
Compare
Per-Service enablement/disablement is covered in depth and feature gate | ||
enablement and disablement will be covered before the feature graduates to GA. | ||
In addition, manual testing covering combinations of | ||
upgrade->downgrade->upgrade cycles will be completed prior to GA graduation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Too late now, but we should have done that before Beta (which enabled that by default).
Anyway - let's that that testing before graduating to GA and update the KEP with the results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, will do. I'm pretty sure I did this kind of manual testing, but it was more than a year ago and I clearly failed to record any results in the GEP.
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: robscott, thockin, wojtek-t The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Updates Topology Aware Hints KEP with GA Graduation Criteria
Issue link: Topology Aware Routing #2433
This is also closely related to #3293.
/sig network
/cc @andrewsykim @bowei @dcbw @danwinship
/assign @thockin @wojtek-t