Skip to content

Conversation

robscott
Copy link
Member

@robscott robscott commented Oct 1, 2022

This is also closely related to #3293.

/sig network
/cc @andrewsykim @bowei @dcbw @danwinship
/assign @thockin @wojtek-t

@k8s-ci-robot k8s-ci-robot added the sig/network Categorizes an issue or PR as relevant to SIG Network. label Oct 1, 2022
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 1, 2022
@robscott robscott force-pushed the topology-ga-update branch from 27b5fd9 to 77a9ffb Compare October 1, 2022 01:22
@robscott robscott mentioned this pull request Oct 1, 2022
20 tasks
@robscott robscott force-pushed the topology-ga-update branch from 77a9ffb to f858900 Compare October 1, 2022 01:24
Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robscott - two comments, other than that it LGTM

- Autoscaling and Scheduling SIGs have a plan to provide zone aware autoscaling
(and scheduling) that allows users to proportionally distribute endpoints
across zones.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commenting here, since can't comment on unchanged line:

L563: Two things that are missing to me in tests are:

(a) the strategy-level test (like this: https://github.com/kubernetes/kubernetes/pull/97058/files#diff-7826f7adbc1996a05ab52e3f5f02429e94b68ce6bce0dc534d1be636154fded3R246-R282 ) that is ensuring that dropping fields is working as expected

(b) Disablement test in endpoint-slice controller, sth like:

  • EPS with hints exist, EPS-controller is started with FG disabled, ensure that all EPS are rewritten without hints now

L581: Please update (I hope that some manual upgrade test were run, so please briefly describe it in the text - what setup and what you did)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wojtek-t! Addressed both of these with latest push (and added to graduation criteria).


To make this feature easier to understand and use, the EndpointSlice controller
will publish events for a Service to describe if the feature has been enabled,
and if not, why not.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not a ServiceStatus field? (The samples below really sound like a Condition.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Today the endpoint-slice controller doesn't have RBAC access to Service, and ServiceStatus doesn't have a Conditions field. I am not against either, per se, but it's worth considering. It DOES feel like a condition.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree this feels like a condition. I wanted to add that, but I also didn't want this to be dependent on a change to Service API.

this feature. This would happen if a user created custom EndpointSlices _and_
enabled Topology Aware Hints _and_ failed to set Hints on their custom
EndpointSlices. This seems very unlikely, but is mentioned here for the sake of
completeness.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The EndpointSlice controller sees events even for EndpointSlices that it doesn't manage. It could check for this situation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that we could, but I'm worried about loosening our filtering here: https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/endpointslice/endpointslice_controller.go#L357-L361. It seems like a place that we could easily break things if we were not careful. I'm not sure that this particular edge case justifies that risk.

- KEP Merged: February 2021
- Alpha release: Kubernetes 1.21
- Beta Release: Kubernetes 1.23
- Feature Gate on-by default, feature available by default: 1.24
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally beta implies "feature gate on by default"... did we just declare beta in 1.23 but leave it disabled? Was that an accident or by design? Might be worth explaining just for clarity...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely an accident, and realized a few days too late :(. Will add a note.

Comment on lines +521 to +534
- Autoscaling and Scheduling SIGs have a plan to provide zone aware autoscaling
(and scheduling) that allows users to proportionally distribute endpoints
across zones.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that you're claiming we're going to make GA in 1.26, that would imply that these plans already exist, in which case can you link to them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how this process works. I'd thought that graduation criteria had to be met prior to making the change in upstream k8s OSS, not before the KEP updates including the graduation criteria merged. I will say that it will be quite a stretch to get this to GA in 1.26, I just want to make sure we're making real progress towards these goals in the 1.26 release.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, true, the criteria don't have to be met yet, but it seems a little sketchy to claim we're planning to make GA for 1.26 when that depends on other teams doing work which isn't started yet...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danwinship I'm not tied to this getting to GA in 1.26, that does seem fairly unlikely at our current pace. Would you feel better if we set the GA target date as 1.27 in this PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if that needs to block this - Dan, you want to make the call?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess if we don't claim 1.26 now then we're blocked from doing it later? So if there's any chance it might happen then I guess we want to say that...

@thockin
Copy link
Member

thockin commented Oct 3, 2022

For GA, what is the behavior when the service.kubernetes.io/topology-aware-routing annotation is not specified - will we assume "Default" or do users still need to opt-in on every Service?

@robscott
Copy link
Member Author

robscott commented Oct 4, 2022

For GA, what is the behavior when the service.kubernetes.io/topology-aware-routing annotation is not specified - will we assume "Default" or do users still need to opt-in on every Service?

I'm not confident enough in this feature to default it to on quite yet. I think I'd rather hide that behind a separate feature gate and see it go through a separate graduation process.

@danwinship
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 5, 2022
@thockin
Copy link
Member

thockin commented Oct 5, 2022

The SIG says:

/approve

@robscott robscott force-pushed the topology-ga-update branch from f858900 to 937a742 Compare October 5, 2022 18:56
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 5, 2022
@robscott robscott force-pushed the topology-ga-update branch from 937a742 to a30d99d Compare October 5, 2022 18:58
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 5, 2022
@danwinship
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 5, 2022
@robscott robscott force-pushed the topology-ga-update branch from a30d99d to 723a7d9 Compare October 6, 2022 05:00
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 6, 2022
@robscott robscott force-pushed the topology-ga-update branch from 723a7d9 to 1cd2e5e Compare October 6, 2022 05:03
Per-Service enablement/disablement is covered in depth and feature gate
enablement and disablement will be covered before the feature graduates to GA.
In addition, manual testing covering combinations of
upgrade->downgrade->upgrade cycles will be completed prior to GA graduation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too late now, but we should have done that before Beta (which enabled that by default).
Anyway - let's that that testing before graduating to GA and update the KEP with the results.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, will do. I'm pretty sure I did this kind of manual testing, but it was more than a year ago and I clearly failed to record any results in the GEP.

@wojtek-t
Copy link
Member

wojtek-t commented Oct 6, 2022

/lgtm
/approve PRR

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 6, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: robscott, thockin, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 6, 2022
@k8s-ci-robot k8s-ci-robot merged commit 1264d3d into kubernetes:master Oct 6, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.26 milestone Oct 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/network Categorizes an issue or PR as relevant to SIG Network. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants