From 95c6f590d352cbbc6228accbe3217856e597195c Mon Sep 17 00:00:00 2001 From: Andrew Sy Kim Date: Wed, 12 May 2021 10:58:49 -0400 Subject: [PATCH 1/2] kep-1669: update alpha milestones for v1.22 Signed-off-by: Andrew Sy Kim --- keps/prod-readiness/sig-network/1669.yaml | 3 + .../README.md | 281 +++++++++++++++++- .../kep.yaml | 23 +- 3 files changed, 298 insertions(+), 9 deletions(-) create mode 100644 keps/prod-readiness/sig-network/1669.yaml diff --git a/keps/prod-readiness/sig-network/1669.yaml b/keps/prod-readiness/sig-network/1669.yaml new file mode 100644 index 00000000000..72ee6b4058b --- /dev/null +++ b/keps/prod-readiness/sig-network/1669.yaml @@ -0,0 +1,3 @@ +kep-number: 1669 +alpha: + approver: "@wojtek-t" diff --git a/keps/sig-network/1669-graceful-termination-local-external-traffic-policy/README.md b/keps/sig-network/1669-graceful-termination-local-external-traffic-policy/README.md index 4a6b4dd4c25..ea2eadc8563 100644 --- a/keps/sig-network/1669-graceful-termination-local-external-traffic-policy/README.md +++ b/keps/sig-network/1669-graceful-termination-local-external-traffic-policy/README.md @@ -20,6 +20,13 @@ - [Alpha](#alpha) - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) - [Implementation History](#implementation-history) - [Drawbacks](#drawbacks) - [Alternatives](#alternatives) @@ -28,10 +35,10 @@ ## Release Signoff Checklist - [X] Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) -- [ ] KEP approvers have approved the KEP status as `implementable` -- [ ] Design details are appropriately documented -- [ ] Test plan is in place, giving consideration to SIG Architecture and SIG Testing input -- [ ] Graduation criteria is in place +- [X] KEP approvers have approved the KEP status as `implementable` +- [X] Design details are appropriately documented +- [X] Test plan is in place, giving consideration to SIG Architecture and SIG Testing input +- [X] Graduation criteria is in place - [ ] "Implementation History" section is up-to-date for milestone - [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] - [ ] Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes @@ -117,7 +124,6 @@ kube-proxy unit tests: #### E2E Tests E2E tests will be added to validate that no traffic is dropped during a rolling update for a Service with ExternalTrafficPolicy=Local. -This test may be marked "Flaky" as the behavior is largely also dependant on the cloud provider's loadbalancer. All existing E2E tests for Services should continue to pass. @@ -125,8 +131,9 @@ All existing E2E tests for Services should continue to pass. #### Alpha -* kube-proxy internally tracks the terminating condition of an endpoint. -* feature is only enabled if the feature gate `EndpointSliceTerminatingCondition` is on. +* kube-proxy internally tracks the `terminating` and `serving` condition from EndpointSlice +* kube-proxy falls back to terminating endpoints if and only if they are the only available endpoints. +* feature is only enabled if the feature gate `ProxyTerminatingEndpoints` is on. * unit tests in kube-proxy. ### Upgrade / Downgrade Strategy @@ -141,6 +148,266 @@ This would either happen if a version of the control plane was not aware of the There's not much risk involved as the worse case scenario is falling back to existing behavior. +## Production Readiness Review Questionnaire + +### Feature Enablement and Rollback + +###### How can this feature be enabled / disabled in a live cluster? + +- [X] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: ProxyTerminatingEndpoints + - Components depending on the feature gate: kube-proxy + +###### Does enabling the feature change any default behavior? + +Yes, when externalTrafficPolicy=Local and there are only terminating endpoints, +kube-proxy will route traffic to those endpoints. Before this change, kube-proxy +dropped this traffic instead. + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + +Yes. + +###### What happens if we reenable the feature if it was previously rolled back? + +kube-proxy will no longer drop traffic if only terminating endpoints are available. + +###### Are there any tests for feature enablement/disablement? + +Yes, there will be unit tests in kube-proxy with the feature gate enabled and disabled. + +### Rollout, Upgrade and Rollback Planning + + + +###### How can a rollout fail? Can it impact already running workloads? + + + +TBD for beta. + +###### What specific metrics should inform a rollback? + + + +TBD for beta. + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + + + +TBD for beta. + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + + + +TBD for beta. + +### Monitoring Requirements + + + +###### How can an operator determine if the feature is in use by workloads? + + + +TBD for beta. + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + + + +TBD for beta. + +- [ ] Metrics + - Metric name: + - [Optional] Aggregation method: + - Components exposing the metric: +- [ ] Other (treat as last resort) + - Details: + +###### What are the reasonable SLOs (Service Level Objectives) for the above SLIs? + + + +TBD for beta. + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + + + +TBD for beta. + +### Dependencies + + + +###### Does this feature depend on any specific services running in the cluster? + + + +TBD for beta. + +### Scalability + + + +TBD for beta. + +###### Will enabling / using this feature result in any new API calls? + + + +TBD for beta. + +###### Will enabling / using this feature result in introducing new API types? + + + +TBD for beta. + +###### Will enabling / using this feature result in any new calls to the cloud provider? + + + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + + + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + + + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + + + +### Troubleshooting + + + +###### How does this feature react if the API server and/or etcd is unavailable? + +###### What are other known failure modes? + + + +###### What steps should be taken if SLOs are not being met to determine the problem? + ## Implementation History - [x] 2020-04-23: KEP accepted as implementable for v1.19 diff --git a/keps/sig-network/1669-graceful-termination-local-external-traffic-policy/kep.yaml b/keps/sig-network/1669-graceful-termination-local-external-traffic-policy/kep.yaml index 6868fc9da53..2e586daa551 100644 --- a/keps/sig-network/1669-graceful-termination-local-external-traffic-policy/kep.yaml +++ b/keps/sig-network/1669-graceful-termination-local-external-traffic-policy/kep.yaml @@ -11,6 +11,8 @@ reviewers: - "@smarterclayton" approvers: - "@thockin" +prr-approvers: + - "@johnbelamaric" creation-date: 2020-04-07 last-updated: 2020-04-07 status: implementable @@ -18,5 +20,22 @@ see-also: - "/keps/sig-network/1672-tracking-terminating-endpoints/README.md" - https://github.com/kubernetes/kubernetes/issues/85643 -latest-milestone: "0.0" -stage: "alpha" +# The target maturity stage in the current dev cycle for this KEP. +stage: alpha + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "v1.22" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: "v1.22" + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + - name: ProxyTerminatingEndpoints + components: + - kube-proxy +disable-supported: true From d248614d4d69ffa6f59e26251f7c9e10a1d08f70 Mon Sep 17 00:00:00 2001 From: Andrew Sy Kim Date: Wed, 12 May 2021 10:59:08 -0400 Subject: [PATCH 2/2] kep-1672: update beta milestones for v1.22 Signed-off-by: Andrew Sy Kim --- keps/prod-readiness/sig-network/1672.yaml | 3 + .../README.md | 158 +++++++++++++++++- .../kep.yaml | 22 ++- 3 files changed, 173 insertions(+), 10 deletions(-) create mode 100644 keps/prod-readiness/sig-network/1672.yaml diff --git a/keps/prod-readiness/sig-network/1672.yaml b/keps/prod-readiness/sig-network/1672.yaml new file mode 100644 index 00000000000..c6ded615224 --- /dev/null +++ b/keps/prod-readiness/sig-network/1672.yaml @@ -0,0 +1,3 @@ +kep-number: 1672 +alpha: + approver: "@wojtek-t" diff --git a/keps/sig-network/1672-tracking-terminating-endpoints/README.md b/keps/sig-network/1672-tracking-terminating-endpoints/README.md index a879d95f04c..913eff7cd29 100644 --- a/keps/sig-network/1672-tracking-terminating-endpoints/README.md +++ b/keps/sig-network/1672-tracking-terminating-endpoints/README.md @@ -15,8 +15,16 @@ - [Test Plan](#test-plan) - [Graduation Criteria](#graduation-criteria) - [Alpha](#alpha) + - [Beta](#beta) - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) - [Implementation History](#implementation-history) - [Drawbacks](#drawbacks) @@ -91,7 +99,7 @@ and possibly more depending on how many times readiness changes during terminati ## Design Details -To track whether an endpoint is terminating, a `terminating` field would be added as part of +To track whether an endpoint is terminating, a `terminating` and `serving` field would be added as part of the `EndpointCondition` type in the EndpointSlice API. ```go @@ -100,14 +108,25 @@ type EndpointConditions struct { // ready indicates that this endpoint is prepared to receive traffic, // according to whatever system is managing the endpoint. A nil value // indicates an unknown state. In most cases consumers should interpret this - // unknown state as ready. + // unknown state as ready. For compatibility reasons, ready should never be + // "true" for terminating endpoints. // +optional Ready *bool `json:"ready,omitempty" protobuf:"bytes,1,name=ready"` - // terminating indicates if this endpoint is terminating. Consumers should assume a - // nil value indicates the endpoint is not terminating. + // serving is identical to ready except that it is set regardless of the + // terminating state of endpoints. This condition should be set to true for + // a ready endpoint that is terminating. If nil, consumers should defer to + // the ready condition. This field can be enabled with the + // EndpointSliceTerminatingCondition feature gate. // +optional - Terminating *bool `json:"terminating,omitempty" protobuf:"bytes,2,name=terminating"` + Serving *bool `json:"serving,omitempty" protobuf:"bytes,2,name=serving"` + + // terminating indicates that this endpoint is terminating. A nil value + // indicates an unknown state. Consumers should interpret this unknown state + // to mean that the endpoint is not terminating. This field can be enabled + // with the EndpointSliceTerminatingCondition feature gate. + // +optional + Terminating *bool `json:"terminating,omitempty" protobuf:"bytes,3,name=terminating"` } ``` @@ -116,7 +135,8 @@ NOTE: A nil value for `Terminating` indicates that the endpoint is not terminati Updates to endpointslice controller: * include pods with a deletion timestamp in endpointslice * any pod with a deletion timestamp will have condition.terminating = true -* allow endpoint ready condition to change during termination +* any terminating pod must have condition.ready = false. +* the new `serving` condition is set based on pod readiness regardless of terminating state. ### Test Plan @@ -134,10 +154,16 @@ E2E tests: #### Alpha -* EndpointSlice API includes `Terminating` condition. -* `Terminating` condition can only be set if feature gate `EndpointSliceTerminatingCondition` is enabled. +* EndpointSlice API includes `Terminating` and `Serving` condition. +* `Terminating` and `Serving` condition can only be set if feature gate `EndpointSliceTerminatingCondition` is enabled. * Unit tests in endpointslice controller and API validation/strategy. +#### Beta + +* Integration API tests exercising the `terminating` and `serving` conditions. +* `EndpointSliceTerminatingCondition` is enabled by default. +* Consensus on scalability implications resulting from additional EndpointSlice writes with approval from sig-scalability. + ### Upgrade / Downgrade Strategy Since this is an addition to the EndpointSlice API, the upgrade/downgrade strategy will follow that @@ -148,9 +174,125 @@ of the [EndpointSlice API work](/keps/sig-network/20190603-endpointslices/README Since this is an addition to the EndpointSlice API, the version skew strategy will follow that of the [EndpointSlice API work](/keps/sig-network/20190603-endpointslices/README.md). +## Production Readiness Review Questionnaire + +### Feature Enablement and Rollback + +###### How can this feature be enabled / disabled in a live cluster? + +- [X] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: EndpointSliceTerminatingCondition + - Components depending on the feature gate: kube-apiserver and kube-controller-manager + +###### Does enabling the feature change any default behavior? + +Yes, terminating endpoints are now included as part of EndpointSlice API. The `ready` condition of an endpoint will always be `false` to ensure consumers do not send traffic to terminating endpoints unless the new conditions `serving` and `terminating` are checked. + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + +Yes. On rollback, terminating endpoints will no longer be included in EndpointSlice and the `terminating` and `serving` conditions will not be set. + +###### What happens if we reenable the feature if it was previously rolled back? + +EndpointSlice will continue to have the `terminating` and `serving` condition set and terminating endpoints will be added to the endpointslice in it's next sync. + +###### Are there any tests for feature enablement/disablement? + +Yes, there will be strategy API unit tests validating if the new API field is allowed based on the feature gate. + +### Rollout, Upgrade and Rollback Planning + +###### How can a rollout fail? Can it impact already running workloads? + +If there are consumers of EndpointSlice that do not check the `ready` condition, then they may unexpectedly start sending traffic to terminating endpoints. +It is assumed that almost all consumers of EndpointSlice check the `ready` condition prior to allowing traffic to a pod. + +###### What specific metrics should inform a rollback? + +Application-level traffic indicating packet-loss or error rates. + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + +Not yet, but manual upgrade and rollback testing will be done prior to graduating the feature to Beta. + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + +No. + +### Monitoring Requirements + +###### How can an operator determine if the feature is in use by workloads? + +The condition will always be set for terminating pods but consumers may choose to ignore them. It is up to consumers of the API to provide metrics +on how the new conditions are being used. + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + +Metrics will be added for total endpoints with the `serving` and `terminating` condition set. + +###### What are the reasonable SLOs (Service Level Objectives) for the above SLIs? + +N/A + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + +N/A + +### Dependencies + +###### Does this feature depend on any specific services running in the cluster? + +N/A + +### Scalability + +###### Will enabling / using this feature result in any new API calls? + +Yes, there will be more writes to EndpointSlice when: +* a pod starts termination +* a pod's readiness changes during termination + +###### Will enabling / using this feature result in introducing new API types? + +No. + +###### Will enabling / using this feature result in any new calls to the cloud provider? + +No. + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + +Yes, it will increase the size of EndpointSlice by adding two boolean fields for each endpoint. + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + +The networking programming latency SLO might be impacted due to additional writes to EndpointSlice. + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + +More writes to EndpointSlice could result in more resource usage from etcd disk IO and network bandwidth for all watchers. + +### Troubleshooting + +###### How does this feature react if the API server and/or etcd is unavailable? + +EndpointSlice conditions will get stale. + +###### What are other known failure modes? + +* Consumers of EndpointSlice that do not not check the `ready` condition may unexpectedly use terminating endpoints. + +###### What steps should be taken if SLOs are not being met to determine the problem? + +* Disable the feature gate +* Check if consumers of EndpointSlice are using the serving or termianting condition +* Check etcd disk usage + ## Implementation History - [x] 2020-04-23: KEP accepted as implementable for v1.19 +- [x] 2020-07-01: initial PR with alpha imlementation merged for v1.20 +- [x] 2020-05-12: KEP accepted as implementable for v1.22 ## Drawbacks diff --git a/keps/sig-network/1672-tracking-terminating-endpoints/kep.yaml b/keps/sig-network/1672-tracking-terminating-endpoints/kep.yaml index 68632764f0f..71ff59ca18c 100644 --- a/keps/sig-network/1672-tracking-terminating-endpoints/kep.yaml +++ b/keps/sig-network/1672-tracking-terminating-endpoints/kep.yaml @@ -19,5 +19,23 @@ see-also: - /kep/sig-network/20190603-EndpointSlice-API.md replaces: [] -latest-milestone: "0.0" -stage: "alpha" +# The target maturity stage in the current dev cycle for this KEP. +stage: alpha + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "v1.22" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: "v1.20" + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + - name: EndpointSliceTerminatingCondition + components: + - kube-apiserver + - kube-controller-manager +disable-supported: true