kep-1672: update beta milestones for v1.22

andrewsykim · andrewsykim · commit 9387262b08f6 · 2021-05-13T17:49:37.000-04:00
Signed-off-by: Andrew Sy Kim &lt;kim.andrewsy@gmail.com&gt;
diff --git a/keps/prod-readiness/sig-network/1672.yaml b/keps/prod-readiness/sig-network/1672.yaml
@@ -0,0 +1,5 @@
+kep-number: 1672
+alpha:
+  approver: "@wojtek-t"
+beta:
+  approver: "@wojtek-t"
diff --git a/keps/sig-network/1672-tracking-terminating-endpoints/README.md b/keps/sig-network/1672-tracking-terminating-endpoints/README.md
@@ -15,8 +15,16 @@
   - [Test Plan](#test-plan)
   - [Graduation Criteria](#graduation-criteria)
     - [Alpha](#alpha)
+    - [Beta](#beta)
   - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
   - [Version Skew Strategy](#version-skew-strategy)
+- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
+  - [Feature Enablement and Rollback](#feature-enablement-and-rollback)
+  - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
+  - [Monitoring Requirements](#monitoring-requirements)
+  - [Dependencies](#dependencies)
+  - [Scalability](#scalability)
+  - [Troubleshooting](#troubleshooting)
 - [Implementation History](#implementation-history)
 - [Drawbacks](#drawbacks)
 <!-- /toc -->
@@ -91,7 +99,7 @@ and possibly more depending on how many times readiness changes during terminati
 
 ## Design Details
 
-To track whether an endpoint is terminating, a `terminating` field would be added as part of
+To track whether an endpoint is terminating, a `terminating` and `serving` field would be added as part of
 the `EndpointCondition` type in the EndpointSlice API.
 
 ```go
@@ -100,14 +108,25 @@ type EndpointConditions struct {
     // ready indicates that this endpoint is prepared to receive traffic,
     // according to whatever system is managing the endpoint. A nil value
     // indicates an unknown state. In most cases consumers should interpret this
-    // unknown state as ready.
+    // unknown state as ready. For compatibility reasons, ready should never be
+    // "true" for terminating endpoints.
     // +optional
     Ready *bool `json:"ready,omitempty" protobuf:"bytes,1,name=ready"`
 
-    // terminating indicates if this endpoint is terminating. Consumers should assume a
-    // nil value indicates the endpoint  is not terminating.
+    // serving is identical to ready except that it is set regardless of the
+    // terminating state of endpoints. This condition should be set to true for
+    // a ready endpoint that is terminating. If nil, consumers should defer to
+    // the ready condition. This field can be enabled with the
+    // EndpointSliceTerminatingCondition feature gate.
     // +optional
-    Terminating *bool `json:"terminating,omitempty" protobuf:"bytes,2,name=terminating"`
+    Serving *bool `json:"serving,omitempty" protobuf:"bytes,2,name=serving"`
+
+    // terminating indicates that this endpoint is terminating. A nil value
+    // indicates an unknown state. Consumers should interpret this unknown state
+    // to mean that the endpoint is not terminating. This field can be enabled
+    // with the EndpointSliceTerminatingCondition feature gate.
+    // +optional
+    Terminating *bool `json:"terminating,omitempty" protobuf:"bytes,3,name=terminating"`
 }
 ```
 
@@ -116,7 +135,8 @@ NOTE: A nil value for `Terminating` indicates that the endpoint is not terminati
 Updates to endpointslice controller:
 * include pods with a deletion timestamp in endpointslice
 * any pod with a deletion timestamp will have condition.terminating = true
-* allow endpoint ready condition to change during termination
+* any terminating pod must have condition.ready = false.
+* the new `serving` condition is set based on pod readiness regardless of terminating state.
 
 ### Test Plan
 
@@ -134,10 +154,16 @@ E2E tests:
 
 #### Alpha
 
-* EndpointSlice API includes `Terminating` condition.
-* `Terminating` condition can only be set if feature gate `EndpointSliceTerminatingCondition` is enabled.
+* EndpointSlice API includes `Terminating` and `Serving` condition.
+* `Terminating` and `Serving` condition can only be set if feature gate `EndpointSliceTerminatingCondition` is enabled.
 * Unit tests in endpointslice controller and API validation/strategy.
 
+#### Beta
+
+* Integration API tests exercising the `terminating` and `serving` conditions.
+* `EndpointSliceTerminatingCondition` is enabled by default.
+* Consensus on scalability implications resulting from additional EndpointSlice writes with approval from sig-scalability.
+
 ### Upgrade / Downgrade Strategy
 
 Since this is an addition to the EndpointSlice API, the upgrade/downgrade strategy will follow that
@@ -148,9 +174,123 @@ of the [EndpointSlice API work](/keps/sig-network/20190603-endpointslices/README
 Since this is an addition to the EndpointSlice API, the version skew strategy will follow that
 of the [EndpointSlice API work](/keps/sig-network/20190603-endpointslices/README.md).
 
+## Production Readiness Review Questionnaire
+
+### Feature Enablement and Rollback
+
+###### How can this feature be enabled / disabled in a live cluster?
+
+- [X] Feature gate (also fill in values in `kep.yaml`)
+  - Feature gate name: EndpointSliceTerminatingCondition
+  - Components depending on the feature gate: kube-apiserver and kube-controller-manager
+
+###### Does enabling the feature change any default behavior?
+
+Yes, terminating endpoints are now included as part of EndpointSlice API. The `ready` condition of an endpoint will always be `false` to ensure consumers do not send traffic to terminating endpoints unless the new conditions `serving` and `terminating` are checked.
+
+###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
+
+Yes. On rollback, terminating endpoints will no longer be included in EndpointSlice and the `terminating` and `serving` conditions will not be set.
+
+###### What happens if we reenable the feature if it was previously rolled back?
+
+EndpointSlice will continue to have the `terminating` and `serving` condition set.
+
+###### Are there any tests for feature enablement/disablement?
+
+Yes, there will be integration and e2e tests validating whether EndpointSlice contains endpoints for pods that are terminating.
+
+### Rollout, Upgrade and Rollback Planning
+
+###### How can a rollout fail? Can it impact already running workloads?
+
+If there are consumers of EndpointSlice that do not check the `ready` condition, then they may unexpectedly start sending traffic to terminating endpoints.
+It is assumed that almost all consumers of EndpointSlice check the `ready` condition prior to allowing traffic to a pod.
+
+###### What specific metrics should inform a rollback?
+
+Application-level traffic indicating packet-loss or error rates.
+
+###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
+
+Not yet, but manual upgrade and rollback testing will be done prior to graduating the feature to Beta.
+
+###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
+
+No.
+
+### Monitoring Requirements
+
+###### How can an operator determine if the feature is in use by workloads?
+
+The condition will always be set for terminating pods but consumers may choose to ignore them. It is up to consumers of the API to provide metrics
+on how the new conditions are being used.
+
+###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
+
+N/A
+
+###### What are the reasonable SLOs (Service Level Objectives) for the above SLIs?
+
+N/A
+
+###### Are there any missing metrics that would be useful to have to improve observability of this feature?
+
+N/A
+
+### Dependencies
+
+###### Does this feature depend on any specific services running in the cluster?
+
+N/A
+
+### Scalability
+
+###### Will enabling / using this feature result in any new API calls?
+
+Yes, there will be more writes to EndpointSlice for every pod when it begins terminating.
+
+###### Will enabling / using this feature result in introducing new API types?
+
+No.
+
+###### Will enabling / using this feature result in any new calls to the cloud provider?
+
+No.
+
+###### Will enabling / using this feature result in increasing size or count of the existing API objects?
+
+Yes, it will increase the size of EndpointSlice by adding two boolean fields for each endpoint.
+
+###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
+
+No.
+
+###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
+
+More writes to EndpointSlice could result in more resource usage from etcd disk IO and network bandwidth for all watchers.
+
+### Troubleshooting
+
+###### How does this feature react if the API server and/or etcd is unavailable?
+
+EndpointSlice conditions will get stale.
+
+###### What are other known failure modes?
+
+* Consumers of EndpointSlice that do not not check the `ready` condition may unexpectedly use terminating endpoints.
+
+###### What steps should be taken if SLOs are not being met to determine the problem?
+
+* Disable the feature gate
+* Check if consumers of EndpointSlice are using the serving or termianting condition
+* Check etcd disk usage
+
 ## Implementation History
 
 - [x] 2020-04-23: KEP accepted as implementable for v1.19
+- [x] 2020-07-01: initial PR with alpha imlementation merged for v1.20
+- [x] 2020-05-12: KEP accepted as implementable for v1.22
 
 ## Drawbacks
 
diff --git a/keps/sig-network/1672-tracking-terminating-endpoints/kep.yaml b/keps/sig-network/1672-tracking-terminating-endpoints/kep.yaml
@@ -19,5 +19,24 @@ see-also:
   - /kep/sig-network/20190603-EndpointSlice-API.md
 replaces: []
 
-latest-milestone: "0.0"
-stage: "alpha"
+# The target maturity stage in the current dev cycle for this KEP.
+stage: beta
+
+# The most recent milestone for which work toward delivery of this KEP has been
+# done. This can be the current (upcoming) milestone, if it is being actively
+# worked on.
+latest-milestone: "v1.22"
+
+# The milestone at which this feature was, or is targeted to be, at each stage.
+milestone:
+  alpha: "v1.20"
+  beta: "v1.22"
+
+# The following PRR answers are required at alpha release
+# List the feature gate name and the components for which it must be enabled
+feature-gates:
+  - name: EndpointSliceTerminatingCondition
+    components:
+      - kube-apiserver
+      - kube-controller-manager
+disable-supported: true