Skip to content

Commit d248614

Browse files
committed
kep-1672: update beta milestones for v1.22
Signed-off-by: Andrew Sy Kim <[email protected]>
1 parent 95c6f59 commit d248614

File tree

3 files changed

+173
-10
lines changed

3 files changed

+173
-10
lines changed
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
kep-number: 1672
2+
alpha:
3+
approver: "@wojtek-t"

keps/sig-network/1672-tracking-terminating-endpoints/README.md

Lines changed: 150 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -15,8 +15,16 @@
1515
- [Test Plan](#test-plan)
1616
- [Graduation Criteria](#graduation-criteria)
1717
- [Alpha](#alpha)
18+
- [Beta](#beta)
1819
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
1920
- [Version Skew Strategy](#version-skew-strategy)
21+
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
22+
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
23+
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
24+
- [Monitoring Requirements](#monitoring-requirements)
25+
- [Dependencies](#dependencies)
26+
- [Scalability](#scalability)
27+
- [Troubleshooting](#troubleshooting)
2028
- [Implementation History](#implementation-history)
2129
- [Drawbacks](#drawbacks)
2230
<!-- /toc -->
@@ -91,7 +99,7 @@ and possibly more depending on how many times readiness changes during terminati
9199

92100
## Design Details
93101

94-
To track whether an endpoint is terminating, a `terminating` field would be added as part of
102+
To track whether an endpoint is terminating, a `terminating` and `serving` field would be added as part of
95103
the `EndpointCondition` type in the EndpointSlice API.
96104

97105
```go
@@ -100,14 +108,25 @@ type EndpointConditions struct {
100108
// ready indicates that this endpoint is prepared to receive traffic,
101109
// according to whatever system is managing the endpoint. A nil value
102110
// indicates an unknown state. In most cases consumers should interpret this
103-
// unknown state as ready.
111+
// unknown state as ready. For compatibility reasons, ready should never be
112+
// "true" for terminating endpoints.
104113
// +optional
105114
Ready *bool `json:"ready,omitempty" protobuf:"bytes,1,name=ready"`
106115

107-
// terminating indicates if this endpoint is terminating. Consumers should assume a
108-
// nil value indicates the endpoint is not terminating.
116+
// serving is identical to ready except that it is set regardless of the
117+
// terminating state of endpoints. This condition should be set to true for
118+
// a ready endpoint that is terminating. If nil, consumers should defer to
119+
// the ready condition. This field can be enabled with the
120+
// EndpointSliceTerminatingCondition feature gate.
109121
// +optional
110-
Terminating *bool `json:"terminating,omitempty" protobuf:"bytes,2,name=terminating"`
122+
Serving *bool `json:"serving,omitempty" protobuf:"bytes,2,name=serving"`
123+
124+
// terminating indicates that this endpoint is terminating. A nil value
125+
// indicates an unknown state. Consumers should interpret this unknown state
126+
// to mean that the endpoint is not terminating. This field can be enabled
127+
// with the EndpointSliceTerminatingCondition feature gate.
128+
// +optional
129+
Terminating *bool `json:"terminating,omitempty" protobuf:"bytes,3,name=terminating"`
111130
}
112131
```
113132

@@ -116,7 +135,8 @@ NOTE: A nil value for `Terminating` indicates that the endpoint is not terminati
116135
Updates to endpointslice controller:
117136
* include pods with a deletion timestamp in endpointslice
118137
* any pod with a deletion timestamp will have condition.terminating = true
119-
* allow endpoint ready condition to change during termination
138+
* any terminating pod must have condition.ready = false.
139+
* the new `serving` condition is set based on pod readiness regardless of terminating state.
120140

121141
### Test Plan
122142

@@ -134,10 +154,16 @@ E2E tests:
134154

135155
#### Alpha
136156

137-
* EndpointSlice API includes `Terminating` condition.
138-
* `Terminating` condition can only be set if feature gate `EndpointSliceTerminatingCondition` is enabled.
157+
* EndpointSlice API includes `Terminating` and `Serving` condition.
158+
* `Terminating` and `Serving` condition can only be set if feature gate `EndpointSliceTerminatingCondition` is enabled.
139159
* Unit tests in endpointslice controller and API validation/strategy.
140160

161+
#### Beta
162+
163+
* Integration API tests exercising the `terminating` and `serving` conditions.
164+
* `EndpointSliceTerminatingCondition` is enabled by default.
165+
* Consensus on scalability implications resulting from additional EndpointSlice writes with approval from sig-scalability.
166+
141167
### Upgrade / Downgrade Strategy
142168

143169
Since this is an addition to the EndpointSlice API, the upgrade/downgrade strategy will follow that
@@ -148,9 +174,125 @@ of the [EndpointSlice API work](/keps/sig-network/20190603-endpointslices/README
148174
Since this is an addition to the EndpointSlice API, the version skew strategy will follow that
149175
of the [EndpointSlice API work](/keps/sig-network/20190603-endpointslices/README.md).
150176

177+
## Production Readiness Review Questionnaire
178+
179+
### Feature Enablement and Rollback
180+
181+
###### How can this feature be enabled / disabled in a live cluster?
182+
183+
- [X] Feature gate (also fill in values in `kep.yaml`)
184+
- Feature gate name: EndpointSliceTerminatingCondition
185+
- Components depending on the feature gate: kube-apiserver and kube-controller-manager
186+
187+
###### Does enabling the feature change any default behavior?
188+
189+
Yes, terminating endpoints are now included as part of EndpointSlice API. The `ready` condition of an endpoint will always be `false` to ensure consumers do not send traffic to terminating endpoints unless the new conditions `serving` and `terminating` are checked.
190+
191+
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
192+
193+
Yes. On rollback, terminating endpoints will no longer be included in EndpointSlice and the `terminating` and `serving` conditions will not be set.
194+
195+
###### What happens if we reenable the feature if it was previously rolled back?
196+
197+
EndpointSlice will continue to have the `terminating` and `serving` condition set and terminating endpoints will be added to the endpointslice in it's next sync.
198+
199+
###### Are there any tests for feature enablement/disablement?
200+
201+
Yes, there will be strategy API unit tests validating if the new API field is allowed based on the feature gate.
202+
203+
### Rollout, Upgrade and Rollback Planning
204+
205+
###### How can a rollout fail? Can it impact already running workloads?
206+
207+
If there are consumers of EndpointSlice that do not check the `ready` condition, then they may unexpectedly start sending traffic to terminating endpoints.
208+
It is assumed that almost all consumers of EndpointSlice check the `ready` condition prior to allowing traffic to a pod.
209+
210+
###### What specific metrics should inform a rollback?
211+
212+
Application-level traffic indicating packet-loss or error rates.
213+
214+
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
215+
216+
Not yet, but manual upgrade and rollback testing will be done prior to graduating the feature to Beta.
217+
218+
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
219+
220+
No.
221+
222+
### Monitoring Requirements
223+
224+
###### How can an operator determine if the feature is in use by workloads?
225+
226+
The condition will always be set for terminating pods but consumers may choose to ignore them. It is up to consumers of the API to provide metrics
227+
on how the new conditions are being used.
228+
229+
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
230+
231+
Metrics will be added for total endpoints with the `serving` and `terminating` condition set.
232+
233+
###### What are the reasonable SLOs (Service Level Objectives) for the above SLIs?
234+
235+
N/A
236+
237+
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
238+
239+
N/A
240+
241+
### Dependencies
242+
243+
###### Does this feature depend on any specific services running in the cluster?
244+
245+
N/A
246+
247+
### Scalability
248+
249+
###### Will enabling / using this feature result in any new API calls?
250+
251+
Yes, there will be more writes to EndpointSlice when:
252+
* a pod starts termination
253+
* a pod's readiness changes during termination
254+
255+
###### Will enabling / using this feature result in introducing new API types?
256+
257+
No.
258+
259+
###### Will enabling / using this feature result in any new calls to the cloud provider?
260+
261+
No.
262+
263+
###### Will enabling / using this feature result in increasing size or count of the existing API objects?
264+
265+
Yes, it will increase the size of EndpointSlice by adding two boolean fields for each endpoint.
266+
267+
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
268+
269+
The networking programming latency SLO might be impacted due to additional writes to EndpointSlice.
270+
271+
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
272+
273+
More writes to EndpointSlice could result in more resource usage from etcd disk IO and network bandwidth for all watchers.
274+
275+
### Troubleshooting
276+
277+
###### How does this feature react if the API server and/or etcd is unavailable?
278+
279+
EndpointSlice conditions will get stale.
280+
281+
###### What are other known failure modes?
282+
283+
* Consumers of EndpointSlice that do not not check the `ready` condition may unexpectedly use terminating endpoints.
284+
285+
###### What steps should be taken if SLOs are not being met to determine the problem?
286+
287+
* Disable the feature gate
288+
* Check if consumers of EndpointSlice are using the serving or termianting condition
289+
* Check etcd disk usage
290+
151291
## Implementation History
152292

153293
- [x] 2020-04-23: KEP accepted as implementable for v1.19
294+
- [x] 2020-07-01: initial PR with alpha imlementation merged for v1.20
295+
- [x] 2020-05-12: KEP accepted as implementable for v1.22
154296

155297
## Drawbacks
156298

keps/sig-network/1672-tracking-terminating-endpoints/kep.yaml

Lines changed: 20 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,5 +19,23 @@ see-also:
1919
- /kep/sig-network/20190603-EndpointSlice-API.md
2020
replaces: []
2121

22-
latest-milestone: "0.0"
23-
stage: "alpha"
22+
# The target maturity stage in the current dev cycle for this KEP.
23+
stage: alpha
24+
25+
# The most recent milestone for which work toward delivery of this KEP has been
26+
# done. This can be the current (upcoming) milestone, if it is being actively
27+
# worked on.
28+
latest-milestone: "v1.22"
29+
30+
# The milestone at which this feature was, or is targeted to be, at each stage.
31+
milestone:
32+
alpha: "v1.20"
33+
34+
# The following PRR answers are required at alpha release
35+
# List the feature gate name and the components for which it must be enabled
36+
feature-gates:
37+
- name: EndpointSliceTerminatingCondition
38+
components:
39+
- kube-apiserver
40+
- kube-controller-manager
41+
disable-supported: true

0 commit comments

Comments
 (0)