From f58c046fa1bc6f48c1865d3d306990f36d86b0bd Mon Sep 17 00:00:00 2001 From: kerthcet Date: Thu, 11 May 2023 11:28:12 +0800 Subject: [PATCH 1/5] Adapt to the latest template Signed-off-by: kerthcet --- .../README.md | 441 +++++++++++++++--- 1 file changed, 365 insertions(+), 76 deletions(-) diff --git a/keps/sig-apps/961-maxunavailable-for-statefulset/README.md b/keps/sig-apps/961-maxunavailable-for-statefulset/README.md index fbbef782646..7183961cfe4 100644 --- a/keps/sig-apps/961-maxunavailable-for-statefulset/README.md +++ b/keps/sig-apps/961-maxunavailable-for-statefulset/README.md @@ -1,4 +1,97 @@ -# Implement maxUnavailable in StatefulSet +# KEP-961: Implement maxUnavailable in StatefulSet + + + + + + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories (Optional)](#user-stories-optional) + - [Story 1](#story-1) + - [Story 2](#story-2) + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) +- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) + + +## Release Signoff Checklist + + + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [x] (R) KEP approvers have approved the KEP status as `implementable` +- [x] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - [ ] e2e Tests for all Beta API Operations (endpoints) + - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free +- [ ] (R) Graduation criteria is in place + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) +- [ ] (R) Production readiness review completed +- [ ] (R) Production readiness review approved +- [x] "Implementation History" section is up-to-date for milestone +- [x] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [x] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + + + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website ## Table of Contents @@ -32,56 +125,153 @@ ## Summary -The purpose of this enhancement is to implement maxUnavailable for StatefulSet during RollingUpdate. -When a StatefulSet’s `.spec.updateStrategy.type` is set to `RollingUpdate`, the StatefulSet controller -will delete and recreate each Pod in the StatefulSet. The updating of each Pod currently happens one at a time. With support for `maxUnavailable`, the updating will proceed `maxUnavailable` number of pods at a time. + + +The purpose of this enhancement is to implement maxUnavailable for StatefulSet during RollingUpdate. +When a StatefulSet’s `.spec.updateStrategy.type` is set to `RollingUpdate`, the StatefulSet controller +will delete and recreate each Pod in the StatefulSet. The updating of each Pod currently happens one at a time. +With support for `maxUnavailable`, the updating will proceed `maxUnavailable` number of pods at a time. ## Motivation + + Consider the following scenarios:- -1. My containers publish metrics to a time series system. If I am using a Deployment, each rolling -update creates a new pod name and hence the metrics published by this new pod starts a new time series -which makes tracking metrics for the application difficult. While this could be mitigated, it requires -some tricks on the time series collection side. It would be so much better, If we could use a +1. My containers publish metrics to a time series system. If I am using a Deployment, each rolling +update creates a new pod name and hence the metrics published by this new pod starts a new time series +which makes tracking metrics for the application difficult. While this could be mitigated, it requires +some tricks on the time series collection side. It would be so much better, If we could use a StatefulSet object so my object names doesnt change and hence all metrics goes to a single time series. This will be easier if StatefulSet is at feature parity with Deployments. -2. My Container does some initial startup tasks like loading up cache or something that takes a lot of -time. If we used StatefulSet, we can only go one pod at a time which would result in a slow rolling -update. If StatefulSet supported maxUnavailable with value greater than 1, it would allow for a faster +2. My Container does some initial startup tasks like loading up cache or something that takes a lot of +time. If we used StatefulSet, we can only go one pod at a time which would result in a slow rolling +update. If StatefulSet supported maxUnavailable with value greater than 1, it would allow for a faster rollout since a total of maxUnavailable number of pods could be loading up the cache at the same time. -3. My Stateful clustered application, has followers and leaders, with followers being many more than 1. My application can tolerate many followers going down at the same time. I want to be able to do faster -rollouts by bringing down 2 or more followers at the same time. This is only possible if StatefulSet +3. My Stateful clustered application, has followers and leaders, with followers being many more than 1. My application can tolerate many followers going down at the same time. I want to be able to do faster +rollouts by bringing down 2 or more followers at the same time. This is only possible if StatefulSet supports maxUnavailable in Rolling Updates. -4. Sometimes I just want easier tracking of revisions of a rolling update. Deployment does it through -ReplicaSets and has its own nuances. Understanding that requires diving into the complicacy of hashing -and how ReplicaSets are named. Over and above that, there are some issues with hash collisions which -further complicate the situation(I know they were solved). StatefulSet introduced ControllerRevisions -in 1.7 which are much easier to think and reason about. They are used by DaemonSet and StatefulSet for -tracking revisions. It would be so much nicer if all the use cases of Deployments can be met in -StatefulSet's and additionally we could track the revisions by ControllerRevisions. Another way of +4. Sometimes I just want easier tracking of revisions of a rolling update. Deployment does it through +ReplicaSets and has its own nuances. Understanding that requires diving into the complicacy of hashing +and how ReplicaSets are named. Over and above that, there are some issues with hash collisions which +further complicate the situation(I know they were solved). StatefulSet introduced ControllerRevisions +in 1.7 which are much easier to think and reason about. They are used by DaemonSet and StatefulSet for +tracking revisions. It would be so much nicer if all the use cases of Deployments can be met in +StatefulSet's and additionally we could track the revisions by ControllerRevisions. Another way of saying this is, all my Deployment use cases are easily met by StatefulSet, and additionally I can enjoy easier revision tracking only if StatefulSet supported `maxUnavailable`. -With this feature in place, when using StatefulSet with maxUnavailable >1, the user is making a -conscious choice that more than one pod going down at the same time during rolling update, would not -cause issues with their Stateful Application which have per pod state and identity. Other Stateful +With this feature in place, when using StatefulSet with maxUnavailable >1, the user is making a +conscious choice that more than one pod going down at the same time during rolling update, would not +cause issues with their Stateful Application which have per pod state and identity. Other Stateful Applications which cannot tolerate more than one pod going down, will resort to the current behavior of one pod at a time Rolling Updates. ### Goals -StatefulSet RollingUpdate strategy will contain an additional parameter called `maxUnavailable` to + + + +StatefulSet RollingUpdate strategy will contain an additional parameter called `maxUnavailable` to control how many Pods will be brought down at a time, during Rolling Update. ### Non-Goals -NA + + + +N/A ## Proposal + + ### User Stories + + #### Story 1 -As a User of Kubernetes, I should be able to update my StatefulSet, more than one Pod at a time, in a -RollingUpdate manner, if my Stateful app can tolerate more than one pod being down, thus allowing my -update to finish much faster. +As a User of Kubernetes, I should be able to update my StatefulSet, more than one Pod at a time, in a +RollingUpdate manner, if my Stateful app can tolerate more than one pod being down, thus allowing my +update to finish much faster. + +### Notes/Constraints/Caveats (Optional) + + + +No. + +### Risks and Mitigations + + + +We are proposing a new field called `maxUnavailable` whose default value will be 1. In this mode, StatefulSet will behave exactly like its current behavior. +Its possible we introduce a bug in the implementation. The mitigation currently is that is disabled by default in Alpha phase for people to try out and give +feedback. +In Beta phase when its enabled by default, people will only see issues or bugs when `maxUnavailable` is set to something greater than 1. Since people have +tried this feature in Alpha, we would have time to fix issues. + +## Design Details + + ### Implementation Details @@ -106,62 +296,62 @@ type RollingUpdateStatefulSetStrategy struct { // Defaults to 1. // +optional MaxUnavailable *intstr.IntOrString `json:"maxUnavailable,omitempty" protobuf:"bytes,2,opt,name=maxUnavailable"` - + ... } ``` -- By Default, if maxUnavailable is not specified, its value will be assumed to be 1 and StatefulSets +- By Default, if maxUnavailable is not specified, its value will be assumed to be 1 and StatefulSets will follow their old behavior. This will also help while upgrading from a release which doesnt support maxUnavailable to a release which supports this field. - If maxUnavailable is specified, it cannot be greater than total number of replicas. - If maxUnavailable is specified and partition is also specified, MaxUnavailable cannot be greater than `replicas-partition` -- If a partition is specified, maxUnavailable will only apply to all the pods which are staged by the -partition. Which means all Pods with an ordinal that is greater than or equal to the partition will be +- If a partition is specified, maxUnavailable will only apply to all the pods which are staged by the +partition. Which means all Pods with an ordinal that is greater than or equal to the partition will be updated when the StatefulSet’s .spec.template is updated. Lets say total replicas is 5 and partition is set to 2 and maxUnavailable is set to 2. If the image is changed in this scenario, following are the possible behavior choices we have:- - 1. Pods with ordinal 4 and 3 will start Terminating at the same time(because of maxUnavailable). Once they are both running and ready, pods with ordinal 2 will start Terminating. Pods with ordinal 0 and 1 -will remain untouched due the partition. In this choice, the number of pods terminating is not always -maxUnavailable, but sometimes less than that. For e.g. if pod with ordinal 3 is running and ready but 4 is not, we still wait for 4 to be running and ready before moving on to 2. This implementation avoids + 1. Pods with ordinal 4 and 3 will start Terminating at the same time(because of maxUnavailable). Once they are both running and ready, pods with ordinal 2 will start Terminating. Pods with ordinal 0 and 1 +will remain untouched due the partition. In this choice, the number of pods terminating is not always +maxUnavailable, but sometimes less than that. For e.g. if pod with ordinal 3 is running and ready but 4 is not, we still wait for 4 to be running and ready before moving on to 2. This implementation avoids out of order Terminations of pods. - 2. Pods with ordinal 4 and 3 will start Terminating at the same time(because of maxUnavailable). When any of 4 or 3 are running and ready, pods with ordinal 2 will start Terminating. This could violate -ordering guarantees, since if 3 is running and ready, then both 4 and 2 are terminating at the same + 2. Pods with ordinal 4 and 3 will start Terminating at the same time(because of maxUnavailable). When any of 4 or 3 are running and ready, pods with ordinal 2 will start Terminating. This could violate +ordering guarantees, since if 3 is running and ready, then both 4 and 2 are terminating at the same time out of order. If 4 is running and ready, then both 3 and 2 are Terminating at the same time and no ordering guarantees are violated. This implementation, guarantees, that always there are maxUnavailable number of Pods Terminating except the last batch. - 3. Pod with ordinal 4 and 3 will start Terminating at the same time(because of maxUnavailable). When 4 is running and ready, 2 will start Terminating. At this time both 2 and 3 are terminating. If 3 is -running and ready before 4, 2 wont start Terminating to preserve ordering semantics. So at this time, + 3. Pod with ordinal 4 and 3 will start Terminating at the same time(because of maxUnavailable). When 4 is running and ready, 2 will start Terminating. At this time both 2 and 3 are terminating. If 3 is +running and ready before 4, 2 wont start Terminating to preserve ordering semantics. So at this time, only 1 is unavailable although we requested 2. 4. Introduce a field in Rolling Update, which decides whether we want maxUnavailable with ordering or without ordering guarantees. Depending on what the user wants, this Choice can either choose behavior 1 or 3 if ordering guarantees are needed or choose behavior 2 if they dont care. To simplify this further PodManagementPolicy today supports `OrderedReady` or `Parallel`. The `Parallel` mode only supports scale up and tear down of StatefulSets and currently doesnt apply to Rolling Updates. So instead of coming up with a new field, we could use the PodManagementPolicy to choose the behavior the User wants. 1. PMP=Parallel will now apply to RollingUpdate. This will choose behavior described in 2 above. - This means always maxUnavailable number of Pods are terminating at the same time except in + This means always maxUnavailable number of Pods are terminating at the same time except in the last case and no ordering guarantees are provided. - 2. PMP=OrderedReady with maxUnavailable can choose one of behavior 1 or 3. + 2. PMP=OrderedReady with maxUnavailable can choose one of behavior 1 or 3. -NOTE: The goal is faster updates of an application. In some cases , people would need both ordering -and faster updates. In other cases they just need faster updates and they dont care about ordering as +NOTE: The goal is faster updates of an application. In some cases , people would need both ordering +and faster updates. In other cases they just need faster updates and they dont care about ordering as long as they get identity. -Choice 1 is simpler to reason about. It does not always have maxUnavailable number of Pods in -Terminating state. It does not guarantee ordering within the batch of maxUnavailable Pods. The maximum +Choice 1 is simpler to reason about. It does not always have maxUnavailable number of Pods in +Terminating state. It does not guarantee ordering within the batch of maxUnavailable Pods. The maximum difference in the ordinals which are Terminating out of Order, cannot be more than maxUnavailable. -Choice 2 always offers maxUnavailable number of Pods in Terminating state. This can sometime lead to +Choice 2 always offers maxUnavailable number of Pods in Terminating state. This can sometime lead to pods terminating out of order. This will always lead to the fastest rollouts. The maximum difference in the ordinals which are Terminating out of Order, can be more than maxUnavailable. -Choice 3 always guarantees than no two pods are ever Terminating out of order. It sometimes does that, -at the cost of not being able to Terminate maxUnavailable pods. The implementationg for this might be +Choice 3 always guarantees than no two pods are ever Terminating out of order. It sometimes does that, +at the cost of not being able to Terminate maxUnavailable pods. The implementationg for this might be complicated. -Choice 4 provides a choice to the users and hence takes the guessing out of the picture on what they +Choice 4 provides a choice to the users and hence takes the guessing out of the picture on what they will expect. Implementing Choice 4 using PMP would be the easiest. #### Implementation The alpha release we are going with Choice 4 with support for both PMP=Parallel and PMP=OrderedReady. For PMP=Parallel, we will use Choice 2 -For PMP=OrderedReady, we will use Choice 3 to ensure we can support ordering guarantees while also +For PMP=OrderedReady, we will use Choice 3 to ensure we can support ordering guarantees while also making sure the rolling updates are fast. @@ -202,7 +392,7 @@ https://github.com/kubernetes/kubernetes/blob/v1.13.0/pkg/controller/statefulset // wait for unhealthy Pods on update if !isHealthy(replicas[target]) { - // If this Pod is unhealthy regardless of revision, count it in + // If this Pod is unhealthy regardless of revision, count it in // unavailable pods unavailablePods = append(unavailablePods, replicas[target].Name) klog.V(4).Infof( @@ -211,7 +401,7 @@ https://github.com/kubernetes/kubernetes/blob/v1.13.0/pkg/controller/statefulset set.Name, replicas[target].Name) } - + // NEW CODE HERE // If at anytime, total number of unavailable Pods exceeds maxUnavailable, // we stop deleting more Pods for Update @@ -229,29 +419,33 @@ https://github.com/kubernetes/kubernetes/blob/v1.13.0/pkg/controller/statefulset ... ``` -### Risks and Mitigations -We are proposing a new field called `maxUnavailable` whose default value will be 1. In this mode, StatefulSet will behave exactly like its current behavior. -Its possible we introduce a bug in the implementation. The mitigation currently is that is disabled by default in Alpha phase for people to try out and give -feedback. -In Beta phase when its enabled by default, people will only see issues or bugs when `maxUnavailable` is set to something greater than 1. Since people have -tried this feature in Alpha, we would have time to fix issues. +### Test Plan + -Downgrades +[x] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. -When downgrading from a release with this feature, to a release without maxUnavailable, there are two cases - - If maxUnavailable is greater than 1, there are two more cases:- - - If you're rolling back to a release that doesn't have this field - then there is even no way to discover it - - If you're just disabling the feature (either together with downgrade to a release that has a field or without downgrade),the field should remain set - (unless someone will explicitly delete it later), but controller should ignore its behavior (and there shouldn't be a way to set it if the feature gate - is switched off). - - If maxUnavailable is less than equal to 1 -- in this case user wont see any difference in behavior +##### Prerequisite testing updates + + -### Tests +No. + +#### Tests - maxUnavailable =1, Same behavior as today with PodManagementPolicy as `OrderedReady` or `Parallel` - Each of these Tests can be run in PodManagementPolicy = `OrderedReady` or `Parallel` and the Update @@ -263,17 +457,107 @@ When downgrading from a release with this feature, to a release without maxUnava - maxUnavailable greater than 1 with partition and staged pods greater than maxUnavailable - maxUnavailable greater than 1 with partition and maxUnavailable greater than replicas -## Test Plan +#### Test Plan For `Alpha`, unit tests and e2e tests will be added to test functionality at both with feature flag enabled and disabled. Defaults will be verified so that users who donot set this flag are not surprised at all. - ## Graduation Criteria + + - Alpha: Initial support for maxUnavailable in StatefulSets added. Disabled by default with default value of 1. -- Beta: Enabled by default with default value of 1 with upgrade downgrade testedd at least manually. +- Beta: Enabled by default with default value of 1 with upgrade downgrade tested at least manually. +### Upgrade / Downgrade Strategy + + + +We will default to 1 for maxUnavailable field in StatefulSet for backward compatibility + +Downgrades + +When downgrading from a release with this feature, to a release without maxUnavailable, there are two cases +- If maxUnavailable is greater than 1, there are two more cases:- + - If you're rolling back to a release that doesn't have this field - then there is even no way to discover it + - If you're just disabling the feature (either together with downgrade to a release that has a field or without downgrade),the field should remain set + (unless someone will explicitly delete it later), but controller should ignore its behavior (and there shouldn't be a way to set it if the feature gate + is switched off). + - If maxUnavailable is less than equal to 1 -- in this case user wont see any difference in behavior --> + +### Version Skew Strategy + +No. ## Production Readiness Review Questionnaire @@ -296,18 +580,18 @@ revert to the old behavior where rolling update will proceed one pod at a time. ###### What happens if we reenable the feature if it was previously rolled back? -We will restore the desired behavior for StatefulSets for which the maxunavailable field wasn't deleted after +We will restore the desired behavior for StatefulSets for which the maxunavailable field wasn't deleted after the feature gate was disabled. ###### Are there any tests for feature enablement/disablement? -yes, there are unit tests which make sure the field is correctly dropped +yes, there are unit tests which make sure the field is correctly dropped on feature enable and disabled ### Rollout, Upgrade and Rollback Planning ###### How can a rollout or rollback fail? Can it impact already running workloads? -A rollout or rollback of this feature can fail if there is a bug which causes the kube-apiserver or +A rollout or rollback of this feature can fail if there is a bug which causes the kube-apiserver or the kube-controller-manager to start crashing when the feature flag is enabled. @@ -375,6 +659,8 @@ No ###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? The controller-manager will see very negligible and almost un-notoceable increase in cpu usage. +###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? + ### Troubleshooting ###### How does this feature react if the API server and/or etcd is unavailable? @@ -384,6 +670,8 @@ hence this feature will also be not be able to be used. ###### What are other known failure modes? NA +###### What steps should be taken if SLOs are not being met to determine the problem? + ## Implementation History - KEP Started on 1/1/2019 @@ -398,6 +686,7 @@ NA - Users who need StatefulSets stable identity and are ok with getting a slow rolling update will continue to use StatefulSets. Users who are not ok with a slow rolling update, will continue to use Deployments with workarounds for the scenarios mentioned in the Motivations section. -- Another alternative would be to use OnDelete and deploy your own Custom Controller on top of StatefulSet Pods. There you can implement +- Another alternative would be to use OnDelete and deploy your own Custom Controller on top of StatefulSet Pods. There you can implement your own logic for deleting more than one pods in a specific order. This requires more work on the user but give them ultimate flexibility. +## Infrastructure Needed (Optional) \ No newline at end of file From 21a765b0090bfcaf70ab21107f8834952d62ec93 Mon Sep 17 00:00:00 2001 From: kerthcet Date: Thu, 11 May 2023 15:32:10 +0800 Subject: [PATCH 2/5] Bump to beta Signed-off-by: kerthcet --- keps/prod-readiness/sig-apps/961.yaml | 2 + .../README.md | 84 ++++++++++++++----- .../kep.yaml | 45 ++++++---- 3 files changed, 96 insertions(+), 35 deletions(-) diff --git a/keps/prod-readiness/sig-apps/961.yaml b/keps/prod-readiness/sig-apps/961.yaml index de2d414190d..3531e9cdaf0 100644 --- a/keps/prod-readiness/sig-apps/961.yaml +++ b/keps/prod-readiness/sig-apps/961.yaml @@ -1,3 +1,5 @@ kep-number: 961 alpha: approver: "@wojtek-t" +beta: + approver: "@wojtek-t" diff --git a/keps/sig-apps/961-maxunavailable-for-statefulset/README.md b/keps/sig-apps/961-maxunavailable-for-statefulset/README.md index 7183961cfe4..13fdeb49e9a 100644 --- a/keps/sig-apps/961-maxunavailable-for-statefulset/README.md +++ b/keps/sig-apps/961-maxunavailable-for-statefulset/README.md @@ -1,6 +1,5 @@ # KEP-961: Implement maxUnavailable in StatefulSet - - [Release Signoff Checklist](#release-signoff-checklist) +- [Table of Contents](#table-of-contents) - [Summary](#summary) - [Motivation](#motivation) - [Goals](#goals) - [Non-Goals](#non-goals) - [Proposal](#proposal) - - [User Stories (Optional)](#user-stories-optional) + - [User Stories](#user-stories) - [Story 1](#story-1) - - [Story 2](#story-2) - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) - [Risks and Mitigations](#risks-and-mitigations) - [Design Details](#design-details) + - [Implementation Details](#implementation-details) + - [API Changes](#api-changes) + - [Implementation](#implementation) - [Test Plan](#test-plan) - [Prerequisite testing updates](#prerequisite-testing-updates) - - [Unit tests](#unit-tests) - - [Integration tests](#integration-tests) - - [e2e tests](#e2e-tests) - - [Graduation Criteria](#graduation-criteria) + - [Tests](#tests) + - [Test Plan](#test-plan-1) +- [Graduation Criteria](#graduation-criteria) - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) - [Version Skew Strategy](#version-skew-strategy) - [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) @@ -206,7 +207,7 @@ What is out of scope for this KEP? Listing non-goals helps to focus discussion and make progress. --> -N/A +None. ## Proposal @@ -458,9 +459,12 @@ No. - maxUnavailable greater than 1 with partition and maxUnavailable greater than replicas #### Test Plan -For `Alpha`, unit tests and e2e tests will be added to test functionality at both + +For `Alpha`, unit tests and integration tests will be added to test functionality at both with feature flag enabled and disabled. Defaults will be verified so that users -who donot set this flag are not surprised at all. +who do not set this flag are not surprised at all. + +For `Beta`, add e2e tests. ## Graduation Criteria @@ -604,11 +608,16 @@ maxUnavailable to a number greater than 1, but the invariants and the logic wil maxUnavailable pods with the same identity and never more than maxUnavailable being deleted. ###### What specific metrics should inform a rollback? -TODO when we reach Beta + +When feature enabled but rolling update in a unexpected phenomenon like the update pods at a time is not equal to the +`maxUnavailable` value or rolling update in a unexpected order. + +Or we can refer to the `rolling-update-duration` metric for observation, if it didn't decrease when setting the `maxUnavailable` +great than 1 or the duration increased abnormally, then we should rollback. ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? -Will be tested when graduating to Beta. +No, but it will be tested manually before merging the PR. ###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? No @@ -623,24 +632,36 @@ The below command should show maxUnavailable value: kubectl get statefulsets -o yaml | grep maxUnavailable ``` +Or refer to the new metric `rolling-update-duration`, it should exist. + ###### How can someone using this feature know that it is working for their instance? -TODO when we reach Beta + +With feature enabled, set the `maxUnavailable` great than 1, and pay attention to the rolling update pods at a time, +it should equal to the `maxUnavailable`. +Or when setting the `maxUnavailable` great than 1, the `rolling-update-duration` should decrease. ###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? +I think it has little relevance with SLOs, but rolling update at a very low speed which impacts the running services. + ###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? +None. + ###### Are there any missing metrics that would be useful to have to improve observability of this feature? +None. + ### Dependencies ###### Does this feature depend on any specific services running in the cluster? -NA +No. ### Scalability ###### Will enabling / using this feature result in any new API calls? -It doesnt make any extra API calls. + +It doesn't make any extra API calls. ###### Will enabling / using this feature result in introducing new API types? No @@ -648,7 +669,6 @@ No ###### Will enabling / using this feature result in any new calls to the cloud provider? No - ###### Will enabling / using this feature result in increasing size or count of the existing API objects? A struct gets added to every StatefulSet object which has three fields, one 32 bit integer and two fields of type string. The struct in question is IntOrString. @@ -661,6 +681,8 @@ The controller-manager will see very negligible and almost un-notoceable increas ###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? +No. + ### Troubleshooting ###### How does this feature react if the API server and/or etcd is unavailable? @@ -668,7 +690,28 @@ The RollingUpdate will fail or will not be able to proceed if etcd or apiserver hence this feature will also be not be able to be used. ###### What are other known failure modes? -NA + + + +In a multi-master setup, when the cluster has skewed CCM, the behaviors may different. + +- [Failure mode brief description] + - Detection: the `rolling-update-duration` didn't decrease when setting the `maxUnavailable` great than 1 or increased abnormally. + - Mitigations: Disable the feature. + - Diagnostics: Set the logger level great than 4. + - Testing: No testing, because the rolling update duration is hard to measure, it can be impact by a lot of things, + like the master performance. ###### What steps should be taken if SLOs are not being met to determine the problem? @@ -676,10 +719,11 @@ NA - KEP Started on 1/1/2019 - Implementation PR and UT by 8/30 +- Bump to beta at 2023-05-11 ## Drawbacks -NA +Slow rolling update. ## Alternatives @@ -689,4 +733,6 @@ section. - Another alternative would be to use OnDelete and deploy your own Custom Controller on top of StatefulSet Pods. There you can implement your own logic for deleting more than one pods in a specific order. This requires more work on the user but give them ultimate flexibility. -## Infrastructure Needed (Optional) \ No newline at end of file +## Infrastructure Needed (Optional) + +No. diff --git a/keps/sig-apps/961-maxunavailable-for-statefulset/kep.yaml b/keps/sig-apps/961-maxunavailable-for-statefulset/kep.yaml index d0acec5ea22..5ea9291dcc8 100644 --- a/keps/sig-apps/961-maxunavailable-for-statefulset/kep.yaml +++ b/keps/sig-apps/961-maxunavailable-for-statefulset/kep.yaml @@ -2,30 +2,43 @@ title: Implement maxUnavailable for StatefulSets kep-number: 961 authors: - "@krmayankk" + - "@kerthcet" owning-sig: sig-apps -participating-sigs: - - sig-apps +participating-sigs: [] +status: implementable +creation-date: 2018-12-29 reviewers: - "@janetkuo" - "@kow3ns" approvers: - "@janetkuo" - "@kow3ns" -editor: TBD -creation-date: 2018-12-29 -last-updated: 2022-01-01 -status: implementable -see-also: - - n/a -replaces: - - n/a -superseded-by: - - n/a + +see-also: [] +replaces: [] + +# The target maturity stage in the current dev cycle for this KEP. +stage: beta + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "v1.28" # The milestone at which this feature was, or is targeted to be, at each stage. milestone: alpha: "v1.24" - beta: "v1.25" - stable: "v1.26" -latest-milestone: "v1.24" -stage: "alpha" + beta: "v1.28" + stable: TBD + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + - name: MaxUnavailableStatefulSet + components: + - kube-controller-manager + - kube-apiserver + +# The following PRR answers are required at beta release +metrics: + - rolling-update-duration From 301e2d81b98dcb3d39667a16e5b4afc200e9b08a Mon Sep 17 00:00:00 2001 From: kerthcet Date: Tue, 16 May 2023 17:17:22 +0800 Subject: [PATCH 3/5] address comments Signed-off-by: kerthcet --- keps/sig-apps/961-maxunavailable-for-statefulset/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/keps/sig-apps/961-maxunavailable-for-statefulset/README.md b/keps/sig-apps/961-maxunavailable-for-statefulset/README.md index 13fdeb49e9a..4748c7772ad 100644 --- a/keps/sig-apps/961-maxunavailable-for-statefulset/README.md +++ b/keps/sig-apps/961-maxunavailable-for-statefulset/README.md @@ -598,7 +598,6 @@ on feature enable and disabled A rollout or rollback of this feature can fail if there is a bug which causes the kube-apiserver or the kube-controller-manager to start crashing when the feature flag is enabled. - Yes, it can impact already running workloads. If a rolling update is in progress for a StatefulSet, while this feature is being enabled in kube-apiserver @@ -610,7 +609,8 @@ maxUnavailable pods with the same identity and never more than maxUnavailable be ###### What specific metrics should inform a rollback? When feature enabled but rolling update in a unexpected phenomenon like the update pods at a time is not equal to the -`maxUnavailable` value or rolling update in a unexpected order. +`maxUnavailable` value (we may have some other factors impacting this, but this should apply for most of the time) or +rolling update in a unexpected order. Or we can refer to the `rolling-update-duration` metric for observation, if it didn't decrease when setting the `maxUnavailable` great than 1 or the duration increased abnormally, then we should rollback. From ddd8c0cd636eef2dadb5e7a9417b69c9106c125c Mon Sep 17 00:00:00 2001 From: kerthcet Date: Wed, 31 May 2023 19:50:21 +0800 Subject: [PATCH 4/5] Address commetns Signed-off-by: kerthcet --- .../sig-apps/3335-statefulset-slice/README.md | 6 +- .../README.md | 208 ++++++++++++++---- .../kep.yaml | 2 + 3 files changed, 171 insertions(+), 45 deletions(-) diff --git a/keps/sig-apps/3335-statefulset-slice/README.md b/keps/sig-apps/3335-statefulset-slice/README.md index 490a3277c9e..eb39a0c4ec9 100644 --- a/keps/sig-apps/3335-statefulset-slice/README.md +++ b/keps/sig-apps/3335-statefulset-slice/README.md @@ -139,10 +139,10 @@ Items marked with (R) are required *prior to targeting to a milestone / release* - [X] (R) Design details are appropriately documented - [X] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) - [ ] e2e Tests for all Beta API Operations (endpoints) - - [ ] (R) Ensure GA e2e tests for meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) Ensure GA e2e tests for meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free - [X] (R) Graduation criteria is in place - - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) - [ ] (R) Production readiness review completed - [ ] (R) Production readiness review approved - [X] "Implementation History" section is up-to-date for milestone @@ -390,7 +390,7 @@ In the main control loop, StatefulSet will attempt to create pod replicas `[k, N * When scaling down pods: If ordinal `j` exists but is not in range `[k, N+k-1]`), pod `j` will be terminated. **RollingUpdate Partition Changes** -Since `ordinals.start` changes the offset of the replica ordinals, this affects the `partition` field used for RollingUpdate. As `partition` specifies an ordinal index, the partition field must be in the range `[k, N+k-1]`, to be valid. +Since `ordinals.start` changes the offset of the replica ordinals, this affects the `partition` field used for RollingUpdate. As `partition` specifies an ordinal index, the partition field must be in the range `[k, N+k-1]`, to be valid. ### Test Plan diff --git a/keps/sig-apps/961-maxunavailable-for-statefulset/README.md b/keps/sig-apps/961-maxunavailable-for-statefulset/README.md index 4748c7772ad..a766d05671c 100644 --- a/keps/sig-apps/961-maxunavailable-for-statefulset/README.md +++ b/keps/sig-apps/961-maxunavailable-for-statefulset/README.md @@ -34,10 +34,15 @@ tags, and then generate with `hack/update-toc.sh`. - [Implementation](#implementation) - [Test Plan](#test-plan) - [Prerequisite testing updates](#prerequisite-testing-updates) - - [Tests](#tests) - - [Test Plan](#test-plan-1) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) - [Graduation Criteria](#graduation-criteria) + - [Alpha](#alpha) + - [Beta](#beta) - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Upgrade](#upgrade) + - [Downgrade](#downgrade) - [Version Skew Strategy](#version-skew-strategy) - [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) - [Feature Enablement and Rollback](#feature-enablement-and-rollback) @@ -444,9 +449,28 @@ Based on reviewers feedback describe what additional tests need to be added prio implementing this enhancement to ensure the enhancements have also solid foundations. --> -No. +##### Unit tests + + + + -#### Tests +Testcases: - maxUnavailable =1, Same behavior as today with PodManagementPolicy as `OrderedReady` or `Parallel` - Each of these Tests can be run in PodManagementPolicy = `OrderedReady` or `Parallel` and the Update @@ -458,13 +482,42 @@ No. - maxUnavailable greater than 1 with partition and staged pods greater than maxUnavailable - maxUnavailable greater than 1 with partition and maxUnavailable greater than replicas -#### Test Plan +Coverage: -For `Alpha`, unit tests and integration tests will be added to test functionality at both -with feature flag enabled and disabled. Defaults will be verified so that users -who do not set this flag are not surprised at all. +- `pkg/apis/apps/v1`: `2023-05-26` - `71.7%` +- `pkg/apis/apps/v1beta2`: `2023-05-26` - `76.7%` +- `pkg/apis/apps/validation`: `2023-05-26` - `92.3%` +- `pkg/controller/statefulset`: `2023-05-26` - `85.7%` +- `pkg/registry/apps/statefulset`: `2023-05-26` - `63.9%` -For `Beta`, add e2e tests. +##### Integration tests + + + +We have controller tests in `pkg/controller/statefulset` cover all cases. What's more, +rolling update is hard to simulate in integration tests because we want to verify the pod +status but actually they're faked. But we'll have e2e tests instead. + +##### e2e tests + + + +- `test/e2e/apps: we'll add a test to verify that the number of pods brought down each +time is equal to configured maxUnavailable. ## Graduation Criteria @@ -530,8 +583,13 @@ in back-to-back releases. - Deprecate the flag --> -- Alpha: Initial support for maxUnavailable in StatefulSets added. Disabled by default with default value of 1. -- Beta: Enabled by default with default value of 1 with upgrade downgrade tested at least manually. +#### Alpha + +- Initial support for maxUnavailable in StatefulSets added. Disabled by default with default value of 1. + +#### Beta + +- Enabled by default with default value of 1 with upgrade downgrade tested at least manually. ### Upgrade / Downgrade Strategy @@ -547,21 +605,43 @@ enhancement: cluster required to make on upgrade, in order to make use of the enhancement? --> -We will default to 1 for maxUnavailable field in StatefulSet for backward compatibility +#### Upgrade -Downgrades +- What changes (in invocations, configurations, API use, etc.) is an existing + cluster required to make on upgrade, in order to maintain previous behavior? + - Nothing special, it's backward compatibility +- What changes (in invocations, configurations, API use, etc.) is an existing + cluster required to make on upgrade, in order to make use of the enhancement? + - This feature is enabled by default in beta so you can configure the maxUnavailable to meet your requirements + +#### Downgrade + +when downgrade from a release `maxUnavailable` enabled to a release it's disabled: + +- If we downgrade to a release even doesn't have this field, then there's no way to discover it. +- Else, if we create statefulset with maxUnavailable configured, it will be dropped to nil automatically. -When downgrading from a release with this feature, to a release without maxUnavailable, there are two cases -- If maxUnavailable is greater than 1, there are two more cases:- - - If you're rolling back to a release that doesn't have this field - then there is even no way to discover it - - If you're just disabling the feature (either together with downgrade to a release that has a field or without downgrade),the field should remain set - (unless someone will explicitly delete it later), but controller should ignore its behavior (and there shouldn't be a way to set it if the feature gate - is switched off). - - If maxUnavailable is less than equal to 1 -- in this case user wont see any difference in behavior --> +Anyway, we'll back to that time rolling update with a single unavailable pod. ### Version Skew Strategy -No. + + +- If both api-server and kube-controller-manager enabled this feature, then everything works as expected +- If only api-server enabled this feature, then we can configure the `maxUnavailable` field but it will not work. +- If only kube-controller-manager enabled this feature, then for newly created statefulSet, the `maxUnavailable` will default to 1. +- If both api-server and kube-controller-manager disabled this feature, then everything works as expected. ## Production Readiness Review Questionnaire @@ -584,36 +664,56 @@ revert to the old behavior where rolling update will proceed one pod at a time. ###### What happens if we reenable the feature if it was previously rolled back? -We will restore the desired behavior for StatefulSets for which the maxunavailable field wasn't deleted after +We will restore the desired behavior for StatefulSets for which the maxUnavailable field wasn't deleted after the feature gate was disabled. ###### Are there any tests for feature enablement/disablement? -yes, there are unit tests which make sure the field is correctly dropped -on feature enable and disabled + + + +Yes, there are unit tests which make sure the field is correctly dropped +on feature enable and disabled, see [strategy tests](https://github.com/kubernetes/kubernetes/blob/23698d3e9f4f3b9738ba5a6fcefd17894a00624f/pkg/registry/apps/statefulset/strategy_test.go#L391-L417). ### Rollout, Upgrade and Rollback Planning ###### How can a rollout or rollback fail? Can it impact already running workloads? -A rollout or rollback of this feature can fail if there is a bug which causes the kube-apiserver or -the kube-controller-manager to start crashing when the feature flag is enabled. + -Yes, it can impact already running workloads. +This could happen when we downgrade the kube-controller-manager from maxUnavailable-enabled release to a disabled one, +but it will not impact running workloads, because `maxUnavailable` only works in rolling update. -If a rolling update is in progress for a StatefulSet, while this feature is being enabled in kube-apiserver -and kube-controller-manager, the StatefulSet controller can run into corner cases where it will take longer -for the controller to converge. This will only happen if after enabling the feature, the customer also sets -maxUnavailable to a number greater than 1, but the invariants and the logic will ensure that there are never more than -maxUnavailable pods with the same identity and never more than maxUnavailable being deleted. +But if a statefulset has a plenty of replicas, when rolling update, it will take more time comparing to +`maxUnavailable` enabled with a number greater than 1. ###### What specific metrics should inform a rollback? -When feature enabled but rolling update in a unexpected phenomenon like the update pods at a time is not equal to the -`maxUnavailable` value (we may have some other factors impacting this, but this should apply for most of the time) or -rolling update in a unexpected order. + -Or we can refer to the `rolling-update-duration` metric for observation, if it didn't decrease when setting the `maxUnavailable` -great than 1 or the duration increased abnormally, then we should rollback. +When `maxUnavailable` enabled, and we're rolling update a statefulset, the number of pods brought down should +equal to the `maxUnavailable`, if not, we should rollback. ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? @@ -642,10 +742,34 @@ Or when setting the `maxUnavailable` great than 1, the `rolling-update-duration` ###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + + I think it has little relevance with SLOs, but rolling update at a very low speed which impacts the running services. ###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + + +- [x] Metrics + - Component exposing the metric: kube-controller-manager + - Metric name: `rolling-update-duration{plugin="PodTopologySpread"}` + - Metric name: `schedule_attempts_total{result="error|unschedulable"}` + None. ###### Are there any missing metrics that would be useful to have to improve observability of this feature? @@ -704,7 +828,7 @@ For each of them, fill in the following information by copying the below templat - Testing: Are there any tests for failure mode? If not, describe why. --> -In a multi-master setup, when the cluster has skewed CCM, the behaviors may different. +In a multi-master setup, when the cluster has skewed kube-controller-manager, the behaviors may differ. - [Failure mode brief description] - Detection: the `rolling-update-duration` didn't decrease when setting the `maxUnavailable` great than 1 or increased abnormally. @@ -717,13 +841,13 @@ In a multi-master setup, when the cluster has skewed CCM, the behaviors may diff ## Implementation History -- KEP Started on 1/1/2019 -- Implementation PR and UT by 8/30 -- Bump to beta at 2023-05-11 +- 2019-01-01: KEP created. +- 2019-08-30: PR Implemented with tests covered. +- 2023-05-11: Bump to Beta. ## Drawbacks -Slow rolling update. +It's an opt-in feature so you can choose to use it or not, then no drawbacks I can think of. ## Alternatives diff --git a/keps/sig-apps/961-maxunavailable-for-statefulset/kep.yaml b/keps/sig-apps/961-maxunavailable-for-statefulset/kep.yaml index 5ea9291dcc8..97317de0602 100644 --- a/keps/sig-apps/961-maxunavailable-for-statefulset/kep.yaml +++ b/keps/sig-apps/961-maxunavailable-for-statefulset/kep.yaml @@ -10,9 +10,11 @@ creation-date: 2018-12-29 reviewers: - "@janetkuo" - "@kow3ns" + - "@soltysh" approvers: - "@janetkuo" - "@kow3ns" + - "@soltysh" see-also: [] replaces: [] From c9a8b37765c807b1b9d3fc9abdfa8d6a2d33270d Mon Sep 17 00:00:00 2001 From: kerthcet Date: Thu, 15 Jun 2023 20:41:00 +0800 Subject: [PATCH 5/5] Add new metric rollingUpdateDurationSeconds to statefulset Signed-off-by: kerthcet --- .../README.md | 163 +++++++++++++----- .../kep.yaml | 3 +- 2 files changed, 121 insertions(+), 45 deletions(-) diff --git a/keps/sig-apps/961-maxunavailable-for-statefulset/README.md b/keps/sig-apps/961-maxunavailable-for-statefulset/README.md index a766d05671c..e9cdd2953e4 100644 --- a/keps/sig-apps/961-maxunavailable-for-statefulset/README.md +++ b/keps/sig-apps/961-maxunavailable-for-statefulset/README.md @@ -32,6 +32,7 @@ tags, and then generate with `hack/update-toc.sh`. - [Implementation Details](#implementation-details) - [API Changes](#api-changes) - [Implementation](#implementation) + - [Metrics](#metrics) - [Test Plan](#test-plan) - [Prerequisite testing updates](#prerequisite-testing-updates) - [Unit tests](#unit-tests) @@ -79,16 +80,16 @@ Items marked with (R) are required *prior to targeting to a milestone / release* - [x] (R) KEP approvers have approved the KEP status as `implementable` - [x] (R) Design details are appropriately documented - [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) - - [ ] e2e Tests for all Beta API Operations (endpoints) + - [x] e2e Tests for all Beta API Operations (endpoints) - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free - [ ] (R) Graduation criteria is in place - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) -- [ ] (R) Production readiness review completed +- [x] (R) Production readiness review completed - [ ] (R) Production readiness review approved - [x] "Implementation History" section is up-to-date for milestone -- [x] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] -- [x] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes #### Story 1 + As a User of Kubernetes, I should be able to update my StatefulSet, more than one Pod at a time, in a RollingUpdate manner, if my Stateful app can tolerate more than one pod being down, thus allowing my update to finish much faster. @@ -248,7 +250,9 @@ Go in to as much detail as necessary here. This might be a good place to talk about core concepts and how they relate. --> -No. +With `maxUnavailable` feature enabled, we'll bring down more than one pod at a time, if your application can't tolerate +this behavior, you should absolutely disable this feature or leave the field unconfigured, which will behave as the default +behavior. ### Risks and Mitigations @@ -425,6 +429,53 @@ https://github.com/kubernetes/kubernetes/blob/v1.13.0/pkg/controller/statefulset ... ``` +#### Metrics + +We'll add a new metric named `rolling-update-duration-seconds`, it tracks how long a statefulset takes to finish a rolling-update, +the general logic is: + +when [performing update](https://github.com/kubernetes/kubernetes/blob/c984d53b31655924b87a57bfd4d8ff90aaeab9f8/pkg/controller/statefulset/stateful_set_control.go#L97-L138), +if the `currentRevision` != `updateRevision`, then we take it as starting to rolling update, but because rolling update can not +finish in one reconciling loop, then we may have to track it, we have two approaches: + +- One is add a new field in the `defaultStatefulSetControl`, like below: + + ```golang + type defaultStatefulSetControl struct { + podControl *StatefulPodControl + statusUpdater StatefulSetStatusUpdaterInterface + controllerHistory history.Interface + recorder record.EventRecorder + + // + // rollingUpdateStartTimes records when each statefulset starts to perform rolling update. + // key is the statefulset name, value is the start time. + rollingUpdateStartTimes map[string]time.Time + } + ``` + + Then when `currentRevision` != `updateRevision`, we'll check whether the statefulset name + exists in the `rollingUpdateStartTimes`, if exists, skip, if not, write the + start time. + + If we found rolling update completed in [updateStatefulSetStatus](https://github.com/kubernetes/kubernetes/blob/c984d53b31655924b87a57bfd4d8ff90aaeab9f8/pkg/controller/statefulset/stateful_set_control.go#L682-L701), + then we'll report the metrics here and clear the statefulset from the `rollingUpdateStartTimes`. + +- Another approach is add a new field to `StatefulSetStatus` as below: + + ```golang + type StatefulSetStatus struct { + + // rollingUpdateStartTime will record when the statefulSet starts to rolling update. + rollingUpdateStartTime *time.Time + } + ``` + + The logic general looks the same, we'll update the time together with status update, so there's no + extra api calls. + + Prefer Opt-2 as we already record the rolling update messages in the status. + ### Test Plan -We have controller tests in `pkg/controller/statefulset` cover all cases. What's more, -rolling update is hard to simulate in integration tests because we want to verify the pod -status but actually they're faked. But we'll have e2e tests instead. +Controller tests in `pkg/controller/statefulset` cover all cases. Since pod statuses +are faked in integration tests, simulating rolling update is difficult. We're opting +for e2e tests instead. ##### e2e tests @@ -590,6 +646,8 @@ in back-to-back releases. #### Beta - Enabled by default with default value of 1 with upgrade downgrade tested at least manually. +- Additional e2e tests listed in the test plan should be added. +- No critical bugs reported and no bad feedbacks collected. ### Upgrade / Downgrade Strategy @@ -609,19 +667,19 @@ enhancement: - What changes (in invocations, configurations, API use, etc.) is an existing cluster required to make on upgrade, in order to maintain previous behavior? - - Nothing special, it's backward compatibility + - This is a new field, it will only be enabled after an upgrade. To maintain the + previous behavior, you can disable the feature gate manually in Beta or leave + the field unconfigured. - What changes (in invocations, configurations, API use, etc.) is an existing cluster required to make on upgrade, in order to make use of the enhancement? - - This feature is enabled by default in beta so you can configure the maxUnavailable to meet your requirements + - This feature is enabled by default in Beta so you can configure the maxUnavailable + to meet your requirements #### Downgrade -when downgrade from a release `maxUnavailable` enabled to a release it's disabled: - -- If we downgrade to a release even doesn't have this field, then there's no way to discover it. -- Else, if we create statefulset with maxUnavailable configured, it will be dropped to nil automatically. - -Anyway, we'll back to that time rolling update with a single unavailable pod. +- If we try to set this field, it will be silently dropped. +- If the object with this field already exists, we maintain is as it is, but the controller + will just ignore this field. ### Version Skew Strategy @@ -638,10 +696,13 @@ enhancement: CRI or CNI may require updating that component before the kubelet. --> -- If both api-server and kube-controller-manager enabled this feature, then everything works as expected -- If only api-server enabled this feature, then we can configure the `maxUnavailable` field but it will not work. -- If only kube-controller-manager enabled this feature, then for newly created statefulSet, the `maxUnavailable` will default to 1. -- If both api-server and kube-controller-manager disabled this feature, then everything works as expected. +- If both api-server and kube-controller-manager enabled this feature, then `maxUnavailable` will take effect. +- If only api-server enabled this feature, because statefulset controller is unaware of this feature, then we'll + fall into the default rolling-update behavior, one pod at a time. +- If only kube-controller-manager enabled this feature, then this field is invisible to users, we'll keep the default + rolling-update behavior, one pod at a time. +- If both api-server and kube-controller-manager disabled this feature, then only one Pod will be updated + at a time as the default behavior. ## Production Readiness Review Questionnaire @@ -659,13 +720,12 @@ No, the default behavior remains the same. ###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? -Yes this feature can be disabled. Once disabled, all existing StatefulSet will -revert to the old behavior where rolling update will proceed one pod at a time. +Yes, you can disable the feature-gate manually once it's in Beta. ###### What happens if we reenable the feature if it was previously rolled back? -We will restore the desired behavior for StatefulSets for which the maxUnavailable field wasn't deleted after -the feature gate was disabled. +If we disable the feature-gate and reenable it again, then the `maxUnavailable` will take effect again, +that's say we'll restart to respect the value of `maxUnavailable` when rolling update. ###### Are there any tests for feature enablement/disablement? @@ -682,9 +742,11 @@ You can take a look at one potential example of such test in: https://github.com/kubernetes/kubernetes/pull/97058/files#diff-7826f7adbc1996a05ab52e3f5f02429e94b68ce6bce0dc534d1be636154fded3R246-R282 --> -Yes, there are unit tests which make sure the field is correctly dropped +There are unit tests which make sure the field is correctly dropped on feature enable and disabled, see [strategy tests](https://github.com/kubernetes/kubernetes/blob/23698d3e9f4f3b9738ba5a6fcefd17894a00624f/pkg/registry/apps/statefulset/strategy_test.go#L391-L417). +Feature enablement/disablement test will also be added when graduating to Beta as [TestStatefulSetStartOrdinalEnablement](https://github.com/kubernetes/kubernetes/blob/23698d3e9f4f3b9738ba5a6fcefd17894a00624f/pkg/registry/apps/statefulset/strategy_test.go#L473) + ### Rollout, Upgrade and Rollback Planning ###### How can a rollout or rollback fail? Can it impact already running workloads? @@ -699,11 +761,15 @@ rollout. Similarly, consider large clusters and how enablement/disablement will rollout across nodes. --> -This could happen when we downgrade the kube-controller-manager from maxUnavailable-enabled release to a disabled one, -but it will not impact running workloads, because `maxUnavailable` only works in rolling update. +It depends, like + +- In a HA cluster, if part of the API servers enabled with this feature gate but the others not, then + for already running workloads, they will not be impacted in rolling update because the process is controlled + by the statefulset controller. But for newly created statefulset, their `maxUnavailable` field will + be dropped if requested a feature-gate disabled api-server. -But if a statefulset has a plenty of replicas, when rolling update, it will take more time comparing to -`maxUnavailable` enabled with a number greater than 1. +- With this feature enabled and a statefulset running into the rolling-update, then we disabled the feature-gate, + then this statefulset will be impacted as it will fall into the default behavior, updating a pod at one time. ###### What specific metrics should inform a rollback? @@ -712,19 +778,26 @@ What signals should users be paying attention to when the feature is young that might indicate a serious problem? --> -When `maxUnavailable` enabled, and we're rolling update a statefulset, the number of pods brought down should +- As a normal user, with `maxUnavailable` enabled, and we're rolling update a statefulset, the number of pods brought down should equal to the `maxUnavailable`, if not, we should rollback. +- As a administrator(as well for normal user), I can check the new metric `rolling-update-duration-seconds`, if we enabled the feature with +greater than 1 `maxUnavailable` set for statefulsets, but didn't see the duration fall down, then we may have some problems here. +This is not precise because rolling-update duration can be impacted by several factors, e.g. pulling down a big, new image, but +can be some indicators refer to. + ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? No, but it will be tested manually before merging the PR. ###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + No ### Monitoring Requirements ###### How can an operator determine if the feature is in use by workloads? + If their StatefulSet rollingUpdate section has the field maxUnavailable specified with a value different than 1. The below command should show maxUnavailable value: @@ -732,13 +805,14 @@ The below command should show maxUnavailable value: kubectl get statefulsets -o yaml | grep maxUnavailable ``` -Or refer to the new metric `rolling-update-duration`, it should exist. +Or refer to the new metric `rolling-update-duration-seconds`, it should exist. ###### How can someone using this feature know that it is working for their instance? With feature enabled, set the `maxUnavailable` great than 1, and pay attention to the rolling update pods at a time, it should equal to the `maxUnavailable`. -Or when setting the `maxUnavailable` great than 1, the `rolling-update-duration` should decrease. +Or refer to the `rolling-update-duration-seconds`, it can give some indicators generally. Like when setting the `maxUnavailable` +greater than 1, then the duration should descend. ###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? @@ -757,7 +831,9 @@ These goals will help you determine what you need to measure (SLIs) in the next question. --> -I think it has little relevance with SLOs, but rolling update at a very low speed which impacts the running services. +Rolling update duration is impacted by a lot of factors, like the container hooks, image size, network, container logics, +so what we can roughly know here is for each individual statefulset. So generally, if we set the `maxUnavailable` to X, then +the rolling update duration should reduce X times. Keep in mind that, this is not precise. ###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? @@ -766,11 +842,8 @@ Pick one more of these and delete the rest. --> - [x] Metrics - - Component exposing the metric: kube-controller-manager - - Metric name: `rolling-update-duration{plugin="PodTopologySpread"}` - - Metric name: `schedule_attempts_total{result="error|unschedulable"}` - -None. + - Metric name: rolling-update-duration-seconds + - Components exposing the metric: kube-controller-manager ###### Are there any missing metrics that would be useful to have to improve observability of this feature? @@ -797,11 +870,14 @@ No A struct gets added to every StatefulSet object which has three fields, one 32 bit integer and two fields of type string. The struct in question is IntOrString. +If we introduce the new metric with Opt-2, then we'll have a new field `rollingUpdateStartTime` with `*Time` type +in the `StatefulSetStatus`. + ###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? No ###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? -The controller-manager will see very negligible and almost un-notoceable increase in cpu usage. +The controller-manager will see very negligible and almost un-noticeable increase in cpu usage. ###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? @@ -828,14 +904,13 @@ For each of them, fill in the following information by copying the below templat - Testing: Are there any tests for failure mode? If not, describe why. --> -In a multi-master setup, when the cluster has skewed kube-controller-manager, the behaviors may differ. +Disable the feature-gate when performing rolling-update. - [Failure mode brief description] - - Detection: the `rolling-update-duration` didn't decrease when setting the `maxUnavailable` great than 1 or increased abnormally. - - Mitigations: Disable the feature. + - Detection: The `maxUnavailabel` is not respected. + - Mitigations: If rolling update in parallel is tolerated for application, set the `podManagementPolicy=Parallel`. - Diagnostics: Set the logger level great than 4. - - Testing: No testing, because the rolling update duration is hard to measure, it can be impact by a lot of things, - like the master performance. + - Testing: We have e2e tests to cover the right behaviors. ###### What steps should be taken if SLOs are not being met to determine the problem? diff --git a/keps/sig-apps/961-maxunavailable-for-statefulset/kep.yaml b/keps/sig-apps/961-maxunavailable-for-statefulset/kep.yaml index 97317de0602..dfa21492280 100644 --- a/keps/sig-apps/961-maxunavailable-for-statefulset/kep.yaml +++ b/keps/sig-apps/961-maxunavailable-for-statefulset/kep.yaml @@ -40,7 +40,8 @@ feature-gates: components: - kube-controller-manager - kube-apiserver +disable-supported: true # The following PRR answers are required at beta release metrics: - - rolling-update-duration + - rolling-update-duration-seconds