-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Add PRR information for maxUnavailable #2688
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| - "@janetkuo" | ||
| - "@kow3ns" | ||
| prr-approvers: | ||
| - "@johnbelamaric" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can remove of it (we replaced that with prod-readiness/ files)
[alternatively, please change to me]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
|
||
| - [x] Feature gate (also fill in values in `kep.yaml`) | ||
| - Feature gate name: MaxUnavailableStatefulSet | ||
| - Components depending on the feature gate: StatefulSet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kube-apiserver and kube-controller-manager
[in kube-apiserver you need a handling for the new field - the field is accepted only if feature gate is enabled]
| ###### What happens if we reenable the feature if it was previously rolled back? | ||
|
|
||
| If the feature is enabled and then later disabled, we will revert the value of | ||
| maxUnavailable to the default value so that customers are not surprised. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not what we want (unless I misunderstood you) - see the comment above also.
If the field is set in the API, it shouldn't silently disappeared - it's just the controller who should ignore its existence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The link above is about controller code - which is useful to know. But i would also like to explicitly write that the field in the object itself is not modified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The text is misleading, because we won't be reverting the value itself.
Also - it's about disabling - not reenabling.
Can you pleas change to something along the lines with: we will restore the desired behavior for StatefulSets for which the maxunavailable field wasn't deleted after the feature gate was disabled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you actually delete a field after a feature gate was disabled ?
|
|
||
| ###### What specific metrics should inform a rollback? | ||
| The customer should look at the behavior of the maxUnavailable and if the rollout | ||
| is not respecting the maxUnavailable, the feature should be rolledback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can they look at it? metrics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added more specific data, they are not really metrics
| is not respecting the maxUnavailable, the feature should be rolledback. | ||
|
|
||
| ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? | ||
| No |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add that it will be tested when graduating to Beta (and add this to graduation criteria).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| ``` | ||
|
|
||
| ###### How can someone using this feature know that it is working for their instance? | ||
| Ensure the number of replicas in your StatefulSet is more than 2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't really what we want. We don't ask how to test the behavior.
We're rather asking how to check if the feature is generally working.
[The question is not strictly required for Alpha - but basically I would expect some metric to be exposed from statefule-set controller that will be reporting e.g. if during any reconciliation more pods than expected were not ready, or sth like that]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added some more data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry - let me be more explicit. This isn't actionable for cluster operators at all.
An operator cannot track upgrades of individual statefulsets and check what is happening during those.
The operator usually also cannot trigger actions on existing StatefulSets or created their own.
Metrics are the mechanism for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to TODO for when we target beta. @kow3ns if you have advice on this or adding something to this effect in alpha
| ### Dependencies | ||
|
|
||
| ###### Does this feature depend on any specific services running in the cluster? | ||
| controller-manager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove - this is somewhat obvious.
There aren't any additional dependencies (we more want to ensure that you don't depend on things like dns, additonal agents, cloud-provider, etc.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| A new field gets added to every StatefulSet object which occupies 32bits. | ||
|
|
||
| ###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? | ||
| The increasing time per StatefulSet for rolling updates should be negligible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have any SLOs that cover statefulsets.
So the answer is basically "no"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
| #### Implementation | ||
|
|
||
| TBD: Will be updated after we have agreed on the semantics being discussed above. | ||
| The alpha release we are going with Choice 4 with support for both PMP=Parallel and PMP=OrderedReady. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parallel is clear to use behavior 2. What is the behavior for OrderedReady? 3 would make the most sense to me, but was there consensus?
Better remove the paragraph Recommended Choice and just go with the details in Implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| ## Graduation Criteria | ||
|
|
||
| - Alpha: Initial support for maxUnavailable in StatefulSets added. Disabled by default. | ||
| - Beta: Enabled by default with default value of 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to imply that default value is not 1 in alpha when the feature is enabled. Please clarify.
| ### Scalability | ||
|
|
||
| ###### Will enabling / using this feature result in any new API calls? | ||
| This feature when set, makes an extra loop over the total pods that belong |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
first sentence seems unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
| to the StatefulSet. It doesnt make any extra API calls. | ||
|
|
||
| ###### Will enabling / using this feature result in introducing new API types? | ||
| This feature adds a new field in the StatefulSet RollingUpdate, called maxUnavailable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
answer here should be no
|
|
||
|
|
||
| ###### Will enabling / using this feature result in increasing size or count of the existing API objects? | ||
| A new field gets added to every StatefulSet object which occupies 32bits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it? doesn't percentage require more bits to represent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree
| ###### What happens if we reenable the feature if it was previously rolled back? | ||
|
|
||
| If the feature is enabled and then later disabled, we will revert the value of | ||
| maxUnavailable to the default value so that customers are not surprised. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The link above is about controller code - which is useful to know. But i would also like to explicitly write that the field in the object itself is not modified.
|
|
||
| ###### How can a rollout or rollback fail? Can it impact already running workloads? | ||
|
|
||
| No, a rollout or rollback cannot fail. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rollout can always fail. One can imagine a bug in strategy that is causing kube-apiserver to crash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when you say bug in strategy what do you mean ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated the text
| to a very high number and due to some bug, the StatefulSet pods are not coming up. | ||
|
|
||
| ###### What specific metrics should inform a rollback? | ||
| The customer should look at StatefulSet.Status.Replicas to ensure if that number is always |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't a metric. An operator cannot track that for all statefulsets potentially across many different clusters.
We really need a metric for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[This question is not strictly required for Beta - so you can change to TODO or sth and figure it out for beta.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to TODO
| ``` | ||
|
|
||
| ###### How can someone using this feature know that it is working for their instance? | ||
| Ensure the number of replicas in your StatefulSet is more than 2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry - let me be more explicit. This isn't actionable for cluster operators at all.
An operator cannot track upgrades of individual statefulsets and check what is happening during those.
The operator usually also cannot trigger actions on existing StatefulSets or created their own.
Metrics are the mechanism for this.
|
|
||
| - Upgrades | ||
| When upgrading from a release without this feature, to a release with maxUnavailable, we will set maxUnavailable to 1. This would give users the same default | ||
| When upgrading from a release without this feature, to a release with maxUnavailable, we will default maxUnavailable to 1. This would give users the same default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not about upgrades per-se. Let's state that "we will default to 1 for backward compatibility".
| - Downgrades | ||
| When downgrading from a release with this feature, to a release without maxUnavailable, there are two cases | ||
| -- if maxUnavailable is greater than 1 -- in this case user can see unexpected behavior(Find out what is the recommendation here(Warning, disable upgrade, drop field, etc? ) | ||
| -- if maxUnavailable is greater than 1, there are two more cases:- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can you make it look like list in the "view" format?
| ###### What happens if we reenable the feature if it was previously rolled back? | ||
|
|
||
| If the feature is enabled and then later disabled, we will revert the value of | ||
| maxUnavailable to the default value so that customers are not surprised. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The text is misleading, because we won't be reverting the value itself.
Also - it's about disabling - not reenabling.
Can you pleas change to something along the lines with: we will restore the desired behavior for StatefulSets for which the maxunavailable field wasn't deleted after the feature gate was disabled
| maxUnavailable to a number greater than 1, but the invariants and the logic will ensure that there are never more than | ||
| maxUnavailable pods with the same identity and never more than maxUnavailable being deleted. | ||
|
|
||
| It can impact already running workloads if service owner accidentally sets maxUnavailable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove this one - if someone "accidentally sets a feature" - many features cna be harmful :)
|
I'm fine with this for Alpha /approve PRR |
soltysh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
this is good starting point for alpha, it will have to go through more careful scrutiny for beta
| ## Test Plan | ||
| For `Alpha`, unit tests and e2e tests will be added to test functionality at both | ||
| with feature flag enabled and disabled. Defaults will be verified so that users | ||
| who donot set this flag are not surprised at all. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do not
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: krmayankk, soltysh, wojtek-t The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Hello @krmayankk 👋, 1.22 Docs Shadow here. Please follow the steps detailed in the documentation to open a PR against dev-1.22 branch in the k/website repo. This PR can be just a placeholder at this time and must be created before Fri July 9, 11:59 PM PDT. Also, take a look at Documenting for a release to familiarize yourself with the docs requirement for the release. Thank you! 🙏 |
Ref #961