Skip to content

Commit 6980284

Browse files
committed
Add KEP for volume scheduling limits
1 parent dee70c1 commit 6980284

File tree

1 file changed

+247
-0
lines changed

1 file changed

+247
-0
lines changed
Lines changed: 247 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,247 @@
1+
---
2+
title: KEP Template
3+
authors:
4+
- "@jsafrane"
5+
owning-sig: sig-storage
6+
participating-sigs:
7+
- sig-scheduling
8+
reviewers:
9+
- "@bsalamat"
10+
- "@gnufied"
11+
- "@davidz627"
12+
approvers:
13+
- "@bsalamat"
14+
- "@davidz627"
15+
editor: TBD
16+
creation-date: 2019-04-08
17+
last-updated: 2019-04-08
18+
status: implementable
19+
see-also:
20+
- https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/20190129-csi-migration.md
21+
replaces: https://github.com/kubernetes/enhancements/pull/730
22+
superseded-by:
23+
---
24+
25+
# Volume Scheduling Limits
26+
27+
## Table of Contents
28+
29+
- [Title](#title)
30+
- [Table of Contents](#table-of-contents)
31+
- [Release Signoff Checklist](#release-signoff-checklist)
32+
- [Summary](#summary)
33+
- [Motivation](#motivation)
34+
- [Goals](#goals)
35+
- [Non-Goals](#non-goals)
36+
- [Proposal](#proposal)
37+
- [User Stories [optional]](#user-stories-optional)
38+
- [Story 1](#story-1)
39+
- [Story 2](#story-2)
40+
- [Implementation Details/Notes/Constraints [optional]](#implementation-detailsnotesconstraints-optional)
41+
- [Risks and Mitigations](#risks-and-mitigations)
42+
- [Design Details](#design-details)
43+
- [Test Plan](#test-plan)
44+
- [Graduation Criteria](#graduation-criteria)
45+
- [Examples](#examples)
46+
- [Alpha -> Beta Graduation](#alpha---beta-graduation)
47+
- [Beta -> GA Graduation](#beta---ga-graduation)
48+
- [Removing a deprecated flag](#removing-a-deprecated-flag)
49+
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
50+
- [Version Skew Strategy](#version-skew-strategy)
51+
- [Implementation History](#implementation-history)
52+
- [Drawbacks [optional]](#drawbacks-optional)
53+
- [Alternatives [optional]](#alternatives-optional)
54+
- [Infrastructure Needed [optional]](#infrastructure-needed-optional)
55+
56+
## Release Signoff Checklist
57+
58+
- [ ] kubernetes/enhancements issue in release milestone, which links to KEP (this should be a link to the KEP location in kubernetes/enhancements, not the initial KEP PR)
59+
- [ ] KEP approvers have set the KEP status to `implementable`
60+
- [ ] Design details are appropriately documented
61+
- [ ] Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
62+
- [ ] Graduation criteria is in place
63+
- [ ] "Implementation History" section is up-to-date for milestone
64+
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
65+
- [ ] Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
66+
67+
**Note:** Any PRs to move a KEP to `implementable` or significant changes once it is marked `implementable` should be approved by each of the KEP approvers. If any of those approvers is no longer appropriate than changes to that list should be approved by the remaining approvers and/or the owning SIG (or SIG-arch for cross cutting KEPs).
68+
69+
## Summary
70+
71+
Number of volumes of certain type that can be attached to a node should be configurable easily and should be based on node type. This proposal implements dynamic attachable volume limits on a per-node basis rather than cluster global defaults that exist today. This proposal also implements a way of configuring volume limits for CSI volumes.
72+
73+
This proposal replaces [#730](https://github.com/kubernetes/enhancements/pull/730) and integrates volume limits for in-tree volumes (AWS EBS, GCE PD, AZURE DD, OpenStack Cinder) and CSI into one predicate. As result, in-tree volumes and corresponding CSI driver can share the same volume limit.
74+
75+
## Motivation
76+
77+
Current scheduler predicates for scheduling of pods with volumes is based on `node.status.capacity` and `node.status.allocatable`. It works well for hardcoded predicates for volume limits on AWS (`MaxEBSVolumeCount`), GCE(`MaxGCEPDVolumeCount`), Azure (`MaxAzureDiskVolumeCount`) and OpenStack (`MaxCinderVolumeCount`).
78+
79+
It is problematic for CSI (`MaxCSIVolumeCountPred`) outlined in [#730](https://github.com/kubernetes/enhancements/pull/730)
80+
81+
- `ResourceName` is limited to 63 characters. We must prefix `ResourceName` with unique string (such as `attachable-volumes-csi-<driver name>`) so it cannot collide with existing resources like `cpu` or `memory`. But `<driver name>` itself is up to 63 character long, so we ended up with using SHA-sums of driver name to keep the `ResourceName` unique, which is not user readable.
82+
- CSI driver cannot share its limits with in-tree volume plugin e.g. when running pods with AWS EBS in-tree volumes and `ebs.csi.aws.com` CSI driver on the same node.
83+
- `node.status` size increases with each installed CSI driver. Node objects is big enough already.
84+
85+
### Goals
86+
87+
- User can run use PVs both with in-tree volume plugins and CSI and they will share their limits. There is only one scheduler predicate that handles both kind of volumes.
88+
89+
- Existing predicates for in-tree volumes `MaxEBSVolumeCount`, `MaxGCEPDVolumeCount`, `MaxAzureDiskVolumeCount` and `MaxCinderVolumeCount` are removed (with deprecation period).
90+
- When both deprecated in-tree predicate and CSI predicate are enabled, only one of them does useful work and the other is NOOP to save CPU.
91+
92+
- Increased CPU consumption as measured by [scheduler benchmark](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-scheduling/scheduler_benchmarking.md) is approved by sig-scheduling. There should be no performance regression in ideal case.
93+
94+
### Non-Goals
95+
96+
97+
## Proposal
98+
99+
* Track volume limits for both in-tree volume plugins and CSI drivers in `CSINode` objects instead of `Node`.
100+
* `Node` object is already big enough.
101+
* Get rid of prefix + SHA for `ResourceName` of CSI volumes.
102+
* In-tree volume plugin can share limits with CSI driver that uses the same storage backend.
103+
104+
* Kubelet will create `CSINode` instance during initial node registration together with `Node` object.
105+
* Limits of each in-tree volume plugin will be added to `CSINode.status.capacity` and `CSINode.status.allocatable`.
106+
* Name of CSI driver corresponding to in-tree volume plugin will be used as `ResourceName` of these in-tree plugins, so the limits are shared between CSI driver and in-tree volume plugin in case both are running on the same node.
107+
* If a CSI driver is registered for an in-tre volume plugin and it reports a different volume limit than in-tree volume plugin, the limit reported by CSI driver is used. (TODO: should kubelet `exit()`?)
108+
* Kubelet will reconcile `CSINode.capacity`, overwriting any user's changes.
109+
* User may change `CSINode.allocatable` to override volume plugin / CSI driver values, e.g. to "reserve" some attachment to the operating system.
110+
* Kubelet will continue filling `Node.status.allocatable` and `Node.status.capacity` for both in-tree and CSI volumes during deprecation period. After the deprecation period, it will stop using them completely.
111+
* User changes of `Node.status.allocatable` and `Node.status.capacity` will be ignored if `CSINode.status.allocatable` or `Node.status.capacity` is present.
112+
* To support old kubelet, `MaxCSIVolumeCountPred` (or any deprecated volume limit predicate) falls back to `Node.status.allocatable` / `capacity` when `CSINode` does not contain any limits for a volume plugin.
113+
114+
CSINode example:
115+
116+
```
117+
apiVersion: storage.k8s.io/v1beta1
118+
kind: CSINode
119+
metadata:
120+
name: ip-172-18-4-112.ec2.internal
121+
spec:
122+
status:
123+
capacity:
124+
"ebs.csi.aws.com": "39"
125+
allocatable:
126+
"ebs.csi.aws.com": "39"
127+
```
128+
129+
* Existing scheduler predicates `MaxEBSVolumeCount`, `MaxGCEPDVolumeCount`, `MaxAzureDiskVolumeCount` and `MaxCinderVolumeCount` are already deprecated.
130+
* If any of them is enabled together with `MaxCSIVolumeCountPred`, the deprecated predicate will do nothing (`MaxCSIVolumeCountPred` does the job of counting both in-tree and CSI volumes).
131+
* The deprecated predicates will filter pods only when `MaxCSIVolumeCountPred` predicate is disabled.
132+
* This way, we save CPU by running only one volume limit predicate during deprecation period.
133+
134+
135+
### New API
136+
137+
TODO
138+
139+
### User Stories
140+
141+
TODO: fill? Is there any interesting user story?
142+
143+
#### Story 1
144+
145+
#### Story 2
146+
147+
### Implementation Details/Notes/Constraints
148+
149+
[CSI migration library](https://github.com/kubernetes/kubernetes/tree/master/staging/src/k8s.io/csi-translation-lib) is used to find CSI driver name for in-tree volume plugins. This CSI driver name is used as key in `CSINode.status.capacity` and `CSINode.status.allocatable` lists.
150+
151+
### Risks and Mitigations
152+
153+
* This KEP depends on [CSI migration library](https://github.com/kubernetes/kubernetes/tree/master/staging/src/k8s.io/csi-translation-lib). It can happen that CSI migration is redesigned / cancelled.
154+
* Countermeasure: [CSI migration](https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/20190129-csi-migration.md) and this KEP should graduate together.
155+
156+
## Design Details
157+
158+
Existing feature gate `AttachVolumeLimit` will be re-used for implementation of this KEP. The feature is already beta and is enabled by default.
159+
160+
### Test Plan
161+
162+
**Note:** *Section not required until targeted at a release.*
163+
164+
Consider the following in developing a test plan for this enhancement:
165+
- Will there be e2e and integration tests, in addition to unit tests?
166+
- How will it be tested in isolation vs with other components?
167+
168+
No need to outline all of the test cases, just the general strategy.
169+
Anything that would count as tricky in the implementation and anything particularly challenging to test should be called out.
170+
171+
All code is expected to have adequate tests (eventually with coverage expectations).
172+
Please adhere to the [Kubernetes testing guidelines][testing-guidelines] when drafting this test plan.
173+
174+
[testing-guidelines]: https://git.k8s.io/community/contributors/devel/sig-testing/testing.md
175+
176+
### Graduation Criteria
177+
178+
**Note:** *Section not required until targeted at a release.*
179+
180+
Define graduation milestones.
181+
182+
These may be defined in terms of API maturity, or as something else. Initial KEP should keep
183+
this high-level with a focus on what signals will be looked at to determine graduation.
184+
185+
Consider the following in developing the graduation criteria for this enhancement:
186+
- [Maturity levels (`alpha`, `beta`, `stable`)][maturity-levels]
187+
- [Deprecation policy][deprecation-policy]
188+
189+
Clearly define what graduation means by either linking to the [API doc definition](https://kubernetes.io/docs/concepts/overview/kubernetes-api/#api-versioning),
190+
or by redefining what graduation means.
191+
192+
In general, we try to use the same stages (alpha, beta, GA), regardless how the functionality is accessed.
193+
194+
[maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions
195+
[deprecation-policy]: https://kubernetes.io/docs/reference/using-api/deprecation-policy/
196+
197+
#### Examples
198+
199+
TODO
200+
201+
##### Alpha -> Beta Graduation
202+
203+
N/A (`AttachVolumeLimit` feature is already beta).
204+
205+
##### Beta -> GA Graduation
206+
207+
TODO
208+
209+
##### Removing a deprecated flag
210+
211+
- Announce deprecation and support policy of the existing flag
212+
- Two versions passed since introducing the functionality which deprecates the flag (to address version skew)
213+
- Address feedback on usage/changed behavior, provided on GitHub issues
214+
- Deprecate the flag
215+
216+
**For non-optional features moving to GA, the graduation criteria must include [conformance tests].**
217+
218+
[conformance tests]: https://github.com/kubernetes/community/blob/master/contributors/devel/conformance-tests.md
219+
220+
### Upgrade / Downgrade Strategy
221+
222+
TODO!
223+
224+
* New scheduler and old kubelet / kubelet with `AttachVolumeLimit` feature disabled:
225+
* `CSINode` does not contain `CSINode.status`: scheduler must fall back to `Node.status`.
226+
227+
228+
### Version Skew Strategy
229+
230+
If applicable, how will the component handle version skew with other components? What are the guarantees? Make sure
231+
this is in the test plan.
232+
233+
Consider the following in developing a version skew strategy for this enhancement:
234+
- Does this enhancement involve coordinating behavior in the control plane and in the kubelet? How does an n-2 kubelet without this feature available behave when this feature is used?
235+
- Will any other components on the node change? For example, changes to CSI, CRI or CNI may require updating that component before the kubelet.
236+
237+
## Implementation History
238+
239+
Major milestones in the life cycle of a KEP should be tracked in `Implementation History`.
240+
Major milestones might include
241+
242+
- the `Summary` and `Motivation` sections being merged signaling SIG acceptance
243+
- the `Proposal` section being merged signaling agreement on a proposed design
244+
- the date implementation started
245+
- the first Kubernetes release where an initial version of the KEP was available
246+
- the version of Kubernetes where the KEP graduated to general availability
247+
- when the KEP was retired or superseded

0 commit comments

Comments
 (0)