You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: keps/sig-storage/20190408-volume-scheduling-limits.md
+82-30Lines changed: 82 additions & 30 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -75,7 +75,6 @@ It is problematic for CSI (`MaxCSIVolumeCountPred`) outlined in [#730](https://g
75
75
76
76
-`ResourceName` is limited to 63 characters. We must prefix `ResourceName` with unique string (such as `attachable-volumes-csi-<driver name>`) so it cannot collide with existing resources like `cpu` or `memory`. But `<driver name>` itself is up to 63 character long, so we ended up with using SHA-sums of driver name to keep the `ResourceName` unique, which is not user readable.
77
77
- CSI driver cannot share its limits with in-tree volume plugin e.g. when running pods with AWS EBS in-tree volumes and `ebs.csi.aws.com` CSI driver on the same node.
78
-
-`node.status` size increases with each installed CSI driver. Node objects are big enough already.
79
78
80
79
### Goals
81
80
@@ -90,6 +89,14 @@ It is problematic for CSI (`MaxCSIVolumeCountPred`) outlined in [#730](https://g
90
89
91
90
- Heterogenous clusters, i.e. clusters where access to storage is limited only to some nodes. Existing `PV.spec.nodeAffinity` handling, not modified by this KEP, will filter out nodes that don't have access to the storage, so predicates changed in this KEP don't need to worry about storage topology and can be simpler.
92
91
92
+
- Scheduling based on availability / health of CSI drivers on nodes. `CSINode` can be used to check which nodes have a driver installed and it could be extended to report health, so scheduler puts pods only to nodes with installed and healthy driver. This is out of scope of this PR.
93
+
94
+
- Multiple plugins sharing the same volume limits. We expect that every CSI driver will have its own limits, not shared with other CSI drivers. In this KEP we support only in-tree volume plugins sharing its limits with one hard-coded CSI driver each.
95
+
96
+
- Multiple "units" per single volume. Each volume used on a node takes exactly 1 unit from `allocatable.volumes`, regardless of the volume size, its replica count, number of connections to remote servers or other underlying resources needed to use the volume. For example, multipath iSCSI volume with three paths (and thus three iSCSI connections to three different servers) still takes 1 unit from `CSINode.status.allocatable.volumes`.
97
+
98
+
- Maximum capacity per node. Some cloud environments limit both number of number of attached volumes (covered in this KEP) and total capacity of attached volumes (not covered in this KEP). For example, this KEP will ensure that scheduler puts max. 128 volumes to a [typical GCE node](https://cloud.google.com/compute/docs/machine-types#predefined_machine_types), but it won't ensure that the total capacity of the volumes is less than 64 TB.
99
+
93
100
## Proposal
94
101
95
102
* Track volume limits for both in-tree volume plugins and CSI drivers in `CSINode` objects instead of `Node`.
@@ -101,26 +108,25 @@ It is problematic for CSI (`MaxCSIVolumeCountPred`) outlined in [#730](https://g
101
108
* Limit for in-tree volumes will be added by kubelet during CSINode creation. Name of corresponding CSI driver will be used as key in `CSINode.status.allocatable` and it will be discovered using [CSI translation library](https://github.com/kubernetes/kubernetes/tree/master/staging/src/k8s.io/csi-translation-lib). If the library does not support migration of an in-tree volume plugin, the volume plugin has no limit.
102
109
* If a CSI driver is registered for an in-tree volume plugin and it reports a different volume limit than in-tree volume plugin, the limit reported by CSI driver is used and kubelet logs a warning.
103
110
* User may NOT change `CSINode.status.allocatable` to override volume plugin / CSI driver values, e.g. to "reserve" some attachment to the operating system. Kubelet will periodically reconcile `CSINode` and overwrite the value.
104
-
* Especially, `kubelet --kube-reserved` or `--system-reserved` cannot be used to "reserve" volumes for kubelet or the OS. It is not possible with current kubelet and this KEP does not change it.
105
-
* We expect that CSI drivers will have configuration options / cmdline arguments to reserve some volumes and they will report their limit already reduced by that reserved amount.
111
+
* Especially, `kubelet --kube-reserved` or `--system-reserved` cannot be used to "reserve" volumes for kubelet or the OS. It is not possible with current kubelet and this KEP does not change it. We expect that CSI drivers will have configuration options / cmdline arguments to reserve some volumes and they will report their limit already reduced by that reserved amount.
106
112
107
113
* Kubelet will continue filling `Node.status.allocatable` and `Node.status.capacity` for both in-tree and CSI volumes during deprecation period. After the deprecation period, it will stop using them completely.
108
114
* Scheduler (all its storage predicates) will ignore `Node.status.allocatable` and `Node.status.capacity` if `CSINode.status.allocatable` is present.
109
115
* If `CSINode.status.allocatable` (or whole `CSINode`) is missing, scheduler falls back to `Node.status.allocatable`. This solves version skew between old kubelet (using `Node.status`) and new scheduler.
110
116
* After deprecation period, scheduler won't schedule any pods that use volumes to a node with missing `CSINode` instance. It is expected that it happens only during node registration when `Node` exists and `CSINode` doesn't and it self-heals quickly.
111
117
112
-
*`CSINode.status.allocatable` is a map CSI Driver name -> int64. Following combinations are possible:
118
+
*`CSINode.status.allocatable` is array of limits. Following combinations are possible:
113
119
114
-
|Driver name | value| Description |
115
-
| ----------------- |----- | ------------ |
116
-
|`ebs.csi.aws.com`| 0| plugin / CSI driver exists and has zero limit, i.e. can attach no volumes |
117
-
|`ebs.csi.aws.com`|X>0 | plugin / CSI driver exists and can attach X volumes (where X > 0) |
118
-
| `ebs.csi.aws.com` | X<0 | negative values are blocked by validation
119
-
|key is missing in `CSINode.status.allocatable`| - | there is no limit of volumes on the node*|
120
+
|`Volumes`| Description |
121
+
|--------- | ------------ |
122
+
| 0 | plugin / CSI driver exists and has zero limit, i.e. can attach no volumes |
123
+
|X>0| plugin / CSI driver exists and can attach X volumes (where X > 0) |
124
+
| X<0 | negative values are blocked by validation
125
+
|Driver is missing in `CSINode.status.allocatable`| there is no limit of volumes on the node*|
120
126
121
127
*) This way we are not able to distinguish between a volume plugin / CSI driver not installed on a node or it has been installed and it has no limits.
122
128
* Predicates modified in this KEP assume that storage provided by **in-tree** volume plugin is available all nodes in the cluster. Other predicate(s) will evaluate `PV.spec.nodeAffinity` and filter out nodes that don't have access to the storage.
123
-
* For CSI drivers, availability of a CSI driver on a node can be checked in `CSINode.spec`.
129
+
* For CSI drivers, availability of a CSI driver on a node can be checked in `CSINode.spec`. Its handling is out of scope of this KEP, see non-goals.
124
130
125
131
CSINode example:
126
132
@@ -133,15 +139,15 @@ spec:
133
139
status:
134
140
allocatable:
135
141
# AWS node can attach max. 40 volumes, 1 is reserved for the system
136
-
ebs.csi.aws.com: 39
142
+
- name: ebs.csi.aws.com
143
+
volumes: 39
137
144
```
138
145
139
146
* Existing scheduler predicates `MaxEBSVolumeCount`, `MaxGCEPDVolumeCount`, `MaxAzureDiskVolumeCount` and `MaxCinderVolumeCount` are already deprecated.
140
147
* If any of them is enabled together with `MaxCSIVolumeCountPred`, the deprecated predicate will do nothing (`MaxCSIVolumeCountPred` does the job of counting both in-tree and CSI volumes).
141
-
* The deprecated predicates will do any useful work only when `MaxCSIVolumeCountPred` predicate is disabled.
148
+
* The deprecated predicates will do any useful work only when `MaxCSIVolumeCountPred` predicate is disabled or `CSINode` object does not have limits for a particular driver.
142
149
* This way, we save CPU by running only one volume limit predicate during deprecation period.
143
150
144
-
145
151
### New API
146
152
147
153
CSINode gets `Status` struct with `Allocatable`, holding limit of volumes for each volume plugin and CSI driver that can be scheduled to the node.
@@ -154,20 +160,27 @@ type CSINode struct {
154
160
Status CSINodeStatus `json:"status" protobuf:"bytes,3,opt,name=status"`
155
161
}
156
162
163
+
// CSINodeStatus holds information about the status of all CSI drivers installed on a node
164
+
type CSINodeStatus struct {
165
+
// allocatable is a list of volume limits for each volume plugin and CSI driver on the node.
// Future proof: max. total size of volumes on the node can be added later
171
184
}
172
185
```
173
186
@@ -183,7 +196,7 @@ type CSINodeStatus struct {
183
196
* This KEP depends on [CSI migration library](https://github.com/kubernetes/kubernetes/tree/master/staging/src/k8s.io/csi-translation-lib). It can happen that CSI migration is redesigned / cancelled.
184
197
* Countermeasure: [CSI migration](https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/20190129-csi-migration.md) and this KEP should graduate together.
185
198
186
-
* This KEP depends on [CSI migration library](https://github.com/kubernetes/kubernetes/tree/master/staging/src/k8s.io/csi-translation-lib) ability to handle in-line in-tree volumes. Scheduler will need to get CSI driver name + `VolumeHandle` from them to count them towards the limit.
199
+
* This KEP depends on [CSI migration library](https://github.com/kubernetes/kubernetes/tree/master/staging/src/k8s.io/csi-translation-lib) ability to handle in-line in-tree volumes. Scheduler will need to get CSI driver name + `VolumeHandle` from them to count them towards the limit.
187
200
188
201
## Design Details
189
202
@@ -217,20 +230,59 @@ N/A (`AttachVolumeLimit` feature is already beta).
217
230
218
231
It must graduate together with CSI migration.
219
232
220
-
##### Removing a deprecated flag
233
+
###Upgrade / Downgrade / Version Skew Strategy
221
234
222
-
- Announce deprecation and support policy of the existing flag
223
-
- Two versions passed since introducing the functionality which deprecates the flag (to address version skew)
224
-
- Address feedback on usage/changed behavior, provided on GitHub issues
225
-
- Deprecate the flag
235
+
During upgrade, downgrade or version skew, kubelet may be older that scheduler. Kubelet will not fill `CSINode.status` with volume limits and it will fill volume limits into `Node.status`. Scheduler must fall back to `Node.status` when `CSINode` is not available or its `status` does not contain a volume plugin / CSI driver.
226
236
227
-
###Upgrade / Downgrade / Version Skew Strategy
237
+
#### Interaction with CSI migration
228
238
239
+
In-tree volume plugins are being migrated to CSI in a [separate KEP](https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/20190129-csi-migration.md). This section covers possible situations:
229
240
230
-
During upgrade, downgrade or version skew, kubelet may be older that scheduler. Kubelet will not fill `CSINode.status` with volume limits and it will fill volume limits into `Node.status`. Scheduler must fall back to `Node.status` when `CSINode` is not available or its `status` does not contain a volume plugin / CSI driver.
241
+
* CSI migration feature is `off` for a volume plugin and CSI driver is not installed on a node:
242
+
* In-tree volume plugin is used for in-tree PVs.
243
+
* Kubelet gets volume limits from the plugin and creates `CSINode` during node registration. Scheduler and kubelet still needs to translate in-tree plugin name to CSI driver name to get the right `name` for `CSINode.status.allocatable`.
231
244
232
-
## Implementation History
245
+
* CSI migration feature is `off` for a volume plugin and CSI driver (for the same storage backend) is installed on a node:
246
+
* In-tree volume plugin is used for in-tree PVs.
247
+
* CSI driver is used for CSI PVs.
248
+
* Kubelet gets volume limits from the plugin and creates `CSINode` during node registration.
249
+
* When CSI driver is registered, kubelet gets its volume limit through CSI and updates `CSINode`, potentially overwriting in-tree plugin limit.
250
+
251
+
* CSI migration feature is `on` for a volume plugin and there is no CSI driver installed:
252
+
* In-tree volume plugin is "off".
253
+
* CSI driver is used both for in-tree and CSI PVs (if the driver is installed).
254
+
* Kubelet creates `CSINode` during node registration with no limit for the volume plugin / CSI driver.
255
+
* When CSI driver is registered, kubelet gets its volume limit through CSI and updates `CSINode` with the new limit.
256
+
* During the period when no CSI driver is registered and CSINode exists, there is "no limit" for the CSI driver and scheduler can put pods there! See non-goals!
257
+
258
+
* CSI migration feature is `on` for a volume plugin and in-tree plugin is removed:
259
+
* Same as above, kubelet creates `CSINode` during node registration with no limit for the volume plugin / CSI driver.
260
+
* When CSI driver is registered, kubelet gets its volume limit through CSI and updates `CSINode` with the new limit.
261
+
* During the period when no CSI driver is registered and CSINode exists, there is "no limit" for the CSI driver and scheduler can put pods there! See non-goals!
233
262
263
+
For brevity, in-line volumes in pods are handled the same as PVs in all cases above,
264
+
265
+
#### Interaction with old `AttachVolumeLimit` implementation
266
+
267
+
Due to version skew, following situations are possible (scheduler is always with `AttachVolumeLimit` enabled and with this KEP implemented):
268
+
269
+
* Kubelet has `AttachVolumeLimit` off:
270
+
* Scheduler does not see any volume limits in `CSINode` nor `Node`.
271
+
* Since `CSINode` is missing, scheduler falls back to `MaxEBSVolumeCount`, `MaxGCEPDVolumeCount`, `MaxAzureDiskVolumeCount` and `MaxCinderVolumeCount` predicates and schedules in-tree volumes the old way with hardcoded limits.
272
+
* From scheduler point of view, the node can handle any number of CSI volumes.
273
+
274
+
* Kubelet has old implementation of `AttachVolumeLimit` and the feature is on (kubelet fills `Node.status.available`):
275
+
* Scheduler does not see any volume limits in `CSINode`.
276
+
* Since `CSINode` is missing, scheduler falls back to `MaxEBSVolumeCount`, `MaxGCEPDVolumeCount`, `MaxAzureDiskVolumeCount` and `MaxCinderVolumeCount` predicates and schedules in-tree volumes the old way.
277
+
* Scheduler falls back to told implementation of `MaxCSIVolumeCountPred` for CSI volumes and uses limits from `Node.status`.
278
+
279
+
* Kubelet has new implementation of `AttachVolumeLimit` and the feature is on (kubelet fills `CSINode`):
280
+
* No issue here, see this KEP.
281
+
* Since `CSINode` is available, scheduler uses new implementation of `MaxCSIVolumeCountPred`.
282
+
283
+
As implied by the above, the scheduler needs to have both old and new implementation of `MaxCSIVolumeCountPred` and switch between them based on `CSINode` availability for a particular node until the old implementation is deprecated and removed (2 releases).
0 commit comments