Skip to content

Commit 82cccb8

Browse files
committed
First review round
1 parent eeeb7b1 commit 82cccb8

File tree

1 file changed

+82
-30
lines changed

1 file changed

+82
-30
lines changed

keps/sig-storage/20190408-volume-scheduling-limits.md

Lines changed: 82 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -75,7 +75,6 @@ It is problematic for CSI (`MaxCSIVolumeCountPred`) outlined in [#730](https://g
7575

7676
- `ResourceName` is limited to 63 characters. We must prefix `ResourceName` with unique string (such as `attachable-volumes-csi-<driver name>`) so it cannot collide with existing resources like `cpu` or `memory`. But `<driver name>` itself is up to 63 character long, so we ended up with using SHA-sums of driver name to keep the `ResourceName` unique, which is not user readable.
7777
- CSI driver cannot share its limits with in-tree volume plugin e.g. when running pods with AWS EBS in-tree volumes and `ebs.csi.aws.com` CSI driver on the same node.
78-
- `node.status` size increases with each installed CSI driver. Node objects are big enough already.
7978

8079
### Goals
8180

@@ -90,6 +89,14 @@ It is problematic for CSI (`MaxCSIVolumeCountPred`) outlined in [#730](https://g
9089

9190
- Heterogenous clusters, i.e. clusters where access to storage is limited only to some nodes. Existing `PV.spec.nodeAffinity` handling, not modified by this KEP, will filter out nodes that don't have access to the storage, so predicates changed in this KEP don't need to worry about storage topology and can be simpler.
9291

92+
- Scheduling based on availability / health of CSI drivers on nodes. `CSINode` can be used to check which nodes have a driver installed and it could be extended to report health, so scheduler puts pods only to nodes with installed and healthy driver. This is out of scope of this PR.
93+
94+
- Multiple plugins sharing the same volume limits. We expect that every CSI driver will have its own limits, not shared with other CSI drivers. In this KEP we support only in-tree volume plugins sharing its limits with one hard-coded CSI driver each.
95+
96+
- Multiple "units" per single volume. Each volume used on a node takes exactly 1 unit from `allocatable.volumes`, regardless of the volume size, its replica count, number of connections to remote servers or other underlying resources needed to use the volume. For example, multipath iSCSI volume with three paths (and thus three iSCSI connections to three different servers) still takes 1 unit from `CSINode.status.allocatable.volumes`.
97+
98+
- Maximum capacity per node. Some cloud environments limit both number of number of attached volumes (covered in this KEP) and total capacity of attached volumes (not covered in this KEP). For example, this KEP will ensure that scheduler puts max. 128 volumes to a [typical GCE node](https://cloud.google.com/compute/docs/machine-types#predefined_machine_types), but it won't ensure that the total capacity of the volumes is less than 64 TB.
99+
93100
## Proposal
94101

95102
* Track volume limits for both in-tree volume plugins and CSI drivers in `CSINode` objects instead of `Node`.
@@ -101,26 +108,25 @@ It is problematic for CSI (`MaxCSIVolumeCountPred`) outlined in [#730](https://g
101108
* Limit for in-tree volumes will be added by kubelet during CSINode creation. Name of corresponding CSI driver will be used as key in `CSINode.status.allocatable` and it will be discovered using [CSI translation library](https://github.com/kubernetes/kubernetes/tree/master/staging/src/k8s.io/csi-translation-lib). If the library does not support migration of an in-tree volume plugin, the volume plugin has no limit.
102109
* If a CSI driver is registered for an in-tree volume plugin and it reports a different volume limit than in-tree volume plugin, the limit reported by CSI driver is used and kubelet logs a warning.
103110
* User may NOT change `CSINode.status.allocatable` to override volume plugin / CSI driver values, e.g. to "reserve" some attachment to the operating system. Kubelet will periodically reconcile `CSINode` and overwrite the value.
104-
* Especially, `kubelet --kube-reserved` or `--system-reserved` cannot be used to "reserve" volumes for kubelet or the OS. It is not possible with current kubelet and this KEP does not change it.
105-
* We expect that CSI drivers will have configuration options / cmdline arguments to reserve some volumes and they will report their limit already reduced by that reserved amount.
111+
* Especially, `kubelet --kube-reserved` or `--system-reserved` cannot be used to "reserve" volumes for kubelet or the OS. It is not possible with current kubelet and this KEP does not change it. We expect that CSI drivers will have configuration options / cmdline arguments to reserve some volumes and they will report their limit already reduced by that reserved amount.
106112

107113
* Kubelet will continue filling `Node.status.allocatable` and `Node.status.capacity` for both in-tree and CSI volumes during deprecation period. After the deprecation period, it will stop using them completely.
108114
* Scheduler (all its storage predicates) will ignore `Node.status.allocatable` and `Node.status.capacity` if `CSINode.status.allocatable` is present.
109115
* If `CSINode.status.allocatable` (or whole `CSINode`) is missing, scheduler falls back to `Node.status.allocatable`. This solves version skew between old kubelet (using `Node.status`) and new scheduler.
110116
* After deprecation period, scheduler won't schedule any pods that use volumes to a node with missing `CSINode` instance. It is expected that it happens only during node registration when `Node` exists and `CSINode` doesn't and it self-heals quickly.
111117

112-
* `CSINode.status.allocatable` is a map CSI Driver name -> int64. Following combinations are possible:
118+
* `CSINode.status.allocatable` is array of limits. Following combinations are possible:
113119

114-
| Driver name | value | Description |
115-
| ----------------- | ----- | ------------ |
116-
| `ebs.csi.aws.com` | 0 | plugin / CSI driver exists and has zero limit, i.e. can attach no volumes |
117-
| `ebs.csi.aws.com` | X>0 | plugin / CSI driver exists and can attach X volumes (where X > 0) |
118-
| `ebs.csi.aws.com` | X<0 | negative values are blocked by validation
119-
| key is missing in `CSINode.status.allocatable` | - | there is no limit of volumes on the node* |
120+
| `Volumes` | Description |
121+
| --------- | ------------ |
122+
| 0 | plugin / CSI driver exists and has zero limit, i.e. can attach no volumes |
123+
| X>0 | plugin / CSI driver exists and can attach X volumes (where X > 0) |
124+
| X<0 | negative values are blocked by validation
125+
| Driver is missing in `CSINode.status.allocatable` | there is no limit of volumes on the node* |
120126

121127
*) This way we are not able to distinguish between a volume plugin / CSI driver not installed on a node or it has been installed and it has no limits.
122128
* Predicates modified in this KEP assume that storage provided by **in-tree** volume plugin is available all nodes in the cluster. Other predicate(s) will evaluate `PV.spec.nodeAffinity` and filter out nodes that don't have access to the storage.
123-
* For CSI drivers, availability of a CSI driver on a node can be checked in `CSINode.spec`.
129+
* For CSI drivers, availability of a CSI driver on a node can be checked in `CSINode.spec`. Its handling is out of scope of this KEP, see non-goals.
124130

125131
CSINode example:
126132

@@ -133,15 +139,15 @@ spec:
133139
status:
134140
allocatable:
135141
# AWS node can attach max. 40 volumes, 1 is reserved for the system
136-
ebs.csi.aws.com: 39
142+
- name: ebs.csi.aws.com
143+
volumes: 39
137144
```
138145
139146
* Existing scheduler predicates `MaxEBSVolumeCount`, `MaxGCEPDVolumeCount`, `MaxAzureDiskVolumeCount` and `MaxCinderVolumeCount` are already deprecated.
140147
* If any of them is enabled together with `MaxCSIVolumeCountPred`, the deprecated predicate will do nothing (`MaxCSIVolumeCountPred` does the job of counting both in-tree and CSI volumes).
141-
* The deprecated predicates will do any useful work only when `MaxCSIVolumeCountPred` predicate is disabled.
148+
* The deprecated predicates will do any useful work only when `MaxCSIVolumeCountPred` predicate is disabled or `CSINode` object does not have limits for a particular driver.
142149
* This way, we save CPU by running only one volume limit predicate during deprecation period.
143150

144-
145151
### New API
146152

147153
CSINode gets `Status` struct with `Allocatable`, holding limit of volumes for each volume plugin and CSI driver that can be scheduled to the node.
@@ -154,20 +160,27 @@ type CSINode struct {
154160
Status CSINodeStatus `json:"status" protobuf:"bytes,3,opt,name=status"`
155161
}
156162

163+
// CSINodeStatus holds information about the status of all CSI drivers installed on a node
164+
type CSINodeStatus struct {
165+
// allocatable is a list of volume limits for each volume plugin and CSI driver on the node.
166+
// +patchMergeKey=name
167+
// +patchStrategy=merge
168+
Allocatable []VolumeLimits `json:"allocatable" patchStrategy:"merge" patchMergeKey:"name" protobuf:"bytes,1,rep,name=allocatable"`
169+
}
170+
157171
// VolumeLimits is map CSI driver name -> maximum count of volumes for the driver on the node.
158172
// For in-tree volume plugins, name of corresponding CSI driver is used.
159173
// Value can be either:
160174
// - Positive integer: that's the volume limit.
161175
// - Zero: such volume cannot be used on the node.
162176
// - key is missing in VolumeLimits: there is no volume limit, i.e. any number of volumes can be used on the node.
163-
type VolumeLimits map[string]int64
177+
type VolumeLimits struct {
178+
// name of CSI driver. For in-tree volume plugins, name of corresponding CSI driver is used.
179+
Name string `json:"name" protobuf:"bytes,1,opt,name=name"`
180+
// volumes is maximum number of volumes provided by the CSI driver that can be used by the node
181+
Volumes int64 `json:"volumes,omitempty" protobuf:"varint,2,opt,name=volumes"`
164182

165-
// CSINodeStatus holds information about the status of all CSI drivers installed on a node
166-
type CSINodeStatus struct {
167-
// allocatable is a list of volume limits for each volume plugin and CSI driver on the node.
168-
// +patchMergeKey=name
169-
// +patchStrategy=merge
170-
Allocatable []VolumeLimits `json:"allocatable" patchStrategy:"merge" patchMergeKey:"name" protobuf:"bytes,1,rep,name=allocatable"`
183+
// Future proof: max. total size of volumes on the node can be added later
171184
}
172185
```
173186

@@ -183,7 +196,7 @@ type CSINodeStatus struct {
183196
* This KEP depends on [CSI migration library](https://github.com/kubernetes/kubernetes/tree/master/staging/src/k8s.io/csi-translation-lib). It can happen that CSI migration is redesigned / cancelled.
184197
* Countermeasure: [CSI migration](https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/20190129-csi-migration.md) and this KEP should graduate together.
185198

186-
* This KEP depends on [CSI migration library](https://github.com/kubernetes/kubernetes/tree/master/staging/src/k8s.io/csi-translation-lib) ability to handle in-line in-tree volumes. Scheduler will need to get CSI driver name + `VolumeHandle` from them to count them towards the limit.
199+
* This KEP depends on [CSI migration library](https://github.com/kubernetes/kubernetes/tree/master/staging/src/k8s.io/csi-translation-lib) ability to handle in-line in-tree volumes. Scheduler will need to get CSI driver name + `VolumeHandle` from them to count them towards the limit.
187200

188201
## Design Details
189202

@@ -217,20 +230,59 @@ N/A (`AttachVolumeLimit` feature is already beta).
217230

218231
It must graduate together with CSI migration.
219232

220-
##### Removing a deprecated flag
233+
### Upgrade / Downgrade / Version Skew Strategy
221234

222-
- Announce deprecation and support policy of the existing flag
223-
- Two versions passed since introducing the functionality which deprecates the flag (to address version skew)
224-
- Address feedback on usage/changed behavior, provided on GitHub issues
225-
- Deprecate the flag
235+
During upgrade, downgrade or version skew, kubelet may be older that scheduler. Kubelet will not fill `CSINode.status` with volume limits and it will fill volume limits into `Node.status`. Scheduler must fall back to `Node.status` when `CSINode` is not available or its `status` does not contain a volume plugin / CSI driver.
226236

227-
### Upgrade / Downgrade / Version Skew Strategy
237+
#### Interaction with CSI migration
228238

239+
In-tree volume plugins are being migrated to CSI in a [separate KEP](https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/20190129-csi-migration.md). This section covers possible situations:
229240

230-
During upgrade, downgrade or version skew, kubelet may be older that scheduler. Kubelet will not fill `CSINode.status` with volume limits and it will fill volume limits into `Node.status`. Scheduler must fall back to `Node.status` when `CSINode` is not available or its `status` does not contain a volume plugin / CSI driver.
241+
* CSI migration feature is `off` for a volume plugin and CSI driver is not installed on a node:
242+
* In-tree volume plugin is used for in-tree PVs.
243+
* Kubelet gets volume limits from the plugin and creates `CSINode` during node registration. Scheduler and kubelet still needs to translate in-tree plugin name to CSI driver name to get the right `name` for `CSINode.status.allocatable`.
231244

232-
## Implementation History
245+
* CSI migration feature is `off` for a volume plugin and CSI driver (for the same storage backend) is installed on a node:
246+
* In-tree volume plugin is used for in-tree PVs.
247+
* CSI driver is used for CSI PVs.
248+
* Kubelet gets volume limits from the plugin and creates `CSINode` during node registration.
249+
* When CSI driver is registered, kubelet gets its volume limit through CSI and updates `CSINode`, potentially overwriting in-tree plugin limit.
250+
251+
* CSI migration feature is `on` for a volume plugin and there is no CSI driver installed:
252+
* In-tree volume plugin is "off".
253+
* CSI driver is used both for in-tree and CSI PVs (if the driver is installed).
254+
* Kubelet creates `CSINode` during node registration with no limit for the volume plugin / CSI driver.
255+
* When CSI driver is registered, kubelet gets its volume limit through CSI and updates `CSINode` with the new limit.
256+
* During the period when no CSI driver is registered and CSINode exists, there is "no limit" for the CSI driver and scheduler can put pods there! See non-goals!
257+
258+
* CSI migration feature is `on` for a volume plugin and in-tree plugin is removed:
259+
* Same as above, kubelet creates `CSINode` during node registration with no limit for the volume plugin / CSI driver.
260+
* When CSI driver is registered, kubelet gets its volume limit through CSI and updates `CSINode` with the new limit.
261+
* During the period when no CSI driver is registered and CSINode exists, there is "no limit" for the CSI driver and scheduler can put pods there! See non-goals!
233262

263+
For brevity, in-line volumes in pods are handled the same as PVs in all cases above,
264+
265+
#### Interaction with old `AttachVolumeLimit` implementation
266+
267+
Due to version skew, following situations are possible (scheduler is always with `AttachVolumeLimit` enabled and with this KEP implemented):
268+
269+
* Kubelet has `AttachVolumeLimit` off:
270+
* Scheduler does not see any volume limits in `CSINode` nor `Node`.
271+
* Since `CSINode` is missing, scheduler falls back to `MaxEBSVolumeCount`, `MaxGCEPDVolumeCount`, `MaxAzureDiskVolumeCount` and `MaxCinderVolumeCount` predicates and schedules in-tree volumes the old way with hardcoded limits.
272+
* From scheduler point of view, the node can handle any number of CSI volumes.
273+
274+
* Kubelet has old implementation of `AttachVolumeLimit` and the feature is on (kubelet fills `Node.status.available`):
275+
* Scheduler does not see any volume limits in `CSINode`.
276+
* Since `CSINode` is missing, scheduler falls back to `MaxEBSVolumeCount`, `MaxGCEPDVolumeCount`, `MaxAzureDiskVolumeCount` and `MaxCinderVolumeCount` predicates and schedules in-tree volumes the old way.
277+
* Scheduler falls back to told implementation of `MaxCSIVolumeCountPred` for CSI volumes and uses limits from `Node.status`.
278+
279+
* Kubelet has new implementation of `AttachVolumeLimit` and the feature is on (kubelet fills `CSINode`):
280+
* No issue here, see this KEP.
281+
* Since `CSINode` is available, scheduler uses new implementation of `MaxCSIVolumeCountPred`.
282+
283+
As implied by the above, the scheduler needs to have both old and new implementation of `MaxCSIVolumeCountPred` and switch between them based on `CSINode` availability for a particular node until the old implementation is deprecated and removed (2 releases).
284+
285+
## Implementation History
234286

235287
# Alternatives
236288

0 commit comments

Comments
 (0)