Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
16 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,9 @@ authors:
- "@bskiba"
- "@schylek"
- "@vinaykul"
owning-sig: sig-autoscaling
owning-sig: sig-node
participating-sigs:
- sig-node
- sig-autoscaling
- sig-scheduling
reviewers:
- "@bsalamat"
Expand All @@ -23,9 +23,10 @@ approvers:
- "@mwielgus"
editor: TBD
creation-date: 2018-11-06
last-updated: 2018-11-06
status: provisional
last-updated: 2020-01-14
status: implementable
see-also:
- "/keps/sig-node/20191025-kubelet-container-resources-cri-api-changes.md"
replaces:
superseded-by:
---
Expand All @@ -48,11 +49,21 @@ superseded-by:
- [Scheduler and API Server Interaction](#scheduler-and-api-server-interaction)
- [Flow Control](#flow-control)
- [Container resource limit update ordering](#container-resource-limit-update-ordering)
- [Container resource limit update failure handling](#container-resource-limit-update-failure-handling)
- [Notes](#notes)
- [Affected Components](#affected-components)
- [Future Enhancements](#future-enhancements)
- [Risks and Mitigations](#risks-and-mitigations)
- [Test Plan](#test-plan)
- [Unit Tests](#unit-tests)
- [Pod Resize E2E Tests](#pod-resize-e2e-tests)
- [Resource Quota and Limit Ranges](#resource-quota-and-limit-ranges)
- [Resize Policy Tests](#resize-policy-tests)
- [Backward Compatibility and Negative Tests](#backward-compatibility-and-negative-tests)
- [Graduation Criteria](#graduation-criteria)
- [Alpha](#alpha)
- [Beta](#beta)
- [Stable](#stable)
- [Implementation History](#implementation-history)
<!-- /toc -->

Expand Down Expand Up @@ -134,15 +145,19 @@ Thanks to the above:
v1.ResourceRequirements) shows the **actual** resources held by the Pod and
its Containers.

A new Pod subresource named 'resourceallocation' is introduced to allow
fine-grained access control that enables Kubelet to set or update resources
allocated to a Pod, and prevents the user or any other component from changing
the allocated resources.
A new admission controller named 'PodResourceAllocation' is introduced in order
to limit access to ResourcesAllocated field such that only Kubelet can update
this field.

Additionally, Kubelet is authorized to update PodSpec, and NodeRestriction
admission plugin is extended to limit Kubelet's update access only to Pod's
ResourcesAllocated field for CPU and memory resources.

#### Container Resize Policy

To provide fine-grained user control, PodSpec.Containers is extended with
ResizePolicy map (new object) for each resource type (CPU, memory):
ResizePolicy - a list of named subobjects (new object) that supports 'cpu'
and 'memory' as names. It supports the following policy values:
* NoRestart - the default value; resize Container without restarting it,
* RestartContainer - restart the Container in-place to apply new resource
values. (e.g. Java process needs to change its Xmx flag)
Expand All @@ -167,6 +182,13 @@ Kubelet calls UpdateContainerResources CRI API which currently takes
but not for Windows. This parameter changes to *runtimeapi.ContainerResources*,
that is runtime agnostic, and will contain platform-specific information.

Additionally, ContainerStatus CRI API is extended to hold
*runtimeapi.ContainerResources* so that it allows Kubelet to query Container's
CPU and memory limit configurations from runtime.

These CRI changes are a separate effort that does not affect the design
proposed in this KEP.

### Kubelet and API Server Interaction

When a new Pod is created, Scheduler is responsible for selecting a suitable
Expand All @@ -185,11 +207,10 @@ resources allocated (Pod.Spec.Containers[i].ResourcesAllocated) for all Pods in
the Node, except the Pod being resized. For the Pod being resized, it adds the
new desired resources (i.e Spec.Containers[i].Resources.Requests) to the sum.
* If new desired resources fit, Kubelet accepts the resize by updating
Pod.Spec.Containers[i].ResourcesAllocated via pods/resourceallocation
subresource, and then proceeds to invoke UpdateContainerResources CRI API
to update the Container resource limits. Once all Containers are successfully
updated, it updates Pod.Status.ContainerStatuses[i].Resources to reflect the
new resource values.
Pod.Spec.Containers[i].ResourcesAllocated, and then proceeds to invoke
UpdateContainerResources CRI API to update Container resource limits. Once
all Containers are successfully updated, it updates
Pod.Status.ContainerStatuses[i].Resources to reflect new resource values.
* If new desired resources don't fit, Kubelet rejects the resize, and no
further action is taken.
- Kubelet retries the Pod resize at a later time.
Expand Down Expand Up @@ -234,10 +255,9 @@ Pod with ResizePolicy set to NoRestart for all its Containers.
resources to determine if the new desired Resources fit the Node.
* _Case 1_: Kubelet finds new desired Resources fit. It accepts the resize
and sets Spec.Containers[i].ResourcesAllocated equal to the values of
Spec.Containers[i].Resources.Requests by invoking resourceallocation
subresource. It then applies the new cgroup limits to the Pod and its
Containers, and once successfully done, sets Pod's
Status.ContainerStatuses[i].Resources to reflect the desired resources.
Spec.Containers[i].Resources.Requests. It then applies the new cgroup
limits to the Pod and its Containers, and once successfully done, sets
Pod's Status.ContainerStatuses[i].Resources to reflect desired resources.
- If at the same time, a new Pod was assigned to this Node against the
capacity taken up by this resource resize, that new Pod is rejected by
Kubelet during admission if Node has no more room.
Expand Down Expand Up @@ -283,6 +303,16 @@ updates resource limit for the Pod and its Containers in the following manner:
In all the above cases, Kubelet applies Container resource limit decreases
before applying limit increases.

#### Container resource limit update failure handling

If multiple Containers in a Pod are being updated, and UpdateContainerResources
CRI API fails for any of the containers, Kubelet will backoff and retry at a
later time. Kubelet does not attempt to update limits for containers that are
lined up for update after the failing container. This ensures that sum of the
container limits does not exceed Pod-level cgroup limit at any point. Once all
the container limits have been successfully updated, Kubelet updates the Pod's
Status.ContainerStatuses[i].Resources to match the desired limit values.

#### Notes

* If CPU Manager policy for a Node is set to 'static', then only integral
Expand All @@ -309,13 +339,20 @@ before applying limit increases.

Pod v1 core API:
* extended model,
* new subresource,
* added validation.
* modify RBAC bootstrap policy authorizing Node to update PodSpec,
* extend NodeRestriction plugin limiting Node's update access to PodSpec only
to the ResourcesAllocated field,
* new admission controller to limit update access to ResourcesAllocated field
only to Node, and mutates any updates to ResourcesAllocated & ResizePolicy
fields to maintain compatibility with older versions of clients,
* added validation allowing only CPU and memory resource changes,
* setting defaults for ResourcesAllocated and ResizePolicy fields.

Admission Controllers: LimitRanger, ResourceQuota need to support Pod Updates:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flow described above requires kubelets to update pod spec, which they do not have permission to do today.

That involves changing:

  • the node authorizer to permit kubelets to patch/update the pods resource (not just the pods/status subresource)
  • the NodeRestriction admission plugin to understand what types of updates a kubelet is allowed to make to a pod (we would not want to allow arbitrary label/image updates, for example)

cc @tallclair

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

From what I can see, the simplest way is to introduce admitPodUpdate method to NodeRestriction plugin that verifies only ResourcesAllocated field is being touched, and that node updating the pod owns the pod. I'll try it out and see if that covers it without leaving any holes.

For authorization, I have modified NodeRules() in plugin/pkg/auth/authorizer/rbac/bootstrappolicy/policy.go to allow nodes to update the pod resource. (They are allowed to create and delete pods at this time).

The above approach is consistent with how pod creates and deletes by node are handled.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the proposal is scoped to support only cpu and memory resources, is kubelet only authorized to change those values? I am assuming that we would want the kubelet to report all resources allocated and enforced (not just cpu and memory), but we would not want to let a user change the pod spec in validation for anything other than cpu and memory? Is that an accurate understanding?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the alternative is that the pod admission plugin sets allocated for all resources other than cpu/memory, but that would make extending this support to other future resource types challenging.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kubelet can continue to report status on all resources. I don't see a need to restrict status changes for the new Resources field. However, for ResourcesAllocated field, it is best to start with allowing Node to change what's actually supported now. As we add support for other resource types, we can just add to the list of supported resource types in the admission plugin.

And yes, for the user, we definitely want to lock it down to just what's supported - cpu and memory

* for ResourceQuota, podEvaluator.Handler implementation is modified to allow
Pod updates, and verify that sum of Pod.Spec.Containers[i].Resources for all
Pods in the Namespace don't exceed quota,
* PodResourceAllocation admission plugin is ordered before ResourceQuota.
* for LimitRanger we check that a resize request does not violate the min and
max limits specified in LimitRange for the Pod's namespace.

Expand All @@ -328,9 +365,6 @@ Kubelet:
Scheduler:
* compute resource allocations using Pod.Spec.Containers[i].ResourcesAllocated.

Controllers:
* propagate Template resources update to running Pod instances.

Other components:
* check how the change of meaning of resource requests influence other
Kubernetes components.
Expand All @@ -347,6 +381,8 @@ Other components:
1. Extend Node Information API to report the CPU Manager policy for the Node,
and enable validation of integral CPU resize for nodes with 'static' CPU
Manager policy.
1. Extend controllers (Job, Deployment, etc) to propagate Template resources
update to running Pods.
1. Allow resizing local ephemeral storage.
1. Allow resource limits to be updated (VPA feature).

Expand All @@ -362,14 +398,148 @@ Other components:
1. Resizing memory lower: Lowering cgroup memory limits may not work as pages
could be in use, and approaches such as setting limit near current usage may
be required. This issue needs further investigation.
1. Older client versions: Previous versions of clients that are unaware of the
new ResourcesAllocated and ResizePolicy fields would set them to nil. To
keep compatibility, PodResourceAllocation admission controller mutates such
an update by copying non-nil values from the old Pod to current Pod.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this proposed adding a new field to pod spec, we need to consider the following cases:

  • updates by clients unaware of the new field, which preserve it and send existing the existing value (e.g. dynamic clients using unstructured json requests/responses, or clients using patch)
    • since those clients would not currently be successfully changing resources, there's probably nothing special that needs to be done for these clients
  • updated by clients unaware of the new field, which drop it on update (e.g. old versions of client-go)
    • an update request from a client like this would set the new field to nil. The server must not treat that as an attempt by the client to clear the field (and forbid based on an authorization check, etc), but must maintain compatibility with existing clients by copying the value from the existing pod

Since the ResourcesAllocated field is in pod spec, and pod spec is also used inside pod templates, are we intending to allow/disallow this field to be set inside workload API types (e.g. daemonset, deployment)? Unless we actively prevent it, values can be set for that field in those types, and we have to think through how to handle updates from old clients for those types as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Controllers: propagate Template resources update to running Pod instances, has that been investigated and proven feasible? There are multiple mechanisms controllers use to match up particular child resources with particular generations of the parent resource, and it would be good to know if some (like hashing of the pod template to determine a label for the child resource's selector) are incompatible with in-place update of the pod template without rolling out a new instance of the child.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Pardon my ignorance with admission controllers, I've just started playing with it a few weeks ago. But I believe I should be able to mutate it with the new PodResourceAllocation controller - I'll look deeper into this. Is there a wiki that I can use to experiment with upgrade?

About controllers, we had the propagation working with Job and Deployment controllers in our old design prototype code. But I'll remove this from the scope of the current KEP - VPA cares about updating running pods, and I don't want to commit to it as I need to budget for a few surprises as I do a thorough implementation of the admission control changes and handle upgrade scenario. So we will disallow updating template nested pods. This can always be added as a subsequent enhancement.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liggitt I dug a bit more into updating controller templates. Currently, we cannot update Resources field for Job controllers, but allowed to do so for Deployment controllers - it results in Pods being recreated with the new desired resources.

I want to keep the same behavior - if we attempted to disallow it because of this feature, it would be a breaking change.

In 1.19 or another future release, we can perhaps consider propagating the template resource change to running pods (as we had done in our old design PoC). So I'll clarify the KEP to state that current behavior will be maintained for template Pod Resources updates.

Copy link
Member

@liggitt liggitt Jan 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In 1.19 or another future release, we can perhaps consider propagating the template resource change to running pods (as we had done in our old design PoC). So I'll clarify the KEP to state that current behavior will be maintained for template Pod Resources updates.

If vertical scaling is only done on individual pod instances, that means a new rollout of a deployment will reset all resource use back to the original levels? Is that acceptable? That seems likely to cause problems if current pods were scaled up in response to load, then a rollout drops capacity back down significantly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or is the idea that a separate process would determine the average required resources and propagate that back into the workload template at some interval?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or is the idea that a separate process would determine the average required resources and propagate that back into the workload template at some interval?

Yes. Current VPA behavior is to make resource recommendations based on historical measurements and current usage, and optionally apply those recommendations during admission control if the user chooses to allow VPA to control the resources. New recommendations are currently applied by evicting the current pod so that it hits the admission controller.

At this time, we want to keep the current behavior aside from the added ability for VPA to request a pod to be resized without restart.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the pod admission mutation makes sense as long as that happens prior to quota evaluation.

btw, i appreciate this additional detail.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liggitt I'm able to take care of updates from older client-go versions by setting default values on create, and copying old object values on update by handling it in admission controller mutating phase rather than defaults.go. Doing this in defaults.go would attempt to set the values that were dropped by older client-go to default values and this we would lose data.

I was able to test this out by writing a little tool similar to staging/src/k8s.io/client-go/examples/create-update-delete-deployment, but one that calls Pods(ns).Update()

Validation allows Resources and ResourcesAllocated fields to be mutable only for PodSpec, and podresourceallocation and noderestriction plugins handle what user can do and what node can update.

Please review PR vinaykul/kubernetes#1

## Test Plan

### Unit Tests

Unit tests will cover the sanity of code changes that implements the feature,
and the policy controls that are introduced as part of this feature.

### Pod Resize E2E Tests

End-to-End tests resize a Pod via PATCH to Pod's Spec.Containers[i].Resources.
The e2e tests use docker as container runtime.
- Resizing of Requests are verified by querying the values in Pod's
Spec.Containers[i].ResourcesAllocated field.
- Resizing of Limits are verified by querying the cgroup limits of the Pod's
containers.

E2E test cases for Guaranteed class Pod with one container:
1. Increase, decrease Requests & Limits for CPU only.
1. Increase, decrease Requests & Limits for memory only.
1. Increase, decrease Requests & Limits for CPU and memory.
1. Increase CPU and decrease memory.
1. Decrease CPU and increase memory.

E2E test cases for Burstable class single container Pod that specifies
both CPU & memory:
1. Increase, decrease Requests - CPU only.
1. Increase, decrease Requests - memory only.
1. Increase, decrease Requests - both CPU & memory.
1. Increase, decrease Limits - CPU only.
1. Increase, decrease Limits - memory only.
1. Increase, decrease Limits - both CPU & memory.
1. Increase, decrease Requests & Limits - CPU only.
1. Increase, decrease Requests & Limits - memory only.
1. Increase, decrease Requests & Limits - both CPU and memory.
1. Increase CPU (Requests+Limits) & decrease memory(Requests+Limits).
1. Decrease CPU (Requests+Limits) & increase memory(Requests+Limits).
1. Increase CPU Requests while decreasing CPU Limits.
1. Decrease CPU Requests while increasing CPU Limits.
1. Increase memory Requests while decreasing memory Limits.
1. Decrease memory Requests while increasing memory Limits.
1. CPU: increase Requests, decrease Limits, Memory: increase Requests, decrease Limits.
1. CPU: decrease Requests, increase Limits, Memory: decrease Requests, increase Limits.

E2E tests for Burstable class single container Pod that specifies CPU only:
1. Increase, decrease CPU - Requests only.
1. Increase, decrease CPU - Limits only.
1. Increase, decrease CPU - both Requests & Limits.

E2E tests for Burstable class single container Pod that specifies memory only:
1. Increase, decrease memory - Requests only.
1. Increase, decrease memory - Limits only.
1. Increase, decrease memory - both Requests & Limits.

E2E tests for Guaranteed class Pod with three containers (c1, c2, c3):
1. Increase CPU & memory for all three containers.
1. Decrease CPU & memory for all three containers.
1. Increase CPU, decrease memory for all three containers.
1. Decrease CPU, increase memory for all three containers.
1. Increase CPU for c1, decrease c2, c3 unchanged - no net CPU change.
1. Increase memory for c1, decrease c2, c3 unchanged - no net memory change.
1. Increase CPU for c1, decrease c2 & c3 - net CPU decrease for Pod.
1. Increase memory for c1, decrease c2 & c3 - net memory decrease for Pod.
1. Increase CPU for c1 & c3, decrease c2 - net CPU increase for Pod.
1. Increase memory for c1 & c3, decrease c2 - net memory increase for Pod.

### Resource Quota and Limit Ranges

Setup a namespace with ResourceQuota and a single, valid Pod.
1. Resize the Pod within resource quota - CPU only.
1. Resize the Pod within resource quota - memory only.
1. Resize the Pod within resource quota - both CPU and memory.
1. Resize the Pod to exceed resource quota - CPU only.
1. Resize the Pod to exceed resource quota - memory only.
1. Resize the Pod to exceed resource quota - both CPU and memory.

Setup a namespace with min and max LimitRange and create a single, valid Pod.
1. Increase, decrease CPU within min/max bounds.
1. Increase CPU to exceed max value.
1. Decrease CPU to go below min value.
1. Increase memory to exceed max value.
1. Decrease memory to go below min value.

### Resize Policy Tests

Setup a guaranteed class Pod with two containers (c1 & c2).
1. No resize policy specified, defaults to NoRestart. Verify that CPU and
memory are resized without restarting containers.
1. NoRestart (cpu, memory) policy for c1, RestartContainer (cpu, memory) for c2.
Verify that c1 is resized without restart, c2 is restarted on resize.
1. NoRestart cpu, RestartContainer memory policy for c1. Resize c1 CPU only,
verify container is resized without restart.
1. NoRestart cpu, RestartContainer memory policy for c1. Resize c1 memory only,
verify container is resized with restart.
1. NoRestart cpu, RestartContainer memory policy for c1. Resize c1 CPU & memory,
verify container is resized with restart.

### Backward Compatibility and Negative Tests

1. Verify that Node is allowed to update only a Pod's ResourcesAllocated field.
1. Verify that only Node account is allowed to udate ResourcesAllocated field.
1. Verify that updating Pod Resources in workload template spec retains current
behavior:
- Updating Pod Resources in Job template is not allowed.
- Updating Pod Resources in Deployment template continues to result in Pod
being restarted with updated resources.
1. Verify Pod updates by older version of client-go doesn't result in current
values of ResourcesAllocated and ResizePolicy fields being dropped.
1. Verify that only CPU and memory resources are mutable by user.

TODO: Identify more cases

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this touches a field involved in pod spec, pod template spec, and workload controllers, we need tests to make sure introduction of this does not cause workloads to redeploy on API server upgrade (e.g. kubernetes/kubernetes#78633); tests that look something like what is described in kubernetes/kubernetes#78904, and which are actually run

## Graduation Criteria

TODO
### Alpha
- In-Place Pod Resouces Update functionality is implemented for running Pods,
- LimitRanger and ResourceQuota handling are added,
- Resize Policies functionality is implemented,
- Unit tests and E2E tests covering basic functionality are added,
- E2E tests covering multiple containers are added.

### Beta
- VPA alpha integration of feature completed and any bugs addressed,
- E2E tests covering Resize Policy, LimitRanger, and ResourceQuota are added,
- Negative tests are identified and added.

### Stable
- VPA integration of feature moved to beta,
- User feedback (ideally from atleast two distinct users) is green,
- No major bugs reported for three months.

## Implementation History

- 2018-11-06 - initial KEP draft created
- 2019-01-18 - implementation proposal extended
- 2019-03-07 - changes to flow control, updates per review feedback
- 2019-08-29 - updated design proposal
- 2019-10-25 - update key open items and move KEP to implementable
- 2020-01-06 - API review suggested changes incorporated
- 2020-01-13 - Test plan and graduation criteria added
- 2020-01-21 - Graduation criteria updated per review feedback
Loading