diff --git a/keps/sig-autoscaling/20181106-in-place-update-of-pod-resources.md b/keps/sig-node/20181106-in-place-update-of-pod-resources.md similarity index 62% rename from keps/sig-autoscaling/20181106-in-place-update-of-pod-resources.md rename to keps/sig-node/20181106-in-place-update-of-pod-resources.md index 663e38fe368..b3016943a22 100644 --- a/keps/sig-autoscaling/20181106-in-place-update-of-pod-resources.md +++ b/keps/sig-node/20181106-in-place-update-of-pod-resources.md @@ -5,9 +5,9 @@ authors: - "@bskiba" - "@schylek" - "@vinaykul" -owning-sig: sig-autoscaling +owning-sig: sig-node participating-sigs: - - sig-node + - sig-autoscaling - sig-scheduling reviewers: - "@bsalamat" @@ -23,9 +23,10 @@ approvers: - "@mwielgus" editor: TBD creation-date: 2018-11-06 -last-updated: 2018-11-06 -status: provisional +last-updated: 2020-01-14 +status: implementable see-also: + - "/keps/sig-node/20191025-kubelet-container-resources-cri-api-changes.md" replaces: superseded-by: --- @@ -48,11 +49,21 @@ superseded-by: - [Scheduler and API Server Interaction](#scheduler-and-api-server-interaction) - [Flow Control](#flow-control) - [Container resource limit update ordering](#container-resource-limit-update-ordering) + - [Container resource limit update failure handling](#container-resource-limit-update-failure-handling) - [Notes](#notes) - [Affected Components](#affected-components) - [Future Enhancements](#future-enhancements) - [Risks and Mitigations](#risks-and-mitigations) +- [Test Plan](#test-plan) + - [Unit Tests](#unit-tests) + - [Pod Resize E2E Tests](#pod-resize-e2e-tests) + - [Resource Quota and Limit Ranges](#resource-quota-and-limit-ranges) + - [Resize Policy Tests](#resize-policy-tests) + - [Backward Compatibility and Negative Tests](#backward-compatibility-and-negative-tests) - [Graduation Criteria](#graduation-criteria) + - [Alpha](#alpha) + - [Beta](#beta) + - [Stable](#stable) - [Implementation History](#implementation-history) @@ -134,15 +145,19 @@ Thanks to the above: v1.ResourceRequirements) shows the **actual** resources held by the Pod and its Containers. -A new Pod subresource named 'resourceallocation' is introduced to allow -fine-grained access control that enables Kubelet to set or update resources -allocated to a Pod, and prevents the user or any other component from changing -the allocated resources. +A new admission controller named 'PodResourceAllocation' is introduced in order +to limit access to ResourcesAllocated field such that only Kubelet can update +this field. + +Additionally, Kubelet is authorized to update PodSpec, and NodeRestriction +admission plugin is extended to limit Kubelet's update access only to Pod's +ResourcesAllocated field for CPU and memory resources. #### Container Resize Policy To provide fine-grained user control, PodSpec.Containers is extended with -ResizePolicy map (new object) for each resource type (CPU, memory): +ResizePolicy - a list of named subobjects (new object) that supports 'cpu' +and 'memory' as names. It supports the following policy values: * NoRestart - the default value; resize Container without restarting it, * RestartContainer - restart the Container in-place to apply new resource values. (e.g. Java process needs to change its Xmx flag) @@ -167,6 +182,13 @@ Kubelet calls UpdateContainerResources CRI API which currently takes but not for Windows. This parameter changes to *runtimeapi.ContainerResources*, that is runtime agnostic, and will contain platform-specific information. +Additionally, ContainerStatus CRI API is extended to hold +*runtimeapi.ContainerResources* so that it allows Kubelet to query Container's +CPU and memory limit configurations from runtime. + +These CRI changes are a separate effort that does not affect the design +proposed in this KEP. + ### Kubelet and API Server Interaction When a new Pod is created, Scheduler is responsible for selecting a suitable @@ -185,11 +207,10 @@ resources allocated (Pod.Spec.Containers[i].ResourcesAllocated) for all Pods in the Node, except the Pod being resized. For the Pod being resized, it adds the new desired resources (i.e Spec.Containers[i].Resources.Requests) to the sum. * If new desired resources fit, Kubelet accepts the resize by updating - Pod.Spec.Containers[i].ResourcesAllocated via pods/resourceallocation - subresource, and then proceeds to invoke UpdateContainerResources CRI API - to update the Container resource limits. Once all Containers are successfully - updated, it updates Pod.Status.ContainerStatuses[i].Resources to reflect the - new resource values. + Pod.Spec.Containers[i].ResourcesAllocated, and then proceeds to invoke + UpdateContainerResources CRI API to update Container resource limits. Once + all Containers are successfully updated, it updates + Pod.Status.ContainerStatuses[i].Resources to reflect new resource values. * If new desired resources don't fit, Kubelet rejects the resize, and no further action is taken. - Kubelet retries the Pod resize at a later time. @@ -234,10 +255,9 @@ Pod with ResizePolicy set to NoRestart for all its Containers. resources to determine if the new desired Resources fit the Node. * _Case 1_: Kubelet finds new desired Resources fit. It accepts the resize and sets Spec.Containers[i].ResourcesAllocated equal to the values of - Spec.Containers[i].Resources.Requests by invoking resourceallocation - subresource. It then applies the new cgroup limits to the Pod and its - Containers, and once successfully done, sets Pod's - Status.ContainerStatuses[i].Resources to reflect the desired resources. + Spec.Containers[i].Resources.Requests. It then applies the new cgroup + limits to the Pod and its Containers, and once successfully done, sets + Pod's Status.ContainerStatuses[i].Resources to reflect desired resources. - If at the same time, a new Pod was assigned to this Node against the capacity taken up by this resource resize, that new Pod is rejected by Kubelet during admission if Node has no more room. @@ -283,6 +303,16 @@ updates resource limit for the Pod and its Containers in the following manner: In all the above cases, Kubelet applies Container resource limit decreases before applying limit increases. +#### Container resource limit update failure handling + +If multiple Containers in a Pod are being updated, and UpdateContainerResources +CRI API fails for any of the containers, Kubelet will backoff and retry at a +later time. Kubelet does not attempt to update limits for containers that are +lined up for update after the failing container. This ensures that sum of the +container limits does not exceed Pod-level cgroup limit at any point. Once all +the container limits have been successfully updated, Kubelet updates the Pod's +Status.ContainerStatuses[i].Resources to match the desired limit values. + #### Notes * If CPU Manager policy for a Node is set to 'static', then only integral @@ -309,13 +339,20 @@ before applying limit increases. Pod v1 core API: * extended model, -* new subresource, -* added validation. +* modify RBAC bootstrap policy authorizing Node to update PodSpec, +* extend NodeRestriction plugin limiting Node's update access to PodSpec only + to the ResourcesAllocated field, +* new admission controller to limit update access to ResourcesAllocated field + only to Node, and mutates any updates to ResourcesAllocated & ResizePolicy + fields to maintain compatibility with older versions of clients, +* added validation allowing only CPU and memory resource changes, +* setting defaults for ResourcesAllocated and ResizePolicy fields. Admission Controllers: LimitRanger, ResourceQuota need to support Pod Updates: * for ResourceQuota, podEvaluator.Handler implementation is modified to allow Pod updates, and verify that sum of Pod.Spec.Containers[i].Resources for all Pods in the Namespace don't exceed quota, +* PodResourceAllocation admission plugin is ordered before ResourceQuota. * for LimitRanger we check that a resize request does not violate the min and max limits specified in LimitRange for the Pod's namespace. @@ -328,9 +365,6 @@ Kubelet: Scheduler: * compute resource allocations using Pod.Spec.Containers[i].ResourcesAllocated. -Controllers: -* propagate Template resources update to running Pod instances. - Other components: * check how the change of meaning of resource requests influence other Kubernetes components. @@ -347,6 +381,8 @@ Other components: 1. Extend Node Information API to report the CPU Manager policy for the Node, and enable validation of integral CPU resize for nodes with 'static' CPU Manager policy. +1. Extend controllers (Job, Deployment, etc) to propagate Template resources + update to running Pods. 1. Allow resizing local ephemeral storage. 1. Allow resource limits to be updated (VPA feature). @@ -362,10 +398,140 @@ Other components: 1. Resizing memory lower: Lowering cgroup memory limits may not work as pages could be in use, and approaches such as setting limit near current usage may be required. This issue needs further investigation. +1. Older client versions: Previous versions of clients that are unaware of the + new ResourcesAllocated and ResizePolicy fields would set them to nil. To + keep compatibility, PodResourceAllocation admission controller mutates such + an update by copying non-nil values from the old Pod to current Pod. + +## Test Plan + +### Unit Tests + +Unit tests will cover the sanity of code changes that implements the feature, +and the policy controls that are introduced as part of this feature. + +### Pod Resize E2E Tests + +End-to-End tests resize a Pod via PATCH to Pod's Spec.Containers[i].Resources. +The e2e tests use docker as container runtime. + - Resizing of Requests are verified by querying the values in Pod's + Spec.Containers[i].ResourcesAllocated field. + - Resizing of Limits are verified by querying the cgroup limits of the Pod's + containers. + +E2E test cases for Guaranteed class Pod with one container: +1. Increase, decrease Requests & Limits for CPU only. +1. Increase, decrease Requests & Limits for memory only. +1. Increase, decrease Requests & Limits for CPU and memory. +1. Increase CPU and decrease memory. +1. Decrease CPU and increase memory. + +E2E test cases for Burstable class single container Pod that specifies +both CPU & memory: +1. Increase, decrease Requests - CPU only. +1. Increase, decrease Requests - memory only. +1. Increase, decrease Requests - both CPU & memory. +1. Increase, decrease Limits - CPU only. +1. Increase, decrease Limits - memory only. +1. Increase, decrease Limits - both CPU & memory. +1. Increase, decrease Requests & Limits - CPU only. +1. Increase, decrease Requests & Limits - memory only. +1. Increase, decrease Requests & Limits - both CPU and memory. +1. Increase CPU (Requests+Limits) & decrease memory(Requests+Limits). +1. Decrease CPU (Requests+Limits) & increase memory(Requests+Limits). +1. Increase CPU Requests while decreasing CPU Limits. +1. Decrease CPU Requests while increasing CPU Limits. +1. Increase memory Requests while decreasing memory Limits. +1. Decrease memory Requests while increasing memory Limits. +1. CPU: increase Requests, decrease Limits, Memory: increase Requests, decrease Limits. +1. CPU: decrease Requests, increase Limits, Memory: decrease Requests, increase Limits. + +E2E tests for Burstable class single container Pod that specifies CPU only: +1. Increase, decrease CPU - Requests only. +1. Increase, decrease CPU - Limits only. +1. Increase, decrease CPU - both Requests & Limits. + +E2E tests for Burstable class single container Pod that specifies memory only: +1. Increase, decrease memory - Requests only. +1. Increase, decrease memory - Limits only. +1. Increase, decrease memory - both Requests & Limits. + +E2E tests for Guaranteed class Pod with three containers (c1, c2, c3): +1. Increase CPU & memory for all three containers. +1. Decrease CPU & memory for all three containers. +1. Increase CPU, decrease memory for all three containers. +1. Decrease CPU, increase memory for all three containers. +1. Increase CPU for c1, decrease c2, c3 unchanged - no net CPU change. +1. Increase memory for c1, decrease c2, c3 unchanged - no net memory change. +1. Increase CPU for c1, decrease c2 & c3 - net CPU decrease for Pod. +1. Increase memory for c1, decrease c2 & c3 - net memory decrease for Pod. +1. Increase CPU for c1 & c3, decrease c2 - net CPU increase for Pod. +1. Increase memory for c1 & c3, decrease c2 - net memory increase for Pod. + +### Resource Quota and Limit Ranges + +Setup a namespace with ResourceQuota and a single, valid Pod. +1. Resize the Pod within resource quota - CPU only. +1. Resize the Pod within resource quota - memory only. +1. Resize the Pod within resource quota - both CPU and memory. +1. Resize the Pod to exceed resource quota - CPU only. +1. Resize the Pod to exceed resource quota - memory only. +1. Resize the Pod to exceed resource quota - both CPU and memory. + +Setup a namespace with min and max LimitRange and create a single, valid Pod. +1. Increase, decrease CPU within min/max bounds. +1. Increase CPU to exceed max value. +1. Decrease CPU to go below min value. +1. Increase memory to exceed max value. +1. Decrease memory to go below min value. + +### Resize Policy Tests + +Setup a guaranteed class Pod with two containers (c1 & c2). +1. No resize policy specified, defaults to NoRestart. Verify that CPU and + memory are resized without restarting containers. +1. NoRestart (cpu, memory) policy for c1, RestartContainer (cpu, memory) for c2. + Verify that c1 is resized without restart, c2 is restarted on resize. +1. NoRestart cpu, RestartContainer memory policy for c1. Resize c1 CPU only, + verify container is resized without restart. +1. NoRestart cpu, RestartContainer memory policy for c1. Resize c1 memory only, + verify container is resized with restart. +1. NoRestart cpu, RestartContainer memory policy for c1. Resize c1 CPU & memory, + verify container is resized with restart. + +### Backward Compatibility and Negative Tests + +1. Verify that Node is allowed to update only a Pod's ResourcesAllocated field. +1. Verify that only Node account is allowed to udate ResourcesAllocated field. +1. Verify that updating Pod Resources in workload template spec retains current + behavior: + - Updating Pod Resources in Job template is not allowed. + - Updating Pod Resources in Deployment template continues to result in Pod + being restarted with updated resources. +1. Verify Pod updates by older version of client-go doesn't result in current + values of ResourcesAllocated and ResizePolicy fields being dropped. +1. Verify that only CPU and memory resources are mutable by user. + +TODO: Identify more cases ## Graduation Criteria -TODO +### Alpha +- In-Place Pod Resouces Update functionality is implemented for running Pods, +- LimitRanger and ResourceQuota handling are added, +- Resize Policies functionality is implemented, +- Unit tests and E2E tests covering basic functionality are added, +- E2E tests covering multiple containers are added. + +### Beta +- VPA alpha integration of feature completed and any bugs addressed, +- E2E tests covering Resize Policy, LimitRanger, and ResourceQuota are added, +- Negative tests are identified and added. + +### Stable +- VPA integration of feature moved to beta, +- User feedback (ideally from atleast two distinct users) is green, +- No major bugs reported for three months. ## Implementation History @@ -373,3 +539,7 @@ TODO - 2019-01-18 - implementation proposal extended - 2019-03-07 - changes to flow control, updates per review feedback - 2019-08-29 - updated design proposal +- 2019-10-25 - update key open items and move KEP to implementable +- 2020-01-06 - API review suggested changes incorporated +- 2020-01-13 - Test plan and graduation criteria added +- 2020-01-21 - Graduation criteria updated per review feedback diff --git a/keps/sig-node/20191025-kubelet-container-resources-cri-api-changes.md b/keps/sig-node/20191025-kubelet-container-resources-cri-api-changes.md new file mode 100644 index 00000000000..c437a2ccbc3 --- /dev/null +++ b/keps/sig-node/20191025-kubelet-container-resources-cri-api-changes.md @@ -0,0 +1,249 @@ +--- +title: Container Resources CRI API Changes for Pod Vertical Scaling +authors: + - "@vinaykul" + - "@quinton-hoole" +owning-sig: sig-node +participating-sigs: +reviewers: + - "@Random-Liu" + - "@yujuhong" + - "@PatrickLang" +approvers: + - "@dchen1107" + - "@derekwaynecarr" +editor: TBD +creation-date: 2019-10-25 +last-updated: 2020-01-14 +status: implementable +see-also: + - "/keps/sig-node/20181106-in-place-update-of-pod-resources.md" +replaces: +superseded-by: +--- + +# Container Resources CRI API Changes for Pod Vertical Scaling + +## Table of Contents + + +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) +- [Design Details](#design-details) + - [Expected Behavior of CRI Runtime](#expected-behavior-of-cri-runtime) + - [Test Plan](#test-plan) + - [Graduation Criteria](#graduation-criteria) + - [Alpha](#alpha) + - [Beta](#beta) + - [Stable](#stable) +- [Implementation History](#implementation-history) + + +## Summary + +This proposal aims to improve the Container Runtime Interface (CRI) APIs for +managing a Container's CPU and memory resource configurations on the runtime. +It seeks to extend UpdateContainerResources CRI API such that it works for +Windows, and other future runtimes besides Linux. It also seeks to extend +ContainerStatus CRI API to allow Kubelet to discover the current resources +configured on a Container. + +## Motivation + +In-Place Pod Vertical Scaling feature relies on Container Runtime Interface +(CRI) to update the CPU and/or memory limits for Container(s) in a Pod. + +The current CRI API set has a few drawbacks that need to be addressed: +1. UpdateContainerResources CRI API takes a parameter that describes Container + resources to update for Linux Containers, and this may not work for Windows + Containers or other potential non-Linux runtimes in the future. +1. There is no CRI mechanism that lets Kubelet query and discover the CPU and + memory limits configured on a Container from the Container runtime. +1. The expected behavior from a runtime that handles UpdateContainerResources + CRI API is not very well defined or documented. + +### Goals + +This proposal has two primary goals: + - Modify UpdateContainerResources to allow it to work for Windows Containers, + as well as Containers managed by other runtimes besides Linux, + - Provide CRI API mechanism to query the Container runtime for CPU and memory + resource configurations that are currently applied to a Container. + +An additional goal of this proposal is to better define and document the +expected behavior of a Container runtime when handling resource updates. + +### Non-Goals + +Definition of expected behavior of a Container runtime when it handles CRI APIs +related to a Container's resources is intended to be a high level guide. It is +a non-goal of this proposal to define a detailed or specific way to implement +these functions. Implementation specifics are left to the runtime, within the +bounds of expected behavior. + +## Proposal + +One key change is to make UpdateContainerResources API work for Windows, and +any other future runtimes, besides Linux by making the resources parameter +passed in the API specific to the target runtime. + +Another change in this proposal is to extend ContainerStatus CRI API such that +Kubelet can query and discover the CPU and memory resources that are presently +applied to a Container. + +To accomplish aforementioned goals: + +* A new protobuf message object named *ContainerResources* that encapsulates +LinuxContainerResources and WindowsContainerResources is introduced as below. + - This message can easily be extended for future runtimes by simply adding a + new runtime-specific resources struct to the ContainerResources message. +``` +// ContainerResources holds resource configuration for a container. +message ContainerResources { + // Resource configuration specific to Linux container. + LinuxContainerResources linux = 1; + // Resource configuration specific to Windows container. + WindowsContainerResources windows = 2; +} +``` + +* UpdateContainerResourcesRequest message is extended to carry + ContainerResources field as below. + - For Linux runtimes, Kubelet fills UpdateContainerResourcesRequest.Linux in + additon to UpdateContainerResourcesRequest.Resources.Linux fields. + - This keeps backward compatibility by letting runtimes that rely on the + current LinuxContainerResources continue to work, while enabling newer + runtime versions to use UpdateContainerResourcesRequest.Resources.Linux, + - It enables deprecation of UpdateContainerResourcesRequest.Linux field. +``` +message UpdateContainerResourcesRequest { + // ID of the container to update. + string container_id = 1; + // Resource configuration specific to Linux container. + LinuxContainerResources linux = 2; + // Resource configuration for the container. + ContainerResources resources = 3; +} +``` + +* ContainerStatus message is extended to return ContainerResources as below. + - This enables Kubelet to query the runtime and discover resources currently + applied to a Container using ContainerStatus CRI API. +``` +@@ -914,6 +912,8 @@ message ContainerStatus { + repeated Mount mounts = 14; + // Log path of container. + string log_path = 15; ++ // Resource configuration of the container. ++ ContainerResources resources = 16; + } +``` + +* ContainerManager CRI API service interface is modified as below. + - UpdateContainerResources takes ContainerResources parameter instead of + LinuxContainerResources. +``` +--- a/staging/src/k8s.io/cri-api/pkg/apis/services.go ++++ b/staging/src/k8s.io/cri-api/pkg/apis/services.go +@@ -43,8 +43,10 @@ type ContainerManager interface { + ListContainers(filter *runtimeapi.ContainerFilter) ([]*runtimeapi.Container, error) + // ContainerStatus returns the status of the container. + ContainerStatus(containerID string) (*runtimeapi.ContainerStatus, error) +- // UpdateContainerResources updates the cgroup resources for the container. +- UpdateContainerResources(containerID string, resources *runtimeapi.LinuxContainerResources) error ++ // UpdateContainerResources updates resource configuration for the container. ++ UpdateContainerResources(containerID string, resources *runtimeapi.ContainerResources) error + // ExecSync executes a command in the container, and returns the stdout output. + // If command exits with a non-zero exit code, an error is returned. + ExecSync(containerID string, cmd []string, timeout time.Duration) (stdout []byte, stderr []byte, err error) +``` + +* Kubelet code is modified to leverage these changes. + +## Design Details + +Below diagram is an overview of Kubelet using UpdateContainerResources and +ContainerStatus CRI APIs to set new container resource limits, and update the +Pod Status in response to user changing the desired resources in Pod Spec. + +``` + +-----------+ +-----------+ +-----------+ + | | | | | | + | apiserver | | kubelet | | runtime | + | | | | | | + +-----+-----+ +-----+-----+ +-----+-----+ + | | | + | watch (pod update) | | + |------------------------------>| | + | [Containers.Resources] | | + | | | + | (admit) | + | | | + | | UpdateContainerResources() | + | |----------------------------->| + | | (set limits) + | |<- - - - - - - - - - - - - - -| + | | | + | | ContainerStatus() | + | |----------------------------->| + | | | + | | [ContainerResources] | + | |<- - - - - - - - - - - - - - -| + | | | + | update (pod status) | | + |<------------------------------| | + | [ContainerStatuses.Resources] | | + | | | + +``` + +* Kubelet invokes UpdateContainerResources() CRI API in ContainerManager + interface to configure new CPU and memory limits for a Container by + specifying those values in ContainerResources parameter to the API. Kubelet + sets ContainerResources parameter specific to the target runtime platform + when calling this CRI API. + +* Kubelet calls ContainerStatus() CRI API in ContainerManager interface to get + the CPU and memory limits applied to a Container. It uses the values returned + in ContainerStatus.Resources to update ContainerStatuses[i].Resources.Limits + for that Container in the Pod's Status. + +### Expected Behavior of CRI Runtime + +TBD + +### Test Plan + +* Unit tests are updated to reflect use of ContainerResources object in + UpdateContainerResources and ContainerStatus APIs. + +* E2E test is added to verify UpdateContainerResources API with docker runtime. + +* E2E test is added to verify ContainerStatus API using docker runtime. + +* E2E test is added to verify backward compatibility usign docker runtime. + +### Graduation Criteria + +#### Alpha + +* UpdateContainerResources and ContainerStatus API changes are done and tested + with dockershim and docker runtime, backward compatibility is maintained. + +#### Beta + +* UpdateContainerResources and ContainerStatus API changes are completed and + tested for Windows runtime. + +#### Stable + +* No major bugs reported for three months. + +## Implementation History + +- 2019-10-25 - Initial KEP draft created +- 2020-01-14 - Test plan and graduation criteria added +