From bd3a3f31bd241db2c941329a109f683b728cae5a Mon Sep 17 00:00:00 2001
From: Mike Spreitzer <mspreitz@us.ibm.com>
Date: Mon, 2 May 2022 01:31:30 -0400
Subject: [PATCH 1/5] Draft borrowing between priority levels in APF, high
 borrows from low

---
 .../1040-priority-and-fairness/README.md      | 202 ++++++++++++++++--
 1 file changed, 185 insertions(+), 17 deletions(-)

diff --git a/keps/sig-api-machinery/1040-priority-and-fairness/README.md b/keps/sig-api-machinery/1040-priority-and-fairness/README.md
index c9dc58c64f4..b7b19bfc5b8 100644
--- a/keps/sig-api-machinery/1040-priority-and-fairness/README.md
+++ b/keps/sig-api-machinery/1040-priority-and-fairness/README.md
@@ -283,8 +283,10 @@ In short, this proposal is about generalizing the existing
 max-in-flight request handler in apiservers to add more discriminating
 handling of requests.  The overall approach is that each request is
 categorized to a priority level and a queue within that priority
-level; each priority level dispatches to its own isolated concurrency
-pool; within each priority level queues compete with even fairness.
+level; each priority level dispatches to its own concurrency pool and,
+according to a configured limit, unused concurrency borrrowed from
+lower priority levels; within each priority level queues compete with
+even fairness.
 
 ### Request Categorization
 
@@ -638,24 +640,190 @@ always dispatched immediately.  Following is how the other requests
 are dispatched at a given apiserver.
 
 The concurrency limit of an apiserver is divided among the non-exempt
-priority levels in proportion to their assured concurrency shares.
-This produces the assured concurrency value (ACV) for each non-exempt
-priority level:
+priority levels, and higher ones can do a limited amount of borrowing
+from lower ones.
 
-```
-ACV(l) = ceil( SCL * ACS(l) / ( sum[priority levels k] ACS(k) ) )
-```
+Two fields of `LimitedPriorityLevelConfiguration`, introduced in the
+midst of the `v1beta2` lifetime, configure the borrowing.  The fields
+are added in all the versions (`v1alpha1`, `v1beta1`, and `v1beta2`).
+The following display shows the two new fields along with the updated
+description for the `AssuredConcurrencyShares` field, in `v1beta2`.
 
-where SCL is the apiserver's concurrency limit and ACS(l) is the
-AssuredConcurrencyShares for priority level l.
+```go
+type LimitedPriorityLevelConfiguration struct {
+  ...
+  // `assuredConcurrencyShares` (ACS) contributes to the computation of the
+  // NominalConcurrencyLimit (NCL) of this level.
+  // This is the number of execution seats available at this priority level.
+  // This is used both for requests dispatched from
+  // this priority level as well as requests dispatched from higher priority
+  // levels borrowing seats from this level.  This does not limit dispatching from
+  // this priority level that borrows seats from lower priority levels (those lower
+  // levels do that).  The server's concurrency limit (SCL) is divided among the
+  // Limited priority levels in proportion to their ACS values:
+  //
+  // NCL(i)  = ceil( SCL * ACS(i) / sum_acs )
+  // sum_acs = sum[limited priority level k] ACS(k)
+  //
+  // Bigger numbers mean a larger nominal concurrency limit, at the expense
+  // of every other Limited priority level.
+  // This field has a default value of 30.
+  // +optional
+  AssuredConcurrencyShares int32
+
+  // `borrowablePercent` prescribes the fraction of the level's NCL that
+  // can be borrowed by higher priority levels.  This value of this
+  // field must be between 0 and 100, inclusive, and it defaults to 0.
+  // The number of seats that higher levels can borrow from this level, known
+  // as this level's BorrowableConcurrencyLimit (BCL), is defined as follows.
+  //
+  // BCL(i) = round( NCL(i) * borrowablePercent(i)/100.0 )
+  //
+  // +optional
+  BorrowablePercent int32
+
+  // `priority` determines where this priority level appears in the total order
+  // of Limited priority levels used to configure borrowing between those levels.
+  // A numerically higher value means a logically lower priority.
+  // Do not create ties; they will be broken arbitrarily.
+  // `priority` MUST BE between 0 and 10000, inclusive, and
+  // SHOULD BE greater than zero.
+  // If it is zero then, for the sake of a smooth transition from the time
+  // before this field existed, this level will be treated as if its `priority`
+  // is the average of the `matchingPrecedence` of the FlowSchema objects
+  // that reference this level.
+  // +optional
+  Priority int32
+}
+```
 
-Dispatching is done independently for each priority level.  Whenever
-(1) a non-exempt priority level's number of running requests is zero
-or below the level's assured concurrency value and (2) that priority
-level has a non-empty queue, it is time to dispatch another request
-for service.  The Fair Queuing for Server Requests algorithm below is
-used to pick a non-empty queue at that priority level.  Then the
-request at the head of that queue is dispatched.
+Prior to the introduction of borrowing, the `assuredConcurrencyShares`
+field had two meanings that amounted to the same thing: the total
+shares of the level, and the non-borrowable shares of the level.
+While it is somewhat unnatural to keep the meaning of "total shares"
+for a field named "assured" shares, rolling out the new behavior into
+existing systems will be more continuous if we keep the meaning of
+"total shares" for the existing field.  In the next version we should
+rename the `AssuredConcurrencyShares` to `NominalConcurrencyShares`.
+
+Consider rolling the borrowing behavior into a pre-existing cluster
+that did not have borrowing.  In particular, suppose that the
+administrators of the cluster have scripts/clients that maintain some
+out-of-tree PriorityLevelConfiguration objects; these, naturally, do
+not specify a value for the `priority` field.  The default behavior
+for `priority` is designed to do something more natural and convenient
+than have them all collide at some fixed number.
+
+The following table shows the current default non-exempt priority
+levels and a proposal for their new configuration.  For the sake of
+continuity with out-of-tree configuration objects, the proposed
+priority values follow the rule given above for the effective value
+when the priority field holds zero.
+
+| Name | Assured Shares | Proposed Borrowable Percent | Proposed Priority |
+| ---- | -------------: | --------------------------: | ----------------: |
+| leader-election |  10 |   0 |   150 |
+| node-high       |  40 |  25 |   400 |
+| system          |  30 |  33 |   500 |
+| workload-high   |  40 |  50 |   833 |
+| workload-low    | 100 |  90 |  9000 |
+| global-default  |  20 |  50 |  9900 |
+| catch-all       |   5 |   0 | 10000 |
+
+
+Borrowing is done on the basis of the current situation, with no
+consideration of opportunity cost, no further rationing according to
+shares (just obeying the concurrency limits as outlined above), and no
+pre-emption when the situation changes.
+
+Whenever a request is dispatched, it takes all its seats from one
+priority level --- either the one referenced by the request's
+FlowSchema or a logically lower priority level.
+
+Note: We *could* take a complementary approach, in which a request can
+borrow from a logically higher priority level.  That would go together
+with a bigger change in the default configuration and greater
+challenges in incremental rollout.  That is not the approach described
+here.
+
+In the implementation, there is an important consideration regarding
+locking.  We do not want one global mutex to be held for any work on
+dispatching; we want to allow concurrent work on dispatching, where
+possible.  Before the introduction of borrowing, the priority levels
+operated completely independently and the only global thing was a
+`sync.RWMutex` that is locked for reading to do dispatching work and
+locked for writing only when digesting changes to the configuration
+API objects.  Each priority level has its own private mutex.
+Borrowing introduces an interaction between priority levels, requiring
+multiple of those private locks to be held at once.  We must avoid
+deadlock.  This is done by insisting that whenever two locks are to be
+held at once, they are acquired in some total order.  In particular,
+the lock of a logically higher priority level is acquired before the
+lock of a logically lower priority level.  The locking order does not
+have to be the same as the priority order, but we make it the same for
+the sake of simplicity.
+
+A request can be dispatched from a queue of priority level X to seats
+at non-exempt priority level Y when either there are no requests
+executing at level Y or the number of seats needed by that request is
+no greater than the number of unused seats at priority level Y.  The
+number of unused seats at a given priority level is that level's
+NominalConcurrencyLimit minus the number of seats used by requests
+executing at that priority level (both requests dispatched from that
+priority level and requests dispatched from higher priority levels
+that are borrowing these lower level seats).
+
+There are two sorts of times when dispatching from/to a non-exempt
+priority level is considered: when a request arrives, and when a
+request releases the seats it was occupying (which is not the same as
+when the request finishes from the client's point of view, due to the
+special considerations for WATCH requests).
+
+At each of these sorts of moments, as many requests as possible are
+dispatched from the priority level involved to the same priority
+level.  The next request to consider in this process is chosen by
+using the Fair Queuing for Server Requests algorithm below to choose
+one of the non-empty queues (if indeed there are any) at that priority
+level.  This was the entirety of the reaction before the introduction
+of borrowing.
+
+Borrowing extends the reaction as follows.  There are two cases.  The
+simpler case is when reacting to a request arrival, let us say at
+priority level X.  In this case, and if the baseline reaction ---
+dispatching as many requests as possible from X to X --- ends with a
+request to dispatch but not enough seats available, then the reaction
+continues with trying to dispatch that request to the logically lower
+priority levels.  With the lock of X still held, the logically lower
+levels Y are enumerated in logically decreasing order and dispatch to
+each one of them is considered.  This consideration starts by
+acquiring Y's lock, then dispatches as many requests from X to Y as
+are allowed by the above rule, and finally releases Y's lock.
+Naturally, if/when X runs out of requests to dispatch, this reaction
+stops.
+
+The more complicated case is reacting to seats being freed up.  Let us
+say this is at priority level X.  As in the other case, the reaction
+starts with the baseline, dispatching as much as possible from X to X.
+If this leaves some seats at X still available, then the reaction
+continues by trying to dispatch from higher priority levels to X.  The
+lock of X is not held over this loop, that would violate locking
+order.  This iteration starts by releasing the lock of X and then
+iterating over the higher priority levels Y (in logically decreasing
+priority order) and considering each one in turn.  That consideration
+starts by acquiring the locks of Y and X (in that order), then does as
+much dispatching from Y to X as is allowed, and finally releases the
+two locks.  Note that dispatching one wide request from Y to X may
+unblock immediate dispatching of additional requests from Y (whether
+to X, Y, or another priority level).  The dispatching of all possible
+from Y to X does not cover all of that.  Thus, the reaction to seats
+being freed up has to include keeping a list of priority levels Y to
+consider general dispatching from.  An element is added to that list
+whenever the dispatching from Y to X concludes in a way that warrants
+more general consideration of Y.  After the primary iteration above is
+done, the reaction to seats being freed continues with iterating over
+this list of Y to reconsider and reacting as for an arrival at Y:
+dispatch as much as possible from Y to Y and logically lower priority
+levels.
 
 ### Fair Queuing for Server Requests
 

From a3fd87c77a591b788002a0e666144ea1f88744f2 Mon Sep 17 00:00:00 2001
From: Mike Spreitzer <mspreitz@us.ibm.com>
Date: Mon, 13 Jun 2022 15:58:45 -0400
Subject: [PATCH 2/5] Do borrowing by periodic adjustment of concurrency limits

---
 .../1040-priority-and-fairness/README.md      | 147 +++++++-----------
 1 file changed, 53 insertions(+), 94 deletions(-)

diff --git a/keps/sig-api-machinery/1040-priority-and-fairness/README.md b/keps/sig-api-machinery/1040-priority-and-fairness/README.md
index b7b19bfc5b8..58fcd802229 100644
--- a/keps/sig-api-machinery/1040-priority-and-fairness/README.md
+++ b/keps/sig-api-machinery/1040-priority-and-fairness/README.md
@@ -730,100 +730,59 @@ when the priority field holds zero.
 | global-default  |  20 |  50 |  9900 |
 | catch-all       |   5 |   0 | 10000 |
 
-
-Borrowing is done on the basis of the current situation, with no
-consideration of opportunity cost, no further rationing according to
-shares (just obeying the concurrency limits as outlined above), and no
-pre-emption when the situation changes.
-
-Whenever a request is dispatched, it takes all its seats from one
-priority level --- either the one referenced by the request's
-FlowSchema or a logically lower priority level.
-
-Note: We *could* take a complementary approach, in which a request can
-borrow from a logically higher priority level.  That would go together
-with a bigger change in the default configuration and greater
-challenges in incremental rollout.  That is not the approach described
-here.
-
-In the implementation, there is an important consideration regarding
-locking.  We do not want one global mutex to be held for any work on
-dispatching; we want to allow concurrent work on dispatching, where
-possible.  Before the introduction of borrowing, the priority levels
-operated completely independently and the only global thing was a
-`sync.RWMutex` that is locked for reading to do dispatching work and
-locked for writing only when digesting changes to the configuration
-API objects.  Each priority level has its own private mutex.
-Borrowing introduces an interaction between priority levels, requiring
-multiple of those private locks to be held at once.  We must avoid
-deadlock.  This is done by insisting that whenever two locks are to be
-held at once, they are acquired in some total order.  In particular,
-the lock of a logically higher priority level is acquired before the
-lock of a logically lower priority level.  The locking order does not
-have to be the same as the priority order, but we make it the same for
-the sake of simplicity.
-
-A request can be dispatched from a queue of priority level X to seats
-at non-exempt priority level Y when either there are no requests
-executing at level Y or the number of seats needed by that request is
-no greater than the number of unused seats at priority level Y.  The
-number of unused seats at a given priority level is that level's
-NominalConcurrencyLimit minus the number of seats used by requests
-executing at that priority level (both requests dispatched from that
-priority level and requests dispatched from higher priority levels
-that are borrowing these lower level seats).
-
-There are two sorts of times when dispatching from/to a non-exempt
-priority level is considered: when a request arrives, and when a
-request releases the seats it was occupying (which is not the same as
-when the request finishes from the client's point of view, due to the
-special considerations for WATCH requests).
-
-At each of these sorts of moments, as many requests as possible are
-dispatched from the priority level involved to the same priority
-level.  The next request to consider in this process is chosen by
-using the Fair Queuing for Server Requests algorithm below to choose
-one of the non-empty queues (if indeed there are any) at that priority
-level.  This was the entirety of the reaction before the introduction
-of borrowing.
-
-Borrowing extends the reaction as follows.  There are two cases.  The
-simpler case is when reacting to a request arrival, let us say at
-priority level X.  In this case, and if the baseline reaction ---
-dispatching as many requests as possible from X to X --- ends with a
-request to dispatch but not enough seats available, then the reaction
-continues with trying to dispatch that request to the logically lower
-priority levels.  With the lock of X still held, the logically lower
-levels Y are enumerated in logically decreasing order and dispatch to
-each one of them is considered.  This consideration starts by
-acquiring Y's lock, then dispatches as many requests from X to Y as
-are allowed by the above rule, and finally releases Y's lock.
-Naturally, if/when X runs out of requests to dispatch, this reaction
-stops.
-
-The more complicated case is reacting to seats being freed up.  Let us
-say this is at priority level X.  As in the other case, the reaction
-starts with the baseline, dispatching as much as possible from X to X.
-If this leaves some seats at X still available, then the reaction
-continues by trying to dispatch from higher priority levels to X.  The
-lock of X is not held over this loop, that would violate locking
-order.  This iteration starts by releasing the lock of X and then
-iterating over the higher priority levels Y (in logically decreasing
-priority order) and considering each one in turn.  That consideration
-starts by acquiring the locks of Y and X (in that order), then does as
-much dispatching from Y to X as is allowed, and finally releases the
-two locks.  Note that dispatching one wide request from Y to X may
-unblock immediate dispatching of additional requests from Y (whether
-to X, Y, or another priority level).  The dispatching of all possible
-from Y to X does not cover all of that.  Thus, the reaction to seats
-being freed up has to include keeping a list of priority levels Y to
-consider general dispatching from.  An element is added to that list
-whenever the dispatching from Y to X concludes in a way that warrants
-more general consideration of Y.  After the primary iteration above is
-done, the reaction to seats being freed continues with iterating over
-this list of Y to reconsider and reacting as for an arrival at Y:
-dispatch as much as possible from Y to Y and logically lower priority
-levels.
+Each priority level has two concurrency limits: its
+NominalConcurrencyLimit (NCL) as defined above by configuration, and a
+CurrentConcurrencyLimit (CCL) that is used in dispatching requests.
+The CCLs are adjusted periodically, based on configuration, the
+current situation at adjustment time, and recent observations.  The
+"borrowing" resides in the differences between CCL and NCL.  A
+priority level's CCL can go as low as NCL-BCL; the upper limit is
+imposed only by how many seats are available for borrowing from other
+priority levels.  The sum of the CCLs, like the sum of the NCLs, is
+always equal to the server's concurrency limit (SCL).  These CCLs are
+floating-point values, because the adjustment logic below is
+incremental.  The actual limits used in dispatching are the result of
+rounding these floating-point numbers to their nearest integer.
+
+Dispatching is done independently for each priority level.  Whenever
+(1) a non-exempt priority level's number of occupied seats is zero or
+below the level's rounded CCL and (2) that priority level has a
+non-empty queue, it is time to dispatch another request for service.
+The Fair Queuing for Server Requests algorithm below is used to pick a
+non-empty queue at that priority level.  Then the request at the head
+of that queue is dispatched if possible.
+
+Every 10 seconds, all the CCLs are adjusted.  The adjustments take
+into account high watermarks of seat demand.  A priority level's seat
+demand is the sum of its occupied seats and the number of seats in the
+queued requests.  Each priority level has two high watermarks: a
+short-term one M1 and a long-term one M2.  During an adjustment
+period, M1 is updated to track the maximum seat demand seen during
+that adjustment period.  At the end of every adjustment period, M2 is
+set to `max(M1, A*M2 + (1-A)*M1)` and M1 is set to the current seat
+demand.  That is, M2 jumps up to M1 if that is higher (so a spike in
+demand gets an immediate response at adjustment time), otherwise
+exponentially drifts down toward M1 with a parameter A; 0.9 might be a
+good value for A.
+
+The adjustment logic takes the M2 values as desired targets to aim
+toward and adjusts the CCL values in two steps.  The first step aims
+to equalize the "pressure" on the priority levels.  Define each
+priority level's pressure `P = max(NCL-BCL, M2) - CCL`.  Let `PAvg` be
+the result of averaging P over the priority levels.  The first step
+adjusts each CCL by adding `B * (P - PAvg)`.  We use a coefficient B
+--- for which 0.25 might be a good value --- so that this step goes
+only part way toward its target.  Such damping is commonly done in
+controllers.
+
+The second step corrects for any lower bounds violations.  There are
+two lower bounds: one imposed by the limit on borrowable seats (BCL),
+and one imposed by priority levels that wish to reclaim borrowed seats
+due to recent load.  For every priority level where `CCL <
+max(NCL-BCL, min(NCL, M2))`, CCL gets increased to that lower bound.
+Whenever there are such increases, there must also be priority levels
+for which `CCL - NCL > 0`.  The seats for the former are taken from
+the latter, in proportion to the latter difference.
 
 ### Fair Queuing for Server Requests
 

From fbedda19b10b2d257b31608130b6f9e4e3e44a52 Mon Sep 17 00:00:00 2001
From: Mike Spreitzer <mspreitz@us.ibm.com>
Date: Wed, 15 Jun 2022 15:17:14 -0400
Subject: [PATCH 3/5] Update borrowing adjustment: smooth input, max-min
 problem

---
 .../1040-priority-and-fairness/README.md      | 165 +++++++++++-------
 1 file changed, 106 insertions(+), 59 deletions(-)

diff --git a/keps/sig-api-machinery/1040-priority-and-fairness/README.md b/keps/sig-api-machinery/1040-priority-and-fairness/README.md
index 58fcd802229..2d225e6af52 100644
--- a/keps/sig-api-machinery/1040-priority-and-fairness/README.md
+++ b/keps/sig-api-machinery/1040-priority-and-fairness/README.md
@@ -640,8 +640,8 @@ always dispatched immediately.  Following is how the other requests
 are dispatched at a given apiserver.
 
 The concurrency limit of an apiserver is divided among the non-exempt
-priority levels, and higher ones can do a limited amount of borrowing
-from lower ones.
+priority levels, and they can do a limited amount of borrowing from
+each other.
 
 Two fields of `LimitedPriorityLevelConfiguration`, introduced in the
 midst of the `v1beta2` lifetime, configure the borrowing.  The fields
@@ -649,20 +649,24 @@ are added in all the versions (`v1alpha1`, `v1beta1`, and `v1beta2`).
 The following display shows the two new fields along with the updated
 description for the `AssuredConcurrencyShares` field, in `v1beta2`.
 
+**Note**: currently this design does not use the `Priority` field for
+anything.  We should either use it for something or take it out of the
+design.
+
 ```go
 type LimitedPriorityLevelConfiguration struct {
   ...
   // `assuredConcurrencyShares` (ACS) contributes to the computation of the
-  // NominalConcurrencyLimit (NCL) of this level.
+  // NominalConcurrencyLimit (NominalCL) of this level.
   // This is the number of execution seats available at this priority level.
   // This is used both for requests dispatched from
   // this priority level as well as requests dispatched from higher priority
   // levels borrowing seats from this level.  This does not limit dispatching from
   // this priority level that borrows seats from lower priority levels (those lower
-  // levels do that).  The server's concurrency limit (SCL) is divided among the
+  // levels do that).  The server's concurrency limit (ServerCL) is divided among the
   // Limited priority levels in proportion to their ACS values:
   //
-  // NCL(i)  = ceil( SCL * ACS(i) / sum_acs )
+  // NominalCL(i)  = ceil( ServerCL * ACS(i) / sum_acs )
   // sum_acs = sum[limited priority level k] ACS(k)
   //
   // Bigger numbers mean a larger nominal concurrency limit, at the expense
@@ -671,13 +675,13 @@ type LimitedPriorityLevelConfiguration struct {
   // +optional
   AssuredConcurrencyShares int32
 
-  // `borrowablePercent` prescribes the fraction of the level's NCL that
+  // `borrowablePercent` prescribes the fraction of the level's NominalCL that
   // can be borrowed by higher priority levels.  This value of this
   // field must be between 0 and 100, inclusive, and it defaults to 0.
   // The number of seats that higher levels can borrow from this level, known
-  // as this level's BorrowableConcurrencyLimit (BCL), is defined as follows.
+  // as this level's BorrowableConcurrencyLimit (BorrowableCL), is defined as follows.
   //
-  // BCL(i) = round( NCL(i) * borrowablePercent(i)/100.0 )
+  // BorrowableCL(i) = round( NominalCL(i) * borrowablePercent(i)/100.0 )
   //
   // +optional
   BorrowablePercent int32
@@ -730,59 +734,91 @@ when the priority field holds zero.
 | global-default  |  20 |  50 |  9900 |
 | catch-all       |   5 |   0 | 10000 |
 
-Each priority level has two concurrency limits: its
-NominalConcurrencyLimit (NCL) as defined above by configuration, and a
-CurrentConcurrencyLimit (CCL) that is used in dispatching requests.
-The CCLs are adjusted periodically, based on configuration, the
-current situation at adjustment time, and recent observations.  The
-"borrowing" resides in the differences between CCL and NCL.  A
-priority level's CCL can go as low as NCL-BCL; the upper limit is
-imposed only by how many seats are available for borrowing from other
-priority levels.  The sum of the CCLs, like the sum of the NCLs, is
-always equal to the server's concurrency limit (SCL).  These CCLs are
-floating-point values, because the adjustment logic below is
-incremental.  The actual limits used in dispatching are the result of
-rounding these floating-point numbers to their nearest integer.
+Each non-exempt priority level `i` has two concurrency limits: its
+NominalConcurrencyLimit (`NominalCL(i)`) as defined above by
+configuration, and a CurrentConcurrencyLimit (`CurrentCL(i)`) that is
+used in dispatching requests.  The CurrentCLs are adjusted
+periodically, based on configuration, the current situation at
+adjustment time, and recent observations.  The "borrowing" resides in
+the differences between CurrentCL and NominalCL.  There is a lower
+bound on each non-exempt priority level's CurrentCL: `MinCL(i) =
+NominalCL(i) - BorrowableCL(i)`; the upper limit is imposed only by
+how many seats are available for borrowing from other priority levels.
+The sum of the CurrentCLs is always equal to the server's concurrency
+limit (ServerCL) plus or minus a little for rounding in the adjustment
+algorithm below.
 
 Dispatching is done independently for each priority level.  Whenever
 (1) a non-exempt priority level's number of occupied seats is zero or
-below the level's rounded CCL and (2) that priority level has a
-non-empty queue, it is time to dispatch another request for service.
-The Fair Queuing for Server Requests algorithm below is used to pick a
-non-empty queue at that priority level.  Then the request at the head
-of that queue is dispatched if possible.
-
-Every 10 seconds, all the CCLs are adjusted.  The adjustments take
-into account high watermarks of seat demand.  A priority level's seat
-demand is the sum of its occupied seats and the number of seats in the
-queued requests.  Each priority level has two high watermarks: a
-short-term one M1 and a long-term one M2.  During an adjustment
-period, M1 is updated to track the maximum seat demand seen during
-that adjustment period.  At the end of every adjustment period, M2 is
-set to `max(M1, A*M2 + (1-A)*M1)` and M1 is set to the current seat
-demand.  That is, M2 jumps up to M1 if that is higher (so a spike in
-demand gets an immediate response at adjustment time), otherwise
-exponentially drifts down toward M1 with a parameter A; 0.9 might be a
-good value for A.
-
-The adjustment logic takes the M2 values as desired targets to aim
-toward and adjusts the CCL values in two steps.  The first step aims
-to equalize the "pressure" on the priority levels.  Define each
-priority level's pressure `P = max(NCL-BCL, M2) - CCL`.  Let `PAvg` be
-the result of averaging P over the priority levels.  The first step
-adjusts each CCL by adding `B * (P - PAvg)`.  We use a coefficient B
---- for which 0.25 might be a good value --- so that this step goes
-only part way toward its target.  Such damping is commonly done in
-controllers.
-
-The second step corrects for any lower bounds violations.  There are
-two lower bounds: one imposed by the limit on borrowable seats (BCL),
-and one imposed by priority levels that wish to reclaim borrowed seats
-due to recent load.  For every priority level where `CCL <
-max(NCL-BCL, min(NCL, M2))`, CCL gets increased to that lower bound.
-Whenever there are such increases, there must also be priority levels
-for which `CCL - NCL > 0`.  The seats for the former are taken from
-the latter, in proportion to the latter difference.
+below the level's CurrentCL and (2) that priority level has a
+non-empty queue, it is time to consider dispatching another request
+for service.  The Fair Queuing for Server Requests algorithm below is
+used to pick a non-empty queue at that priority level.  Then the
+request at the head of that queue is dispatched if possible.
+
+Every 10 seconds, all the CurrentCLs are adjusted.  We do smoothing on
+the inputs to the adjustment logic in order to dampen control
+gyrations, in a way that lets a priority level reclaim lent seats at
+the nearest adjustment time.  The adjustments take into account the
+high watermark `HighSD(i)`, time-weighted average `AvgSD(i)`, and
+time-weighted population standard deviation `StDevSD(i)` of each
+priority level `i`'s seat demand over the just-concluded adjustment
+period.  A priority level's seat demand at any given moment is the sum
+of its occupied seats and the number of seats in the queued requests.
+We also define `EnvelopeSD(i) = AvgSD(i) + StDevSD(i)`.  The
+adjustment logic is driven by a quantity called smoothed seat demand
+(`SmoothSD(i)`), which does an exponential averaging of EnvelopeSD
+values using a coeficient A in the range (0,1) and immediately tracks
+EnvelopeSD when it exceeds SmoothSD.  The rule for updating priority
+level `i`'s SmoothSD at the end of an adjustment period is
+`SmoothSD(i) := max( EnvelopeSD(i), A*SmoothSD(i) + (1-A)*Envelope(i)
+)`.  The command line flag `--seat-demand-history-fraction` with a
+default value of 0.9 configures A.
+
+Adjustment is also done on configuration change, when a priority level
+is introduced or removed or its NominalCL or BorrowableCL changes.  At
+such a time, the current adjustment period comes to an early end and
+the regular adjustment logic runs; the adjustment timer is reset to
+next fire 10 seconds later.  For a newly introduced priority level, we
+set HighSD, AvgSD, and SmoothSD to NominalCL-BorrowableSD/2 and
+StDevSD to zero.
+
+For adjusting the CurrentCL values, each non-exempt priority level `i`
+has a lower bound (`MinCurrentCL(i)`) for the new value.  It is simply
+HighSD clipped by the configured concurrency limits: `MinCurrentCL(i)
+= max( MinCL(i), min( NominalCL(i), HighSD(i) ) )`.
+
+If MinCurrentCL(i) = NominalCL(i) for every non-exempt priority level
+i then there is no wiggle room.  No priority level is willing to lend
+any seats.  The new CurrentCL values must equal the NominalCL values.
+Otherwise there is wiggle room and the adjustment proceeds as follows.
+
+The priority levels would all be fairly happy if we set CurrentCL =
+SmoothSD for each.  We clip that by the lower bound just shown, taking
+`Target(i) = max(SmoothSD(i), MinCurrentCL(i))` as a first-order
+target for each non-exempt priority level `i`.
+
+Sadly, the sum of the Target values --- let's name that TargetSum ---
+is not necessarily equal to ServerCL and the individual Target values
+do not necessarily respect the corresponding MinCurrentCL bound.  If
+we had only the first of those two problems then we could set each
+CurrentCL(i) to FairFrac * Target(i) where FairFrac = ServerCL /
+TargetSum.  This would share the gain or pain in equal proportion
+among the priority levels.  Taking the lower bounds into account means
+finding the one FairFrac value that solves the following conditions,
+for all the non-exempt priority levels `i`, and also makes the
+CurrentCL values sum to ServerCL.  For this step we let the CurrentCL
+values be floating-point numbers, not necessarily integers.
+
+```
+CurrentCL(i) = FairFrac * Target(i)  if FairFrac * Target(i) >= MinCurrentCL(i)
+CurrentCL(i) = MinCurrentCL(i)       if FairFrac * Target(i) <= MinCurrentCL(i)
+```
+
+This is the mirror image of the max-min fairness problem and can be
+solved with the same sort of algorithm, taking O(N log N) time and
+O(N) space.  After finding the floating point CurrentCL solutions,
+each one is rounded to the nearest integer to use in dispatching.
 
 ### Fair Queuing for Server Requests
 
@@ -1917,7 +1953,7 @@ others, at any given time this may compute for some priority level(s)
 an assured concurrency value that is lower than the number currently
 executing.  In these situations the total number allowed to execute
 will temporarily exceed the apiserver's configured concurrency limit
-(`SCL`) and will settle down to the configured limit as requests
+(`ServerCL`) and will settle down to the configured limit as requests
 complete their service.
 
 ### Default Behavior
@@ -1991,6 +2027,17 @@ This KEP adds the following metrics.
 - apiserver_dispatched_requests (count, broken down by priority, FlowSchema)
 - apiserver_wait_duration (histogram, broken down by priority, FlowSchema)
 - apiserver_service_duration (histogram, broken down by priority, FlowSchema)
+- `apiserver_flowcontrol_request_concurrency_limit` (gauge of NominalCL, broken down by priority)
+- `apiserver_flowcontrol_request_min_concurrency_limit` (gauge of MinCL, broken down by priority)
+- `apiserver_flowcontrol_request_current_concurrency_limit` (gauge of CurrentCL, broken down by priority)
+- `apiserver_flowcontrol_demand_seats` (timing ratio histogram of seat demand / NominalCL, broken down by priority)
+- `apiserver_flowcontrol_demand_seats_high_water_mark` (gauge of HighSD, broken down by priority)
+- `apiserver_flowcontrol_demand_seats_average` (gauge of AvgSD, broken down by priority)
+- `apiserver_flowcontrol_demand_seats_stdev` (gauge of StDevSD, broken down by priority)
+- `apiserver_flowcontrol_envelope_seats` (gauge of EnvelopeSD, broken down by priority)
+- `apiserver_flowcontrol_smoothed_demand_seats` (gauge of SmoothSD, broken down by priority)
+- `apiserver_flowcontrol_target_seats` (gauge of Target, brokwn down by priority)
+- `apiserver_flowcontrol_seat_fair_frac` (gauge of FairFrac)
 
 ### Testing
 

From b2017df6647d365785b5b71fecde1762829bad50 Mon Sep 17 00:00:00 2001
From: Mike Spreitzer <mspreitz@us.ibm.com>
Date: Fri, 17 Jun 2022 00:24:48 -0400
Subject: [PATCH 4/5] Revise borrowing adjustment write-up in response to
 review

---
 .../1040-priority-and-fairness/README.md      | 108 ++++++++++--------
 1 file changed, 62 insertions(+), 46 deletions(-)

diff --git a/keps/sig-api-machinery/1040-priority-and-fairness/README.md b/keps/sig-api-machinery/1040-priority-and-fairness/README.md
index 2d225e6af52..9afb3c4898b 100644
--- a/keps/sig-api-machinery/1040-priority-and-fairness/README.md
+++ b/keps/sig-api-machinery/1040-priority-and-fairness/README.md
@@ -760,65 +760,81 @@ Every 10 seconds, all the CurrentCLs are adjusted.  We do smoothing on
 the inputs to the adjustment logic in order to dampen control
 gyrations, in a way that lets a priority level reclaim lent seats at
 the nearest adjustment time.  The adjustments take into account the
-high watermark `HighSD(i)`, time-weighted average `AvgSD(i)`, and
-time-weighted population standard deviation `StDevSD(i)` of each
-priority level `i`'s seat demand over the just-concluded adjustment
-period.  A priority level's seat demand at any given moment is the sum
-of its occupied seats and the number of seats in the queued requests.
-We also define `EnvelopeSD(i) = AvgSD(i) + StDevSD(i)`.  The
-adjustment logic is driven by a quantity called smoothed seat demand
-(`SmoothSD(i)`), which does an exponential averaging of EnvelopeSD
-values using a coeficient A in the range (0,1) and immediately tracks
-EnvelopeSD when it exceeds SmoothSD.  The rule for updating priority
-level `i`'s SmoothSD at the end of an adjustment period is
-`SmoothSD(i) := max( EnvelopeSD(i), A*SmoothSD(i) + (1-A)*Envelope(i)
-)`.  The command line flag `--seat-demand-history-fraction` with a
-default value of 0.9 configures A.
+high watermark `HighSeatDemand(i)`, time-weighted average
+`AvgSeatDemand(i)`, and time-weighted population standard deviation
+`StDevSeatDemand(i)` of each priority level `i`'s seat demand over the
+just-concluded adjustment period.  A priority level's seat demand at
+any given moment is the sum of its occupied seats and the number of
+seats in the queued requests.  We also define `EnvelopeSeatDemand(i) =
+AvgSeatDemand(i) + StDevSeatDemand(i)`.  The adjustment logic is
+driven by a quantity called smoothed seat demand
+(`SmoothSeatDemand(i)`), which does an exponential averaging of
+EnvelopeSeatDemand values using a coeficient A in the range (0,1) and
+immediately tracks EnvelopeSeatDemand when it exceeds
+SmoothSeatDemand.  The rule for updating priority level `i`'s
+SmoothSeatDemand at the end of an adjustment period is
+`SmoothSeatDemand(i) := max( EnvelopeSeatDemand(i),
+A*SmoothSeatDemand(i) + (1-A)*EnvelopeSeatDemand(i) )`.  The command
+line flag `--seat-demand-history-fraction` with a default value of 0.9
+configures A.
 
 Adjustment is also done on configuration change, when a priority level
 is introduced or removed or its NominalCL or BorrowableCL changes.  At
 such a time, the current adjustment period comes to an early end and
 the regular adjustment logic runs; the adjustment timer is reset to
 next fire 10 seconds later.  For a newly introduced priority level, we
-set HighSD, AvgSD, and SmoothSD to NominalCL-BorrowableSD/2 and
-StDevSD to zero.
+set HighSeatDemand, AvgSeatDemand, and SmoothSeatDemand to
+NominalCL-BorrowableSD/2 and StDevSeatDemand to zero.
 
 For adjusting the CurrentCL values, each non-exempt priority level `i`
 has a lower bound (`MinCurrentCL(i)`) for the new value.  It is simply
-HighSD clipped by the configured concurrency limits: `MinCurrentCL(i)
-= max( MinCL(i), min( NominalCL(i), HighSD(i) ) )`.
-
-If MinCurrentCL(i) = NominalCL(i) for every non-exempt priority level
-i then there is no wiggle room.  No priority level is willing to lend
-any seats.  The new CurrentCL values must equal the NominalCL values.
-Otherwise there is wiggle room and the adjustment proceeds as follows.
+HighSeatDemand clipped by the configured concurrency limits:
+`MinCurrentCL(i) = max( MinCL(i), min( NominalCL(i), HighSeatDemand(i)
+) )`.
+
+If `MinCurrentCL(i) = NominalCL(i)` for every non-exempt priority
+level `i` then there is no wiggle room.  In this situation, no
+priority level is willing to lend any seats.  The new CurrentCL values
+must equal the NominalCL values.  Otherwise there is wiggle room and
+the adjustment proceeds as follows.  For the following logic we let
+the CurrentCL values be floating-point numbers, not necessarily
+integers.
 
 The priority levels would all be fairly happy if we set CurrentCL =
-SmoothSD for each.  We clip that by the lower bound just shown, taking
-`Target(i) = max(SmoothSD(i), MinCurrentCL(i))` as a first-order
-target for each non-exempt priority level `i`.
+SmoothSeatDemand for each.  We clip that by the lower bound just
+shown, taking `Target(i) = max(SmoothSeatDemand(i), MinCurrentCL(i))`
+as a first-order target for each non-exempt priority level `i`.
 
 Sadly, the sum of the Target values --- let's name that TargetSum ---
-is not necessarily equal to ServerCL and the individual Target values
-do not necessarily respect the corresponding MinCurrentCL bound.  If
-we had only the first of those two problems then we could set each
-CurrentCL(i) to FairFrac * Target(i) where FairFrac = ServerCL /
-TargetSum.  This would share the gain or pain in equal proportion
-among the priority levels.  Taking the lower bounds into account means
-finding the one FairFrac value that solves the following conditions,
-for all the non-exempt priority levels `i`, and also makes the
-CurrentCL values sum to ServerCL.  For this step we let the CurrentCL
-values be floating-point numbers, not necessarily integers.
+is not necessarily equal to ServerCL.  However, if `TargetSum <=
+ServerCL` then all the Targets can be scaled up in the same proportion
+`FairProp = ServerCL / TargetSum` to get the new concurrency limits.
+That is, `CurrentCL(i) := FairProp * Target(i)` for each non-exempt
+priority level `i`.  This shares the wealth proportionally among the
+priority levels.  Also note, the following computation produces the
+same result.
+
+If `TargetSum > ServerCL` then we can not necessarily scale all the
+Targets down by the same factor --- because that might violate some
+lower bounds.  The problem is to find a proportion `FairProp`, which
+we know must lie somewhere in the range (0,1) when `TargetSum >
+ServerCL`, that can be shared by all the priority levels except those
+whose lower bound forbids that.  This means to find the one value of
+`FairProp` that solves the following conditions, for all the
+non-exempt priority levels `i`, and also makes the CurrentCL values
+sum to ServerCL.
 
 ```
-CurrentCL(i) = FairFrac * Target(i)  if FairFrac * Target(i) >= MinCurrentCL(i)
-CurrentCL(i) = MinCurrentCL(i)       if FairFrac * Target(i) <= MinCurrentCL(i)
+CurrentCL(i) = FairProp * Target(i)  if FairProp * Target(i) >= MinCurrentCL(i)
+CurrentCL(i) = MinCurrentCL(i)       if FairProp * Target(i) <= MinCurrentCL(i)
 ```
 
 This is the mirror image of the max-min fairness problem and can be
 solved with the same sort of algorithm, taking O(N log N) time and
-O(N) space.  After finding the floating point CurrentCL solutions,
-each one is rounded to the nearest integer to use in dispatching.
+O(N) space.
+
+After finding the floating point CurrentCL solutions, each one is
+rounded to the nearest integer to use in subsequent dispatching.
 
 ### Fair Queuing for Server Requests
 
@@ -2031,13 +2047,13 @@ This KEP adds the following metrics.
 - `apiserver_flowcontrol_request_min_concurrency_limit` (gauge of MinCL, broken down by priority)
 - `apiserver_flowcontrol_request_current_concurrency_limit` (gauge of CurrentCL, broken down by priority)
 - `apiserver_flowcontrol_demand_seats` (timing ratio histogram of seat demand / NominalCL, broken down by priority)
-- `apiserver_flowcontrol_demand_seats_high_water_mark` (gauge of HighSD, broken down by priority)
-- `apiserver_flowcontrol_demand_seats_average` (gauge of AvgSD, broken down by priority)
-- `apiserver_flowcontrol_demand_seats_stdev` (gauge of StDevSD, broken down by priority)
-- `apiserver_flowcontrol_envelope_seats` (gauge of EnvelopeSD, broken down by priority)
-- `apiserver_flowcontrol_smoothed_demand_seats` (gauge of SmoothSD, broken down by priority)
+- `apiserver_flowcontrol_demand_seats_high_water_mark` (gauge of HighSeatDemand, broken down by priority)
+- `apiserver_flowcontrol_demand_seats_average` (gauge of AvgSeatDemand, broken down by priority)
+- `apiserver_flowcontrol_demand_seats_stdev` (gauge of StDevSeatDemand, broken down by priority)
+- `apiserver_flowcontrol_envelope_seats` (gauge of EnvelopeSeatDemand, broken down by priority)
+- `apiserver_flowcontrol_smoothed_demand_seats` (gauge of SmoothSeatDemand, broken down by priority)
 - `apiserver_flowcontrol_target_seats` (gauge of Target, brokwn down by priority)
-- `apiserver_flowcontrol_seat_fair_frac` (gauge of FairFrac)
+- `apiserver_flowcontrol_seat_fair_frac` (gauge of FairProp)
 
 ### Testing
 

From 2998b407a071c4c6e1549b87a2519de7fb98d3a3 Mon Sep 17 00:00:00 2001
From: Mike Spreitzer <mspreitz@us.ibm.com>
Date: Sun, 19 Jun 2022 22:58:38 -0400
Subject: [PATCH 5/5] Removed unused Priority field

Also updated outdated text about directed borrowing.
---
 .../1040-priority-and-fairness/README.md      | 66 ++++++-------------
 1 file changed, 19 insertions(+), 47 deletions(-)

diff --git a/keps/sig-api-machinery/1040-priority-and-fairness/README.md b/keps/sig-api-machinery/1040-priority-and-fairness/README.md
index 9afb3c4898b..31c98c14069 100644
--- a/keps/sig-api-machinery/1040-priority-and-fairness/README.md
+++ b/keps/sig-api-machinery/1040-priority-and-fairness/README.md
@@ -643,16 +643,12 @@ The concurrency limit of an apiserver is divided among the non-exempt
 priority levels, and they can do a limited amount of borrowing from
 each other.
 
-Two fields of `LimitedPriorityLevelConfiguration`, introduced in the
-midst of the `v1beta2` lifetime, configure the borrowing.  The fields
-are added in all the versions (`v1alpha1`, `v1beta1`, and `v1beta2`).
-The following display shows the two new fields along with the updated
+One field of `LimitedPriorityLevelConfiguration`, introduced in the
+midst of the `v1beta2` lifetime, limits the borrowing.  The field is
+added in all the versions (`v1alpha1`, `v1beta1`, and `v1beta2`).  The
+following display shows the new fields along with the updated
 description for the `AssuredConcurrencyShares` field, in `v1beta2`.
 
-**Note**: currently this design does not use the `Priority` field for
-anything.  We should either use it for something or take it out of the
-design.
-
 ```go
 type LimitedPriorityLevelConfiguration struct {
   ...
@@ -660,9 +656,9 @@ type LimitedPriorityLevelConfiguration struct {
   // NominalConcurrencyLimit (NominalCL) of this level.
   // This is the number of execution seats available at this priority level.
   // This is used both for requests dispatched from
-  // this priority level as well as requests dispatched from higher priority
+  // this priority level as well as requests dispatched from other priority
   // levels borrowing seats from this level.  This does not limit dispatching from
-  // this priority level that borrows seats from lower priority levels (those lower
+  // this priority level that borrows seats from other priority levels (those other
   // levels do that).  The server's concurrency limit (ServerCL) is divided among the
   // Limited priority levels in proportion to their ACS values:
   //
@@ -676,28 +672,15 @@ type LimitedPriorityLevelConfiguration struct {
   AssuredConcurrencyShares int32
 
   // `borrowablePercent` prescribes the fraction of the level's NominalCL that
-  // can be borrowed by higher priority levels.  This value of this
+  // can be borrowed by other priority levels.  This value of this
   // field must be between 0 and 100, inclusive, and it defaults to 0.
-  // The number of seats that higher levels can borrow from this level, known
+  // The number of seats that other levels can borrow from this level, known
   // as this level's BorrowableConcurrencyLimit (BorrowableCL), is defined as follows.
   //
   // BorrowableCL(i) = round( NominalCL(i) * borrowablePercent(i)/100.0 )
   //
   // +optional
   BorrowablePercent int32
-
-  // `priority` determines where this priority level appears in the total order
-  // of Limited priority levels used to configure borrowing between those levels.
-  // A numerically higher value means a logically lower priority.
-  // Do not create ties; they will be broken arbitrarily.
-  // `priority` MUST BE between 0 and 10000, inclusive, and
-  // SHOULD BE greater than zero.
-  // If it is zero then, for the sake of a smooth transition from the time
-  // before this field existed, this level will be treated as if its `priority`
-  // is the average of the `matchingPrecedence` of the FlowSchema objects
-  // that reference this level.
-  // +optional
-  Priority int32
 }
 ```
 
@@ -710,29 +693,18 @@ existing systems will be more continuous if we keep the meaning of
 "total shares" for the existing field.  In the next version we should
 rename the `AssuredConcurrencyShares` to `NominalConcurrencyShares`.
 
-Consider rolling the borrowing behavior into a pre-existing cluster
-that did not have borrowing.  In particular, suppose that the
-administrators of the cluster have scripts/clients that maintain some
-out-of-tree PriorityLevelConfiguration objects; these, naturally, do
-not specify a value for the `priority` field.  The default behavior
-for `priority` is designed to do something more natural and convenient
-than have them all collide at some fixed number.
-
 The following table shows the current default non-exempt priority
-levels and a proposal for their new configuration.  For the sake of
-continuity with out-of-tree configuration objects, the proposed
-priority values follow the rule given above for the effective value
-when the priority field holds zero.
-
-| Name | Assured Shares | Proposed Borrowable Percent | Proposed Priority |
-| ---- | -------------: | --------------------------: | ----------------: |
-| leader-election |  10 |   0 |   150 |
-| node-high       |  40 |  25 |   400 |
-| system          |  30 |  33 |   500 |
-| workload-high   |  40 |  50 |   833 |
-| workload-low    | 100 |  90 |  9000 |
-| global-default  |  20 |  50 |  9900 |
-| catch-all       |   5 |   0 | 10000 |
+levels and a proposal for their new configuration.
+
+| Name | Assured Shares | Proposed Borrowable Percent |
+| ---- | -------------: | --------------------------: |
+| leader-election |  10 |   0 |
+| node-high       |  40 |  25 |
+| system          |  30 |  33 |
+| workload-high   |  40 |  50 |
+| workload-low    | 100 |  90 |
+| global-default  |  20 |  50 |
+| catch-all       |   5 |   0 |
 
 Each non-exempt priority level `i` has two concurrency limits: its
 NominalConcurrencyLimit (`NominalCL(i)`) as defined above by