@@ -714,75 +714,6 @@ not specify a value for the `priority` field. The default behavior
714714for ` priority ` is designed to do something more natural and convenient
715715than have them all collide at some fixed number.
716716
717- Borrowing is done on the basis of the current situation, with no
718- consideration of opportunity cost, no further rationing according to
719- shares (just obeying the concurrency limits as outlined above), and no
720- pre-emption when the situation changes.
721-
722- Whenever a request is dispatched, it takes all its seats from one
723- priority level --- either the one referenced by the request's
724- FlowSchema or a lower priority level.
725-
726- Borrowing is further limited by a practical consideration: we do not
727- want mutual exclusion among all priority levels in the dispatching
728- code. Aside from borrowing, dispatching from one priority level is
729- done independently from dispatching at another --- using a distinct
730- ` sync.Mutex ` for each priority level. There _ is_ a global
731- ` sync.RWMutex ` , but queuing and dispatching only requires locking that
732- for reading (which is not mutually exclusive); write locking is done
733- only while reconfiguring. Borrowing is allowed in just one direction:
734- higher may borrow from lower. A higher priority level may actively
735- borrow from a lower one but a lower level does not actively lend to a
736- higher one (see below). In this design, working on dispatching one
737- request requires holding at most two priority level mutexes at any
738- given moment. To avoid deadlock, there is a strict ordering on
739- acquisition of those mutexes (namely, decreasing logical priority).
740-
741- We could take a complementary approach, in which lower actively lends
742- to higher but higher does not actively borrow from lower. The chosen
743- direction was chosen based on looking at metrics from scalabiility
744- tests showing that the ` workload-low ` priority level has a
745- significantly larger NominalConcurrencyLimit than the other levels and
746- is usually very under-utilized (so it would consider lending less
747- often than higher levels would consider borrowing).
748-
749- A request can be dispatched exactly at a non-exempt priority level
750- when either there are no requests executing at that priority level or
751- the number of seats needed by that request is no greater than the
752- number of unused seats at that priority level. The number of unused
753- seats at a given priority level is that level's
754- NominalConcurrencyLimit minus the number of seats used by requests
755- executing at that priority level (both requests dispatched from that
756- priority level and requests dispatched from higher priority levels
757- that are borrowing these lower level seats).
758-
759- There are two sorts of times when dispatching to a non-exempt priority
760- level is considered: when a request arrives, and when a request
761- releases the seats it was occupying (which is not the same as when the
762- request finishes from the client's point of view, due to the special
763- considerations for WATCH requests).
764-
765- At each of these sorts of moments, as many requests are dispatched
766- exactly at the same priority level as possible. The next request to
767- consider in this process is chosen by using the Fair Queuing for
768- Server Requests algorithm below to choose one of the non-empty queues
769- at that priority level, and if indeed there is a non-empty queue then
770- the request at the head of the chosen queue is considered for
771- dispatching. If the level has non-empty queues but the chosen request
772- can not be dispatched exactly at this level at the moment then the
773- logically lower non-exempt priority levels are considered, one at a
774- time, in decreasing logical priority order. As soon as one is found
775- at which the request can be dispatched at the moment according to the
776- rule above then the search stops and the request is dispatched to
777- execute using some of the lower priority level's seats. If no
778- suitable priority level is found then the request is not dispatched at
779- the moment.
780-
781- As can be seen from this logic, when seats are freed up at a given
782- priority level they are _ not_ actively lent to logically higher
783- priority levels. We avoid that in order to have a total order in
784- which priority level mutexes are acquired (thus avoiding deadlock).
785-
786717The following table shows the current default non-exempt priority
787718levels and a proposal for their new configuration. For the sake of
788719continuity with out-of-tree configuration objects, the proposed
@@ -800,6 +731,99 @@ when the priority field holds zero.
800731| catch-all | 5 | 0 | 10000 |
801732
802733
734+ Borrowing is done on the basis of the current situation, with no
735+ consideration of opportunity cost, no further rationing according to
736+ shares (just obeying the concurrency limits as outlined above), and no
737+ pre-emption when the situation changes.
738+
739+ Whenever a request is dispatched, it takes all its seats from one
740+ priority level --- either the one referenced by the request's
741+ FlowSchema or a logically lower priority level.
742+
743+ Note: We * could* take a complementary approach, in which a request can
744+ borrow from a logically higher priority level. That would go together
745+ with a bigger change in the default configuration and greater
746+ challenges in incremental rollout. That is not the approach described
747+ here.
748+
749+ In the implementation, there is an important consideration regarding
750+ locking. We do not want one global mutex to be held for any work on
751+ dispatching; we want to allow concurrent work on dispatching, where
752+ possible. Before the introduction of borrowing, the priority levels
753+ operated completely independently and the only global thing was a
754+ ` sync.RWMutex ` that is locked for reading to do dispatching work and
755+ locked for writing only when digesting changes to the configuration
756+ API objects. Each priority level has its own private mutex.
757+ Borrowing introduces an interaction between priority levels, requiring
758+ multiple of those private locks to be held at once. We must avoid
759+ deadlock. This is done by insisting that whenever two locks are to be
760+ held at once, they are acquired in some total order. In particular,
761+ the lock of a logically higher priority level is acquired before the
762+ lock of a logically lower priority level. The locking order does not
763+ have to be the same as the priority order, but we make it the same for
764+ the sake of simplicity.
765+
766+ A request can be dispatched from a queue of priority level X to seats
767+ at non-exempt priority level Y when either there are no requests
768+ executing at level Y or the number of seats needed by that request is
769+ no greater than the number of unused seats at priority level Y. The
770+ number of unused seats at a given priority level is that level's
771+ NominalConcurrencyLimit minus the number of seats used by requests
772+ executing at that priority level (both requests dispatched from that
773+ priority level and requests dispatched from higher priority levels
774+ that are borrowing these lower level seats).
775+
776+ There are two sorts of times when dispatching from/to a non-exempt
777+ priority level is considered: when a request arrives, and when a
778+ request releases the seats it was occupying (which is not the same as
779+ when the request finishes from the client's point of view, due to the
780+ special considerations for WATCH requests).
781+
782+ At each of these sorts of moments, as many requests as possible are
783+ dispatched from the priority level involved to the same priority
784+ level. The next request to consider in this process is chosen by
785+ using the Fair Queuing for Server Requests algorithm below to choose
786+ one of the non-empty queues (if indeed there are any) at that priority
787+ level. This was the entirety of the reaction before the introduction
788+ of borrowing.
789+
790+ Borrowing extends the reaction as follows. There are two cases. The
791+ simpler case is when reacting to a request arrival, let us say at
792+ priority level X. In this case, and if the baseline reaction ---
793+ dispatching as many requests as possible from X to X --- ends with a
794+ request to dispatch but not enough seats available, then the reaction
795+ continues with trying to dispatch that request to the logically lower
796+ priority levels. With the lock of X still held, the logically lower
797+ levels Y are enumerated in logically decreasing order and dispatch to
798+ each one of them is considered. This consideration starts by
799+ acquiring Y's lock, then dispatches as many requests as are allowed by
800+ the above rule, and finally releases Y's lock. Naturally, if/when X
801+ runs out of requests to dispatch, this iteration stops.
802+
803+ The more complicated case is reacting to seats being freed up. Let us
804+ say this is at priority level X. As in the other case, the reaction
805+ starts with the baseline, dispatching as much as possible from X to X.
806+ If this leaves some seats at X still available, then the reaction
807+ continues by trying to dispatch from higher priority levels to X. The
808+ lock of X is not held over this loop, that would violate locking
809+ order. This iteration starts by releasing the lock of X and then
810+ iterating over the higher priority levels Y (in logically decreasing
811+ priority order) and considering each one in turn. That consideration
812+ starts by acquiring the locks of Y and X (in that order), then does as
813+ much dispatching from Y to X as is allowed, and finally releases the
814+ two locks. Note that dispatching one wide request from Y to X may
815+ unblock immediate dispatching of additional requests from Y (whether
816+ to X, Y, or another priority level). So the dispatching of all
817+ possible from Y to X does not cover all of that. The reaction to
818+ seats being freed up has to include keeping a list of priority levels
819+ Y to consider general dispatching from. An element is added to that
820+ list whenever the primary considerations here cause a dispatch of a
821+ request with width greater than 1 and leaving Y with some queued
822+ requests. After the primary iteration above is done, the reaction to
823+ seats being freed continues with iterating over this list of Y to
824+ reconsider and reacting as for an arrival at Y: dispatch as much as
825+ possible from Y to Y and logically lower priority levels.
826+
803827### Fair Queuing for Server Requests
804828
805829The following subsections cover the problem statements and the current
0 commit comments