Skip to content

Commit 1113369

Browse files
committed
Add active lending, secondary dispatching
1 parent 0d1246c commit 1113369

File tree

1 file changed

+93
-69
lines changed
  • keps/sig-api-machinery/1040-priority-and-fairness

1 file changed

+93
-69
lines changed

keps/sig-api-machinery/1040-priority-and-fairness/README.md

Lines changed: 93 additions & 69 deletions
Original file line numberDiff line numberDiff line change
@@ -714,75 +714,6 @@ not specify a value for the `priority` field. The default behavior
714714
for `priority` is designed to do something more natural and convenient
715715
than have them all collide at some fixed number.
716716

717-
Borrowing is done on the basis of the current situation, with no
718-
consideration of opportunity cost, no further rationing according to
719-
shares (just obeying the concurrency limits as outlined above), and no
720-
pre-emption when the situation changes.
721-
722-
Whenever a request is dispatched, it takes all its seats from one
723-
priority level --- either the one referenced by the request's
724-
FlowSchema or a lower priority level.
725-
726-
Borrowing is further limited by a practical consideration: we do not
727-
want mutual exclusion among all priority levels in the dispatching
728-
code. Aside from borrowing, dispatching from one priority level is
729-
done independently from dispatching at another --- using a distinct
730-
`sync.Mutex` for each priority level. There _is_ a global
731-
`sync.RWMutex`, but queuing and dispatching only requires locking that
732-
for reading (which is not mutually exclusive); write locking is done
733-
only while reconfiguring. Borrowing is allowed in just one direction:
734-
higher may borrow from lower. A higher priority level may actively
735-
borrow from a lower one but a lower level does not actively lend to a
736-
higher one (see below). In this design, working on dispatching one
737-
request requires holding at most two priority level mutexes at any
738-
given moment. To avoid deadlock, there is a strict ordering on
739-
acquisition of those mutexes (namely, decreasing logical priority).
740-
741-
We could take a complementary approach, in which lower actively lends
742-
to higher but higher does not actively borrow from lower. The chosen
743-
direction was chosen based on looking at metrics from scalabiility
744-
tests showing that the `workload-low` priority level has a
745-
significantly larger NominalConcurrencyLimit than the other levels and
746-
is usually very under-utilized (so it would consider lending less
747-
often than higher levels would consider borrowing).
748-
749-
A request can be dispatched exactly at a non-exempt priority level
750-
when either there are no requests executing at that priority level or
751-
the number of seats needed by that request is no greater than the
752-
number of unused seats at that priority level. The number of unused
753-
seats at a given priority level is that level's
754-
NominalConcurrencyLimit minus the number of seats used by requests
755-
executing at that priority level (both requests dispatched from that
756-
priority level and requests dispatched from higher priority levels
757-
that are borrowing these lower level seats).
758-
759-
There are two sorts of times when dispatching to a non-exempt priority
760-
level is considered: when a request arrives, and when a request
761-
releases the seats it was occupying (which is not the same as when the
762-
request finishes from the client's point of view, due to the special
763-
considerations for WATCH requests).
764-
765-
At each of these sorts of moments, as many requests are dispatched
766-
exactly at the same priority level as possible. The next request to
767-
consider in this process is chosen by using the Fair Queuing for
768-
Server Requests algorithm below to choose one of the non-empty queues
769-
at that priority level, and if indeed there is a non-empty queue then
770-
the request at the head of the chosen queue is considered for
771-
dispatching. If the level has non-empty queues but the chosen request
772-
can not be dispatched exactly at this level at the moment then the
773-
logically lower non-exempt priority levels are considered, one at a
774-
time, in decreasing logical priority order. As soon as one is found
775-
at which the request can be dispatched at the moment according to the
776-
rule above then the search stops and the request is dispatched to
777-
execute using some of the lower priority level's seats. If no
778-
suitable priority level is found then the request is not dispatched at
779-
the moment.
780-
781-
As can be seen from this logic, when seats are freed up at a given
782-
priority level they are _not_ actively lent to logically higher
783-
priority levels. We avoid that in order to have a total order in
784-
which priority level mutexes are acquired (thus avoiding deadlock).
785-
786717
The following table shows the current default non-exempt priority
787718
levels and a proposal for their new configuration. For the sake of
788719
continuity with out-of-tree configuration objects, the proposed
@@ -800,6 +731,99 @@ when the priority field holds zero.
800731
| catch-all | 5 | 0 | 10000 |
801732

802733

734+
Borrowing is done on the basis of the current situation, with no
735+
consideration of opportunity cost, no further rationing according to
736+
shares (just obeying the concurrency limits as outlined above), and no
737+
pre-emption when the situation changes.
738+
739+
Whenever a request is dispatched, it takes all its seats from one
740+
priority level --- either the one referenced by the request's
741+
FlowSchema or a logically lower priority level.
742+
743+
Note: We *could* take a complementary approach, in which a request can
744+
borrow from a logically higher priority level. That would go together
745+
with a bigger change in the default configuration and greater
746+
challenges in incremental rollout. That is not the approach described
747+
here.
748+
749+
In the implementation, there is an important consideration regarding
750+
locking. We do not want one global mutex to be held for any work on
751+
dispatching; we want to allow concurrent work on dispatching, where
752+
possible. Before the introduction of borrowing, the priority levels
753+
operated completely independently and the only global thing was a
754+
`sync.RWMutex` that is locked for reading to do dispatching work and
755+
locked for writing only when digesting changes to the configuration
756+
API objects. Each priority level has its own private mutex.
757+
Borrowing introduces an interaction between priority levels, requiring
758+
multiple of those private locks to be held at once. We must avoid
759+
deadlock. This is done by insisting that whenever two locks are to be
760+
held at once, they are acquired in some total order. In particular,
761+
the lock of a logically higher priority level is acquired before the
762+
lock of a logically lower priority level. The locking order does not
763+
have to be the same as the priority order, but we make it the same for
764+
the sake of simplicity.
765+
766+
A request can be dispatched from a queue of priority level X to seats
767+
at non-exempt priority level Y when either there are no requests
768+
executing at level Y or the number of seats needed by that request is
769+
no greater than the number of unused seats at priority level Y. The
770+
number of unused seats at a given priority level is that level's
771+
NominalConcurrencyLimit minus the number of seats used by requests
772+
executing at that priority level (both requests dispatched from that
773+
priority level and requests dispatched from higher priority levels
774+
that are borrowing these lower level seats).
775+
776+
There are two sorts of times when dispatching from/to a non-exempt
777+
priority level is considered: when a request arrives, and when a
778+
request releases the seats it was occupying (which is not the same as
779+
when the request finishes from the client's point of view, due to the
780+
special considerations for WATCH requests).
781+
782+
At each of these sorts of moments, as many requests as possible are
783+
dispatched from the priority level involved to the same priority
784+
level. The next request to consider in this process is chosen by
785+
using the Fair Queuing for Server Requests algorithm below to choose
786+
one of the non-empty queues (if indeed there are any) at that priority
787+
level. This was the entirety of the reaction before the introduction
788+
of borrowing.
789+
790+
Borrowing extends the reaction as follows. There are two cases. The
791+
simpler case is when reacting to a request arrival, let us say at
792+
priority level X. In this case, and if the baseline reaction ---
793+
dispatching as many requests as possible from X to X --- ends with a
794+
request to dispatch but not enough seats available, then the reaction
795+
continues with trying to dispatch that request to the logically lower
796+
priority levels. With the lock of X still held, the logically lower
797+
levels Y are enumerated in logically decreasing order and dispatch to
798+
each one of them is considered. This consideration starts by
799+
acquiring Y's lock, then dispatches as many requests as are allowed by
800+
the above rule, and finally releases Y's lock. Naturally, if/when X
801+
runs out of requests to dispatch, this iteration stops.
802+
803+
The more complicated case is reacting to seats being freed up. Let us
804+
say this is at priority level X. As in the other case, the reaction
805+
starts with the baseline, dispatching as much as possible from X to X.
806+
If this leaves some seats at X still available, then the reaction
807+
continues by trying to dispatch from higher priority levels to X. The
808+
lock of X is not held over this loop, that would violate locking
809+
order. This iteration starts by releasing the lock of X and then
810+
iterating over the higher priority levels Y (in logically decreasing
811+
priority order) and considering each one in turn. That consideration
812+
starts by acquiring the locks of Y and X (in that order), then does as
813+
much dispatching from Y to X as is allowed, and finally releases the
814+
two locks. Note that dispatching one wide request from Y to X may
815+
unblock immediate dispatching of additional requests from Y (whether
816+
to X, Y, or another priority level). So the dispatching of all
817+
possible from Y to X does not cover all of that. The reaction to
818+
seats being freed up has to include keeping a list of priority levels
819+
Y to consider general dispatching from. An element is added to that
820+
list whenever the primary considerations here cause a dispatch of a
821+
request with width greater than 1 and leaving Y with some queued
822+
requests. After the primary iteration above is done, the reaction to
823+
seats being freed continues with iterating over this list of Y to
824+
reconsider and reacting as for an arrival at Y: dispatch as much as
825+
possible from Y to Y and logically lower priority levels.
826+
803827
### Fair Queuing for Server Requests
804828

805829
The following subsections cover the problem statements and the current

0 commit comments

Comments
 (0)