Skip to content

Commit b7443a8

Browse files
committed
Revised according to review
1 parent 33183fc commit b7443a8

File tree

1 file changed

+115
-69
lines changed
  • keps/sig-api-machinery/1040-priority-and-fairness

1 file changed

+115
-69
lines changed

keps/sig-api-machinery/1040-priority-and-fairness/README.md

Lines changed: 115 additions & 69 deletions
Original file line numberDiff line numberDiff line change
@@ -643,59 +643,97 @@ The concurrency limit of an apiserver is divided among the non-exempt
643643
priority levels, and higher ones can do a limited amount of borrowing
644644
from lower ones.
645645

646-
Non-exempt priority levels are ordered in a total order for the
647-
purpose of borrowing currently-unused concurrency. The ordering is
648-
based (first) on increasing value of a spec field whose name is
649-
`priority` and whose value is constrained to lie in the range 0
650-
through 10000 inclusive and (second) on increasing name of the
651-
PriorityLevelConfiguration object. Priority levels that appear later
652-
in the order are considered to have lower priority.
646+
Two fields of `LimitedPriorityLevelConfiguration`, introduced in the
647+
midst of the `v1beta2` lifetime, configure the borrowing. The following
648+
display shows the two new fields along with the updated description for
649+
the `AssuredConcurrencyShares` field.
650+
651+
```go
652+
type LimitedPriorityLevelConfiguration struct {
653+
...
654+
// `assuredConcurrencyShares` (ACS) contributes to the computation of the
655+
// NominalConcurrencyLimit (NCL) of this level.
656+
// This is the number of execution seats available at this priority level.
657+
// This is used both for requests dispatched from
658+
// this priority level as well as requests dispatched from higher priority
659+
// levels borrowing seats from this level. This does not limit dispatching from
660+
// this priority level that borrows seats from lower priority levels (those lower
661+
// levels do that). The server's concurrency limit (SCL) is divided among the
662+
// Limited priority levels in proportion to their ACS values:
663+
//
664+
// NCL(i) = ceil( SCL * ACS(i) / sum_acs )
665+
// sum_acs = sum[limited priority level k] ACS(k)
666+
//
667+
// Bigger numbers mean a larger nominal concurrency limit, at the expense
668+
// of every other Limited priority level.
669+
// This field has a default value of 30.
670+
// +optional
671+
AssuredConcurrencyShares int32
672+
673+
// `lendableConcurrencyShares` (LCS) contributes to the computation of the
674+
// LendableConcurrencyLimit (LCL) for this level. This is the number of
675+
// execution seats of this level that can be borrowed by higher priority
676+
// Limited levels.
677+
// This may not be negative, and may not be greater than
678+
// `assuredConcurrencyShares`.
679+
//
680+
// LCL(i) = ceil( SCL * LCS(i) / sum_acs )
681+
//
682+
// This field has a default value of zero.
683+
// +optional
684+
LendableConcurrencyShares int32
685+
686+
// `priority` determines where this priority level appears in the total order
687+
// of Limited priority levels used to configure borrowing between those levels.
688+
// A numerically higher value means a logically lower priority.
689+
// Do not create ties; they will be broken arbitrarily.
690+
// `priority` SHOULD be a positive number no greater than 10000.
691+
// If it is zero then, for the sake of a smooth transition from the time
692+
// before this field existed, this level will be treated as if its `priority`
693+
// is the average of the `matchingPrecedence` of the FlowSchema objects
694+
// that reference this level.
695+
// +optional
696+
Priority int32
697+
}
698+
```
699+
700+
This is a somewhat tortured meaning for "assured", but it is the
701+
meaning we need for introduction of the new field to the existing type
702+
while having a smooth transition in behavior. In the next version we
703+
should rename the `AssuredConcurrencyShares` to
704+
`NominalConcurrencyShares`.
653705

654706
Borrowing is done on the basis of the current situation, with no
655-
consideration of opportunity cost and no pre-emption when the
656-
situation changes.
657-
658-
An apiserver assigns three concurrency limits to each non-exempt
659-
priority level, with a constraint that means there are only two
660-
degrees of freedom.
661-
662-
- The ***LendableConcurrencyLimit*** is the number of seats that are
663-
statically (i.e., before any borrowing takes place) assigned to this
664-
level and can be dynamically borrowed by higher levels.
665-
- The ***NonLendableConcurrencyLimit*** is the number of seats that
666-
are statically assigned to this level and can _not_ be borrowed by
667-
higher levels.
668-
- The ***NominalConcurrencyLimit*** is the number of seats statically
669-
assigned to this level and is the sum of the
670-
LendableConcurrencyLimit and the NonLendableConcurrencyLimit.
671-
672-
Each non-exempt PriorityLevelConfiguration's spec has an
673-
`assuredConcurrencyShares`, which has existed since APF was introduced
674-
and may not be zero, and a `lendableConcurrencyShares` field, which is
675-
being added in the midst of the lifetime of the `v1beta2` version of
676-
the API and may be any value between zero and
677-
`assuredConcurrencyShares` inclusive (default is zero). Each
678-
apiserver allocates NominalConcurrencyLimits in proportion to
679-
`assuredConcurrencyShares` and LendableConcurrncyLimit in
680-
corresponding propotion:
681-
682-
```
683-
NominalConcurrencyLimit(i) = ceil( SCL * assuredConcurrencyShares(i) / sum_assured )
684-
LendableConcurrencyLimit(i) = ceil( SCL * lendableConcurrencyShares(i) / sum_assured )
685-
NonLendableConcurrencyLimit(i) = NominalConcurrencyLimit(i) - LendableConcurrencyLimit(i)
686-
sum_assured = sum[priority levels k] assuredConcurrencyShares(k)
687-
```
688-
689-
where SCL is the apiserver's concurrency limit.
707+
consideration of opportunity cost, no further rationing according to
708+
shares (just obeying the concurrency limits as outlined above), and no
709+
pre-emption when the situation changes.
710+
711+
Whenever a request is dispatched, it takes all its seats from one
712+
priority level --- either the one referenced by the request's
713+
FlowSchema or a lower priority level.
690714

691715
Borrowing is further limited by a practical consideration: we do not
692716
want a global mutex covering all dispatching. Aside from borrowing,
693717
dispatching from one priority level is done independently from
694-
dispatching at another. Borrowing is allowed in just one direction
695-
(higher may borrow from lower, and lower does not actively lend to
696-
higher) and in a very limited quantity: an attempt to dispatch for one
697-
priority level will consider borrowing from just one other priority
698-
level (if there is any lower priority level at all).
718+
dispatching at another. There _is_ a global mutex held by the logic
719+
that digests configuration objects, but it produces an immutable
720+
object that is passed through `sync/atomic.Store` and `.Load` to the
721+
logic that does queuing and dispatching. Borrowing is allowed in just
722+
one direction: higher may borrow from lower. Furthermore, a higher
723+
priority level may actively borrow from a lower one but a lower level
724+
does not actively lend to a higher one (see below). In this design,
725+
working on one request requires holding at most two priority level
726+
mutexes at any given moment. To avoid deadlock, there is a strict
727+
ordering on acquisition of those mutexes, namely decreasing logical
728+
priority.
729+
730+
We could take a complementary approach, in which lower actively lends
731+
to higher but higher does not actively borrow from lower. The chosen
732+
direction was chosen based on looking at metrics from scalabiility
733+
tests showing that the `workload-low` priority level has a
734+
significantly larger NominalConcurrencyLimit than the other levels and
735+
is usually very under-utilized (so it would consider lending less
736+
often than higher levels would consider borrowing).
699737

700738
A request can be dispatched exactly at a non-exempt priority level
701739
when either there are no requests executing at that priority level or
@@ -706,39 +744,47 @@ NominalConcurrencyLimit minus the number of seats used by requests
706744
executing at that priority level (dispatched from that priority level
707745
and higher ones).
708746

709-
There are two sorts of times when dispatching to a non-empty priority
747+
There are two sorts of times when dispatching to a non-exempt priority
710748
level is considered: when a request arrives, and when a request
711749
releases the seats it was occupying (which is not the same as when the
712-
request finishes from the client's point of view, see below about
713-
WATCH requests).
750+
request finishes from the client's point of view, due to the special
751+
considerations for WATCH requests).
714752

715753
At each of these sorts of moments, as many requests are dispatched
716754
exactly at the same priority level as possible. The next request to
717-
consider dispatching is chosen by using the Fair Queuing for Server
718-
Requests algorithm below to choose a queue at that priority level, and
719-
the request at the head of that queue is considered. If (a) no
720-
requests can be dispatched exactly at that priority level at that
721-
moment, (b) there are non-empty queues at that level, and (c) there
722-
are lower non-exempt priority levels, then the request at the head of
723-
the chosen queue is considered for dispatch at one of the lower
724-
priority levels. The particular lower priority level considered is
725-
drawn at random from the lower ones, in proportion to their
726-
LendableConcurrencyLimit (we use a static value so that the drawing
727-
can be done without acquiring mutexes). The request is executed at
728-
the chosen lower level (occupying some of its seats) if the request
729-
can be dispatched exactly at that level according to the rule above.
755+
consider in this process is chosen by using the Fair Queuing for
756+
Server Requests algorithm below to choose one of the non-empty queues
757+
at that priority level, and if indeed there is a non-empty queue then
758+
the request at the head of the chosen queue is considered for
759+
dispatching. If the level has non-empty queues but the chosen request
760+
can not be dispatched exactly at this level at the moment then the
761+
logically lower non-exempt priority levels are considered, one at a
762+
time, in decreasing logical priority order. As soon as one is found
763+
at which the request can be dispatched at the moment according to the
764+
rule above then the search stops and the request is dispatched to
765+
execute using some of the lower priority level's seats. If no
766+
suitable priority level is found then the request is not dispatched at
767+
the moment.
768+
769+
As can be seen from this logic, when seats are freed up at a given
770+
priority level they are _not_ actively lent to logically higher
771+
priority levels. We avoid that in order to have a total order in
772+
which priority level mutexes are acquired.
730773

731774
The following table shows the current default non-exempt priority
732-
levels and a proposal for their new configuration.
775+
levels and a proposal for their new configuration. For the sake of
776+
continuity with out-of-tree configuration objects, the proposed
777+
priority values follow the rule given above for the effective value
778+
when the priority field holds zero.
733779

734780
| Name | Assured Shares | Proposed Lendable Shares | Proposed Priority |
735781
| ---- | -------------: | -----------------------: | ----------------: |
736-
| leader-election | 10 | 0 | 200 |
782+
| leader-election | 10 | 0 | 150 |
737783
| node-high | 40 | 10 | 400 |
738-
| system | 30 | 10 | 600 |
739-
| workload-high | 40 | 20 | 1000 |
740-
| workload-low | 100 | 90 | 8000 |
741-
| global-default | 20 | 10 | 9000 |
784+
| system | 30 | 10 | 500 |
785+
| workload-high | 40 | 20 | 833 |
786+
| workload-low | 100 | 90 | 9000 |
787+
| global-default | 20 | 10 | 9900 |
742788
| catch-all | 5 | 0 | 10000 |
743789

744790

0 commit comments

Comments
 (0)