[Draft] KEP-1040: directed borrowing between priority levels in APF #3290

MikeSpreitzer · 2022-05-02T06:44:50Z

One-line PR description: describe how to add borrowing between priority levels in API Priority and Fairness in a directed way

wojtek-t · 2022-05-02T08:36:08Z

keps/sig-api-machinery/1040-priority-and-fairness/README.md

+purpose of borrowing currently-unused concurrency.  The ordering is
+based (first) on increasing value of a spec field whose name is
+`priority` and whose value is constrained to lie in the range 0
+through 10000 inclusive and (second) on increasing name of the


What is the reason for constraining it to 0..10000?

Same reason we constrain the matchingPrecedence that way.

wojtek-t · 2022-05-02T08:42:50Z

keps/sig-api-machinery/1040-priority-and-fairness/README.md

+based (first) on increasing value of a spec field whose name is
+`priority` and whose value is constrained to lie in the range 0
+through 10000 inclusive and (second) on increasing name of the
+PriorityLevelConfiguration object.  Priority levels that appear later


I know what you want to achieve (the total order), but I don't like the order by name - that will lead to confusion imho.

I would much rather prefer to say that no two PLs can share the same priority. The drawback of that is that such cross-object validation is non-trivial to add potentially causing issues (e.g. IPs/ports in services).

@deads2k - thoughts?

If you have to have a total order, instead of attempting to force humans to use unique numbers for everything (a nightmare if you ever want to incorporate configs from multiple places), please order first by priority, and then break ties alphabetically by name. If you don't want it to be predictable, you can hash one or both of those.

(To keep the actual comparisons numeric & easy, I'd expect to convert whatever the tie break is to a number and treat it as a fractional part of the priority number, but that's just implementation details)

Since kube api-machinery has no multi-object transactions there is no direct way to prevent 100% of collisions. We could do a validation check that catches violations that are well enough separated in time. We could add a controller that complains somehow or deletes violating objects, but I think that would be more code to maintain than it is worth. I have revised to more strongly say "don't do that" and not promise how ties are broken.

wojtek-t · 2022-05-02T08:48:53Z

keps/sig-api-machinery/1040-priority-and-fairness/README.md

+consideration of opportunity cost and no pre-emption when the
+situation changes.
+
+An apiserver assigns three concurrency limits to each non-exempt


Can you please start with explicitly defining how we will change the API?

IIUC, what you want to do is to add a new field to LimitedPriorityLevelConfiguration:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/api/flowcontrol/v1beta2/types.go#L452-L473

by setting what percentege of shares will be lendable?

OK - I see that described below now, please maybe start with saying something like

The exinst LimitedPriorityLevelConfiguration type will be extended as following:

type LimitedPriorityLevelConfiguration struct { ... // FIXME: comment lendableConcurrencyShared int32 }

Revised to display the new/updated type fields.

wojtek-t · 2022-05-02T09:09:13Z

keps/sig-api-machinery/1040-priority-and-fairness/README.md

+are lower non-exempt priority levels, then the request at the head of
+the chosen queue is considered for dispatch at one of the lower
+priority levels.  The particular lower priority level considered is
+drawn at random from the lower ones, in proportion to their


I'm not a fan of this. And thinking about it, I'm wondering if we really need that.

The critical property you're introducing is effectively:

"A given request can borrow from at most one other priority level"

That is a nice simplification and I think I can buy its consequences of potentially not being able to dispatch a request even if in theory it would be dispatchable if we would be able to borrow from multiple PLs. Because the additional cognitive benefit from that simplification is fairly huge.

Now - I don't really buy the fact that we will be choosing the PL to borrow from at random. Any randomization here will make it fairly hard to understand/debug what was really happening in my production cluster. And I don't think I really see why we need that.

Here is how I think about it:

we have a set of (ordered) PLs [the set itself is protected by its own lock, but individual QueueSet per PL are protected on their own by their own locks]

when we can't dispatch a request R for PL (due to lack of seats) then we:

grab a reader lock from (1) [this seems cheap, because PLs are added/removed really rarely, so that's fine]

iterate over all PLs with lower priority and for each of them:
(a) take a lock on that PL
(b) check if it has enough seats so that we can borrow to satisfy our request
(c) if so - borrow and return
(d) release the lock from that PL

In the above approach, we don't really need a global lock - we still lock exactly one PL at a time, which seems what we need.

Based on our discussion in the meeting today, I changed the proposal to have the higher priority level sequentially consider borrowing from each lower priority level. And I also added an explicit statement that all the seats are taken from one priority level. I also added some words about how little the existing global lock is used.

wojtek-t · 2022-05-02T09:10:55Z

keps/sig-api-machinery/1040-priority-and-fairness/README.md

+(higher may borrow from lower, and lower does not actively lend to
+higher) and in a very limited quantity: an attempt to dispatch for one
+priority level will consider borrowing from just one other priority
+level (if there is any lower priority level at all).


Let me clarify this one - are you saying that to dispatch a request at PL it can:
(a) use the unoccupied seats from its PL and additionally borrow from 1 more PL?
(b) if its borrowing from some other PL, it cannot use unoccupied seats from its own PL?

I hope it's (a), but it's unclear to me. And we need to clarify it here.

OK - it seems that it's actually (b) based on line 729.

I'm not entirely sure I like it yet, but maybe that's acceptable. I need to think about it more.

Revised to answer this explicitly.

wojtek-t · 2022-05-02T09:18:41Z

keps/sig-api-machinery/1040-priority-and-fairness/README.md

+can be done without acquiring mutexes).  The request is executed at
+the chosen lower level (occupying some of its seats) if the request
+can be dispatched exactly at that level according to the rule above.
+


One thing that I'm missing is the fact that what changes now is:

before:
(1) when a request releases its seats after processing, we look for requests to dispatch only at its PL

after:
(2) when a request releases its seats after processing, it may not only mean that some other request at the same PL is now dispatchable, but it may also mean that requests from some higher PL become dispatchable now

So I think that we need to decide:
(a) what do we prioritize [I would say that if there is a request from a given PL that we can now dispatch, we dispatch this one]
(b) if we can't dispatch request from the same PL, do we try to dispatch requests from higher PLs? And if so, in what order? [I would say that we should do that, if and only if our PL is empty (no waiting requests) and do that starting from the highest priority PL]

Revised to explicitly address this.

I'm not convinced - as I mentioned above in (b), if our PL becomes empty (all queues are empty), then what we could in theory do is to:
(1) release the lock on it
(2) iterate from the highest priority level and when we find a PL that has non-empty queue try to borrow for lower levels

That won't cause inversion and potentially let us dispatch more.
The only question is how often that happens and as a result is it worth it...

Remember that the queue being dispatched from is in a priority level, so I think the lock on that priority level has to be held until a dispatching decision is made.

So what I'm suggesting is:

let's say we released all seats on a given PL and we have an empty queue there

we forget about the queue

but we start from the highest priority level and try to dispatch from those queues (just optimistically assuming that our queue will remain to have free seats) - yes you will have to take a lock on our initial queue again, but that sounds acceptable if there are requests to dispatch in higher PLs

yes you will have to take a lock on our initial queue again, but that sounds acceptable if there are requests to dispatch in higher PLs

This matches what I would expect.

wojtek-t · 2022-05-02T09:20:58Z

keps/sig-api-machinery/1040-priority-and-fairness/README.md

+| node-high       |  40 |  10 |   400 |
+| system          |  30 |  10 |   600 |
+| workload-high   |  40 |  20 |  1000 |
+| workload-low    | 100 |  90 |  8000 |


Workload low is relatively widely used by different controllers. So I think it should have higher priority. More like 2000 in my opinion. And potentially also lower lendable shares?

Revised and added an explanation. A new thought occurred during revision: what happens when reading a pre-existing v1beta2.PriorityLevelConfiguration object that was stored before the new fields were introduced?

keps/sig-api-machinery/1040-priority-and-fairness/README.md

lavalamp · 2022-05-02T18:07:14Z

keps/sig-api-machinery/1040-priority-and-fairness/README.md

+in the order are considered to have lower priority.
+
+Borrowing is done on the basis of the current situation, with no
+consideration of opportunity cost and no pre-emption when the


I can imagine situations where this could be problematic. We may want to (let humans) disallow PLs with many slow requests from borrowing. (still reading so maybe you already solve this)

No, the proposal does not address this issue. I am not yet sure what I think about it.

keps/sig-api-machinery/1040-priority-and-fairness/README.md

k8s-ci-robot · 2022-05-03T03:13:57Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: MikeSpreitzer
To complete the pull request process, please ask for approval from deads2k after the PR has been reviewed.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

keps/sig-api-machinery/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

keps/sig-api-machinery/1040-priority-and-fairness/README.md

MikeSpreitzer · 2022-05-03T17:30:46Z

Following up on a comment in-line: we would be less constrained if we introduced a v1beta3 version and put the new fields only in the new version.
Indeed, why make our lives harder than necessary by not going to a new version?

wojtek-t · 2022-05-05T12:25:35Z

Following up on a comment in-line: we would be less constrained if we introduced a v1beta3 version and put the new fields only in the new version.

That doesn't help much, because we still have to be able to roundtrip via v1beta2 and get the same thing.

I'm not an API expert, so I would defer to @liggitt or @lavalamp but:
(1) we usually do that via encoding the newly added fields as JSON in the older types
But they would point you to the the right rules/practices.

liggitt · 2022-05-05T12:38:07Z

we would be less constrained if we introduced a v1beta3 version and put the new fields only in the new version.

@wojtek-t is correct, that is not easier. The apiserver has to be able to round-trip all fields in all versions to all other versions.

we usually do that via encoding the newly added fields as JSON in the older types

We usually do that by adding the fields to older version definitions as normal fields. The few places we've tried to round-trip via serialized JSON in annotations to older versions has really not gone well (lots of bugs with invalid data being allowed into annotations and breaking conversion)

keps/sig-api-machinery/1040-priority-and-fairness/README.md

wojtek-t · 2022-05-05T12:38:18Z

keps/sig-api-machinery/1040-priority-and-fairness/README.md

+can be done without acquiring mutexes).  The request is executed at
+the chosen lower level (occupying some of its seats) if the request
+can be dispatched exactly at that level according to the rule above.
+


I'm not convinced - as I mentioned above in (b), if our PL becomes empty (all queues are empty), then what we could in theory do is to:
(1) release the lock on it
(2) iterate from the highest priority level and when we find a PL that has non-empty queue try to borrow for lower levels

That won't cause inversion and potentially let us dispatch more.
The only question is how often that happens and as a result is it worth it...

wojtek-t · 2022-05-05T12:39:14Z

keps/sig-api-machinery/1040-priority-and-fairness/README.md

+| leader-election |  10 |   0 |   150 |
+| node-high       |  40 |  10 |   400 |
+| system          |  30 |  10 |   500 |
+| workload-high   |  40 |  20 |   833 |


I wouldn't pick up this number - why not some rounded one, like 800?

@wojtek-t : regarding both of your comments about priority values: the proposal here is to set exactly the same values as prescribed by the rule for behavior when the value is zero. Let's first finish the discussion of the overall way to get the new fields added; the concern here is really a corollary to that.

I'm talking here about priority itself (not the shares). The priority field value is orthogonal to the discussion we're having above imho.

wojtek-t · 2022-05-05T12:39:59Z

keps/sig-api-machinery/1040-priority-and-fairness/README.md

+| node-high       |  40 |  10 |   400 |
+| system          |  30 |  10 |   500 |
+| workload-high   |  40 |  20 |   833 |
+| workload-low    | 100 |  90 |  9000 |


I think I mentioned it somewhere, but my comment got lost - workload-load is used by different components in different setups, so I would make it much higher priority, maybe 1000 ?

Still not solved.

Also - I would like to avoid numbers like 833 - let's try to use more rounded numbers.

We need to define what happens for a PriorityLevelConfiguration that does not have its Priority field set. This PR has a proposal for that. The values you see in this table equal the values that come from that proposal. I think it is good to have consistency between the designed Priority values for the in-tree PLCs and the defaults (which will initially apply to out-of-tree PLCs).

I don't understand your comment.
What matter for us is the order between PLs, not the absolute numbers. So what I'm saying is that given that, let's avoid numbers like 833 and let's make workload-load higher priority, because I would expect user-originating PLs to be after workload-load.

MikeSpreitzer · 2022-05-09T02:02:41Z

Let me make sure I understand the considerations around adding new fields. We want good behavior from a system that:

has a mixture of servers from all the point releases (so far) of the latest minor release and the previous one;
has a mixture of in-tree clients from all the point releases (so far) of the latest minor release and the three previous ones;
has a mixture of out-of-tree client binaries built from out-of-tree sources of uncontrolled vintage and importing vintages of the in-tree code of any (existing) point release of the latest minor release or any of the three previous ones; and
starts with storage contents written by the previous release.

Have I got that right?

The part about old clients means we drag a really heavy boat anchor. I am thinking that what I wrote above does not exclude old clients that read, modify locally, and PUT a whole replacmenet value.

wojtek-t · 2022-05-09T06:51:39Z

Have I got that right?

The part about old clients means we drag a really heavy boat anchor. I am thinking that what I wrote above does not exclude old clients that read, modify locally, and PUT a whole replacmenet value.

You won't get fully protection from the old client making an Update call and overwriting something written by the new client). And we don't have a goal of preventing that. Well behaving clients should be using Patch(), which would prevent the above.

Also - we shouldn't try to reinvent the wheel - we've been adding fields to different APIs in the past gazillion times and API approvers already know what works best and we should follow that path.

MikeSpreitzer · 2022-05-10T23:41:44Z

@kikisdeliveryservice : why the title change here?

Replaced lendable concurrency shares with borrowable percent. Explained a motivation for the rule for default priority behavior.

wojtek-t · 2022-05-11T08:49:37Z

keps/sig-api-machinery/1040-priority-and-fairness/README.md

+  // of every other Limited priority level.
+  // This field has a default value of 30.
+  // +optional
+  AssuredConcurrencyShares int32


Do we want to say that when graduating to v1 (or v1beta3, but hopefully that won't be needed), we will rename it to X (TotalConcurrencyShared - but I think you had better name).

OK - I see that you're mentioning it below with: NominalConcurrencyShares

Would it make sense to extend the comment above and say that this will will be renamed to NominalConcurrencyShares in v1 version?

friendly ping

kikisdeliveryservice · 2022-05-11T17:22:37Z

@kikisdeliveryservice : why the title change here?

KEP number in the PR title is standard for PRs in this repo :)

deads2k · 2022-06-03T19:42:02Z

keps/sig-api-machinery/1040-priority-and-fairness/README.md

+given moment.  To avoid deadlock, there is a strict ordering on
+acquisition of those mutexes (namely, decreasing logical priority).
+
+We could take a complementary approach, in which lower actively lends


This is a good observation of the current state. I'm wondering how we would choose to configure priority levels after this. Would we choose to reserve large portions for the cluster critical services and then have a low concurrency for workload low and simply let it borrow from the higher levels when there was space available.

cc @wojtek-t @tkashem for comment as well

That's actually a good point, I was thinking about that at some point but then forget.

This is actually a problem in the current design, where we're explicitly saying in lines 733-734 that we're not considering borrowing from higher levels to lower levels.
So David - what you're suggesting here will not happen in the current design.

The way I would like it to work is indeed how you're saying: "we reserve the large portion for cluster stuff that we may potentially need, and then let other to borrow from that if those are underutilized". But that implies that cluster stuff is lowest level, which is counter-intuitive...
OTOH, I really don't want to allocate a small fraction from cluster and let it borrow from other levels in case cluster is hugely utilized - that would be counter-productive.

Strawman (although I just came up with this idea and have no idea if it makes sense yet): what if the priority only defines the ordering (then maybe we don't need it even?) but in fact every level can borrow from any other?

What we would need then though is more like rule:
"if there are requests at a given PL that could consume a capacity from it, then these have precendence over any other requests that would like to borrow the capacity"...

With the above, by giving good chunk of capacity to cluster stuff we ensure that it will have capacity when we need to and it's also borrowable.

And FWIW - I think we don't need the "priority" field at all (we can define ordering based on namespace/name or sth) and from that perspective it would be closer to what Daniel was proposing (i.e. we conceptually give capacity to others, it's just we still dispatch only from one level, not from the whole bag?)

We have four proposals on the table: this PR, its complement (#3394), #3389, and #3391.

k8s-ci-robot · 2022-06-21T06:35:08Z

@MikeSpreitzer: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wojtek-t · 2022-07-11T10:05:22Z

Closing in favor of #3391

k8s-ci-robot requested review from apelisse and deads2k May 2, 2022 06:45

MikeSpreitzer force-pushed the apf-add-borrowing branch from 2664372 to 4cca9c8 Compare May 2, 2022 07:23

Draft borrowing between priority levels in APF

33183fc

MikeSpreitzer force-pushed the apf-add-borrowing branch from 4cca9c8 to 33183fc Compare May 2, 2022 07:41

wojtek-t reviewed May 2, 2022

View reviewed changes

wojtek-t assigned lavalamp, deads2k and wojtek-t May 2, 2022

lavalamp reviewed May 2, 2022

View reviewed changes

keps/sig-api-machinery/1040-priority-and-fairness/README.md Show resolved Hide resolved

lavalamp reviewed May 2, 2022

View reviewed changes

keps/sig-api-machinery/1040-priority-and-fairness/README.md Outdated Show resolved Hide resolved

lavalamp reviewed May 2, 2022

View reviewed changes

keps/sig-api-machinery/1040-priority-and-fairness/README.md Outdated Show resolved Hide resolved

MikeSpreitzer force-pushed the apf-add-borrowing branch 2 times, most recently from b7443a8 to fba5899 Compare May 3, 2022 03:31

Revised according to review

bc225ec

MikeSpreitzer force-pushed the apf-add-borrowing branch from fba5899 to bc225ec Compare May 3, 2022 03:34

lavalamp reviewed May 3, 2022

View reviewed changes

keps/sig-api-machinery/1040-priority-and-fairness/README.md Outdated Show resolved Hide resolved

lavalamp reviewed May 3, 2022

View reviewed changes

keps/sig-api-machinery/1040-priority-and-fairness/README.md Outdated Show resolved Hide resolved

wojtek-t reviewed May 5, 2022

View reviewed changes

kikisdeliveryservice changed the title ~~Draft borrowing between priority levels in APF~~ [Draft] KEP-1040: borrowing between priority levels in APF May 10, 2022

More revisions after review

0d1246c

Replaced lendable concurrency shares with borrowable percent. Explained a motivation for the rule for default priority behavior.

MikeSpreitzer force-pushed the apf-add-borrowing branch from f883f9a to 0d1246c Compare May 11, 2022 01:00

wojtek-t reviewed May 11, 2022

View reviewed changes

deads2k reviewed Jun 3, 2022

View reviewed changes

MikeSpreitzer force-pushed the apf-add-borrowing branch 2 times, most recently from 1113369 to 8494831 Compare June 13, 2022 05:36

Add active lending, secondary dispatching

6213de4

MikeSpreitzer force-pushed the apf-add-borrowing branch from 8494831 to 6213de4 Compare June 13, 2022 05:50

MikeSpreitzer changed the title ~~[Draft] KEP-1040: borrowing between priority levels in APF~~ [Draft] KEP-1040: directed borrowing between priority levels in APF Jun 13, 2022

This was referenced Jun 13, 2022

[Draft] KEP-1040: all-to-all borrowing between priority levels in APF #3389

Closed

KEP-1040: periodically adjusted borrowing between priority levels in APF #3391

Merged

[Draft] KEP-1040: low-to-high borrowing between priority levels in APF #3394

Closed

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 21, 2022

wojtek-t closed this Jul 11, 2022

[Draft] KEP-1040: directed borrowing between priority levels in APF #3290

[Draft] KEP-1040: directed borrowing between priority levels in APF #3290

Uh oh!

Conversation

MikeSpreitzer commented May 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MikeSpreitzer May 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

k8s-ci-robot commented May 3, 2022

Uh oh!

Uh oh!

Uh oh!

MikeSpreitzer commented May 3, 2022

Uh oh!

wojtek-t commented May 5, 2022

Uh oh!

liggitt commented May 5, 2022

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MikeSpreitzer commented May 2, 2022 •

edited

Loading

MikeSpreitzer May 3, 2022 •

edited

Loading

MikeSpreitzer Jun 13, 2022 •

edited

Loading

deads2k Jun 3, 2022 •

edited

Loading