Skip to content

Conversation

@MikeSpreitzer
Copy link
Member

@MikeSpreitzer MikeSpreitzer commented May 2, 2022

  • One-line PR description: describe how to add borrowing between priority levels in API Priority and Fairness in a directed way

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels May 2, 2022
@k8s-ci-robot k8s-ci-robot requested review from apelisse and deads2k May 2, 2022 06:45
purpose of borrowing currently-unused concurrency. The ordering is
based (first) on increasing value of a spec field whose name is
`priority` and whose value is constrained to lie in the range 0
through 10000 inclusive and (second) on increasing name of the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for constraining it to 0..10000?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same reason we constrain the matchingPrecedence that way.

based (first) on increasing value of a spec field whose name is
`priority` and whose value is constrained to lie in the range 0
through 10000 inclusive and (second) on increasing name of the
PriorityLevelConfiguration object. Priority levels that appear later
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know what you want to achieve (the total order), but I don't like the order by name - that will lead to confusion imho.

I would much rather prefer to say that no two PLs can share the same priority. The drawback of that is that such cross-object validation is non-trivial to add potentially causing issues (e.g. IPs/ports in services).

@deads2k - thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have to have a total order, instead of attempting to force humans to use unique numbers for everything (a nightmare if you ever want to incorporate configs from multiple places), please order first by priority, and then break ties alphabetically by name. If you don't want it to be predictable, you can hash one or both of those.

(To keep the actual comparisons numeric & easy, I'd expect to convert whatever the tie break is to a number and treat it as a fractional part of the priority number, but that's just implementation details)

Copy link
Member Author

@MikeSpreitzer MikeSpreitzer May 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since kube api-machinery has no multi-object transactions there is no direct way to prevent 100% of collisions. We could do a validation check that catches violations that are well enough separated in time. We could add a controller that complains somehow or deletes violating objects, but I think that would be more code to maintain than it is worth. I have revised to more strongly say "don't do that" and not promise how ties are broken.

consideration of opportunity cost and no pre-emption when the
situation changes.

An apiserver assigns three concurrency limits to each non-exempt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please start with explicitly defining how we will change the API?

IIUC, what you want to do is to add a new field to LimitedPriorityLevelConfiguration:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/api/flowcontrol/v1beta2/types.go#L452-L473

by setting what percentege of shares will be lendable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - I see that described below now, please maybe start with saying something like

The exinst LimitedPriorityLevelConfiguration type will be extended as following:

type LimitedPriorityLevelConfiguration struct {
  ...
  
  // FIXME: comment
  lendableConcurrencyShared int32
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised to display the new/updated type fields.

are lower non-exempt priority levels, then the request at the head of
the chosen queue is considered for dispatch at one of the lower
priority levels. The particular lower priority level considered is
drawn at random from the lower ones, in proportion to their
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of this. And thinking about it, I'm wondering if we really need that.

The critical property you're introducing is effectively:

"A given request can borrow from at most one other priority level"

That is a nice simplification and I think I can buy its consequences of potentially not being able to dispatch a request even if in theory it would be dispatchable if we would be able to borrow from multiple PLs. Because the additional cognitive benefit from that simplification is fairly huge.

Now - I don't really buy the fact that we will be choosing the PL to borrow from at random. Any randomization here will make it fairly hard to understand/debug what was really happening in my production cluster. And I don't think I really see why we need that.

Here is how I think about it:

  1. we have a set of (ordered) PLs [the set itself is protected by its own lock, but individual QueueSet per PL are protected on their own by their own locks]
  2. when we can't dispatch a request R for PL (due to lack of seats) then we:
  • grab a reader lock from (1) [this seems cheap, because PLs are added/removed really rarely, so that's fine]
  • iterate over all PLs with lower priority and for each of them:
    (a) take a lock on that PL
    (b) check if it has enough seats so that we can borrow to satisfy our request
    (c) if so - borrow and return
    (d) release the lock from that PL

In the above approach, we don't really need a global lock - we still lock exactly one PL at a time, which seems what we need.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on our discussion in the meeting today, I changed the proposal to have the higher priority level sequentially consider borrowing from each lower priority level. And I also added an explicit statement that all the seats are taken from one priority level. I also added some words about how little the existing global lock is used.

(higher may borrow from lower, and lower does not actively lend to
higher) and in a very limited quantity: an attempt to dispatch for one
priority level will consider borrowing from just one other priority
level (if there is any lower priority level at all).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me clarify this one - are you saying that to dispatch a request at PL it can:
(a) use the unoccupied seats from its PL and additionally borrow from 1 more PL?
(b) if its borrowing from some other PL, it cannot use unoccupied seats from its own PL?

I hope it's (a), but it's unclear to me. And we need to clarify it here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - it seems that it's actually (b) based on line 729.

I'm not entirely sure I like it yet, but maybe that's acceptable. I need to think about it more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised to answer this explicitly.

can be done without acquiring mutexes). The request is executed at
the chosen lower level (occupying some of its seats) if the request
can be dispatched exactly at that level according to the rule above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that I'm missing is the fact that what changes now is:

before:
(1) when a request releases its seats after processing, we look for requests to dispatch only at its PL

after:
(2) when a request releases its seats after processing, it may not only mean that some other request at the same PL is now dispatchable, but it may also mean that requests from some higher PL become dispatchable now

So I think that we need to decide:
(a) what do we prioritize [I would say that if there is a request from a given PL that we can now dispatch, we dispatch this one]
(b) if we can't dispatch request from the same PL, do we try to dispatch requests from higher PLs? And if so, in what order? [I would say that we should do that, if and only if our PL is empty (no waiting requests) and do that starting from the highest priority PL]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised to explicitly address this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced - as I mentioned above in (b), if our PL becomes empty (all queues are empty), then what we could in theory do is to:
(1) release the lock on it
(2) iterate from the highest priority level and when we find a PL that has non-empty queue try to borrow for lower levels

That won't cause inversion and potentially let us dispatch more.
The only question is how often that happens and as a result is it worth it...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember that the queue being dispatched from is in a priority level, so I think the lock on that priority level has to be held until a dispatching decision is made.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what I'm suggesting is:

  • let's say we released all seats on a given PL and we have an empty queue there
  • we forget about the queue
  • but we start from the highest priority level and try to dispatch from those queues (just optimistically assuming that our queue will remain to have free seats) - yes you will have to take a lock on our initial queue again, but that sounds acceptable if there are requests to dispatch in higher PLs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes you will have to take a lock on our initial queue again, but that sounds acceptable if there are requests to dispatch in higher PLs

This matches what I would expect.

| node-high | 40 | 10 | 400 |
| system | 30 | 10 | 600 |
| workload-high | 40 | 20 | 1000 |
| workload-low | 100 | 90 | 8000 |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Workload low is relatively widely used by different controllers. So I think it should have higher priority. More like 2000 in my opinion. And potentially also lower lendable shares?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised and added an explanation. A new thought occurred during revision: what happens when reading a pre-existing v1beta2.PriorityLevelConfiguration object that was stored before the new fields were introduced?

in the order are considered to have lower priority.

Borrowing is done on the basis of the current situation, with no
consideration of opportunity cost and no pre-emption when the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can imagine situations where this could be problematic. We may want to (let humans) disallow PLs with many slow requests from borrowing. (still reading so maybe you already solve this)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the proposal does not address this issue. I am not yet sure what I think about it.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: MikeSpreitzer
To complete the pull request process, please ask for approval from deads2k after the PR has been reviewed.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@MikeSpreitzer MikeSpreitzer force-pushed the apf-add-borrowing branch 2 times, most recently from b7443a8 to fba5899 Compare May 3, 2022 03:31
@MikeSpreitzer
Copy link
Member Author

Following up on a comment in-line: we would be less constrained if we introduced a v1beta3 version and put the new fields only in the new version.
Indeed, why make our lives harder than necessary by not going to a new version?

@wojtek-t
Copy link
Member

wojtek-t commented May 5, 2022

Following up on a comment in-line: we would be less constrained if we introduced a v1beta3 version and put the new fields only in the new version.

That doesn't help much, because we still have to be able to roundtrip via v1beta2 and get the same thing.

I'm not an API expert, so I would defer to @liggitt or @lavalamp but:
(1) we usually do that via encoding the newly added fields as JSON in the older types
But they would point you to the the right rules/practices.

@liggitt
Copy link
Member

liggitt commented May 5, 2022

we would be less constrained if we introduced a v1beta3 version and put the new fields only in the new version.

@wojtek-t is correct, that is not easier. The apiserver has to be able to round-trip all fields in all versions to all other versions.

we usually do that via encoding the newly added fields as JSON in the older types

We usually do that by adding the fields to older version definitions as normal fields. The few places we've tried to round-trip via serialized JSON in annotations to older versions has really not gone well (lots of bugs with invalid data being allowed into annotations and breaking conversion)

can be done without acquiring mutexes). The request is executed at
the chosen lower level (occupying some of its seats) if the request
can be dispatched exactly at that level according to the rule above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced - as I mentioned above in (b), if our PL becomes empty (all queues are empty), then what we could in theory do is to:
(1) release the lock on it
(2) iterate from the highest priority level and when we find a PL that has non-empty queue try to borrow for lower levels

That won't cause inversion and potentially let us dispatch more.
The only question is how often that happens and as a result is it worth it...

| leader-election | 10 | 0 | 150 |
| node-high | 40 | 10 | 400 |
| system | 30 | 10 | 500 |
| workload-high | 40 | 20 | 833 |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't pick up this number - why not some rounded one, like 800?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wojtek-t : regarding both of your comments about priority values: the proposal here is to set exactly the same values as prescribed by the rule for behavior when the value is zero. Let's first finish the discussion of the overall way to get the new fields added; the concern here is really a corollary to that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm talking here about priority itself (not the shares). The priority field value is orthogonal to the discussion we're having above imho.

| node-high | 40 | 10 | 400 |
| system | 30 | 10 | 500 |
| workload-high | 40 | 20 | 833 |
| workload-low | 100 | 90 | 9000 |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I mentioned it somewhere, but my comment got lost - workload-load is used by different components in different setups, so I would make it much higher priority, maybe 1000 ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not solved.

Also - I would like to avoid numbers like 833 - let's try to use more rounded numbers.

Copy link
Member Author

@MikeSpreitzer MikeSpreitzer Jun 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to define what happens for a PriorityLevelConfiguration that does not have its Priority field set. This PR has a proposal for that. The values you see in this table equal the values that come from that proposal. I think it is good to have consistency between the designed Priority values for the in-tree PLCs and the defaults (which will initially apply to out-of-tree PLCs).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand your comment.
What matter for us is the order between PLs, not the absolute numbers. So what I'm saying is that given that, let's avoid numbers like 833 and let's make workload-load higher priority, because I would expect user-originating PLs to be after workload-load.

@MikeSpreitzer
Copy link
Member Author

Let me make sure I understand the considerations around adding new fields. We want good behavior from a system that:

  • has a mixture of servers from all the point releases (so far) of the latest minor release and the previous one;
  • has a mixture of in-tree clients from all the point releases (so far) of the latest minor release and the three previous ones;
  • has a mixture of out-of-tree client binaries built from out-of-tree sources of uncontrolled vintage and importing vintages of the in-tree code of any (existing) point release of the latest minor release or any of the three previous ones; and
  • starts with storage contents written by the previous release.

Have I got that right?

The part about old clients means we drag a really heavy boat anchor. I am thinking that what I wrote above does not exclude old clients that read, modify locally, and PUT a whole replacmenet value.

@wojtek-t
Copy link
Member

wojtek-t commented May 9, 2022

Have I got that right?

The part about old clients means we drag a really heavy boat anchor. I am thinking that what I wrote above does not exclude old clients that read, modify locally, and PUT a whole replacmenet value.

You won't get fully protection from the old client making an Update call and overwriting something written by the new client). And we don't have a goal of preventing that. Well behaving clients should be using Patch(), which would prevent the above.

Also - we shouldn't try to reinvent the wheel - we've been adding fields to different APIs in the past gazillion times and API approvers already know what works best and we should follow that path.

@kikisdeliveryservice kikisdeliveryservice changed the title Draft borrowing between priority levels in APF [Draft] KEP-1040: borrowing between priority levels in APF May 10, 2022
@MikeSpreitzer
Copy link
Member Author

@kikisdeliveryservice : why the title change here?

Replaced lendable concurrency shares with borrowable percent.

Explained a motivation for the rule for default priority behavior.
// of every other Limited priority level.
// This field has a default value of 30.
// +optional
AssuredConcurrencyShares int32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to say that when graduating to v1 (or v1beta3, but hopefully that won't be needed), we will rename it to X (TotalConcurrencyShared - but I think you had better name).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - I see that you're mentioning it below with: NominalConcurrencyShares

Would it make sense to extend the comment above and say that this will will be renamed to NominalConcurrencyShares in v1 version?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

friendly ping

@kikisdeliveryservice
Copy link
Member

@kikisdeliveryservice : why the title change here?

KEP number in the PR title is standard for PRs in this repo :)

given moment. To avoid deadlock, there is a strict ordering on
acquisition of those mutexes (namely, decreasing logical priority).

We could take a complementary approach, in which lower actively lends
Copy link
Contributor

@deads2k deads2k Jun 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good observation of the current state. I'm wondering how we would choose to configure priority levels after this. Would we choose to reserve large portions for the cluster critical services and then have a low concurrency for workload low and simply let it borrow from the higher levels when there was space available.

cc @wojtek-t @tkashem for comment as well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's actually a good point, I was thinking about that at some point but then forget.

This is actually a problem in the current design, where we're explicitly saying in lines 733-734 that we're not considering borrowing from higher levels to lower levels.
So David - what you're suggesting here will not happen in the current design.

The way I would like it to work is indeed how you're saying: "we reserve the large portion for cluster stuff that we may potentially need, and then let other to borrow from that if those are underutilized". But that implies that cluster stuff is lowest level, which is counter-intuitive...
OTOH, I really don't want to allocate a small fraction from cluster and let it borrow from other levels in case cluster is hugely utilized - that would be counter-productive.

Strawman (although I just came up with this idea and have no idea if it makes sense yet): what if the priority only defines the ordering (then maybe we don't need it even?) but in fact every level can borrow from any other?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we would need then though is more like rule:
"if there are requests at a given PL that could consume a capacity from it, then these have precendence over any other requests that would like to borrow the capacity"...

With the above, by giving good chunk of capacity to cluster stuff we ensure that it will have capacity when we need to and it's also borrowable.

And FWIW - I think we don't need the "priority" field at all (we can define ordering based on namespace/name or sth) and from that perspective it would be closer to what Daniel was proposing (i.e. we conceptually give capacity to others, it's just we still dispatch only from one level, not from the whole bag?)

Copy link
Member Author

@MikeSpreitzer MikeSpreitzer Jun 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have four proposals on the table: this PR, its complement (#3394), #3389, and #3391.

@MikeSpreitzer MikeSpreitzer force-pushed the apf-add-borrowing branch 2 times, most recently from 1113369 to 8494831 Compare June 13, 2022 05:36
@MikeSpreitzer MikeSpreitzer changed the title [Draft] KEP-1040: borrowing between priority levels in APF [Draft] KEP-1040: directed borrowing between priority levels in APF Jun 13, 2022
@k8s-ci-robot
Copy link
Contributor

@MikeSpreitzer: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 21, 2022
@wojtek-t
Copy link
Member

Closing in favor of #3391

@wojtek-t wojtek-t closed this Jul 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants