-
Notifications
You must be signed in to change notification settings - Fork 1.6k
[Draft] KEP-1040: directed borrowing between priority levels in APF #3290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
2664372 to
4cca9c8
Compare
4cca9c8 to
33183fc
Compare
| purpose of borrowing currently-unused concurrency. The ordering is | ||
| based (first) on increasing value of a spec field whose name is | ||
| `priority` and whose value is constrained to lie in the range 0 | ||
| through 10000 inclusive and (second) on increasing name of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason for constraining it to 0..10000?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same reason we constrain the matchingPrecedence that way.
| based (first) on increasing value of a spec field whose name is | ||
| `priority` and whose value is constrained to lie in the range 0 | ||
| through 10000 inclusive and (second) on increasing name of the | ||
| PriorityLevelConfiguration object. Priority levels that appear later |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know what you want to achieve (the total order), but I don't like the order by name - that will lead to confusion imho.
I would much rather prefer to say that no two PLs can share the same priority. The drawback of that is that such cross-object validation is non-trivial to add potentially causing issues (e.g. IPs/ports in services).
@deads2k - thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have to have a total order, instead of attempting to force humans to use unique numbers for everything (a nightmare if you ever want to incorporate configs from multiple places), please order first by priority, and then break ties alphabetically by name. If you don't want it to be predictable, you can hash one or both of those.
(To keep the actual comparisons numeric & easy, I'd expect to convert whatever the tie break is to a number and treat it as a fractional part of the priority number, but that's just implementation details)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since kube api-machinery has no multi-object transactions there is no direct way to prevent 100% of collisions. We could do a validation check that catches violations that are well enough separated in time. We could add a controller that complains somehow or deletes violating objects, but I think that would be more code to maintain than it is worth. I have revised to more strongly say "don't do that" and not promise how ties are broken.
| consideration of opportunity cost and no pre-emption when the | ||
| situation changes. | ||
|
|
||
| An apiserver assigns three concurrency limits to each non-exempt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please start with explicitly defining how we will change the API?
IIUC, what you want to do is to add a new field to LimitedPriorityLevelConfiguration:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/api/flowcontrol/v1beta2/types.go#L452-L473
by setting what percentege of shares will be lendable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - I see that described below now, please maybe start with saying something like
The exinst LimitedPriorityLevelConfiguration type will be extended as following:
type LimitedPriorityLevelConfiguration struct {
...
// FIXME: comment
lendableConcurrencyShared int32
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revised to display the new/updated type fields.
| are lower non-exempt priority levels, then the request at the head of | ||
| the chosen queue is considered for dispatch at one of the lower | ||
| priority levels. The particular lower priority level considered is | ||
| drawn at random from the lower ones, in proportion to their |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a fan of this. And thinking about it, I'm wondering if we really need that.
The critical property you're introducing is effectively:
"A given request can borrow from at most one other priority level"
That is a nice simplification and I think I can buy its consequences of potentially not being able to dispatch a request even if in theory it would be dispatchable if we would be able to borrow from multiple PLs. Because the additional cognitive benefit from that simplification is fairly huge.
Now - I don't really buy the fact that we will be choosing the PL to borrow from at random. Any randomization here will make it fairly hard to understand/debug what was really happening in my production cluster. And I don't think I really see why we need that.
Here is how I think about it:
- we have a set of (ordered) PLs [the set itself is protected by its own lock, but individual QueueSet per PL are protected on their own by their own locks]
- when we can't dispatch a request R for PL (due to lack of seats) then we:
- grab a reader lock from (1) [this seems cheap, because PLs are added/removed really rarely, so that's fine]
- iterate over all PLs with lower
priorityand for each of them:
(a) take a lock on that PL
(b) check if it has enough seats so that we can borrow to satisfy our request
(c) if so - borrow and return
(d) release the lock from that PL
In the above approach, we don't really need a global lock - we still lock exactly one PL at a time, which seems what we need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on our discussion in the meeting today, I changed the proposal to have the higher priority level sequentially consider borrowing from each lower priority level. And I also added an explicit statement that all the seats are taken from one priority level. I also added some words about how little the existing global lock is used.
| (higher may borrow from lower, and lower does not actively lend to | ||
| higher) and in a very limited quantity: an attempt to dispatch for one | ||
| priority level will consider borrowing from just one other priority | ||
| level (if there is any lower priority level at all). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me clarify this one - are you saying that to dispatch a request at PL it can:
(a) use the unoccupied seats from its PL and additionally borrow from 1 more PL?
(b) if its borrowing from some other PL, it cannot use unoccupied seats from its own PL?
I hope it's (a), but it's unclear to me. And we need to clarify it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - it seems that it's actually (b) based on line 729.
I'm not entirely sure I like it yet, but maybe that's acceptable. I need to think about it more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revised to answer this explicitly.
| can be done without acquiring mutexes). The request is executed at | ||
| the chosen lower level (occupying some of its seats) if the request | ||
| can be dispatched exactly at that level according to the rule above. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that I'm missing is the fact that what changes now is:
before:
(1) when a request releases its seats after processing, we look for requests to dispatch only at its PL
after:
(2) when a request releases its seats after processing, it may not only mean that some other request at the same PL is now dispatchable, but it may also mean that requests from some higher PL become dispatchable now
So I think that we need to decide:
(a) what do we prioritize [I would say that if there is a request from a given PL that we can now dispatch, we dispatch this one]
(b) if we can't dispatch request from the same PL, do we try to dispatch requests from higher PLs? And if so, in what order? [I would say that we should do that, if and only if our PL is empty (no waiting requests) and do that starting from the highest priority PL]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revised to explicitly address this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced - as I mentioned above in (b), if our PL becomes empty (all queues are empty), then what we could in theory do is to:
(1) release the lock on it
(2) iterate from the highest priority level and when we find a PL that has non-empty queue try to borrow for lower levels
That won't cause inversion and potentially let us dispatch more.
The only question is how often that happens and as a result is it worth it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remember that the queue being dispatched from is in a priority level, so I think the lock on that priority level has to be held until a dispatching decision is made.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what I'm suggesting is:
- let's say we released all seats on a given PL and we have an empty queue there
- we forget about the queue
- but we start from the highest priority level and try to dispatch from those queues (just optimistically assuming that our queue will remain to have free seats) - yes you will have to take a lock on our initial queue again, but that sounds acceptable if there are requests to dispatch in higher PLs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes you will have to take a lock on our initial queue again, but that sounds acceptable if there are requests to dispatch in higher PLs
This matches what I would expect.
| | node-high | 40 | 10 | 400 | | ||
| | system | 30 | 10 | 600 | | ||
| | workload-high | 40 | 20 | 1000 | | ||
| | workload-low | 100 | 90 | 8000 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Workload low is relatively widely used by different controllers. So I think it should have higher priority. More like 2000 in my opinion. And potentially also lower lendable shares?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revised and added an explanation. A new thought occurred during revision: what happens when reading a pre-existing v1beta2.PriorityLevelConfiguration object that was stored before the new fields were introduced?
| in the order are considered to have lower priority. | ||
|
|
||
| Borrowing is done on the basis of the current situation, with no | ||
| consideration of opportunity cost and no pre-emption when the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can imagine situations where this could be problematic. We may want to (let humans) disallow PLs with many slow requests from borrowing. (still reading so maybe you already solve this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the proposal does not address this issue. I am not yet sure what I think about it.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: MikeSpreitzer The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
b7443a8 to
fba5899
Compare
fba5899 to
bc225ec
Compare
|
Following up on a comment in-line: we would be less constrained if we introduced a |
That doesn't help much, because we still have to be able to roundtrip via v1beta2 and get the same thing. I'm not an API expert, so I would defer to @liggitt or @lavalamp but: |
@wojtek-t is correct, that is not easier. The apiserver has to be able to round-trip all fields in all versions to all other versions.
We usually do that by adding the fields to older version definitions as normal fields. The few places we've tried to round-trip via serialized JSON in annotations to older versions has really not gone well (lots of bugs with invalid data being allowed into annotations and breaking conversion) |
| can be done without acquiring mutexes). The request is executed at | ||
| the chosen lower level (occupying some of its seats) if the request | ||
| can be dispatched exactly at that level according to the rule above. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced - as I mentioned above in (b), if our PL becomes empty (all queues are empty), then what we could in theory do is to:
(1) release the lock on it
(2) iterate from the highest priority level and when we find a PL that has non-empty queue try to borrow for lower levels
That won't cause inversion and potentially let us dispatch more.
The only question is how often that happens and as a result is it worth it...
| | leader-election | 10 | 0 | 150 | | ||
| | node-high | 40 | 10 | 400 | | ||
| | system | 30 | 10 | 500 | | ||
| | workload-high | 40 | 20 | 833 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't pick up this number - why not some rounded one, like 800?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wojtek-t : regarding both of your comments about priority values: the proposal here is to set exactly the same values as prescribed by the rule for behavior when the value is zero. Let's first finish the discussion of the overall way to get the new fields added; the concern here is really a corollary to that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm talking here about priority itself (not the shares). The priority field value is orthogonal to the discussion we're having above imho.
| | node-high | 40 | 10 | 400 | | ||
| | system | 30 | 10 | 500 | | ||
| | workload-high | 40 | 20 | 833 | | ||
| | workload-low | 100 | 90 | 9000 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I mentioned it somewhere, but my comment got lost - workload-load is used by different components in different setups, so I would make it much higher priority, maybe 1000 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still not solved.
Also - I would like to avoid numbers like 833 - let's try to use more rounded numbers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to define what happens for a PriorityLevelConfiguration that does not have its Priority field set. This PR has a proposal for that. The values you see in this table equal the values that come from that proposal. I think it is good to have consistency between the designed Priority values for the in-tree PLCs and the defaults (which will initially apply to out-of-tree PLCs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand your comment.
What matter for us is the order between PLs, not the absolute numbers. So what I'm saying is that given that, let's avoid numbers like 833 and let's make workload-load higher priority, because I would expect user-originating PLs to be after workload-load.
|
Let me make sure I understand the considerations around adding new fields. We want good behavior from a system that:
Have I got that right? The part about old clients means we drag a really heavy boat anchor. I am thinking that what I wrote above does not exclude old clients that read, modify locally, and PUT a whole replacmenet value. |
You won't get fully protection from the old client making an Update call and overwriting something written by the new client). And we don't have a goal of preventing that. Well behaving clients should be using Patch(), which would prevent the above. Also - we shouldn't try to reinvent the wheel - we've been adding fields to different APIs in the past gazillion times and API approvers already know what works best and we should follow that path. |
|
@kikisdeliveryservice : why the title change here? |
Replaced lendable concurrency shares with borrowable percent. Explained a motivation for the rule for default priority behavior.
f883f9a to
0d1246c
Compare
| // of every other Limited priority level. | ||
| // This field has a default value of 30. | ||
| // +optional | ||
| AssuredConcurrencyShares int32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to say that when graduating to v1 (or v1beta3, but hopefully that won't be needed), we will rename it to X (TotalConcurrencyShared - but I think you had better name).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - I see that you're mentioning it below with: NominalConcurrencyShares
Would it make sense to extend the comment above and say that this will will be renamed to NominalConcurrencyShares in v1 version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
friendly ping
KEP number in the PR title is standard for PRs in this repo :) |
| given moment. To avoid deadlock, there is a strict ordering on | ||
| acquisition of those mutexes (namely, decreasing logical priority). | ||
|
|
||
| We could take a complementary approach, in which lower actively lends |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good observation of the current state. I'm wondering how we would choose to configure priority levels after this. Would we choose to reserve large portions for the cluster critical services and then have a low concurrency for workload low and simply let it borrow from the higher levels when there was space available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's actually a good point, I was thinking about that at some point but then forget.
This is actually a problem in the current design, where we're explicitly saying in lines 733-734 that we're not considering borrowing from higher levels to lower levels.
So David - what you're suggesting here will not happen in the current design.
The way I would like it to work is indeed how you're saying: "we reserve the large portion for cluster stuff that we may potentially need, and then let other to borrow from that if those are underutilized". But that implies that cluster stuff is lowest level, which is counter-intuitive...
OTOH, I really don't want to allocate a small fraction from cluster and let it borrow from other levels in case cluster is hugely utilized - that would be counter-productive.
Strawman (although I just came up with this idea and have no idea if it makes sense yet): what if the priority only defines the ordering (then maybe we don't need it even?) but in fact every level can borrow from any other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What we would need then though is more like rule:
"if there are requests at a given PL that could consume a capacity from it, then these have precendence over any other requests that would like to borrow the capacity"...
With the above, by giving good chunk of capacity to cluster stuff we ensure that it will have capacity when we need to and it's also borrowable.
And FWIW - I think we don't need the "priority" field at all (we can define ordering based on namespace/name or sth) and from that perspective it would be closer to what Daniel was proposing (i.e. we conceptually give capacity to others, it's just we still dispatch only from one level, not from the whole bag?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1113369 to
8494831
Compare
8494831 to
6213de4
Compare
|
@MikeSpreitzer: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
Closing in favor of #3391 |
Uh oh!
There was an error while loading. Please reload this page.