-
Notifications
You must be signed in to change notification settings - Fork 2k
scheduler: perform feasibility checks for system canaries before computing placements #26953
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
pkazmierczak
wants to merge
17
commits into
main
Choose a base branch
from
f-system-deployments-canaries-evict-refactor
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+1,396
−834
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
tgross
reviewed
Oct 17, 2025
b057f84 to
88f1fb4
Compare
tgross
reviewed
Oct 23, 2025
tgross
reviewed
Oct 23, 2025
838bcd8 to
e1234c1
Compare
e1234c1 to
a6cd581
Compare
tgross
added a commit
that referenced
this pull request
Oct 28, 2025
Two groups on the same job cannot both have a static port assignment, but this ends up getting configured in the update block test for system deployments. This test setup bug has complicated landing the fix in #26953.
tgross
added a commit
that referenced
this pull request
Oct 28, 2025
Two groups on the same job cannot both have a static port assignment, but this ends up getting configured in the update block test for system deployments. This test setup bug has complicated landing the fix in #26953.
697d0ff to
0c70ac8
Compare
Fix some previously broken counts in the system deployments test and add comments to most of the placement counts to make it a little easier to read the test and verify it's correct.
If we leave empty keys in that map, the plan will not be considered a no-op, which it otherwise should be.
54b7110 to
e51cce3
Compare
jrasell
reviewed
Oct 31, 2025
… state Previously we copied the behavior found in the generic scheduler, where we rely on reconciler results to decide if there's enough placements made. In the system scheduler we always know exactly how many placements there should be based on the DesiredTotal field of the deployment state, so a better way to check completeness of the deployment is to simplify it and base on dstate alone.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Canaries for system jobs are placed on a
tg.update.canarypercent of eligible nodes. Some of these nodes may not be feasible, and until now we removed infeasible nodes during placement computation. However, if it happens to be that the first eligible node we picked to place a canary on is infeasible, this will lead to the scheduler halting deployment.The solution presented here simplifies canary deployments: initially, system jobs that use canary updates get allocations placed on all eligible nodes, but before we start computing actual placements, a method called
evictUnneededCanariesis called (much likeevictAndPlaceis for honoring MaxParallel) which removes those canary placements that are not needed. We also change the behavior ofcomputePlacementswhich no longer performs node feasibility checks, as these are performed earlier for every allocation and node. This way we get accurate counts of all feasible nodes that let us correctly set deployment state fields.Fixes: #26885
Fixes: #26886