Skip to content

Conversation

@pkazmierczak
Copy link
Contributor

@pkazmierczak pkazmierczak commented Oct 15, 2025

Canaries for system jobs are placed on a tg.update.canary percent of eligible nodes. Some of these nodes may not be feasible, and until now we removed infeasible nodes during placement computation. However, if it happens to be that the first eligible node we picked to place a canary on is infeasible, this will lead to the scheduler halting deployment.

The solution presented here simplifies canary deployments: initially, system jobs that use canary updates get allocations placed on all eligible nodes, but before we start computing actual placements, a method called evictUnneededCanaries is called (much like evictAndPlace is for honoring MaxParallel) which removes those canary placements that are not needed. We also change the behavior of computePlacements which no longer performs node feasibility checks, as these are performed earlier for every allocation and node. This way we get accurate counts of all feasible nodes that let us correctly set deployment state fields.

Fixes: #26885
Fixes: #26886

@pkazmierczak pkazmierczak force-pushed the f-system-deployments-canaries-evict-refactor branch from 838bcd8 to e1234c1 Compare October 28, 2025 08:30
@pkazmierczak pkazmierczak force-pushed the f-system-deployments-canaries-evict-refactor branch from e1234c1 to a6cd581 Compare October 28, 2025 17:33
tgross added a commit that referenced this pull request Oct 28, 2025
Two groups on the same job cannot both have a static port assignment, but this
ends up getting configured in the update block test for system deployments. This
test setup bug has complicated landing the fix in #26953.
tgross added a commit that referenced this pull request Oct 28, 2025
Two groups on the same job cannot both have a static port assignment, but this
ends up getting configured in the update block test for system deployments. This
test setup bug has complicated landing the fix in #26953.
@pkazmierczak pkazmierczak force-pushed the f-system-deployments-canaries-evict-refactor branch from 697d0ff to 0c70ac8 Compare October 28, 2025 18:29
pkazmierczak and others added 8 commits October 31, 2025 09:07
Fix some previously broken counts in the system deployments test and add
comments to most of the placement counts to make it a little easier to read the
test and verify it's correct.
If we leave empty keys in that map, the plan will not be considered a
no-op, which it otherwise should be.
@pkazmierczak pkazmierczak force-pushed the f-system-deployments-canaries-evict-refactor branch from 54b7110 to e51cce3 Compare October 31, 2025 08:08
@pkazmierczak pkazmierczak marked this pull request as ready for review October 31, 2025 08:10
@pkazmierczak pkazmierczak requested review from a team as code owners October 31, 2025 08:10
… state

Previously we copied the behavior found in the generic scheduler, where
we rely on reconciler results to decide if there's enough placements
made. In the system scheduler we always know exactly how many placements
there should be based on the DesiredTotal field of the deployment state,
so a better way to check completeness of the deployment is to simplify
it and base on dstate alone.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

4 participants