Skip to content

Conversation

@tgross
Copy link
Member

@tgross tgross commented Oct 28, 2025

When using deployments, the deployment watcher emits all the evals we expect to see to step through the deployment once each allocation becomes healthy for min_healthy_time. But we left the old rolling-update eval code in place and with the new job parsing of the update block that ends up emitting a eval immediately with a zero wait time and we don't end up respecting the min_healthy_time.

This is all driven from the deployment watcher now, so we can simply remove it from the scheduler.

Testing & Reproduction steps

Run two nodes. Run the following jobspec.

jobspec
job "httpd" {
  type = "system"

  group "web" {

    network {
      mode = "bridge"
      port "www" {
        to     = 8001
      }
    }

    update {
      max_parallel = 1
      min_healthy_time = "15s"
    }

    task "http" {

      driver = "docker"

      config {
        image   = "busybox:1"
        command = "httpd"
        args    = ["-v", "-f", "-p", "8001", "-h", "/local"]
        ports   = ["www"]
      }

      env {
        version = "8"
      }

      resources {
        cpu    = 128
        memory = 128
      }
    }
  }
}

Wait until the initial deployment is done and then destructively update (ex. incrementing task.env.version). Previously an eval triggered by rolling-update would run and you'd see both new allocs immediately. Now we wait 15s after the first alloc becomes healthy and then the new alloc is created via an eval triggered by the deploymentwatcher.


  • If a change needs to be reverted, we will roll out an update to the code within 7 days.

Changes to Security Controls

Are there any changes to security controls (access controls, encryption, logging) in this pull request? If so, explain.

When using deployments, the deployment watcher emits all the evals we expect to
see to step through the deployment once each allocation becomes healthy for
`min_healthy_time`. But we left the old rolling-update eval code in place and
with the new job parsing of the `update` block that ends up emitting a eval
immediately with a zero wait time and we don't end up respecting the
`min_healthy_time`.

This is all driven from the deployment watcher now, so we can simply remove it
from the scheduler.

// If the limit of placements was reached we need to create an evaluation
// to pickup from here after the stagger period.
if s.limitReached && s.nextEval == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT nextEval is now always nil ... can we just remove it? I removed it on main and everything compiled, but I hit a test failure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise it appears limitReached is no longer used which may allow for some removal of code from evictAndPlace

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed nextEval from this scheduler but not from the setStatus function params where it's consumed, because the generic scheduler writes evals there still. I want to leave limitReached in place because I know @pkazmierczak is busy working on that area of code; we'll clean it up afterwards.

Copy link
Contributor

@pkazmierczak pkazmierczak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, and as mentioned, we can clean up further when #26953 lands.

@tgross tgross merged commit c61bc83 into main Oct 29, 2025
39 checks passed
@tgross tgross deleted the systemdeployments-no-rolling-evals branch October 29, 2025 15:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants