Skip to content

Conversation

pchila
Copy link
Member

@pchila pchila commented Jul 1, 2025

What does this PR do?

This PR introduces a new --rollback flag to elastic-agent upgrade command that will switch back to a previous elastic-agent installation, effectively rolling back an upgrade.
This PR makes the elastic agent main process "take over" the watcher applocker to write the rollback request before running the watcher again, which will perform the rollback.
The Upgrader now receives a rollback boolean which will trigger:

  1. a takeover of the watcher applocker
  2. it will extract the versionedHome of the version we want to rollback to
  3. launch a new elastic-agent watch --rollback command which will take care of the actual rollback, restart and cleanup of the agent

On the watcher side, the watch loop is now interruptible to be able to have the watcher exit gracefully.

This PR also changes the way the watcher process is launched and signaled on Windows, needed to implement any sort of graceful shutdown: the gist of it is that now the watcher processes are always launched while holding a Windows console (even when the elastic-agent service does not have one). The watcher processes can be terminated gracefully by using elastic-agent watch --takedown that will connect to the watcher console and send a Ctrl+Break event, cutting short a watch operation without side effect.

Why is it important?

This is the initial implementation for a manual rollback flow.
In this first iteration, the rollback only works during the grace period (that is while the watcher is still running), in follow-up PRs this functionality will be extended for the duration of agent.upgrade.rollback.window.

Checklist

  • I have read and understood the pull request guidelines of this project.
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • [ ] I have added an entry in ./changelog/fragments using the changelog tool
  • [ ] I have added an integration test or an E2E test

Disruptive User Impact

How to test this PR locally

  1. Package elastic agent twice from this PR:
SNAPSHOT=true EXTERNAL=true PACKAGES=tar.gz  PLATFORMS="linux/amd64" mage -v package
AGENT_PACKAGE_VERSION="9.2.0+20250701000000" SNAPSHOT=true EXTERNAL=true PACKAGES=tar.gz  PLATFORMS="linux/amd64" mage -v package
  1. Install the version 9.2.0-SNAPSHOT as usual
  2. Set a rollback window duration > 0, for example 2 hours:
  agent.upgrade:
    rollback:
      window: 2h
  1. Trigger an update to the other package (saved on disk):
elastic-agent upgrade --skip-verify --source-uri=file:///vagrant/build/distributions 9.2.0+20250701000000-SNAPSHOT
  1. Wait for the new agent to come online and the upgrade details to signal UPG_WATCHING state
  2. Manually rollback to the previous version:
elastic-agent upgrade --rollback 9.2.0-SNAPSHOT

Notes:

  • Trying to rollback after the grace period does not work and may break the agent install (this is because the watcher is still cleaning up the upgrade marker and the previous install at the end of the grace period)

Related issues

Questions to ask yourself

  • How are we going to support this in production?
  • How are we going to measure its adoption?
  • How are we going to debug this?
  • What are the metrics I should take care of?
  • ...

@mergify mergify bot assigned pchila Jul 1, 2025
Copy link
Contributor

mergify bot commented Jul 1, 2025

This pull request does not have a backport label. Could you fix it @pchila? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-./d./d is the label that automatically backports to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

@pchila pchila force-pushed the lock-free-manual-rollback branch 2 times, most recently from a7e6486 to 33bfe58 Compare July 11, 2025 07:09
Copy link
Contributor

mergify bot commented Jul 11, 2025

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b lock-free-manual-rollback upstream/lock-free-manual-rollback
git merge upstream/main
git push upstream lock-free-manual-rollback

@pchila pchila force-pushed the lock-free-manual-rollback branch 2 times, most recently from 764b7ad to e4f6b45 Compare July 15, 2025 16:41
@pchila pchila changed the title [DO NOT MERGE] - Lock free manual rollback PoC Lock free manual rollback Jul 25, 2025
@pchila pchila changed the title Lock free manual rollback Elastic Agent upgrade: lock-free manual rollback Jul 25, 2025
@pchila pchila force-pushed the lock-free-manual-rollback branch 3 times, most recently from 6873b56 to 3901e20 Compare July 30, 2025 14:54
@pchila pchila linked an issue Aug 1, 2025 that may be closed by this pull request
@pchila pchila added Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team enhancement New feature or request labels Aug 1, 2025
@pchila pchila force-pushed the lock-free-manual-rollback branch from 64c5b4c to 251de28 Compare August 6, 2025 17:22
@pchila pchila force-pushed the lock-free-manual-rollback branch 2 times, most recently from ace5304 to 9d0725c Compare August 15, 2025 10:31
@pchila pchila force-pushed the lock-free-manual-rollback branch 3 times, most recently from 6e5c7e0 to 1f936ce Compare August 28, 2025 06:29
ycombinator
ycombinator previously approved these changes Sep 16, 2025
Copy link
Contributor

@pkoutsovasilis pkoutsovasilis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pchila I did some manual testing of this, and I noticed that if I hit the rollback cmd during either UPG_DOWNLOADING or UPG_EXTRACTING I get an error that it cannot stat the upgrade marker. However, this recovers after a while but I feel that maybe the rollback should be allowed only in UPG_WATCHING?

PS: If it makes any difference I did find another bug for my merged PR #9634 which I am gonna open a fix about 😢

@pchila
Copy link
Member Author

pchila commented Sep 16, 2025

@pchila I did some manual testing of this, and I noticed that if I hit the rollback cmd during either UPG_DOWNLOADING or UPG_EXTRACTING I get an error that it cannot stat the upgrade marker. However, this recovers after a while but I feel that maybe the rollback should be allowed only in UPG_WATCHING?

@pkoutsovasilis
Good catch! That is an actual race condition we should protect ourselves from.
Added a state check in f05a783

Copy link

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @pchila

Copy link
Contributor

@pkoutsovasilis pkoutsovasilis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok after the last commit I see that sudo elastic-agent upgrade --rollback 9.2.0-SNAPSHOT doesn't execute on states UPG_EXTRACTING and UPG_DOWNLOADING. However, if I snipe the command when the state is at UPG_RESTARTING this executes and prints this message Rollback triggered to version , Elastic Agent is currently restarting. Then when agent actually restarts and I re-issue the rollback cmd I get Error: Failed trigger upgrade of daemon: no rollbacks available

@pchila
Copy link
Member Author

pchila commented Sep 17, 2025

ok after the last commit I see that sudo elastic-agent upgrade --rollback 9.2.0-SNAPSHOT doesn't execute on states UPG_EXTRACTING and UPG_DOWNLOADING. However, if I snipe the command when the state is at UPG_RESTARTING this executes and prints this message Rollback triggered to version , Elastic Agent is currently restarting. Then when agent actually restarts and I re-issue the rollback cmd I get Error: Failed trigger upgrade of daemon: no rollbacks available

@pkoutsovasilis
The update/rollbacks triggered from command line and executed by the GRPC server do not have any syncronization/queuing but execute concurrently.
What is happening is that as soon as a watcher starts watching because of an upgrade being executed, it writes the state in the update marker which in turn causes an update in the upgrade details to details.StateWatching and that allows a rollback request to execute even if the agent is about to restart.
The rollback code is not able to distinguish if a restart has already happened, so it tries to execute a rollback when things are still in flux, ending up with weird behaviors like the one you described.

Will have a look and check what options are there to close this loophole without rewriting the whole thing 😉

@pchila
Copy link
Member Author

pchila commented Sep 17, 2025

@pkoutsovasilis
I tried to reproduce the issue and I noticed that the error message is coming from

fmt.Fprintf(input.streams.Out, "Upgrade triggered to version %s, Elastic Agent is currently restarting\n", version)
and the command interprets the agent's GRPC server shutting down as operation completed correctly (this happens already in main for upgrade commands due to #4519).

In my tests I can reproduce the error message but I checked and the rollback is not actually performed: instead the agent restart after the upgrade and then I can still rollback correctly to the previous version by reissuing the command (the install is not corrupted or broken from what I can see).

root@2225b2f69a3c:/# elastic-agent upgrade --rollback 9.2.0-SNAPSHOT
Rollback triggered to version , Elastic Agent is currently restarting
root@2225b2f69a3c:/# elastic-agent version
Binary: 9.2.0+20250917-SNAPSHOT (build: f05a783e904b72bd6e281a068e65c2268a4eac93 at 2025-09-17 09:39:37 +0000 UTC)
Daemon: 9.2.0+20250917-SNAPSHOT (build: f05a783e904b72bd6e281a068e65c2268a4eac93 at 2025-09-17 09:39:37 +0000 UTC)
root@2225b2f69a3c:/# elastic-agent upgrade --rollback 9.2.0-SNAPSHOT
Rollback triggered to version 9.2.0-SNAPSHOT, Elastic Agent is currently restarting
root@2225b2f69a3c:/#

Copy link
Contributor

@pkoutsovasilis pkoutsovasilis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look together with me on this @pchila 🙏 After our sync, I’m fairly sure I simply forgot to add the configuration bits required to have available rollbacks, which explains why I was getting the no rollbacks available message.

I do agree that the current Elastic Agent is currently restarting output isn’t very helpful in the context of rollback - it’s not clear whether the rollback will eventually happen or not. That said, since re-issuing the command after restart works as expected and the install isn’t left in a broken state, this LGTM ✅

@pchila pchila merged commit 9bc4e8f into elastic:main Sep 17, 2025
23 checks passed
@cmacknz
Copy link
Member

cmacknz commented Sep 17, 2025

Thanks all for the extra attention to manual testing this one got before merge :)

cmacknz pushed a commit that referenced this pull request Sep 18, 2025
* Add rollback field to UpgradeRequest

* Introduce rollback parameter to upgrade

* Concurrently retry taking over watcher

* Gracefully shutdown agent watcher

* Add rollbacks available to upgrade marker

* disable rollback window by default

* Add formal checks to manual rollback arguments

* Add minimum version check for creating rollbacks entries in update marker

* Gracefully terminate watcher process on windows

* Allow watcher to listen to signals only during watch loop

* make watcher rollback only if the agent has not been already rolled back

* Remove parent death signal for watcher on linux

* Distinguish between upgrade and rollback operations in upgrade subcommand

* remove DESIRED_OUTCOME in favor of watch --rollback

* Add version agent rollbacks to in manual rollback reason

* Check upgrade details state before allowing a manual rollback
@ebeahan
Copy link
Member

ebeahan commented Sep 18, 2025

🚀

intxgo pushed a commit to intxgo/elastic-agent that referenced this pull request Sep 24, 2025
* Add rollback field to UpgradeRequest

* Introduce rollback parameter to upgrade

* Concurrently retry taking over watcher

* Gracefully shutdown agent watcher

* Add rollbacks available to upgrade marker

* disable rollback window by default

* Add formal checks to manual rollback arguments

* Add minimum version check for creating rollbacks entries in update marker

* Gracefully terminate watcher process on windows

* Allow watcher to listen to signals only during watch loop

* make watcher rollback only if the agent has not been already rolled back

* Remove parent death signal for watcher on linux

* Distinguish between upgrade and rollback operations in upgrade subcommand

* remove DESIRED_OUTCOME in favor of watch --rollback

* Add version agent rollbacks to in manual rollback reason

* Check upgrade details state before allowing a manual rollback
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request skip-changelog Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
8 participants