-
Notifications
You must be signed in to change notification settings - Fork 195
Elastic Agent upgrade: lock-free manual rollback #8767
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This pull request does not have a backport label. Could you fix it @pchila? 🙏
|
a7e6486
to
33bfe58
Compare
This pull request is now in conflicts. Could you fix it? 🙏
|
764b7ad
to
e4f6b45
Compare
6873b56
to
3901e20
Compare
64c5b4c
to
251de28
Compare
ace5304
to
9d0725c
Compare
6e5c7e0
to
1f936ce
Compare
78c6a5c
to
1a376f1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pchila I did some manual testing of this, and I noticed that if I hit the rollback cmd during either UPG_DOWNLOADING
or UPG_EXTRACTING
I get an error that it cannot stat the upgrade marker. However, this recovers after a while but I feel that maybe the rollback should be allowed only in UPG_WATCHING
?
PS: If it makes any difference I did find another bug for my merged PR #9634 which I am gonna open a fix about 😢
@pkoutsovasilis |
|
💚 Build Succeeded
History
cc @pchila |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok after the last commit I see that sudo elastic-agent upgrade --rollback 9.2.0-SNAPSHOT
doesn't execute on states UPG_EXTRACTING and UPG_DOWNLOADING. However, if I snipe the command when the state is at UPG_RESTARTING
this executes and prints this message Rollback triggered to version , Elastic Agent is currently restarting
. Then when agent actually restarts and I re-issue the rollback cmd I get Error: Failed trigger upgrade of daemon: no rollbacks available
@pkoutsovasilis Will have a look and check what options are there to close this loophole without rewriting the whole thing 😉 |
@pkoutsovasilis
In my tests I can reproduce the error message but I checked and the rollback is not actually performed: instead the agent restart after the upgrade and then I can still rollback correctly to the previous version by reissuing the command (the install is not corrupted or broken from what I can see). root@2225b2f69a3c:/# elastic-agent upgrade --rollback 9.2.0-SNAPSHOT
Rollback triggered to version , Elastic Agent is currently restarting
root@2225b2f69a3c:/# elastic-agent version
Binary: 9.2.0+20250917-SNAPSHOT (build: f05a783e904b72bd6e281a068e65c2268a4eac93 at 2025-09-17 09:39:37 +0000 UTC)
Daemon: 9.2.0+20250917-SNAPSHOT (build: f05a783e904b72bd6e281a068e65c2268a4eac93 at 2025-09-17 09:39:37 +0000 UTC)
root@2225b2f69a3c:/# elastic-agent upgrade --rollback 9.2.0-SNAPSHOT
Rollback triggered to version 9.2.0-SNAPSHOT, Elastic Agent is currently restarting
root@2225b2f69a3c:/# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking a look together with me on this @pchila 🙏 After our sync, I’m fairly sure I simply forgot to add the configuration bits required to have available rollbacks, which explains why I was getting the no rollbacks available
message.
I do agree that the current Elastic Agent is currently restarting
output isn’t very helpful in the context of rollback - it’s not clear whether the rollback will eventually happen or not. That said, since re-issuing the command after restart works as expected and the install isn’t left in a broken state, this LGTM ✅
Thanks all for the extra attention to manual testing this one got before merge :) |
* Add rollback field to UpgradeRequest * Introduce rollback parameter to upgrade * Concurrently retry taking over watcher * Gracefully shutdown agent watcher * Add rollbacks available to upgrade marker * disable rollback window by default * Add formal checks to manual rollback arguments * Add minimum version check for creating rollbacks entries in update marker * Gracefully terminate watcher process on windows * Allow watcher to listen to signals only during watch loop * make watcher rollback only if the agent has not been already rolled back * Remove parent death signal for watcher on linux * Distinguish between upgrade and rollback operations in upgrade subcommand * remove DESIRED_OUTCOME in favor of watch --rollback * Add version agent rollbacks to in manual rollback reason * Check upgrade details state before allowing a manual rollback
🚀 |
* Add rollback field to UpgradeRequest * Introduce rollback parameter to upgrade * Concurrently retry taking over watcher * Gracefully shutdown agent watcher * Add rollbacks available to upgrade marker * disable rollback window by default * Add formal checks to manual rollback arguments * Add minimum version check for creating rollbacks entries in update marker * Gracefully terminate watcher process on windows * Allow watcher to listen to signals only during watch loop * make watcher rollback only if the agent has not been already rolled back * Remove parent death signal for watcher on linux * Distinguish between upgrade and rollback operations in upgrade subcommand * remove DESIRED_OUTCOME in favor of watch --rollback * Add version agent rollbacks to in manual rollback reason * Check upgrade details state before allowing a manual rollback
What does this PR do?
This PR introduces a new
--rollback
flag toelastic-agent upgrade
command that will switch back to a previous elastic-agent installation, effectively rolling back an upgrade.This PR makes the elastic agent main process "take over" the watcher applocker to write the rollback request before running the watcher again, which will perform the rollback.
The Upgrader now receives a
rollback
boolean which will trigger:elastic-agent watch --rollback
command which will take care of the actual rollback, restart and cleanup of the agentOn the watcher side, the watch loop is now interruptible to be able to have the watcher exit gracefully.
This PR also changes the way the watcher process is launched and signaled on Windows, needed to implement any sort of graceful shutdown: the gist of it is that now the watcher processes are always launched while holding a Windows console (even when the elastic-agent service does not have one). The watcher processes can be terminated gracefully by using
elastic-agent watch --takedown
that will connect to the watcher console and send aCtrl+Break
event, cutting short a watch operation without side effect.Why is it important?
This is the initial implementation for a manual rollback flow.
In this first iteration, the rollback only works during the grace period (that is while the watcher is still running), in follow-up PRs this functionality will be extended for the duration of
agent.upgrade.rollback.window
.Checklist
[ ] I have made corresponding changes to the documentation[ ] I have made corresponding change to the default configuration files[ ] I have added an entry in./changelog/fragments
using the changelog tool[ ] I have added an integration test or an E2E testDisruptive User Impact
How to test this PR locally
SNAPSHOT=true EXTERNAL=true PACKAGES=tar.gz PLATFORMS="linux/amd64" mage -v package
9.2.0-SNAPSHOT
as usualNotes:
Related issues
rollback_window
is set #6880--rollback
option toelastic-agent
upgrade subcommand #6887UPG_WATCHING
state #6890Questions to ask yourself