-
Notifications
You must be signed in to change notification settings - Fork 188
fix: scheduled upgrade details state #9562
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: scheduled upgrade details state #9562
Conversation
8e685fc
to
978475c
Compare
💔 Build Failed
Failed CI StepsHistory
|
978475c
to
a0c4283
Compare
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
Tested this PR with the given instructions. I see the scheduled upgrade being reported both in the Fleet UI and in the output of I restarted Elastic Agent three different ways after that: by running Before the scheduled time arrived, I canceled the upgrade via the Fleet UI. I see that the Reviewing the code changes now... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks good, just a two nits that I found.
Overall the testing looks good. The integration testing is good. I do have a request for an extra integration test. I didn't see a test case for scheduling an upgrade action, the action running, it failing, then scheduling a new action not scheduled, that replaces that action with the same version and URI. I believe that could be added with an upgrade image behind a transparent proxy, where the first action always fails to download, and then the following up action it allows it to succeed.
Thanks for the review @blakerouse - appreciate the close look! On the extra integration test: In general I agree more coverage around upgrades is always valuable. That said the flow you’re describing kicks in after an upgrade action is dispatched. At that point the action leaves the dispatcher’s action queue and becomes part of bkgActions slice which is managed by the upgrader. So it isn’t exercising the dispatcher changes in this PR. I’d like to keep this PR scoped to "actions that go through the dispatcher" where the related issues are, and avoid adding an integration test that primarily validates coordinator behavior here. I can, however, add a unit test in this PR that forces the upgrade action handler to fail initially, then schedules a new upgrade action (same version/URI) and verifies the dispatcher dispatches the new one - matching the spirit of your ask while staying within dispatcher boundaries. WDYT? |
@pkoutsovasilis A unit test for that would work. |
…ly the expiration of retried stored actions, and update upgrade details on retries
c976e06
to
cabe8c7
Compare
@blakerouse @ycombinator while working on the unit test we agreed on, I ran into a few more issues (story of my life
We now keep only the first encountered upgrade action in input order. I’m not sure how/why fleetgateway can send multiple upgrade actions, but it causes problems. Example: if we get both an immediate upgrade and a scheduled upgrade, which one should set the upgrade details? And what if the scheduled one gets dispatched in the next window while the immediate one is still running? That feels like something that shouldn’t be allowed, but I’d like your thoughts @blakerouse @ycombinator
Previously (code link), retried immediate upgrade actions (no expiration) were incorrectly treated as expired because
The dispatcher now updates upgrade details on retry, including the error from the prior attempt. Unfortunately, Fleet UI still shows only the
Learned here that the coordinator upgrader can skip setting upgrade details (e.g. if the action is ignored because it’s the same version/URI). Without resetting, the last dispatcher-set upgrade details would “stick” to the agent and, since the action gets ack’ed, there’d be no cancellation path. Fix: whenever the dispatcher sees a dispatchable upgrade action, it sets upgrade details to Now, all that said, I think there’s another upcoming issue: in PR #8767, available rollback versions are becoming part of the upgrade details. That can create issues since the dispatcher knows nothing about which versions are available for rollback, so it could incorrectly set upgrade details with these missing. From what I understand now about how upgrade details flow through the code, rollback versions don’t really belong in upgrade details - they should probably be part of the check-in request body instead. Fleet UI could then combine those with upgrade details to decide whether to show a rollback option, similar to the existing flow that Fleet UI decides that there is an available upgrade for an agent. @pchila @ebeahan let's discuss about that - maybe I am missing something |
|
@pkoutsovasilis agreed, looking at the upgrade details' behavior that this PR is dealing with for scheduled upgrades and retries I feel like the list of available rollbacks should not be in the details metadata (that location was chosen initially as a "convenient" spot since that was already part of agent status and check-in payload). Let's discuss offline with @ebeahan involvement how to avoid introducing more complexity on the upgrade details functionality. |
@pkoutsovasilis Great that these extra issues where caught now, with added unit tests. The changes adding in that commit look good, very well commented.
I don't think there is really away of having Fleet prevent this. I could see two requests coming in near the same time one scheduled and the other not. To me the non-scheduled wins and should just happen, the scheduled one should just be ignored as the upgrade is now happening. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM — glad to have this fixed!
I still have one question but it's not blocking this PR. Depending on the answer, we can make a follow up PR.
@Mergifyio backport 8.17 8.18 8.19 9.0 9.1 |
✅ Backports have been created
|
* fix: persisting and reporting of upgrade details * ci: align and extend dispatcher unit-tests * ci: update coordinator and application new signatures in unit-tests * ci: add integration tests for scheduled upgrade details * doc: add changelog fragment * doc: reword existing and add more comments in code * feat: change queuedUpgradeActions inside dispatchCancelActions to have values of struct{} * fix: remove redundant continue * fix: dedupe upgrade actions from fleetgateway actions, handle correctly the expiration of retried stored actions, and update upgrade details on retries (cherry picked from commit ff80471) # Conflicts: # internal/pkg/agent/application/actions/handlers/handler_action_upgrade_test.go # internal/pkg/agent/application/application.go # internal/pkg/agent/application/application_test.go # internal/pkg/agent/application/coordinator/coordinator.go # internal/pkg/agent/application/coordinator/coordinator_test.go # internal/pkg/agent/cmd/run.go
* fix: persisting and reporting of upgrade details * ci: align and extend dispatcher unit-tests * ci: update coordinator and application new signatures in unit-tests * ci: add integration tests for scheduled upgrade details * doc: add changelog fragment * doc: reword existing and add more comments in code * feat: change queuedUpgradeActions inside dispatchCancelActions to have values of struct{} * fix: remove redundant continue * fix: dedupe upgrade actions from fleetgateway actions, handle correctly the expiration of retried stored actions, and update upgrade details on retries (cherry picked from commit ff80471) # Conflicts: # internal/pkg/agent/application/application.go # internal/pkg/agent/application/coordinator/coordinator.go # internal/pkg/agent/cmd/run.go
* fix: persisting and reporting of upgrade details * ci: align and extend dispatcher unit-tests * ci: update coordinator and application new signatures in unit-tests * ci: add integration tests for scheduled upgrade details * doc: add changelog fragment * doc: reword existing and add more comments in code * feat: change queuedUpgradeActions inside dispatchCancelActions to have values of struct{} * fix: remove redundant continue * fix: dedupe upgrade actions from fleetgateway actions, handle correctly the expiration of retried stored actions, and update upgrade details on retries (cherry picked from commit ff80471) # Conflicts: # internal/pkg/agent/cmd/run.go
* fix: persisting and reporting of upgrade details * ci: align and extend dispatcher unit-tests * ci: update coordinator and application new signatures in unit-tests * ci: add integration tests for scheduled upgrade details * doc: add changelog fragment * doc: reword existing and add more comments in code * feat: change queuedUpgradeActions inside dispatchCancelActions to have values of struct{} * fix: remove redundant continue * fix: dedupe upgrade actions from fleetgateway actions, handle correctly the expiration of retried stored actions, and update upgrade details on retries (cherry picked from commit ff80471) # Conflicts: # internal/pkg/agent/application/application.go # internal/pkg/agent/application/coordinator/coordinator.go # internal/pkg/agent/cmd/run.go
* fix: persisting and reporting of upgrade details * ci: align and extend dispatcher unit-tests * ci: update coordinator and application new signatures in unit-tests * ci: add integration tests for scheduled upgrade details * doc: add changelog fragment * doc: reword existing and add more comments in code * feat: change queuedUpgradeActions inside dispatchCancelActions to have values of struct{} * fix: remove redundant continue * fix: dedupe upgrade actions from fleetgateway actions, handle correctly the expiration of retried stored actions, and update upgrade details on retries (cherry picked from commit ff80471) # Conflicts: # internal/pkg/agent/application/application.go # internal/pkg/agent/application/coordinator/coordinator.go # internal/pkg/agent/cmd/run.go
* fix: scheduled upgrade details state (#9562) * fix: persisting and reporting of upgrade details * ci: align and extend dispatcher unit-tests * ci: update coordinator and application new signatures in unit-tests * ci: add integration tests for scheduled upgrade details * doc: add changelog fragment * doc: reword existing and add more comments in code * feat: change queuedUpgradeActions inside dispatchCancelActions to have values of struct{} * fix: remove redundant continue * fix: dedupe upgrade actions from fleetgateway actions, handle correctly the expiration of retried stored actions, and update upgrade details on retries (cherry picked from commit ff80471) # Conflicts: # internal/pkg/agent/application/application.go # internal/pkg/agent/application/coordinator/coordinator.go # internal/pkg/agent/cmd/run.go * fix: resolve conflicts * fix: define missing test helper func * fix: ignore QF1007 merge conditional assignment into variable declaration --------- Co-authored-by: Panos Koutsovasilis <[email protected]>
* fix: scheduled upgrade details state (#9562) * fix: persisting and reporting of upgrade details * ci: align and extend dispatcher unit-tests * ci: update coordinator and application new signatures in unit-tests * ci: add integration tests for scheduled upgrade details * doc: add changelog fragment * doc: reword existing and add more comments in code * feat: change queuedUpgradeActions inside dispatchCancelActions to have values of struct{} * fix: remove redundant continue * fix: dedupe upgrade actions from fleetgateway actions, handle correctly the expiration of retried stored actions, and update upgrade details on retries (cherry picked from commit ff80471) # Conflicts: # internal/pkg/agent/application/application.go # internal/pkg/agent/application/coordinator/coordinator.go # internal/pkg/agent/cmd/run.go * fix: resolve conflicts * fix: define missing test helper func * fix: ignore QF1007 merge conditional assignment into variable declaration --------- Co-authored-by: Panos Koutsovasilis <[email protected]>
* fix: scheduled upgrade details state (#9562) * fix: persisting and reporting of upgrade details * ci: align and extend dispatcher unit-tests * ci: update coordinator and application new signatures in unit-tests * ci: add integration tests for scheduled upgrade details * doc: add changelog fragment * doc: reword existing and add more comments in code * feat: change queuedUpgradeActions inside dispatchCancelActions to have values of struct{} * fix: remove redundant continue * fix: dedupe upgrade actions from fleetgateway actions, handle correctly the expiration of retried stored actions, and update upgrade details on retries (cherry picked from commit ff80471) # Conflicts: # internal/pkg/agent/cmd/run.go * fix: resolve conflicts * fix: define missing test helper func * fix: ignore QF1007 merge conditional assignment into variable declaration --------- Co-authored-by: Panos Koutsovasilis <[email protected]>
* fix: scheduled upgrade details state (#9562) * fix: persisting and reporting of upgrade details * ci: align and extend dispatcher unit-tests * ci: update coordinator and application new signatures in unit-tests * ci: add integration tests for scheduled upgrade details * doc: add changelog fragment * doc: reword existing and add more comments in code * feat: change queuedUpgradeActions inside dispatchCancelActions to have values of struct{} * fix: remove redundant continue * fix: dedupe upgrade actions from fleetgateway actions, handle correctly the expiration of retried stored actions, and update upgrade details on retries (cherry picked from commit ff80471) # Conflicts: # internal/pkg/agent/application/application.go # internal/pkg/agent/application/coordinator/coordinator.go # internal/pkg/agent/cmd/run.go * fix: resolve conflicts * fix: define missing test helper func * fix: ignore QF1007 merge conditional assignment into variable declaration --------- Co-authored-by: Panos Koutsovasilis <[email protected]>
* fix: scheduled upgrade details state (#9562) * fix: persisting and reporting of upgrade details * ci: align and extend dispatcher unit-tests * ci: update coordinator and application new signatures in unit-tests * ci: add integration tests for scheduled upgrade details * doc: add changelog fragment * doc: reword existing and add more comments in code * feat: change queuedUpgradeActions inside dispatchCancelActions to have values of struct{} * fix: remove redundant continue * fix: dedupe upgrade actions from fleetgateway actions, handle correctly the expiration of retried stored actions, and update upgrade details on retries (cherry picked from commit ff80471) # Conflicts: # internal/pkg/agent/application/actions/handlers/handler_action_upgrade_test.go # internal/pkg/agent/application/application.go # internal/pkg/agent/application/application_test.go # internal/pkg/agent/application/coordinator/coordinator.go # internal/pkg/agent/application/coordinator/coordinator_test.go # internal/pkg/agent/cmd/run.go * fix: resolve conflicts * fix: define missing test helper func --------- Co-authored-by: Panos Koutsovasilis <[email protected]>
What does this PR do?
This PR fixes how Elastic Agent persists and reports upgrade details across check-ins, restarts, cancellations, and expired scheduled upgrade actions.
Specifically:
Coordinator
is initialized with the correct upgrade details, either from the upgrade marker file or queued Fleet actions.ActionDispatcher
logic to:nil
when an upgrade action is dispatched without an error.testing/integration/ess/scheduled_upgrade_details_test.go
to verify that scheduled upgrade details are:All the new business logic is captured by 95ab857 (+234 -126 lines changed) and cabe8c7 (+174 -20 lines changed)
Why is it important?
Previously, scheduled upgrade details could be lost or incorrectly reported:
These inconsistencies caused confusion in Fleet UI and sometimes left users unable to perform upgrades.
With this fix, users now get a consistent and accurate view of upcoming, active, or failed upgrades across the entire Agent lifecycle.
Checklist
./changelog/fragments
using the changelog toolDisruptive User Impact
None expected. This change only improves correctness of reported upgrade details.
Users upgrading to this version will see more consistent and reliable upgrade state reporting in Fleet.
How to test this PR locally
AGENT_PACKAGE_VERSION=9.1.1 EXTERNAL=true SNAPSHOT=true PLATFORMS="linux/arm64" PACKAGES="tar.gz" mage package
(ty @ycombinator for the proposal)Screen.Recording.2025-08-26.at.5.12.03.PM.mov
Also I tested the approach mentioned here to force update an agent and everything is reflected correctly to the upgrade details
Screen.Recording.2025-08-27.at.12.42.49.AM.mov
(PS: For the needs of the videos above I compiled this PR with pseudo prior version 9.1.1)
Related issues