Skip to content

Conversation

pkoutsovasilis
Copy link
Contributor

@pkoutsovasilis pkoutsovasilis commented Aug 25, 2025

What does this PR do?

This PR fixes how Elastic Agent persists and reports upgrade details across check-ins, restarts, cancellations, and expired scheduled upgrade actions.

Specifically:

  • (95ab857) Ensures the Coordinator is initialized with the correct upgrade details, either from the upgrade marker file or queued Fleet actions.
  • (95ab857, cabe8c7) Extends the ActionDispatcher logic to:
    • Deduplicate upgrade actions from the received fleetgateway actions by keeping only the first encountered upgrade in input order.
    • Correctly track when upgrade details need to be updated (new upgrade actions, cancellations, expirations, etc.).
    • Correctly handle upgrade action cancellations.
    • Correctly handle expired scheduled upgrade actions and keep persisting them until explicitly canceled or replaced by a newer upgrade action.
    • Correctly handle retried upgrade actions.
    • Correctly update upgrade details when an upgrade action is scheduled for a retry.
    • Correctly update the upgrade details to nil when an upgrade action is dispatched without an error.
  • (a16a9f3) Refactors and extends related dispatcher tests to include an actual action queue and cover new flows (queueing, canceling, expiring).
  • (3e7324e) Adds new integration tests under testing/integration/ess/scheduled_upgrade_details_test.go to verify that scheduled upgrade details are:
    • Reported correctly when received.
    • Reported correctly when expired.
    • Preserved across Agent restarts.
    • Cleared on cancel.

All the new business logic is captured by 95ab857 (+234 -126 lines changed) and cabe8c7 (+174 -20 lines changed)

Why is it important?

Previously, scheduled upgrade details could be lost or incorrectly reported:

  • They were not consistently preserved across Agent restarts.
  • Cancel actions did not clear the reported upgrade details.
  • Expired scheduled upgrade actions were treated as ephemeral and not persisted.

These inconsistencies caused confusion in Fleet UI and sometimes left users unable to perform upgrades.
With this fix, users now get a consistent and accurate view of upcoming, active, or failed upgrades across the entire Agent lifecycle.

Checklist

  • I have read and understood the pull request guidelines of this project.
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

Disruptive User Impact

None expected. This change only improves correctness of reported upgrade details.
Users upgrading to this version will see more consistent and reliable upgrade state reporting in Fleet.

How to test this PR locally

  1. Run the new integration tests
  2. Or build the code of this PR
  • Provision a 9.1.2 cloud stack
  • AGENT_PACKAGE_VERSION=9.1.1 EXTERNAL=true SNAPSHOT=true PLATFORMS="linux/arm64" PACKAGES="tar.gz" mage package (ty @ycombinator for the proposal)
  • Install and enroll the Agent to Fleet
  • Schedule an upgrade → check it is reported.
  • Restart the Agent → check details persist.
  • Cancel the upgrade → check details clear.
Screen.Recording.2025-08-26.at.5.12.03.PM.mov

Also I tested the approach mentioned here to force update an agent and everything is reflected correctly to the upgrade details

Screen.Recording.2025-08-27.at.12.42.49.AM.mov

(PS: For the needs of the videos above I compiled this PR with pseudo prior version 9.1.1)

Related issues

@pkoutsovasilis pkoutsovasilis self-assigned this Aug 25, 2025
@pkoutsovasilis pkoutsovasilis added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team skip-changelog backport-active-all Automated backport with mergify to all the active branches labels Aug 25, 2025
@pkoutsovasilis pkoutsovasilis force-pushed the fix/upgrade_details branch 3 times, most recently from 8e685fc to 978475c Compare August 26, 2025 14:23
@elasticmachine
Copy link
Collaborator

elasticmachine commented Aug 26, 2025

💔 Build Failed

Failed CI Steps

History

cc @pkoutsovasilis

@pkoutsovasilis pkoutsovasilis marked this pull request as ready for review August 26, 2025 20:51
@pkoutsovasilis pkoutsovasilis requested a review from a team as a code owner August 26, 2025 20:51
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@ycombinator
Copy link
Contributor

Tested this PR with the given instructions. I see the scheduled upgrade being reported both in the Fleet UI and in the output of sudo elastic-agent status as part of the upgrade_details key 👍.

I restarted Elastic Agent three different ways after that: by running sudo elastic-agent restart, by stopping and starting the elastic-agent service, and by killing the elastic-agent process and letting the service bring it back up. In each case, the UPG_SCHEDULED state in upgrade_details remained the same as before 👍.

Before the scheduled time arrived, I canceled the upgrade via the Fleet UI. I see that the UPG_SCHEDULED state is no longer seen either in the Fleet UI or in the output of sudo elastic-agent status as part of the upgrade_details key 👍.

Reviewing the code changes now...

Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good, just a two nits that I found.

Overall the testing looks good. The integration testing is good. I do have a request for an extra integration test. I didn't see a test case for scheduling an upgrade action, the action running, it failing, then scheduling a new action not scheduled, that replaces that action with the same version and URI. I believe that could be added with an upgrade image behind a transparent proxy, where the first action always fails to download, and then the following up action it allows it to succeed.

@pkoutsovasilis
Copy link
Contributor Author

pkoutsovasilis commented Aug 27, 2025

Thanks for the review @blakerouse - appreciate the close look!

On the extra integration test: In general I agree more coverage around upgrades is always valuable. That said the flow you’re describing kicks in after an upgrade action is dispatched. At that point the action leaves the dispatcher’s action queue and becomes part of bkgActions slice which is managed by the upgrader. So it isn’t exercising the dispatcher changes in this PR. I’d like to keep this PR scoped to "actions that go through the dispatcher" where the related issues are, and avoid adding an integration test that primarily validates coordinator behavior here.

I can, however, add a unit test in this PR that forces the upgrade action handler to fail initially, then schedules a new upgrade action (same version/URI) and verifies the dispatcher dispatches the new one - matching the spirit of your ask while staying within dispatcher boundaries. WDYT?

@blakerouse
Copy link
Contributor

@pkoutsovasilis A unit test for that would work.

…ly the expiration of retried stored actions, and update upgrade details on retries
@pkoutsovasilis
Copy link
Contributor Author

pkoutsovasilis commented Aug 28, 2025

@blakerouse @ycombinator while working on the unit test we agreed on, I ran into a few more issues (story of my life :bowtie:). That’s why I added a new commit (cabe8c7) with additional business logic changes. I’ve updated the PR description, but to call them out specifically:

Deduplicate upgrade actions from the received fleetgateway actions by keeping only the first encountered upgrade in input order.

We now keep only the first encountered upgrade action in input order. I’m not sure how/why fleetgateway can send multiple upgrade actions, but it causes problems. Example: if we get both an immediate upgrade and a scheduled upgrade, which one should set the upgrade details? And what if the scheduled one gets dispatched in the next window while the immediate one is still running? That feels like something that shouldn’t be allowed, but I’d like your thoughts @blakerouse @ycombinator

Correctly handle retried upgrade actions.

Previously (code link), retried immediate upgrade actions (no expiration) were incorrectly treated as expired because exp was 0, and ts.After(exp) was always true.

Correctly update upgrade details when an upgrade action is scheduled for a retry.

The dispatcher now updates upgrade details on retry, including the error from the prior attempt. Unfortunately, Fleet UI still shows only the UPG_SCHEDULED tag without an error indicator (e.g. a warning/triangle), but the agent state correctly captures this - so I think this is fine.

Correctly update the upgrade details to nil when an upgrade action is dispatched without an error.

Learned here that the coordinator upgrader can skip setting upgrade details (e.g. if the action is ignored because it’s the same version/URI). Without resetting, the last dispatcher-set upgrade details would “stick” to the agent and, since the action gets ack’ed, there’d be no cancellation path. Fix: whenever the dispatcher sees a dispatchable upgrade action, it sets upgrade details to nil. That way, if the upgrader doesn’t set new ones, we don’t end up with stale data. IMO this needs to be investigated properly and most probably change


Now, all that said, I think there’s another upcoming issue: in PR #8767, available rollback versions are becoming part of the upgrade details. That can create issues since the dispatcher knows nothing about which versions are available for rollback, so it could incorrectly set upgrade details with these missing. From what I understand now about how upgrade details flow through the code, rollback versions don’t really belong in upgrade details - they should probably be part of the check-in request body instead. Fleet UI could then combine those with upgrade details to decide whether to show a rollback option, similar to the existing flow that Fleet UI decides that there is an available upgrade for an agent.

@pchila @ebeahan let's discuss about that - maybe I am missing something

Copy link

@pchila
Copy link
Member

pchila commented Aug 28, 2025

Now, all that said, I think there’s another upcoming issue: in PR #8767, available rollback versions are becoming part of the upgrade details. That can create issues since the dispatcher knows nothing about which versions are available for rollback, so it could incorrectly set upgrade details with these missing. From what I understand now about how upgrade details flow through the code, rollback versions don’t really belong in upgrade details - they should probably be part of the check-in request body instead. Fleet UI could then combine those with upgrade details to decide whether to show a rollback option, similar to the existing flow that Fleet UI decides that there is an available upgrade for an agent.

@pkoutsovasilis agreed, looking at the upgrade details' behavior that this PR is dealing with for scheduled upgrades and retries I feel like the list of available rollbacks should not be in the details metadata (that location was chosen initially as a "convenient" spot since that was already part of agent status and check-in payload).

Let's discuss offline with @ebeahan involvement how to avoid introducing more complexity on the upgrade details functionality.

@blakerouse
Copy link
Contributor

@blakerouse @ycombinator while working on the unit test we agreed on, I ran into a few more issues (story of my life :bowtie:). That’s why I added a new commit (cabe8c7) with additional business logic changes. I’ve updated the PR description, but to call them out specifically:

@pkoutsovasilis Great that these extra issues where caught now, with added unit tests. The changes adding in that commit look good, very well commented.

Deduplicate upgrade actions from the received fleetgateway actions by keeping only the first encountered upgrade in input order.

We now keep only the first encountered upgrade action in input order. I’m not sure how/why fleetgateway can send multiple upgrade actions, but it causes problems. Example: if we get both an immediate upgrade and a scheduled upgrade, which one should set the upgrade details? And what if the scheduled one gets dispatched in the next window while the immediate one is still running? That feels like something that shouldn’t be allowed, but I’d like your thoughts @blakerouse @ycombinator

I don't think there is really away of having Fleet prevent this. I could see two requests coming in near the same time one scheduled and the other not. To me the non-scheduled wins and should just happen, the scheduled one should just be ignored as the upgrade is now happening.

Copy link
Contributor

@ycombinator ycombinator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — glad to have this fixed!

I still have one question but it's not blocking this PR. Depending on the answer, we can make a follow up PR.

@pkoutsovasilis pkoutsovasilis merged commit ff80471 into elastic:main Sep 1, 2025
19 checks passed
Copy link
Contributor

github-actions bot commented Sep 1, 2025

@Mergifyio backport 8.17 8.18 8.19 9.0 9.1

Copy link
Contributor

mergify bot commented Sep 1, 2025

backport 8.17 8.18 8.19 9.0 9.1

✅ Backports have been created

mergify bot pushed a commit that referenced this pull request Sep 1, 2025
* fix: persisting and reporting of upgrade details

* ci: align and extend dispatcher unit-tests

* ci: update coordinator and application new signatures in unit-tests

* ci: add integration tests for scheduled upgrade details

* doc: add changelog fragment

* doc: reword existing and add more comments in code

* feat: change queuedUpgradeActions inside dispatchCancelActions to have values of struct{}

* fix: remove redundant continue

* fix: dedupe upgrade actions from fleetgateway actions, handle correctly the expiration of retried stored actions, and update upgrade details on retries

(cherry picked from commit ff80471)

# Conflicts:
#	internal/pkg/agent/application/actions/handlers/handler_action_upgrade_test.go
#	internal/pkg/agent/application/application.go
#	internal/pkg/agent/application/application_test.go
#	internal/pkg/agent/application/coordinator/coordinator.go
#	internal/pkg/agent/application/coordinator/coordinator_test.go
#	internal/pkg/agent/cmd/run.go
mergify bot pushed a commit that referenced this pull request Sep 1, 2025
* fix: persisting and reporting of upgrade details

* ci: align and extend dispatcher unit-tests

* ci: update coordinator and application new signatures in unit-tests

* ci: add integration tests for scheduled upgrade details

* doc: add changelog fragment

* doc: reword existing and add more comments in code

* feat: change queuedUpgradeActions inside dispatchCancelActions to have values of struct{}

* fix: remove redundant continue

* fix: dedupe upgrade actions from fleetgateway actions, handle correctly the expiration of retried stored actions, and update upgrade details on retries

(cherry picked from commit ff80471)

# Conflicts:
#	internal/pkg/agent/application/application.go
#	internal/pkg/agent/application/coordinator/coordinator.go
#	internal/pkg/agent/cmd/run.go
mergify bot pushed a commit that referenced this pull request Sep 1, 2025
* fix: persisting and reporting of upgrade details

* ci: align and extend dispatcher unit-tests

* ci: update coordinator and application new signatures in unit-tests

* ci: add integration tests for scheduled upgrade details

* doc: add changelog fragment

* doc: reword existing and add more comments in code

* feat: change queuedUpgradeActions inside dispatchCancelActions to have values of struct{}

* fix: remove redundant continue

* fix: dedupe upgrade actions from fleetgateway actions, handle correctly the expiration of retried stored actions, and update upgrade details on retries

(cherry picked from commit ff80471)

# Conflicts:
#	internal/pkg/agent/cmd/run.go
mergify bot pushed a commit that referenced this pull request Sep 1, 2025
* fix: persisting and reporting of upgrade details

* ci: align and extend dispatcher unit-tests

* ci: update coordinator and application new signatures in unit-tests

* ci: add integration tests for scheduled upgrade details

* doc: add changelog fragment

* doc: reword existing and add more comments in code

* feat: change queuedUpgradeActions inside dispatchCancelActions to have values of struct{}

* fix: remove redundant continue

* fix: dedupe upgrade actions from fleetgateway actions, handle correctly the expiration of retried stored actions, and update upgrade details on retries

(cherry picked from commit ff80471)

# Conflicts:
#	internal/pkg/agent/application/application.go
#	internal/pkg/agent/application/coordinator/coordinator.go
#	internal/pkg/agent/cmd/run.go
mergify bot pushed a commit that referenced this pull request Sep 1, 2025
* fix: persisting and reporting of upgrade details

* ci: align and extend dispatcher unit-tests

* ci: update coordinator and application new signatures in unit-tests

* ci: add integration tests for scheduled upgrade details

* doc: add changelog fragment

* doc: reword existing and add more comments in code

* feat: change queuedUpgradeActions inside dispatchCancelActions to have values of struct{}

* fix: remove redundant continue

* fix: dedupe upgrade actions from fleetgateway actions, handle correctly the expiration of retried stored actions, and update upgrade details on retries

(cherry picked from commit ff80471)

# Conflicts:
#	internal/pkg/agent/application/application.go
#	internal/pkg/agent/application/coordinator/coordinator.go
#	internal/pkg/agent/cmd/run.go
pkoutsovasilis added a commit that referenced this pull request Sep 1, 2025
* fix: scheduled upgrade details state (#9562)

* fix: persisting and reporting of upgrade details

* ci: align and extend dispatcher unit-tests

* ci: update coordinator and application new signatures in unit-tests

* ci: add integration tests for scheduled upgrade details

* doc: add changelog fragment

* doc: reword existing and add more comments in code

* feat: change queuedUpgradeActions inside dispatchCancelActions to have values of struct{}

* fix: remove redundant continue

* fix: dedupe upgrade actions from fleetgateway actions, handle correctly the expiration of retried stored actions, and update upgrade details on retries

(cherry picked from commit ff80471)

# Conflicts:
#	internal/pkg/agent/application/application.go
#	internal/pkg/agent/application/coordinator/coordinator.go
#	internal/pkg/agent/cmd/run.go

* fix: resolve conflicts

* fix: define missing test helper func

* fix: ignore QF1007 merge conditional assignment into variable declaration

---------

Co-authored-by: Panos Koutsovasilis <[email protected]>
pkoutsovasilis added a commit that referenced this pull request Sep 1, 2025
* fix: scheduled upgrade details state (#9562)

* fix: persisting and reporting of upgrade details

* ci: align and extend dispatcher unit-tests

* ci: update coordinator and application new signatures in unit-tests

* ci: add integration tests for scheduled upgrade details

* doc: add changelog fragment

* doc: reword existing and add more comments in code

* feat: change queuedUpgradeActions inside dispatchCancelActions to have values of struct{}

* fix: remove redundant continue

* fix: dedupe upgrade actions from fleetgateway actions, handle correctly the expiration of retried stored actions, and update upgrade details on retries

(cherry picked from commit ff80471)

# Conflicts:
#	internal/pkg/agent/application/application.go
#	internal/pkg/agent/application/coordinator/coordinator.go
#	internal/pkg/agent/cmd/run.go

* fix: resolve conflicts

* fix: define missing test helper func

* fix: ignore QF1007 merge conditional assignment into variable declaration

---------

Co-authored-by: Panos Koutsovasilis <[email protected]>
pkoutsovasilis added a commit that referenced this pull request Sep 1, 2025
* fix: scheduled upgrade details state (#9562)

* fix: persisting and reporting of upgrade details

* ci: align and extend dispatcher unit-tests

* ci: update coordinator and application new signatures in unit-tests

* ci: add integration tests for scheduled upgrade details

* doc: add changelog fragment

* doc: reword existing and add more comments in code

* feat: change queuedUpgradeActions inside dispatchCancelActions to have values of struct{}

* fix: remove redundant continue

* fix: dedupe upgrade actions from fleetgateway actions, handle correctly the expiration of retried stored actions, and update upgrade details on retries

(cherry picked from commit ff80471)

# Conflicts:
#	internal/pkg/agent/cmd/run.go

* fix: resolve conflicts

* fix: define missing test helper func

* fix: ignore QF1007 merge conditional assignment into variable declaration

---------

Co-authored-by: Panos Koutsovasilis <[email protected]>
pkoutsovasilis added a commit that referenced this pull request Sep 1, 2025
* fix: scheduled upgrade details state (#9562)

* fix: persisting and reporting of upgrade details

* ci: align and extend dispatcher unit-tests

* ci: update coordinator and application new signatures in unit-tests

* ci: add integration tests for scheduled upgrade details

* doc: add changelog fragment

* doc: reword existing and add more comments in code

* feat: change queuedUpgradeActions inside dispatchCancelActions to have values of struct{}

* fix: remove redundant continue

* fix: dedupe upgrade actions from fleetgateway actions, handle correctly the expiration of retried stored actions, and update upgrade details on retries

(cherry picked from commit ff80471)

# Conflicts:
#	internal/pkg/agent/application/application.go
#	internal/pkg/agent/application/coordinator/coordinator.go
#	internal/pkg/agent/cmd/run.go

* fix: resolve conflicts

* fix: define missing test helper func

* fix: ignore QF1007 merge conditional assignment into variable declaration

---------

Co-authored-by: Panos Koutsovasilis <[email protected]>
pkoutsovasilis added a commit that referenced this pull request Sep 1, 2025
* fix: scheduled upgrade details state (#9562)

* fix: persisting and reporting of upgrade details

* ci: align and extend dispatcher unit-tests

* ci: update coordinator and application new signatures in unit-tests

* ci: add integration tests for scheduled upgrade details

* doc: add changelog fragment

* doc: reword existing and add more comments in code

* feat: change queuedUpgradeActions inside dispatchCancelActions to have values of struct{}

* fix: remove redundant continue

* fix: dedupe upgrade actions from fleetgateway actions, handle correctly the expiration of retried stored actions, and update upgrade details on retries

(cherry picked from commit ff80471)

# Conflicts:
#	internal/pkg/agent/application/actions/handlers/handler_action_upgrade_test.go
#	internal/pkg/agent/application/application.go
#	internal/pkg/agent/application/application_test.go
#	internal/pkg/agent/application/coordinator/coordinator.go
#	internal/pkg/agent/application/coordinator/coordinator_test.go
#	internal/pkg/agent/cmd/run.go

* fix: resolve conflicts

* fix: define missing test helper func

---------

Co-authored-by: Panos Koutsovasilis <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-active-all Automated backport with mergify to all the active branches bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Elastic-agent loses UPG_SCHEDULED status when restarted Agents remains in UPG_SCHEDULED state when upgrade action is cancelled
5 participants