Skip to content

Conversation

mergify[bot]
Copy link
Contributor

@mergify mergify bot commented Sep 1, 2025

What does this PR do?

This PR fixes how Elastic Agent persists and reports upgrade details across check-ins, restarts, cancellations, and expired scheduled upgrade actions.

Specifically:

  • (95ab857) Ensures the Coordinator is initialized with the correct upgrade details, either from the upgrade marker file or queued Fleet actions.
  • (95ab857, cabe8c7) Extends the ActionDispatcher logic to:
    • Deduplicate upgrade actions from the received fleetgateway actions by keeping only the first encountered upgrade in input order.
    • Correctly track when upgrade details need to be updated (new upgrade actions, cancellations, expirations, etc.).
    • Correctly handle upgrade action cancellations.
    • Correctly handle expired scheduled upgrade actions and keep persisting them until explicitly canceled or replaced by a newer upgrade action.
    • Correctly handle retried upgrade actions.
    • Correctly update upgrade details when an upgrade action is scheduled for a retry.
    • Correctly update the upgrade details to nil when an upgrade action is dispatched without an error.
  • (a16a9f3) Refactors and extends related dispatcher tests to include an actual action queue and cover new flows (queueing, canceling, expiring).
  • (3e7324e) Adds new integration tests under testing/integration/ess/scheduled_upgrade_details_test.go to verify that scheduled upgrade details are:
    • Reported correctly when received.
    • Reported correctly when expired.
    • Preserved across Agent restarts.
    • Cleared on cancel.

All the new business logic is captured by 95ab857 (+234 -126 lines changed) and cabe8c7 (+174 -20 lines changed)

Why is it important?

Previously, scheduled upgrade details could be lost or incorrectly reported:

  • They were not consistently preserved across Agent restarts.
  • Cancel actions did not clear the reported upgrade details.
  • Expired scheduled upgrade actions were treated as ephemeral and not persisted.

These inconsistencies caused confusion in Fleet UI and sometimes left users unable to perform upgrades.
With this fix, users now get a consistent and accurate view of upcoming, active, or failed upgrades across the entire Agent lifecycle.

Checklist

  • I have read and understood the pull request guidelines of this project.
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

Disruptive User Impact

None expected. This change only improves correctness of reported upgrade details.
Users upgrading to this version will see more consistent and reliable upgrade state reporting in Fleet.

How to test this PR locally

  1. Run the new integration tests
  2. Or build the code of this PR
  • Provision a 9.1.2 cloud stack
  • AGENT_PACKAGE_VERSION=9.1.1 EXTERNAL=true SNAPSHOT=true PLATFORMS="linux/arm64" PACKAGES="tar.gz" mage package (ty @ycombinator for the proposal)
  • Install and enroll the Agent to Fleet
  • Schedule an upgrade → check it is reported.
  • Restart the Agent → check details persist.
  • Cancel the upgrade → check details clear.
Screen.Recording.2025-08-26.at.5.12.03.PM.mov

Also I tested the approach mentioned here to force update an agent and everything is reflected correctly to the upgrade details

Screen.Recording.2025-08-27.at.12.42.49.AM.mov

(PS: For the needs of the videos above I compiled this PR with pseudo prior version 9.1.1)

Related issues

* fix: persisting and reporting of upgrade details

* ci: align and extend dispatcher unit-tests

* ci: update coordinator and application new signatures in unit-tests

* ci: add integration tests for scheduled upgrade details

* doc: add changelog fragment

* doc: reword existing and add more comments in code

* feat: change queuedUpgradeActions inside dispatchCancelActions to have values of struct{}

* fix: remove redundant continue

* fix: dedupe upgrade actions from fleetgateway actions, handle correctly the expiration of retried stored actions, and update upgrade details on retries

(cherry picked from commit ff80471)

# Conflicts:
#	internal/pkg/agent/application/application.go
#	internal/pkg/agent/application/coordinator/coordinator.go
#	internal/pkg/agent/cmd/run.go
@mergify mergify bot added backport conflicts There is a conflict in the backported pull request labels Sep 1, 2025
@mergify mergify bot requested a review from a team as a code owner September 1, 2025 08:15
@mergify mergify bot requested review from nkvoll and blakerouse and removed request for a team September 1, 2025 08:15
Copy link
Contributor Author

mergify bot commented Sep 1, 2025

Cherry-pick of ff80471 has failed:

On branch mergify/bp/8.18/pr-9562
Your branch is up to date with 'origin/8.18'.

You are currently cherry-picking commit ff8047180.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:
	new file:   changelog/fragments/1756218044-fix-upgrade-details-state.yaml
	modified:   internal/pkg/agent/application/actions/handlers/handler_action_upgrade_test.go
	modified:   internal/pkg/agent/application/application_test.go
	modified:   internal/pkg/agent/application/coordinator/coordinator_test.go
	modified:   internal/pkg/agent/application/dispatcher/dispatcher.go
	modified:   internal/pkg/agent/application/dispatcher/dispatcher_test.go
	modified:   internal/pkg/agent/application/managed_mode.go
	modified:   testing/fleetservertest/ackableactions.go
	modified:   testing/fleetservertest/checkin.go
	modified:   testing/fleetservertest/models.go
	new file:   testing/integration/ess/scheduled_upgrade_details_test.go

Unmerged paths:
  (use "git add <file>..." to mark resolution)
	both modified:   internal/pkg/agent/application/application.go
	both modified:   internal/pkg/agent/application/coordinator/coordinator.go
	both modified:   internal/pkg/agent/cmd/run.go

To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally

@mergify mergify bot mentioned this pull request Sep 1, 2025
8 tasks
@github-actions github-actions bot added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Sep 1, 2025
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@pkoutsovasilis pkoutsovasilis force-pushed the mergify/bp/8.18/pr-9562 branch from fda3d90 to c9b0649 Compare September 1, 2025 08:36
@pkoutsovasilis pkoutsovasilis removed the conflicts There is a conflict in the backported pull request label Sep 1, 2025
Copy link

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @pkoutsovasilis

@pkoutsovasilis pkoutsovasilis merged commit 17eed84 into 8.18 Sep 1, 2025
18 checks passed
@pkoutsovasilis pkoutsovasilis deleted the mergify/bp/8.18/pr-9562 branch September 1, 2025 12:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants