[Fleet] Add retry logic to automatic agent upgrades #212744

jillguyonnet · 2025-02-28T10:33:46Z

Summary

Relates https://github.com/elastic/ingest-dev/issues/4720

This PR adds retry logic to the task that handles automatic agent upgrades originally implemented in #211019.

Complementary fleet-server change which sets the agent's upgrade_attempts to null once the upgrade is complete.: elastic/fleet-server#4528

Approach

A new upgrade_attempts property is added to agents and stored in the agent doc (ES mapping update in [Fleet] Add upgrade_attempts to .fleet-agents index elasticsearch#123256).
When a bulk upgrade action is sent from the automatic upgrade task, it pushes the timestamp of the upgrade to the affected agents' upgrade_attempts.
The default retry delays are ['30m', '1h', '2h', '4h', '8h', '16h', '24h'] and can be overridden with the new xpack.fleet.autoUpgrades.retryDelays setting.
On every run, the automatic upgrade task will first process retries and then query more agents if necessary (cf. https://github.com/elastic/ingest-dev/issues/4720#issuecomment-2671660795).
Once an agent has completed and failed the max retries defined by the retry delays array, it is no longer retried.

Testing

The ES query for fetching agents with existing upgrade_attempts needs the updated mappings, so it might be necessary to pull the latest main in the elasticsearch repo and run yarn es source instead of yarn es snapshot (requires an up-to-date Java environment, currently 23).

In order to test that upgrade_attempts is set to null when the upgrade is complete, fleet-server should be run in dev using the change in elastic/fleet-server#4528.

Checklist

Unit or functional tests were updated or added to match the most common scenarios
The PR description includes the appropriate Release Notes section, and the correct release_note:* label is applied per the guidelines

Identify risks

Low probability risk of incorrectly triggering agent upgrades. This feature is currently behind the enableAutomaticAgentUpgrades feature flag.

docs/settings/fleet-settings.asciidoc

jillguyonnet · 2025-02-28T10:43:13Z

x-pack/platform/plugins/shared/fleet/server/tasks/automatic_agent_upgrade_task.ts

One important piece of this logic is that we can't just pick up agents stuck in updating now since we're explicitly handling retries. I'm allowing agents stuck in updating with no upgrade_attempts (i.e. that were not upgraded through this task) to be selected.

One thing that came to mind while implementing this is that we could probably make this task more performant by modifying the kuery for selecting candidate agents for upgrade: instead of allowing all status:updating agents, we could restrict it to only allow status:updating AND NOT upgrade_attempts:* AND upgrade_details.target_version:{version} AND upgrade_details.state:UPG_FAILED. The only difference would be that it wouldn't pick up agents on older versions without upgrade_details that are stuck in updating.

@juliaElastic Pinging you on this comment for your thoughts on it.

We could include those stuck agents without upgrade_details in a date range query, something like upgrade_started_at < now-2h.
One caveat with these agents is we don't have target_version on them, so we don't necessarily auto upgrade them to the same version as initially, but I think it's still better to try to upgrade them rather than keep them stuck.

Thanks for the suggestion, pushed a change in this direction.

jillguyonnet · 2025-02-28T10:49:05Z

x-pack/platform/plugins/shared/fleet/server/tasks/automatic_agent_upgrade_task.ts

 const TITLE = 'Fleet Automatic agent upgrades';
 const SCOPE = ['fleet'];
-const INTERVAL = '1h';
+const INTERVAL = '30m';


Changing this to the default min retry delay.

…t --include-path /api/status --include-path /api/alerting/rule/ --include-path /api/alerting/rules --include-path /api/actions --include-path /api/security/role --include-path /api/spaces --include-path /api/fleet --include-path /api/dashboards --update'

x-pack/platform/plugins/shared/fleet/server/tasks/automatic_agent_upgrade_task.ts

x-pack/platform/plugins/shared/fleet/server/services/agents/upgrade_action_runner.ts

x-pack/platform/plugins/shared/fleet/server/tasks/automatic_agent_upgrade_task.ts

Co-authored-by: Julia Bardi <[email protected]>

jillguyonnet · 2025-03-03T13:29:51Z

@elasticmachine merge upstream

juliaElastic · 2025-03-04T12:12:51Z

Tested locally with horde agents, and getting this error, it looks like missing upgrade_details doesn't work well with the query.

[2025-03-04T13:09:05.797+01:00][ERROR][plugins.fleet.fleet:automatic-agent-upgrade-task:1.0.0] [AutomaticAgentUpgradeTask] Error: ResponseError: search_phase_execution_exception
        Root causes:
                query_shard_exception: failed to create query: [nested] failed to find nested object under path [upgrade_details.target_version]

I could get past it by adding the nestedIgnoreUnmapped parameter here:
https://github.com/elastic/kibana/blob/main/x-pack/platform/plugins/shared/fleet/server/services/agents/crud.ts#L414

const query = kueryNode ? { query: toElasticsearchQuery(kueryNode, undefined, {nestedIgnoreUnmapped: true}) } : {};

x-pack/platform/plugins/shared/fleet/server/tasks/automatic_agent_upgrade_task.ts

juliaElastic · 2025-03-04T14:09:23Z

x-pack/platform/plugins/shared/fleet/server/tasks/automatic_agent_upgrade_task.ts

+  }
+
+  private isAgentReadyForRetry(agent: Agent, agentPolicy: AgentPolicy) {
+    if (!agent.upgrade_attempts) {


When testing this, it seems the upgrade_attempts array keeps being extended with more items even after max attempts are exceeded, and way faster than the retry delays.
Not sure where the bug is, I'm testing with an agent doc that's manually updated to be in failed upgrade state.

It seems if the agent is not picked up for retry, it gets upgraded in findAndUpgradeCandidateAgents. Probably we shouldn't upgrade there if upgrade_attempts.length > 0

Thanks for reporting this. I'm currently troubleshooting with a real agent and I think with the change in elastic/fleet-server#4528 upgrade_attempts gets cleared after a failed upgrade, which is not the intent. Let me troubleshoot this further and see if I can reproduce the issue you are seeing.

I've done more testing with real agents and a new approach in elastic/fleet-server#4528. Retrying seems to work for agents with upgrade details (it's hitting the max retry attempts exceeded for agent log as expected). Not sure if that's due to my last fixes or if I'm missing some flow.

There is however currently a gap for agents with no upgrade details: as you pointed out in #212744 (comment), the ES queries don't work as expected. Furthermore, the current fleet-server change relies on upgrade details, so retries don't work for these agents. I'm still looking at how to improve queries, but I'd welcome some feedback on the approach.

It seems if the agent is not picked up for retry, it gets upgraded in findAndUpgradeCandidateAgents. Probably we shouldn't upgrade there if upgrade_attempts.length > 0

Currently the logic does this:

count how many agents need to be on the target version

subtract number of agents already on it or updating to it

subtract number of agents marked for retry (with upgrade_attempts set)

if this is still a positive number, fetch more agents with findAndUpgradeCandidateAgents

findAndUpgradeCandidateAgents should definitely not attempt to upgrade an agent that was marked for retry. In theory the AND (NOT upgrade_attempts:*) part of the agents fetcher kuery should prevent that (I fixed that). Do you have any concerns about this?

I think it's okay if we don't retry agents without upgrade details, just count them as updating.
The logic sounds good, we should be able to test that it works as expected.

Should we avoid setting upgrade_attempts on agents below 8.12.0 then? If we do that, they would naturally get retried by the task (with no limit to the number of attempts).

I think that sounds reasonable.

Pushed the change, along with a fix (the task was erroring if no agents were returned by the fetcher).

I've tested again with real agents below and above 8.12.0, as well as horde agents. I'm not seeing any issues, please let me know if you do.

jillguyonnet · 2025-03-06T15:25:19Z

@elasticmachine merge upstream

elasticmachine · 2025-03-06T15:29:34Z

Pinging @elastic/fleet (Team:Fleet)

juliaElastic

LGTM

elasticmachine · 2025-03-06T19:31:25Z

💛 Build succeeded, but was flaky

Buildkite Build
Commit: 9b1ca3d

Failed CI Steps

Test Failures

[job] [logs] FTR Configs #19 / discover/context_awareness extension getAdditionalCellActions data view mode should render additional cell actions for logs data source
[job] [logs] FTR Configs #5 / Fleet Endpoints fleet_proxies_crud PUT /proxies/{itemId} should allow to update an existing fleet proxy

Metrics [docs]

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id	before	after	diff
`fleet`	158.8KB	158.8KB	+31.0B

Unknown metric groups

API count

id	before	after	diff
`fleet`	1463	1464	+1

History

cc @jillguyonnet

## Summary Relates elastic/ingest-dev#4720 This PR adds retry logic to the task that handles automatic agent upgrades originally implemented in elastic#211019. Complementary fleet-server change which sets the agent's `upgrade_attempts` to `null` once the upgrade is complete.: elastic/fleet-server#4528 ### Approach - A new `upgrade_attempts` property is added to agents and stored in the agent doc (ES mapping update in elastic/elasticsearch#123256). - When a bulk upgrade action is sent from the automatic upgrade task, it pushes the timestamp of the upgrade to the affected agents' `upgrade_attempts`. - The default retry delays are `['30m', '1h', '2h', '4h', '8h', '16h', '24h']` and can be overridden with the new `xpack.fleet.autoUpgrades.retryDelays` setting. - On every run, the automatic upgrade task will first process retries and then query more agents if necessary (cf. elastic/ingest-dev#4720 (comment)). - Once an agent has completed and failed the max retries defined by the retry delays array, it is no longer retried. ### Testing The ES query for fetching agents with existing `upgrade_attempts` needs the updated mappings, so it might be necessary to pull the latest `main` in the `elasticsearch` repo and run `yarn es source` instead of `yarn es snapshot` (requires an up-to-date Java environment, currently 23). In order to test that `upgrade_attempts` is set to `null` when the upgrade is complete, fleet-server should be run in dev using the change in elastic/fleet-server#4528. ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios - [x] The PR description includes the appropriate Release Notes section, and the correct `release_note:*` label is applied per the [guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process) ### Identify risks Low probability risk of incorrectly triggering agent upgrades. This feature is currently behind the `enableAutomaticAgentUpgrades` feature flag. --------- Co-authored-by: kibanamachine <[email protected]> Co-authored-by: Julia Bardi <[email protected]> Co-authored-by: Elastic Machine <[email protected]>

Relates elastic/ingest-dev#4720 This PR adds retry logic to the task that handles automatic agent upgrades originally implemented in elastic#211019. Complementary fleet-server change which sets the agent's `upgrade_attempts` to `null` once the upgrade is complete.: elastic/fleet-server#4528 - A new `upgrade_attempts` property is added to agents and stored in the agent doc (ES mapping update in elastic/elasticsearch#123256). - When a bulk upgrade action is sent from the automatic upgrade task, it pushes the timestamp of the upgrade to the affected agents' `upgrade_attempts`. - The default retry delays are `['30m', '1h', '2h', '4h', '8h', '16h', '24h']` and can be overridden with the new `xpack.fleet.autoUpgrades.retryDelays` setting. - On every run, the automatic upgrade task will first process retries and then query more agents if necessary (cf. elastic/ingest-dev#4720 (comment)). - Once an agent has completed and failed the max retries defined by the retry delays array, it is no longer retried. The ES query for fetching agents with existing `upgrade_attempts` needs the updated mappings, so it might be necessary to pull the latest `main` in the `elasticsearch` repo and run `yarn es source` instead of `yarn es snapshot` (requires an up-to-date Java environment, currently 23). In order to test that `upgrade_attempts` is set to `null` when the upgrade is complete, fleet-server should be run in dev using the change in elastic/fleet-server#4528. - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios - [x] The PR description includes the appropriate Release Notes section, and the correct `release_note:*` label is applied per the [guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process) Low probability risk of incorrectly triggering agent upgrades. This feature is currently behind the `enableAutomaticAgentUpgrades` feature flag. --------- Co-authored-by: kibanamachine <[email protected]> Co-authored-by: Julia Bardi <[email protected]> Co-authored-by: Elastic Machine <[email protected]>

jillguyonnet self-assigned this Feb 28, 2025

This comment was marked as outdated.

Sign in to view

jillguyonnet commented Feb 28, 2025

View reviewed changes

docs/settings/fleet-settings.asciidoc Outdated Show resolved Hide resolved

jillguyonnet commented Feb 28, 2025

View reviewed changes

jillguyonnet force-pushed the fleet/4720-automatic-upgrades-retry branch from 57a093b to f066052 Compare February 28, 2025 10:44

This comment was marked as outdated.

Sign in to view

jillguyonnet force-pushed the fleet/4720-automatic-upgrades-retry branch from f066052 to 835b23d Compare February 28, 2025 10:45

This comment was marked as outdated.

Sign in to view

jillguyonnet force-pushed the fleet/4720-automatic-upgrades-retry branch from 835b23d to 5b2c811 Compare February 28, 2025 10:46

This comment was marked as outdated.

Sign in to view

[Fleet] Add retry logic to automatic agent upgrades

28582f7

jillguyonnet force-pushed the fleet/4720-automatic-upgrades-retry branch from 5b2c811 to 28582f7 Compare February 28, 2025 10:48

This comment was marked as outdated.

Sign in to view

jillguyonnet commented Feb 28, 2025

View reviewed changes

jillguyonnet added Team:Fleet Team label for Observability Data Collection Fleet team release_note:skip Skip the PR/issue when compiling release notes backport:skip This PR does not require backporting v9.1.0 labels Feb 28, 2025

Remove asciidoc change

7207089

jillguyonnet requested a review from juliaElastic February 28, 2025 10:57

jillguyonnet mentioned this pull request Feb 28, 2025

Clear agent.upgrade_attempts on upgrade complete elastic/fleet-server#4528

Merged

8 tasks

juliaElastic reviewed Feb 28, 2025

View reviewed changes

x-pack/platform/plugins/shared/fleet/server/tasks/automatic_agent_upgrade_task.ts Show resolved Hide resolved

juliaElastic reviewed Feb 28, 2025

View reviewed changes

x-pack/platform/plugins/shared/fleet/server/services/agents/upgrade_action_runner.ts Outdated Show resolved Hide resolved

Feedback

08aaf9c

juliaElastic reviewed Mar 3, 2025

View reviewed changes

x-pack/platform/plugins/shared/fleet/server/tasks/automatic_agent_upgrade_task.ts Outdated Show resolved Hide resolved

Fix oldStuckInUpdatingKuery

1839863

Co-authored-by: Julia Bardi <[email protected]>

Merge branch 'main' into fleet/4720-automatic-upgrades-retry

6e57501

juliaElastic reviewed Mar 4, 2025

View reviewed changes

x-pack/platform/plugins/shared/fleet/server/tasks/automatic_agent_upgrade_task.ts Outdated Show resolved Hide resolved

juliaElastic reviewed Mar 4, 2025

View reviewed changes

x-pack/platform/plugins/shared/fleet/server/tasks/automatic_agent_upgrade_task.ts Outdated Show resolved Hide resolved

juliaElastic reviewed Mar 4, 2025

View reviewed changes

jillguyonnet and others added 5 commits March 6, 2025 12:56

Fix queries

e183081

Fix query

f0d0833

[CI] Auto-commit changed files from 'make api-docs'

a5812e5

Only set upgrade_attempts if agent has upgrade_details

d0bde15

Check if agentsFetcher returned any agents

831883e

Merge branch 'main' into fleet/4720-automatic-upgrades-retry

16c0627

jillguyonnet marked this pull request as ready for review March 6, 2025 15:29

jillguyonnet requested a review from a team as a code owner March 6, 2025 15:29

Fix config type

f7a59cc

juliaElastic approved these changes Mar 6, 2025

View reviewed changes

Fix frontend unit test

9b1ca3d

jillguyonnet merged commit bdbc2ef into elastic:main Mar 6, 2025
9 checks passed

jillguyonnet deleted the fleet/4720-automatic-upgrades-retry branch March 6, 2025 20:31

juliaElastic mentioned this pull request Apr 4, 2025

Clear upgrade_attempts on handleAck elastic/fleet-server#4762

Merged

8 tasks

This was referenced Apr 8, 2025

[8.x](backport #4528) Clear agent.upgrade_attempts on upgrade complete elastic/fleet-server#4777

Merged

[8.x](backport #4762) Clear upgrade_attempts on handleAck elastic/fleet-server#4778

Merged

[Fleet] Add retry logic to automatic agent upgrades #212744

[Fleet] Add retry logic to automatic agent upgrades #212744

Uh oh!

Conversation

jillguyonnet commented Feb 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Approach

Testing

Checklist

Identify risks

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jillguyonnet Feb 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jillguyonnet commented Mar 3, 2025

Uh oh!

juliaElastic commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

juliaElastic Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jillguyonnet Mar 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jillguyonnet commented Mar 6, 2025

Uh oh!

elasticmachine commented Mar 6, 2025

Uh oh!

juliaElastic left a comment

Choose a reason for hiding this comment

Uh oh!

elasticmachine commented Mar 6, 2025

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

Metrics [docs]

Page load bundle

API count

History

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

jillguyonnet commented Feb 28, 2025 •

edited

Loading

jillguyonnet Feb 28, 2025 •

edited

Loading

juliaElastic commented Mar 4, 2025 •

edited

Loading

juliaElastic Mar 4, 2025 •

edited

Loading

jillguyonnet Mar 6, 2025 •

edited

Loading