Skip to content

Conversation

@jillguyonnet
Copy link
Contributor

@jillguyonnet jillguyonnet commented Feb 28, 2025

Summary

Relates https://github.com/elastic/ingest-dev/issues/4720

This PR adds retry logic to the task that handles automatic agent upgrades originally implemented in #211019.

Complementary fleet-server change which sets the agent's upgrade_attempts to null once the upgrade is complete.: elastic/fleet-server#4528

Approach

  • A new upgrade_attempts property is added to agents and stored in the agent doc (ES mapping update in [Fleet] Add upgrade_attempts to .fleet-agents index elasticsearch#123256).
  • When a bulk upgrade action is sent from the automatic upgrade task, it pushes the timestamp of the upgrade to the affected agents' upgrade_attempts.
  • The default retry delays are ['30m', '1h', '2h', '4h', '8h', '16h', '24h'] and can be overridden with the new xpack.fleet.autoUpgrades.retryDelays setting.
  • On every run, the automatic upgrade task will first process retries and then query more agents if necessary (cf. https://github.com/elastic/ingest-dev/issues/4720#issuecomment-2671660795).
  • Once an agent has completed and failed the max retries defined by the retry delays array, it is no longer retried.

Testing

The ES query for fetching agents with existing upgrade_attempts needs the updated mappings, so it might be necessary to pull the latest main in the elasticsearch repo and run yarn es source instead of yarn es snapshot (requires an up-to-date Java environment, currently 23).

In order to test that upgrade_attempts is set to null when the upgrade is complete, fleet-server should be run in dev using the change in elastic/fleet-server#4528.

Checklist

  • Unit or functional tests were updated or added to match the most common scenarios
  • The PR description includes the appropriate Release Notes section, and the correct release_note:* label is applied per the guidelines

Identify risks

Low probability risk of incorrectly triggering agent upgrades. This feature is currently behind the enableAutomaticAgentUpgrades feature flag.

@jillguyonnet jillguyonnet self-assigned this Feb 28, 2025
@github-actions

This comment was marked as outdated.

@github-actions

This comment was marked as outdated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One important piece of this logic is that we can't just pick up agents stuck in updating now since we're explicitly handling retries. I'm allowing agents stuck in updating with no upgrade_attempts (i.e. that were not upgraded through this task) to be selected.

One thing that came to mind while implementing this is that we could probably make this task more performant by modifying the kuery for selecting candidate agents for upgrade: instead of allowing all status:updating agents, we could restrict it to only allow status:updating AND NOT upgrade_attempts:* AND upgrade_details.target_version:{version} AND upgrade_details.state:UPG_FAILED. The only difference would be that it wouldn't pick up agents on older versions without upgrade_details that are stuck in updating.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@juliaElastic Pinging you on this comment for your thoughts on it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could include those stuck agents without upgrade_details in a date range query, something like upgrade_started_at < now-2h.
One caveat with these agents is we don't have target_version on them, so we don't necessarily auto upgrade them to the same version as initially, but I think it's still better to try to upgrade them rather than keep them stuck.

Copy link
Contributor Author

@jillguyonnet jillguyonnet Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion, pushed a change in this direction.

@jillguyonnet jillguyonnet force-pushed the fleet/4720-automatic-upgrades-retry branch from 57a093b to f066052 Compare February 28, 2025 10:44
@github-actions

This comment was marked as outdated.

@jillguyonnet jillguyonnet force-pushed the fleet/4720-automatic-upgrades-retry branch from f066052 to 835b23d Compare February 28, 2025 10:45
@github-actions

This comment was marked as outdated.

@jillguyonnet jillguyonnet force-pushed the fleet/4720-automatic-upgrades-retry branch from 835b23d to 5b2c811 Compare February 28, 2025 10:46
@github-actions

This comment was marked as outdated.

@jillguyonnet jillguyonnet force-pushed the fleet/4720-automatic-upgrades-retry branch from 5b2c811 to 28582f7 Compare February 28, 2025 10:48
@github-actions

This comment was marked as outdated.

const TITLE = 'Fleet Automatic agent upgrades';
const SCOPE = ['fleet'];
const INTERVAL = '1h';
const INTERVAL = '30m';
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing this to the default min retry delay.

@jillguyonnet jillguyonnet added Team:Fleet Team label for Observability Data Collection Fleet team release_note:skip Skip the PR/issue when compiling release notes backport:skip This PR does not require backporting v9.1.0 labels Feb 28, 2025
…t --include-path /api/status --include-path /api/alerting/rule/ --include-path /api/alerting/rules --include-path /api/actions --include-path /api/security/role --include-path /api/spaces --include-path /api/fleet --include-path /api/dashboards --update'
Co-authored-by: Julia Bardi <[email protected]>
@jillguyonnet
Copy link
Contributor Author

@elasticmachine merge upstream

@juliaElastic
Copy link
Contributor

juliaElastic commented Mar 4, 2025

Tested locally with horde agents, and getting this error, it looks like missing upgrade_details doesn't work well with the query.

[2025-03-04T13:09:05.797+01:00][ERROR][plugins.fleet.fleet:automatic-agent-upgrade-task:1.0.0] [AutomaticAgentUpgradeTask] Error: ResponseError: search_phase_execution_exception
        Root causes:
                query_shard_exception: failed to create query: [nested] failed to find nested object under path [upgrade_details.target_version]

I could get past it by adding the nestedIgnoreUnmapped parameter here:
https://github.com/elastic/kibana/blob/main/x-pack/platform/plugins/shared/fleet/server/services/agents/crud.ts#L414

const query = kueryNode ? { query: toElasticsearchQuery(kueryNode, undefined, {nestedIgnoreUnmapped: true}) } : {};

}

private isAgentReadyForRetry(agent: Agent, agentPolicy: AgentPolicy) {
if (!agent.upgrade_attempts) {
Copy link
Contributor

@juliaElastic juliaElastic Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When testing this, it seems the upgrade_attempts array keeps being extended with more items even after max attempts are exceeded, and way faster than the retry delays.
Not sure where the bug is, I'm testing with an agent doc that's manually updated to be in failed upgrade state.

It seems if the agent is not picked up for retry, it gets upgraded in findAndUpgradeCandidateAgents. Probably we shouldn't upgrade there if upgrade_attempts.length > 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reporting this. I'm currently troubleshooting with a real agent and I think with the change in elastic/fleet-server#4528 upgrade_attempts gets cleared after a failed upgrade, which is not the intent. Let me troubleshoot this further and see if I can reproduce the issue you are seeing.

Copy link
Contributor Author

@jillguyonnet jillguyonnet Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done more testing with real agents and a new approach in elastic/fleet-server#4528. Retrying seems to work for agents with upgrade details (it's hitting the max retry attempts exceeded for agent log as expected). Not sure if that's due to my last fixes or if I'm missing some flow.

There is however currently a gap for agents with no upgrade details: as you pointed out in #212744 (comment), the ES queries don't work as expected. Furthermore, the current fleet-server change relies on upgrade details, so retries don't work for these agents. I'm still looking at how to improve queries, but I'd welcome some feedback on the approach.

It seems if the agent is not picked up for retry, it gets upgraded in findAndUpgradeCandidateAgents. Probably we shouldn't upgrade there if upgrade_attempts.length > 0

Currently the logic does this:

  • count how many agents need to be on the target version
  • subtract number of agents already on it or updating to it
  • subtract number of agents marked for retry (with upgrade_attempts set)
  • if this is still a positive number, fetch more agents with findAndUpgradeCandidateAgents

findAndUpgradeCandidateAgents should definitely not attempt to upgrade an agent that was marked for retry. In theory the AND (NOT upgrade_attempts:*) part of the agents fetcher kuery should prevent that (I fixed that). Do you have any concerns about this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's okay if we don't retry agents without upgrade details, just count them as updating.
The logic sounds good, we should be able to test that it works as expected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we avoid setting upgrade_attempts on agents below 8.12.0 then? If we do that, they would naturally get retried by the task (with no limit to the number of attempts).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that sounds reasonable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed the change, along with a fix (the task was erroring if no agents were returned by the fetcher).

I've tested again with real agents below and above 8.12.0, as well as horde agents. I'm not seeing any issues, please let me know if you do.

@jillguyonnet
Copy link
Contributor Author

@elasticmachine merge upstream

@jillguyonnet jillguyonnet marked this pull request as ready for review March 6, 2025 15:29
@jillguyonnet jillguyonnet requested a review from a team as a code owner March 6, 2025 15:29
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

Copy link
Contributor

@juliaElastic juliaElastic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@elasticmachine
Copy link
Contributor

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

  • [job] [logs] FTR Configs #19 / discover/context_awareness extension getAdditionalCellActions data view mode should render additional cell actions for logs data source
  • [job] [logs] FTR Configs #5 / Fleet Endpoints fleet_proxies_crud PUT /proxies/{itemId} should allow to update an existing fleet proxy

Metrics [docs]

Page load bundle

Size of the bundles that are downloaded on every page load. Target size is below 100kb

id before after diff
fleet 158.8KB 158.8KB +31.0B
Unknown metric groups

API count

id before after diff
fleet 1463 1464 +1

History

cc @jillguyonnet

@jillguyonnet jillguyonnet merged commit bdbc2ef into elastic:main Mar 6, 2025
9 checks passed
@jillguyonnet jillguyonnet deleted the fleet/4720-automatic-upgrades-retry branch March 6, 2025 20:31
CAWilson94 pushed a commit to CAWilson94/kibana that referenced this pull request Mar 22, 2025
## Summary

Relates elastic/ingest-dev#4720

This PR adds retry logic to the task that handles automatic agent
upgrades originally implemented in
elastic#211019.

Complementary fleet-server change which sets the agent's
`upgrade_attempts` to `null` once the upgrade is complete.:
elastic/fleet-server#4528

### Approach

- A new `upgrade_attempts` property is added to agents and stored in the
agent doc (ES mapping update in
elastic/elasticsearch#123256).
- When a bulk upgrade action is sent from the automatic upgrade task, it
pushes the timestamp of the upgrade to the affected agents'
`upgrade_attempts`.
- The default retry delays are `['30m', '1h', '2h', '4h', '8h', '16h',
'24h']` and can be overridden with the new
`xpack.fleet.autoUpgrades.retryDelays` setting.
- On every run, the automatic upgrade task will first process retries
and then query more agents if necessary (cf.
elastic/ingest-dev#4720 (comment)).
- Once an agent has completed and failed the max retries defined by the
retry delays array, it is no longer retried.

### Testing

The ES query for fetching agents with existing `upgrade_attempts` needs
the updated mappings, so it might be necessary to pull the latest `main`
in the `elasticsearch` repo and run `yarn es source` instead of `yarn es
snapshot` (requires an up-to-date Java environment, currently 23).

In order to test that `upgrade_attempts` is set to `null` when the
upgrade is complete, fleet-server should be run in dev using the change
in elastic/fleet-server#4528.

### Checklist

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
- [x] The PR description includes the appropriate Release Notes section,
and the correct `release_note:*` label is applied per the
[guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)

### Identify risks

Low probability risk of incorrectly triggering agent upgrades. This
feature is currently behind the `enableAutomaticAgentUpgrades` feature
flag.

---------

Co-authored-by: kibanamachine <[email protected]>
Co-authored-by: Julia Bardi <[email protected]>
Co-authored-by: Elastic Machine <[email protected]>
juliaElastic added a commit to juliaElastic/kibana that referenced this pull request Apr 8, 2025
Relates elastic/ingest-dev#4720

This PR adds retry logic to the task that handles automatic agent
upgrades originally implemented in
elastic#211019.

Complementary fleet-server change which sets the agent's
`upgrade_attempts` to `null` once the upgrade is complete.:
elastic/fleet-server#4528

- A new `upgrade_attempts` property is added to agents and stored in the
agent doc (ES mapping update in
elastic/elasticsearch#123256).
- When a bulk upgrade action is sent from the automatic upgrade task, it
pushes the timestamp of the upgrade to the affected agents'
`upgrade_attempts`.
- The default retry delays are `['30m', '1h', '2h', '4h', '8h', '16h',
'24h']` and can be overridden with the new
`xpack.fleet.autoUpgrades.retryDelays` setting.
- On every run, the automatic upgrade task will first process retries
and then query more agents if necessary (cf.
elastic/ingest-dev#4720 (comment)).
- Once an agent has completed and failed the max retries defined by the
retry delays array, it is no longer retried.

The ES query for fetching agents with existing `upgrade_attempts` needs
the updated mappings, so it might be necessary to pull the latest `main`
in the `elasticsearch` repo and run `yarn es source` instead of `yarn es
snapshot` (requires an up-to-date Java environment, currently 23).

In order to test that `upgrade_attempts` is set to `null` when the
upgrade is complete, fleet-server should be run in dev using the change
in elastic/fleet-server#4528.

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
- [x] The PR description includes the appropriate Release Notes section,
and the correct `release_note:*` label is applied per the
[guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)

Low probability risk of incorrectly triggering agent upgrades. This
feature is currently behind the `enableAutomaticAgentUpgrades` feature
flag.

---------

Co-authored-by: kibanamachine <[email protected]>
Co-authored-by: Julia Bardi <[email protected]>
Co-authored-by: Elastic Machine <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport:skip This PR does not require backporting release_note:skip Skip the PR/issue when compiling release notes Team:Fleet Team label for Observability Data Collection Fleet team v9.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants