Clear agent.upgrade_attempts on upgrade complete #4528

jillguyonnet · 2025-02-28T11:52:16Z

What is the problem this PR solves?

elastic/kibana#212744 adds retry logic to the task that automatically ugprades agents. Agents that were upgraded through this task have their new upgrade_attempts property populated. It is missing a way to clear this property when the upgrade completes successfully.

How does this PR solve the problem?

The change in this PR clears upgrade_attempts when the upgrade details of the agent get into UPG_WATCHING state and are processed in handleCheckin.

How to test this PR locally

This should be tested alongside elastic/kibana#212744 (or after it is merged - this is fine, since automatic upgrades are currently behind the enableAutomaticAgentUpgrades feature flag). With this change, agents upgraded through the automatic upgrade task should have their upgrade_attempts property set to null when the upgrade is successful.

Testing should also validate that upgrade_attempts stays set if the upgrade failed, e.g. after requesting an upgrade to an invalid version.

Design Checklist

I have ensured my design is stateless and will work when multiple fleet-server instances are behind a load balancer.
I have or intend to scale test my changes, ensuring it will work reliably with 100K+ agents connected.
I have included fail safe mechanisms to limit the load on fleet-server: rate limiting, circuit breakers, caching, load shedding, etc.

Checklist

I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in ./changelog/fragments using the changelog tool

Related issues

Relates https://github.com/elastic/ingest-dev/issues/4720

mergify · 2025-02-28T11:52:52Z

This pull request does not have a backport label. Could you fix it @jillguyonnet? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-./d./d is the label to automatically backport to the 8./d branch. /d is the digit
backport-active-all is the label that automatically backports to all active branches.
backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

internal/pkg/api/handleCheckin.go

pchila

Code looks sensible, holding off the approval until we have a green CI run (it seems we are getting some errors from ECH when creating a stack and 503s when trying to clean it up (not sure if it's related to the failed creation)

michel-laterman

lgtm, merge when CI is green

cmacknz · 2025-03-03T21:20:27Z

There is a test you should update to check that this field is properly reset:

fleet-server/internal/pkg/api/handleCheckin_test.go

Lines 334 to 354 in ba68a24

    
           name:    "agent has details checkin details are nil", 
        
           agent:   &model.Agent{ESDocument: esd, Agent: &model.AgentMetadata{ID: "test-agent"}, UpgradeDetails: &model.UpgradeDetails{}}, 
        
           details: nil, 
        
           bulk: func() *ftesting.MockBulk { 
        
           	mBulk := ftesting.NewMockBulk() 
        
           	mBulk.On("Update", mock.Anything, dl.FleetAgents, "doc-ID", mock.MatchedBy(func(p []byte) bool { 
        
           		doc := struct { 
        
           			Doc map[string]interface{} `json:"doc"` 
        
           		}{} 
        
           		if err := json.Unmarshal(p, &doc); err != nil { 
        
           			t.Logf("bulk match unmarshal error: %v", err) 
        
           			return false 
        
           		} 
        
           		return doc.Doc[dl.FieldUpgradeDetails] == nil && doc.Doc[dl.FieldUpgradeStartedAt] == nil && doc.Doc[dl.FieldUpgradeStatus] == nil && doc.Doc[dl.FieldUpgradedAt] != "" 
        
           	}), mock.Anything, mock.Anything).Return(nil) 
        
           	return mBulk 
        
           }, 
        
           cache: func() *testcache.MockCache { 
        
           	return testcache.NewMockCache() 
        
           }, 
        
           err: nil,

changelog/fragments/1740760469-Clear-agent-upgrade-attempts-when-upgrade-complete.yaml

jillguyonnet · 2025-03-04T16:39:09Z

Upon more testing with elastic/kibana#212744, this is not behaving as expected as upgrade_attempts is cleared after failed upgrades. 👀

jillguyonnet · 2025-03-05T18:01:13Z

OK, my original approach didn't do what is expected, i.e. only reset upgrade_attempts if the upgrade is complete and successful. This is because in handleAck the upgrade is considered complete when done retrying, even if the outcome was failure. I was also under the impression that upgrade_details were updated as part of it, but I see that's actually handled in handleCheckin, which makes sense.

I have changed the approach to reset upgrade_attempts when the agent reports an upgrade_details.state value of UPG_WATCHING in handleCheckin. I could be missing some edge cases, but it seems to meet the primary expectation. My testing so far looks good.

@michel-laterman Would you please be able to review this approach? If it looks OK, I will update tests.

Edit: ~~it seems we need to support agents with no upgrade details as well, so this won't be sufficient~~ nevermind

internal/pkg/model/schema.go

internal/pkg/api/handleCheckin.go

changelog/fragments/1740760469-Clear-agent-upgrade-attempts-when-upgrade-complete.yaml

## Summary Relates elastic/ingest-dev#4720 This PR adds retry logic to the task that handles automatic agent upgrades originally implemented in #211019. Complementary fleet-server change which sets the agent's `upgrade_attempts` to `null` once the upgrade is complete.: elastic/fleet-server#4528 ### Approach - A new `upgrade_attempts` property is added to agents and stored in the agent doc (ES mapping update in elastic/elasticsearch#123256). - When a bulk upgrade action is sent from the automatic upgrade task, it pushes the timestamp of the upgrade to the affected agents' `upgrade_attempts`. - The default retry delays are `['30m', '1h', '2h', '4h', '8h', '16h', '24h']` and can be overridden with the new `xpack.fleet.autoUpgrades.retryDelays` setting. - On every run, the automatic upgrade task will first process retries and then query more agents if necessary (cf. elastic/ingest-dev#4720 (comment)). - Once an agent has completed and failed the max retries defined by the retry delays array, it is no longer retried. ### Testing The ES query for fetching agents with existing `upgrade_attempts` needs the updated mappings, so it might be necessary to pull the latest `main` in the `elasticsearch` repo and run `yarn es source` instead of `yarn es snapshot` (requires an up-to-date Java environment, currently 23). In order to test that `upgrade_attempts` is set to `null` when the upgrade is complete, fleet-server should be run in dev using the change in elastic/fleet-server#4528. ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios - [x] The PR description includes the appropriate Release Notes section, and the correct `release_note:*` label is applied per the [guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process) ### Identify risks Low probability risk of incorrectly triggering agent upgrades. This feature is currently behind the `enableAutomaticAgentUpgrades` feature flag. --------- Co-authored-by: kibanamachine <[email protected]> Co-authored-by: Julia Bardi <[email protected]> Co-authored-by: Elastic Machine <[email protected]>

michel-laterman

lgtm

elastic-sonarqube · 2025-03-10T12:05:59Z

Quality Gate passed

Issues
1 New issue
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
100.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube

juliaElastic · 2025-03-10T12:28:50Z

@michel-laterman could you approve again? had to fix the sonarqube failure

blakerouse

Looks good.

## Summary Relates elastic/ingest-dev#4720 This PR adds retry logic to the task that handles automatic agent upgrades originally implemented in elastic#211019. Complementary fleet-server change which sets the agent's `upgrade_attempts` to `null` once the upgrade is complete.: elastic/fleet-server#4528 ### Approach - A new `upgrade_attempts` property is added to agents and stored in the agent doc (ES mapping update in elastic/elasticsearch#123256). - When a bulk upgrade action is sent from the automatic upgrade task, it pushes the timestamp of the upgrade to the affected agents' `upgrade_attempts`. - The default retry delays are `['30m', '1h', '2h', '4h', '8h', '16h', '24h']` and can be overridden with the new `xpack.fleet.autoUpgrades.retryDelays` setting. - On every run, the automatic upgrade task will first process retries and then query more agents if necessary (cf. elastic/ingest-dev#4720 (comment)). - Once an agent has completed and failed the max retries defined by the retry delays array, it is no longer retried. ### Testing The ES query for fetching agents with existing `upgrade_attempts` needs the updated mappings, so it might be necessary to pull the latest `main` in the `elasticsearch` repo and run `yarn es source` instead of `yarn es snapshot` (requires an up-to-date Java environment, currently 23). In order to test that `upgrade_attempts` is set to `null` when the upgrade is complete, fleet-server should be run in dev using the change in elastic/fleet-server#4528. ### Checklist - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios - [x] The PR description includes the appropriate Release Notes section, and the correct `release_note:*` label is applied per the [guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process) ### Identify risks Low probability risk of incorrectly triggering agent upgrades. This feature is currently behind the `enableAutomaticAgentUpgrades` feature flag. --------- Co-authored-by: kibanamachine <[email protected]> Co-authored-by: Julia Bardi <[email protected]> Co-authored-by: Elastic Machine <[email protected]>

Relates elastic/ingest-dev#4720 This PR adds retry logic to the task that handles automatic agent upgrades originally implemented in elastic#211019. Complementary fleet-server change which sets the agent's `upgrade_attempts` to `null` once the upgrade is complete.: elastic/fleet-server#4528 - A new `upgrade_attempts` property is added to agents and stored in the agent doc (ES mapping update in elastic/elasticsearch#123256). - When a bulk upgrade action is sent from the automatic upgrade task, it pushes the timestamp of the upgrade to the affected agents' `upgrade_attempts`. - The default retry delays are `['30m', '1h', '2h', '4h', '8h', '16h', '24h']` and can be overridden with the new `xpack.fleet.autoUpgrades.retryDelays` setting. - On every run, the automatic upgrade task will first process retries and then query more agents if necessary (cf. elastic/ingest-dev#4720 (comment)). - Once an agent has completed and failed the max retries defined by the retry delays array, it is no longer retried. The ES query for fetching agents with existing `upgrade_attempts` needs the updated mappings, so it might be necessary to pull the latest `main` in the `elasticsearch` repo and run `yarn es source` instead of `yarn es snapshot` (requires an up-to-date Java environment, currently 23). In order to test that `upgrade_attempts` is set to `null` when the upgrade is complete, fleet-server should be run in dev using the change in elastic/fleet-server#4528. - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios - [x] The PR description includes the appropriate Release Notes section, and the correct `release_note:*` label is applied per the [guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process) Low probability risk of incorrectly triggering agent upgrades. This feature is currently behind the `enableAutomaticAgentUpgrades` feature flag. --------- Co-authored-by: kibanamachine <[email protected]> Co-authored-by: Julia Bardi <[email protected]> Co-authored-by: Elastic Machine <[email protected]>

* Clear agent.upgrade_attemps on upgrade complete * This actually works * Silence nolintlint error in handleCheckin.go * Remove nolint comment altogether * Add changelog * Update handleCheckin unit test * Change approach * Revert unit test change * This seems needed * Run make generate * Remove internal link * add unit test * reduce complexity * return nil if action is nil --------- Co-authored-by: Julia Bardi <[email protected]> Co-authored-by: Julia Bardi <[email protected]> (cherry picked from commit 2b40416)

* Clear agent.upgrade_attemps on upgrade complete * This actually works * Silence nolintlint error in handleCheckin.go * Remove nolint comment altogether * Add changelog * Update handleCheckin unit test * Change approach * Revert unit test change * This seems needed * Run make generate * Remove internal link * add unit test * reduce complexity * return nil if action is nil --------- Co-authored-by: Julia Bardi <[email protected]> Co-authored-by: Julia Bardi <[email protected]> (cherry picked from commit 2b40416) Co-authored-by: Jill Guyonnet <[email protected]>

* Clear upgrade_attempts on handleAck (#4762) * clear upgrade_attempts on handleAck * clear upgrade_attempts if upgrade_details is missing * added unit test (cherry picked from commit fb093cc) * Clear agent.upgrade_attempts on upgrade complete (#4528) (#4777) * Clear agent.upgrade_attemps on upgrade complete * This actually works * Silence nolintlint error in handleCheckin.go * Remove nolint comment altogether * Add changelog * Update handleCheckin unit test * Change approach * Revert unit test change * This seems needed * Run make generate * Remove internal link * add unit test * reduce complexity * return nil if action is nil --------- Co-authored-by: Julia Bardi <[email protected]> Co-authored-by: Julia Bardi <[email protected]> (cherry picked from commit 2b40416) Co-authored-by: Jill Guyonnet <[email protected]> --------- Co-authored-by: Julia Bardi <[email protected]> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Jill Guyonnet <[email protected]>

Clear agent.upgrade_attemps on upgrade complete

1846685

jillguyonnet added the enhancement New feature or request label Feb 28, 2025

jillguyonnet self-assigned this Feb 28, 2025

jillguyonnet added the backport-skip Skip notification from the automated backport with mergify label Feb 28, 2025

This actually works

0bc402d

jillguyonnet marked this pull request as ready for review February 28, 2025 15:28

jillguyonnet requested a review from a team as a code owner February 28, 2025 15:28

jillguyonnet requested review from juliaElastic, kaanyalti and pchila February 28, 2025 15:28

jillguyonnet mentioned this pull request Feb 28, 2025

[Fleet] Add retry logic to automatic agent upgrades elastic/kibana#212744

Merged

2 tasks

jillguyonnet added 2 commits February 28, 2025 16:49

Silence nolintlint error in handleCheckin.go

1893775

Remove nolint comment altogether

d196d60

jillguyonnet commented Feb 28, 2025

View reviewed changes

internal/pkg/api/handleCheckin.go Show resolved Hide resolved

Add changelog

7eb37ff

cmacknz added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Feb 28, 2025

pchila reviewed Mar 3, 2025

View reviewed changes

michel-laterman previously approved these changes Mar 3, 2025

View reviewed changes

Update handleCheckin unit test

68f7388

blakerouse reviewed Mar 4, 2025

View reviewed changes

changelog/fragments/1740760469-Clear-agent-upgrade-attempts-when-upgrade-complete.yaml Outdated Show resolved Hide resolved

jillguyonnet dismissed michel-laterman’s stale review via 68f7388 March 5, 2025 17:05

jillguyonnet added 2 commits March 5, 2025 18:57

Change approach

8d7ce7d

Revert unit test change

062d12e

jillguyonnet requested a review from michel-laterman March 5, 2025 18:01

This seems needed

f5eb3dc

jillguyonnet commented Mar 6, 2025

View reviewed changes

internal/pkg/model/schema.go Outdated Show resolved Hide resolved

Run make generate

33acbf4

michel-laterman reviewed Mar 6, 2025

View reviewed changes

internal/pkg/api/handleCheckin.go Show resolved Hide resolved

changelog/fragments/1740760469-Clear-agent-upgrade-attempts-when-upgrade-complete.yaml Outdated Show resolved Hide resolved

Remove internal link

74649e3

michel-laterman previously approved these changes Mar 7, 2025

View reviewed changes

add unit test

56426c2

juliaElastic dismissed michel-laterman’s stale review via 56426c2 March 10, 2025 09:38

juliaElastic added 2 commits March 10, 2025 11:33

reduce complexity

269019e

return nil if action is nil

9e6d4d4

juliaElastic approved these changes Mar 10, 2025

View reviewed changes

blakerouse approved these changes Mar 10, 2025

View reviewed changes

juliaElastic merged commit 2b40416 into elastic:main Mar 10, 2025
9 checks passed

juliaElastic mentioned this pull request Apr 2, 2025

Clear upgrade_attempts on handleAck #4762

Merged

8 tasks

juliaElastic added backport-8.x Automated backport to the 8.x branch with mergify and removed backport-skip Skip notification from the automated backport with mergify labels Apr 8, 2025

This was referenced Apr 8, 2025

[8.x](backport #4528) Clear agent.upgrade_attempts on upgrade complete #4777

Merged

[8.x](backport #4762) Clear upgrade_attempts on handleAck #4778

Merged

Clear agent.upgrade_attempts on upgrade complete #4528

Clear agent.upgrade_attempts on upgrade complete #4528

Uh oh!

Conversation

jillguyonnet commented Feb 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the problem this PR solves?

How does this PR solve the problem?

How to test this PR locally

Design Checklist

Checklist

Related issues

Uh oh!

mergify bot commented Feb 28, 2025

Uh oh!

Uh oh!

pchila left a comment

Choose a reason for hiding this comment

Uh oh!

michel-laterman left a comment

Choose a reason for hiding this comment

Uh oh!

cmacknz commented Mar 3, 2025

Uh oh!

Uh oh!

jillguyonnet commented Mar 4, 2025

Uh oh!

jillguyonnet commented Mar 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

michel-laterman left a comment

Choose a reason for hiding this comment

Uh oh!

elastic-sonarqube bot commented Mar 10, 2025

Quality Gate passed

Uh oh!

juliaElastic commented Mar 10, 2025

Uh oh!

blakerouse left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

jillguyonnet commented Feb 28, 2025 •

edited

Loading

jillguyonnet commented Mar 5, 2025 •

edited

Loading