Skip to content

Conversation

@juliaElastic
Copy link
Contributor

@juliaElastic juliaElastic commented Apr 2, 2025

What is the problem this PR solves?

upgrade_attempts were not cleared correctly when agent doesn't have upgrade_details (for example in horde or older versions).

How does this PR solve the problem?

Clear upgrade_attempts when upgrade is acked (at the same time when upgrade_started_at field is cleared).

How to test this PR locally

Test with agent policy with auto upgrade config and a few horde agents enrolled. Verify that after the upgrade completed, upgrade_attempts is set to null.
The upgrade_attempts field is only cleared if there is no upgrade_details. Tested with a real agent upgraded to a non-existent version, the agent going to UPG_FAILED state.

Design Checklist

  • I have ensured my design is stateless and will work when multiple fleet-server instances are behind a load balancer.
  • I have or intend to scale test my changes, ensuring it will work reliably with 100K+ agents connected.
  • I have included fail safe mechanisms to limit the load on fleet-server: rate limiting, circuit breakers, caching, load shedding, etc.

Checklist

  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool

Related issues

Relates #4528
Relates elastic/kibana#212744

@juliaElastic juliaElastic added the bug Something isn't working label Apr 2, 2025
@juliaElastic juliaElastic self-assigned this Apr 2, 2025
@juliaElastic juliaElastic requested a review from a team as a code owner April 2, 2025 11:56
@juliaElastic juliaElastic requested review from kaanyalti and pchila April 2, 2025 11:56
@mergify
Copy link
Contributor

mergify bot commented Apr 2, 2025

This pull request does not have a backport label. Could you fix it @juliaElastic? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-./d./d is the label to automatically backport to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

@juliaElastic juliaElastic added backport-skip Skip notification from the automated backport with mergify enhancement New feature or request and removed bug Something isn't working labels Apr 2, 2025
@pierrehilbert pierrehilbert added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Apr 2, 2025
@juliaElastic
Copy link
Contributor Author

@jillguyonnet I saw this comment, do you know the quickest way to reproduce the failed upgrade? I tried upgrading to a non-existent version, but the retries take a long time (about 2 hours).

@jillguyonnet
Copy link
Contributor

@jillguyonnet I saw this #4528 (comment), do you know the quickest way to reproduce the failed upgrade? I tried upgrading to a non-existent version, but the retries take a long time (about 2 hours).

Strange, IIRC that's also what I did. I don't remember the agent retrying for that long (essentially it went into UPG_FAILED pretty quickly, maybe ~10 min).

@juliaElastic
Copy link
Contributor Author

@jillguyonnet I saw this #4528 (comment), do you know the quickest way to reproduce the failed upgrade? I tried upgrading to a non-existent version, but the retries take a long time (about 2 hours).

Strange, IIRC that's also what I did. I don't remember the agent retrying for that long (essentially it went into UPG_FAILED pretty quickly, maybe ~10 min).

Thanks, I'm seeing now the UPG_FAILED after 15 mins, I wasn't sure when it would happen as upgrade_details had retry_until in 2 hours.

@jillguyonnet
Copy link
Contributor

Thanks, I'm seeing now the UPG_FAILED after 15 mins, I wasn't sure when it would happen as upgrade_details had retry_until in 2 hours.

Yeah, I did notice that as well, but it transitioned into UPG_FAILED much quicker. Not sure if that's expected, come to think of it.

@elastic-sonarqube
Copy link

@juliaElastic juliaElastic merged commit fb093cc into elastic:main Apr 7, 2025
9 checks passed
@juliaElastic juliaElastic added backport-8.x Automated backport to the 8.x branch with mergify and removed backport-skip Skip notification from the automated backport with mergify labels Apr 8, 2025
mergify bot pushed a commit that referenced this pull request Apr 8, 2025
* clear upgrade_attempts on handleAck

* clear upgrade_attempts if upgrade_details is missing

* added unit test

(cherry picked from commit fb093cc)
juliaElastic added a commit that referenced this pull request Apr 9, 2025
* Clear upgrade_attempts on handleAck (#4762)

* clear upgrade_attempts on handleAck

* clear upgrade_attempts if upgrade_details is missing

* added unit test

(cherry picked from commit fb093cc)

* Clear agent.upgrade_attempts on upgrade complete (#4528) (#4777)

* Clear agent.upgrade_attemps on upgrade complete

* This actually works

* Silence nolintlint error in handleCheckin.go

* Remove nolint comment altogether

* Add changelog

* Update handleCheckin unit test

* Change approach

* Revert unit test change

* This seems needed

* Run make generate

* Remove internal link

* add unit test

* reduce complexity

* return nil if action is nil

---------

Co-authored-by: Julia Bardi <[email protected]>
Co-authored-by: Julia Bardi <[email protected]>
(cherry picked from commit 2b40416)

Co-authored-by: Jill Guyonnet <[email protected]>

---------

Co-authored-by: Julia Bardi <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Jill Guyonnet <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-8.x Automated backport to the 8.x branch with mergify enhancement New feature or request Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants