Skip to content

Conversation

@jillguyonnet
Copy link
Contributor

@jillguyonnet jillguyonnet commented Feb 28, 2025

What is the problem this PR solves?

elastic/kibana#212744 adds retry logic to the task that automatically ugprades agents. Agents that were upgraded through this task have their new upgrade_attempts property populated. It is missing a way to clear this property when the upgrade completes successfully.

How does this PR solve the problem?

The change in this PR clears upgrade_attempts when the upgrade details of the agent get into UPG_WATCHING state and are processed in handleCheckin.

How to test this PR locally

This should be tested alongside elastic/kibana#212744 (or after it is merged - this is fine, since automatic upgrades are currently behind the enableAutomaticAgentUpgrades feature flag). With this change, agents upgraded through the automatic upgrade task should have their upgrade_attempts property set to null when the upgrade is successful.

Testing should also validate that upgrade_attempts stays set if the upgrade failed, e.g. after requesting an upgrade to an invalid version.

Design Checklist

  • I have ensured my design is stateless and will work when multiple fleet-server instances are behind a load balancer.
  • I have or intend to scale test my changes, ensuring it will work reliably with 100K+ agents connected.
  • I have included fail safe mechanisms to limit the load on fleet-server: rate limiting, circuit breakers, caching, load shedding, etc.

Checklist

  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool

Related issues

Relates https://github.com/elastic/ingest-dev/issues/4720

@jillguyonnet jillguyonnet added the enhancement New feature or request label Feb 28, 2025
@jillguyonnet jillguyonnet self-assigned this Feb 28, 2025
@mergify
Copy link
Contributor

mergify bot commented Feb 28, 2025

This pull request does not have a backport label. Could you fix it @jillguyonnet? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-./d./d is the label to automatically backport to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

@jillguyonnet jillguyonnet added the backport-skip Skip notification from the automated backport with mergify label Feb 28, 2025
@cmacknz cmacknz added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Feb 28, 2025
Copy link
Member

@pchila pchila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks sensible, holding off the approval until we have a green CI run (it seems we are getting some errors from ECH when creating a stack and 503s when trying to clean it up (not sure if it's related to the failed creation)

michel-laterman
michel-laterman previously approved these changes Mar 3, 2025
Copy link
Contributor

@michel-laterman michel-laterman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, merge when CI is green

@cmacknz
Copy link
Member

cmacknz commented Mar 3, 2025

There is a test you should update to check that this field is properly reset:

name: "agent has details checkin details are nil",
agent: &model.Agent{ESDocument: esd, Agent: &model.AgentMetadata{ID: "test-agent"}, UpgradeDetails: &model.UpgradeDetails{}},
details: nil,
bulk: func() *ftesting.MockBulk {
mBulk := ftesting.NewMockBulk()
mBulk.On("Update", mock.Anything, dl.FleetAgents, "doc-ID", mock.MatchedBy(func(p []byte) bool {
doc := struct {
Doc map[string]interface{} `json:"doc"`
}{}
if err := json.Unmarshal(p, &doc); err != nil {
t.Logf("bulk match unmarshal error: %v", err)
return false
}
return doc.Doc[dl.FieldUpgradeDetails] == nil && doc.Doc[dl.FieldUpgradeStartedAt] == nil && doc.Doc[dl.FieldUpgradeStatus] == nil && doc.Doc[dl.FieldUpgradedAt] != ""
}), mock.Anything, mock.Anything).Return(nil)
return mBulk
},
cache: func() *testcache.MockCache {
return testcache.NewMockCache()
},
err: nil,

@jillguyonnet
Copy link
Contributor Author

Upon more testing with elastic/kibana#212744, this is not behaving as expected as upgrade_attempts is cleared after failed upgrades. 👀

@jillguyonnet
Copy link
Contributor Author

jillguyonnet commented Mar 5, 2025

OK, my original approach didn't do what is expected, i.e. only reset upgrade_attempts if the upgrade is complete and successful. This is because in handleAck the upgrade is considered complete when done retrying, even if the outcome was failure. I was also under the impression that upgrade_details were updated as part of it, but I see that's actually handled in handleCheckin, which makes sense.

I have changed the approach to reset upgrade_attempts when the agent reports an upgrade_details.state value of UPG_WATCHING in handleCheckin. I could be missing some edge cases, but it seems to meet the primary expectation. My testing so far looks good.

@michel-laterman Would you please be able to review this approach? If it looks OK, I will update tests.

Edit: it seems we need to support agents with no upgrade details as well, so this won't be sufficient nevermind

jillguyonnet added a commit to elastic/kibana that referenced this pull request Mar 6, 2025
## Summary

Relates elastic/ingest-dev#4720

This PR adds retry logic to the task that handles automatic agent
upgrades originally implemented in
#211019.

Complementary fleet-server change which sets the agent's
`upgrade_attempts` to `null` once the upgrade is complete.:
elastic/fleet-server#4528

### Approach

- A new `upgrade_attempts` property is added to agents and stored in the
agent doc (ES mapping update in
elastic/elasticsearch#123256).
- When a bulk upgrade action is sent from the automatic upgrade task, it
pushes the timestamp of the upgrade to the affected agents'
`upgrade_attempts`.
- The default retry delays are `['30m', '1h', '2h', '4h', '8h', '16h',
'24h']` and can be overridden with the new
`xpack.fleet.autoUpgrades.retryDelays` setting.
- On every run, the automatic upgrade task will first process retries
and then query more agents if necessary (cf.
elastic/ingest-dev#4720 (comment)).
- Once an agent has completed and failed the max retries defined by the
retry delays array, it is no longer retried.

### Testing

The ES query for fetching agents with existing `upgrade_attempts` needs
the updated mappings, so it might be necessary to pull the latest `main`
in the `elasticsearch` repo and run `yarn es source` instead of `yarn es
snapshot` (requires an up-to-date Java environment, currently 23).

In order to test that `upgrade_attempts` is set to `null` when the
upgrade is complete, fleet-server should be run in dev using the change
in elastic/fleet-server#4528.

### Checklist

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
- [x] The PR description includes the appropriate Release Notes section,
and the correct `release_note:*` label is applied per the
[guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)

### Identify risks

Low probability risk of incorrectly triggering agent upgrades. This
feature is currently behind the `enableAutomaticAgentUpgrades` feature
flag.

---------

Co-authored-by: kibanamachine <[email protected]>
Co-authored-by: Julia Bardi <[email protected]>
Co-authored-by: Elastic Machine <[email protected]>
michel-laterman
michel-laterman previously approved these changes Mar 7, 2025
Copy link
Contributor

@michel-laterman michel-laterman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@elastic-sonarqube
Copy link

@juliaElastic
Copy link
Contributor

@michel-laterman could you approve again? had to fix the sonarqube failure

Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@juliaElastic juliaElastic merged commit 2b40416 into elastic:main Mar 10, 2025
9 checks passed
CAWilson94 pushed a commit to CAWilson94/kibana that referenced this pull request Mar 22, 2025
## Summary

Relates elastic/ingest-dev#4720

This PR adds retry logic to the task that handles automatic agent
upgrades originally implemented in
elastic#211019.

Complementary fleet-server change which sets the agent's
`upgrade_attempts` to `null` once the upgrade is complete.:
elastic/fleet-server#4528

### Approach

- A new `upgrade_attempts` property is added to agents and stored in the
agent doc (ES mapping update in
elastic/elasticsearch#123256).
- When a bulk upgrade action is sent from the automatic upgrade task, it
pushes the timestamp of the upgrade to the affected agents'
`upgrade_attempts`.
- The default retry delays are `['30m', '1h', '2h', '4h', '8h', '16h',
'24h']` and can be overridden with the new
`xpack.fleet.autoUpgrades.retryDelays` setting.
- On every run, the automatic upgrade task will first process retries
and then query more agents if necessary (cf.
elastic/ingest-dev#4720 (comment)).
- Once an agent has completed and failed the max retries defined by the
retry delays array, it is no longer retried.

### Testing

The ES query for fetching agents with existing `upgrade_attempts` needs
the updated mappings, so it might be necessary to pull the latest `main`
in the `elasticsearch` repo and run `yarn es source` instead of `yarn es
snapshot` (requires an up-to-date Java environment, currently 23).

In order to test that `upgrade_attempts` is set to `null` when the
upgrade is complete, fleet-server should be run in dev using the change
in elastic/fleet-server#4528.

### Checklist

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
- [x] The PR description includes the appropriate Release Notes section,
and the correct `release_note:*` label is applied per the
[guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)

### Identify risks

Low probability risk of incorrectly triggering agent upgrades. This
feature is currently behind the `enableAutomaticAgentUpgrades` feature
flag.

---------

Co-authored-by: kibanamachine <[email protected]>
Co-authored-by: Julia Bardi <[email protected]>
Co-authored-by: Elastic Machine <[email protected]>
juliaElastic added a commit to juliaElastic/kibana that referenced this pull request Apr 8, 2025
Relates elastic/ingest-dev#4720

This PR adds retry logic to the task that handles automatic agent
upgrades originally implemented in
elastic#211019.

Complementary fleet-server change which sets the agent's
`upgrade_attempts` to `null` once the upgrade is complete.:
elastic/fleet-server#4528

- A new `upgrade_attempts` property is added to agents and stored in the
agent doc (ES mapping update in
elastic/elasticsearch#123256).
- When a bulk upgrade action is sent from the automatic upgrade task, it
pushes the timestamp of the upgrade to the affected agents'
`upgrade_attempts`.
- The default retry delays are `['30m', '1h', '2h', '4h', '8h', '16h',
'24h']` and can be overridden with the new
`xpack.fleet.autoUpgrades.retryDelays` setting.
- On every run, the automatic upgrade task will first process retries
and then query more agents if necessary (cf.
elastic/ingest-dev#4720 (comment)).
- Once an agent has completed and failed the max retries defined by the
retry delays array, it is no longer retried.

The ES query for fetching agents with existing `upgrade_attempts` needs
the updated mappings, so it might be necessary to pull the latest `main`
in the `elasticsearch` repo and run `yarn es source` instead of `yarn es
snapshot` (requires an up-to-date Java environment, currently 23).

In order to test that `upgrade_attempts` is set to `null` when the
upgrade is complete, fleet-server should be run in dev using the change
in elastic/fleet-server#4528.

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
- [x] The PR description includes the appropriate Release Notes section,
and the correct `release_note:*` label is applied per the
[guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)

Low probability risk of incorrectly triggering agent upgrades. This
feature is currently behind the `enableAutomaticAgentUpgrades` feature
flag.

---------

Co-authored-by: kibanamachine <[email protected]>
Co-authored-by: Julia Bardi <[email protected]>
Co-authored-by: Elastic Machine <[email protected]>
@juliaElastic juliaElastic added backport-8.x Automated backport to the 8.x branch with mergify and removed backport-skip Skip notification from the automated backport with mergify labels Apr 8, 2025
mergify bot pushed a commit that referenced this pull request Apr 8, 2025
* Clear agent.upgrade_attemps on upgrade complete

* This actually works

* Silence nolintlint error in handleCheckin.go

* Remove nolint comment altogether

* Add changelog

* Update handleCheckin unit test

* Change approach

* Revert unit test change

* This seems needed

* Run make generate

* Remove internal link

* add unit test

* reduce complexity

* return nil if action is nil

---------

Co-authored-by: Julia Bardi <[email protected]>
Co-authored-by: Julia Bardi <[email protected]>
(cherry picked from commit 2b40416)
juliaElastic pushed a commit that referenced this pull request Apr 8, 2025
* Clear agent.upgrade_attemps on upgrade complete

* This actually works

* Silence nolintlint error in handleCheckin.go

* Remove nolint comment altogether

* Add changelog

* Update handleCheckin unit test

* Change approach

* Revert unit test change

* This seems needed

* Run make generate

* Remove internal link

* add unit test

* reduce complexity

* return nil if action is nil

---------

Co-authored-by: Julia Bardi <[email protected]>
Co-authored-by: Julia Bardi <[email protected]>
(cherry picked from commit 2b40416)

Co-authored-by: Jill Guyonnet <[email protected]>
juliaElastic pushed a commit that referenced this pull request Apr 8, 2025
* Clear agent.upgrade_attemps on upgrade complete

* This actually works

* Silence nolintlint error in handleCheckin.go

* Remove nolint comment altogether

* Add changelog

* Update handleCheckin unit test

* Change approach

* Revert unit test change

* This seems needed

* Run make generate

* Remove internal link

* add unit test

* reduce complexity

* return nil if action is nil

---------

Co-authored-by: Julia Bardi <[email protected]>
Co-authored-by: Julia Bardi <[email protected]>
(cherry picked from commit 2b40416)

Co-authored-by: Jill Guyonnet <[email protected]>
juliaElastic added a commit that referenced this pull request Apr 9, 2025
* Clear upgrade_attempts on handleAck (#4762)

* clear upgrade_attempts on handleAck

* clear upgrade_attempts if upgrade_details is missing

* added unit test

(cherry picked from commit fb093cc)

* Clear agent.upgrade_attempts on upgrade complete (#4528) (#4777)

* Clear agent.upgrade_attemps on upgrade complete

* This actually works

* Silence nolintlint error in handleCheckin.go

* Remove nolint comment altogether

* Add changelog

* Update handleCheckin unit test

* Change approach

* Revert unit test change

* This seems needed

* Run make generate

* Remove internal link

* add unit test

* reduce complexity

* return nil if action is nil

---------

Co-authored-by: Julia Bardi <[email protected]>
Co-authored-by: Julia Bardi <[email protected]>
(cherry picked from commit 2b40416)

Co-authored-by: Jill Guyonnet <[email protected]>

---------

Co-authored-by: Julia Bardi <[email protected]>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Jill Guyonnet <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-8.x Automated backport to the 8.x branch with mergify enhancement New feature or request Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants