From 4e68b23e8a4b9903b830c79f51b3082572d3e14a Mon Sep 17 00:00:00 2001 From: Karen Metts Date: Thu, 18 Sep 2025 16:21:13 -0400 Subject: [PATCH 1/4] Doc: Add known issue 9.0.77: Agent stuck on failed upgrade --- docs/release-notes/known-issues.md | 31 ++++++++++++++++++++++++++++++ 1 file changed, 31 insertions(+) diff --git a/docs/release-notes/known-issues.md b/docs/release-notes/known-issues.md index d25cbfb652a..70a1afb20df 100644 --- a/docs/release-notes/known-issues.md +++ b/docs/release-notes/known-issues.md @@ -23,6 +23,37 @@ Known issues are significant defects or limitations that may impact your impleme % Workaround description. % ::: + +:::{dropdown} Failed upgrades leave {{agent}} stuck until restart + +**Applies to: {{agent}} 8.18.7, 9.0.7** + +On September 17, 2025, a known issue was discovered that can cause {{agent}} upgrades to get stuck if an upgrade attempt fails early. This happens because the coordinator’s overrideState remains set, leaving the agent in a state that appears to be upgrading. + +**Conditions** + +This issue is triggered if the upgrade fails during one of the early checks inside Coordinator.Upgrade, for example: + +- The agent is not upgradeable +- Capabilities check denies the upgrade +- Most commonly: When {{agent}} is tamper-protected and Endpoint returns an error during action proxying, for example, because the upgrade action signature is invalid, missing, or fails verification. This causes the coordinator’s override state to be stuck. + +**Symptoms** + +- {{fleet}} shows the upgrade action in progress, even though the upgrade remains stuck +- No further upgrade attempts succeed +- Elastic-agent status shows an override state indicating upgrade + +**Workaround** + +Restart the {{agent}} to clear the coordinator’s overrideState and allow new upgrade attempts to proceed. + +**Resolution** +This issue was fixed in [#9992](https://github.com/elastic/elastic-agent/pull/9992), which ensures that the coordinator clears its override state whenever an early failure occurs. + +The fix will be included in versions 9.1.4, 8.19.4, 9.0.8, and 8.18.8. +::: + :::{dropdown} [Windows] {{agent}} does not process Windows security events **Applies to: {{agent}} 8.19.0, 9.1.0 (Windows only)** From 8359cda0e36870cc9ba83421157253f7054be6d0 Mon Sep 17 00:00:00 2001 From: Karen Metts <35154725+karenzone@users.noreply.github.com> Date: Fri, 19 Sep 2025 15:45:58 -0400 Subject: [PATCH 2/4] Apply suggestions from code review Co-authored-by: Colleen McGinnis --- docs/release-notes/known-issues.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/release-notes/known-issues.md b/docs/release-notes/known-issues.md index 70a1afb20df..979d1b39229 100644 --- a/docs/release-notes/known-issues.md +++ b/docs/release-notes/known-issues.md @@ -28,30 +28,30 @@ Known issues are significant defects or limitations that may impact your impleme **Applies to: {{agent}} 8.18.7, 9.0.7** -On September 17, 2025, a known issue was discovered that can cause {{agent}} upgrades to get stuck if an upgrade attempt fails early. This happens because the coordinator’s overrideState remains set, leaving the agent in a state that appears to be upgrading. +On September 17, 2025, a known issue was discovered that can cause {{agent}} upgrades to get stuck if an upgrade attempt fails early. This happens because the coordinator’s `overrideState` remains set, leaving the agent in a state that appears to be upgrading. **Conditions** -This issue is triggered if the upgrade fails during one of the early checks inside Coordinator.Upgrade, for example: +This issue is triggered if the upgrade fails during one of the early checks inside `Coordinator.Upgrade`, for example: - The agent is not upgradeable - Capabilities check denies the upgrade -- Most commonly: When {{agent}} is tamper-protected and Endpoint returns an error during action proxying, for example, because the upgrade action signature is invalid, missing, or fails verification. This causes the coordinator’s override state to be stuck. +- When {agent} is tamper-protected, Endpoint must validate that the upgrade action was correctly signed by Kibana to allow the upgrade. If the signature is missing, invalid, or the connection between {agent} and Endpoint was interrupted, the validation fails. This causes the agent coordinator's override state to become stuck until the agent is restarted. **Symptoms** - {{fleet}} shows the upgrade action in progress, even though the upgrade remains stuck - No further upgrade attempts succeed -- Elastic-agent status shows an override state indicating upgrade +- Elastic Agent status shows an override state indicating upgrade **Workaround** -Restart the {{agent}} to clear the coordinator’s overrideState and allow new upgrade attempts to proceed. +Restart the {{agent}} to clear the coordinator’s `overrideState` and allow new upgrade attempts to proceed. **Resolution** This issue was fixed in [#9992](https://github.com/elastic/elastic-agent/pull/9992), which ensures that the coordinator clears its override state whenever an early failure occurs. -The fix will be included in versions 9.1.4, 8.19.4, 9.0.8, and 8.18.8. +The fix is included in versions 9.1.4 and 8.19.4, and planned for versions 9.0.8 and 8.18.8. ::: :::{dropdown} [Windows] {{agent}} does not process Windows security events From cefb21c2abc4302f6f585071be6cd87eb8ef7879 Mon Sep 17 00:00:00 2001 From: Karen Metts <35154725+karenzone@users.noreply.github.com> Date: Fri, 19 Sep 2025 15:48:13 -0400 Subject: [PATCH 3/4] Port over review comments from 8.x known issue review --- docs/release-notes/known-issues.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/release-notes/known-issues.md b/docs/release-notes/known-issues.md index 979d1b39229..d2e12d73a02 100644 --- a/docs/release-notes/known-issues.md +++ b/docs/release-notes/known-issues.md @@ -28,7 +28,7 @@ Known issues are significant defects or limitations that may impact your impleme **Applies to: {{agent}} 8.18.7, 9.0.7** -On September 17, 2025, a known issue was discovered that can cause {{agent}} upgrades to get stuck if an upgrade attempt fails early. This happens because the coordinator’s `overrideState` remains set, leaving the agent in a state that appears to be upgrading. +On September 17, 2025, a known issue was discovered that can cause {{agent}} upgrades to get stuck if an upgrade attempt fails under specific conditions. This happens because the coordinator’s `overrideState` remains set, leaving the agent in a state that appears to be upgrading. **Conditions** From 987ee256de4560d6962d994e940db2f85f2ef43a Mon Sep 17 00:00:00 2001 From: Karen Metts <35154725+karenzone@users.noreply.github.com> Date: Fri, 19 Sep 2025 16:01:38 -0400 Subject: [PATCH 4/4] Use MD format for attribute --- docs/release-notes/known-issues.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/release-notes/known-issues.md b/docs/release-notes/known-issues.md index d2e12d73a02..bf28749231b 100644 --- a/docs/release-notes/known-issues.md +++ b/docs/release-notes/known-issues.md @@ -36,7 +36,7 @@ This issue is triggered if the upgrade fails during one of the early checks insi - The agent is not upgradeable - Capabilities check denies the upgrade -- When {agent} is tamper-protected, Endpoint must validate that the upgrade action was correctly signed by Kibana to allow the upgrade. If the signature is missing, invalid, or the connection between {agent} and Endpoint was interrupted, the validation fails. This causes the agent coordinator's override state to become stuck until the agent is restarted. +- When {{agent}} is tamper-protected, Endpoint must validate that the upgrade action was correctly signed by Kibana to allow the upgrade. If the signature is missing, invalid, or the connection between {{agent}} and Endpoint was interrupted, the validation fails. This causes the agent coordinator's override state to become stuck until the agent is restarted. **Symptoms**