Skip to content

Conversation

nkvoll
Copy link
Member

@nkvoll nkvoll commented Sep 1, 2025

What does this PR do?

This PR includes the aggregated status of the agent node to the liveness health check.

As a bonus, it also adds status code assertion to the tests, which were missing before. (All liveness/readiness tests were passing without any assertions).

Why is it important?

Checklist

  • I have read and understood the pull request guidelines of this project.
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

Disruptive User Impact

Liveness probes will now fail if the configuration is invalid, likely causing the container to be restarted (see https://kubernetes.io/docs/concepts/configuration/liveness-readiness-startup-probes/#liveness-probe).

How to test this PR locally

  1. Create an elastic-agent.yml file with an invalid output, i.e set use_output: nonexistent
  2. Start elastic-agent with relevant monitoring endpoints enabled.
  3. Verify that the agent is failed with elastic-agent status
┌─ fleet
│  └─ status: (STOPPED) Not enrolled into Fleet
└─ elastic-agent
   ├─ status: (FAILED) Invalid component model: failed to render components: invalid 'inputs.0.use_output', references an unknown output 'nonexistent'
   └─ info
      ├─ id: e1a1e08b-9b0c-4394-a024-d35b823d415b
      ├─ version: 9.2.0
      └─ commit: ff80471809aca1f2280ce55f0e24f85cefec5d55
  1. Liveness probes should fail:
$ curl -w 'HTTP %{http_code}\n' 'http://localhost:6792/liveness?failon=degraded'
HTTP 500
$ curl -w 'HTTP %{http_code}\n' 'http://localhost:6792/liveness?failon=failed'
HTTP 500
$ curl -w 'HTTP %{http_code}\n' 'http://localhost:6792/liveness?failon=heartbeat'
HTTP 200

Related issues

@nkvoll nkvoll requested a review from a team as a code owner September 1, 2025 13:44
@mergify mergify bot assigned nkvoll Sep 1, 2025
Copy link
Contributor

mergify bot commented Sep 1, 2025

This pull request does not have a backport label. Could you fix it @nkvoll? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-./d./d is the label that automatically backports to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

@@ -76,12 +77,10 @@ func livenessHandler(coord CoordinatorState) func(http.ResponseWriter, *http.Req
return fmt.Errorf("error handling form values: %w", err)
}

// if user has requested `coordinator` mode, just revert to that, skip everything else
if !failConfig.Degraded && !failConfig.Failed && failConfig.Heartbeat {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this part as it's already covered by line 70-73, without the encapsulating if-statement.

@nkvoll
Copy link
Member Author

nkvoll commented Sep 1, 2025

From my testing, if this is the startup-state of the agent, it doesn't seem to start any components, but if configuration is edited while the agent is running, it keeps all existing components as-is.

This makes me wonder if what currently happens in the liveness endpoint should be happening in the readiness endpoint instead. Worth discussing? /cc @cmacknz @blakerouse

Copy link

@elasticmachine
Copy link
Collaborator

💛 Build succeeded, but was flaky

Failed CI Steps

cc @nkvoll

@pierrehilbert pierrehilbert added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Sep 2, 2025
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Liveness endpoint does not consider overall agent state, only component state
3 participants