Add /readiness and /liveness when enrolling with the container #9612

blakerouse · 2025-08-27T20:04:10Z

What does this PR do?

Starts the monitoring endpoint when Elastic Agent is running as a container and has enrollment enabled. This allows the healthchecks in Kubernetes to succeed even when it is enrolling into Fleet. If enrollment to Fleet takes longer than the k8s healthchecks then it can result in the pod not being reported as healthy, which can cause it to be killed and then it never enrolls.

This also adds a simple /readiness endpoint that provides a different between ready and alive. There is a difference, and this provides the difference between the two.

Why is it important?

This ensures that healthchecks do not prevent enrollment from actually working. Enrollment will still fail and the container will restart, giving the same behavior.

Checklist

I have read and understood the pull request guidelines of this project.
My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~[ ] I have made corresponding changes to the documentation~~
~~[ ] I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works
I have added an entry in ./changelog/fragments using the changelog tool
~~[ ] I have added an integration test or an E2E test~~

Disruptive User Impact

None

How to test this PR locally

$ docker run -it --rm -p 5066:5066 -e FLEET_ENROLL=1 -e FLEET_URL=https://invalid-url:443 -e FLEET_ENROLLMENT_TOKEN=invalid-token ${build_image}
$ curl -v http://localhost:5066/readiness
$ curl -v http://localhost:5066/liveness

Related issues

Closes Liveness endpoint is not present when the container is enrolling #9611

…diness endpoint.

mergify · 2025-08-27T20:04:48Z

This pull request does not have a backport label. Could you fix it @blakerouse? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-./d./d is the label that automatically backports to the 8./d branch. /d is the digit
backport-active-all is the label that automatically backports to all active branches.
backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

elasticmachine · 2025-08-28T00:30:29Z

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

elasticmachine · 2025-08-28T02:33:36Z

💚 Build Succeeded

Buildkite Build
Commit: 726fb96

History

💔 Build #25827 failed 043e710

cc @blakerouse

pchila

Left a couple of comments, mostly for clarification.
Reading the PR description, I cannot help but wonder if keeping the elastic-agent alive long enough to complete enrollment should be implemented with a startup probe instead since readiness and liveness probes will not start until the startupProbe succeeds.
Maybe @pkoutsovasilis can weigh in on that as well

internal/pkg/agent/application/monitoring/liveness.go

internal/pkg/agent/application/monitoring/server_test.go

blakerouse · 2025-08-28T12:04:20Z

Left a couple of comments, mostly for clarification. Reading the PR description, I cannot help but wonder if keeping the elastic-agent alive long enough to complete enrollment should be implemented with a startup probe instead since readiness and liveness probes will not start until the startupProbe succeeds. Maybe @pkoutsovasilis can weigh in on that as well

@pchila This is meant to not fail for agentless. We do not want the container to stop trying to enroll, we handle that with FLEET_ENROLL_TIMEOUT=-1. We want the pod to always be considered in good health, if its trying to enroll then its good. The default is not FLEET_ENROLL_TIMEOUT=-1 so for normal use cases the container will fail to enroll, the container will exit non-zero and that will be reported as a crash of the pod on the kubernetes side.

pchila · 2025-08-28T12:31:54Z

@pchila This is meant to not fail for agentless. We do not want the container to stop trying to enroll, we handle that with FLEET_ENROLL_TIMEOUT=-1. We want the pod to always be considered in good health, if its trying to enroll then its good. The default is not FLEET_ENROLL_TIMEOUT=-1 so for normal use cases the container will fail to enroll, the container will exit non-zero and that will be reported as a crash of the pod on the kubernetes side.

The startup probe is meant to signal k8s when a container can be considered "started" so that liveness and readiness checks can take over. If elastic-agent is still enrolling and would normally fail the liveness checks, using the startup probe can be useful to give agent container the time it needs without changing the liveness and readiness checks.

blakerouse · 2025-08-28T17:57:23Z

@pchila I assume you are saying that using a startup probe could remove the need to have the health check endpoints even available during enrollment. That still will not work, because there is a failureThreshold on the startup probe. That will result in a point in time where it will still mark the container as failed. We want this to run indefinitely, if its trying to enroll then it is working.

internal/pkg/agent/application/monitoring/readiness.go

pchila · 2025-08-29T08:02:41Z

@pchila I assume you are saying that using a startup probe could remove the need to have the health check endpoints even available during enrollment. That still will not work, because there is a failureThreshold on the startup probe. That will result in a point in time where it will still mark the container as failed. We want this to run indefinitely, if its trying to enroll then it is working.

I would think that an agent container that didn't managed to enroll after a given period of time (let's say 5 or 10 mins for example) is not starting correctly and it should be terminated (and restarted according to its restartPolicy).
Having an agent that keeps running and trying to enroll indefinitely while reporting itself alive and ready doesn't feel correct, but I may be misunderstanding the usecase.

If you already evaluated the startup probe and discarded it because there's no upper bound to the time it takes to an agent to enroll, I assume that presenting the elastic-agent alive and ready even if it's not able to perform any work until enrollment happens is a deliberate design choice.

Co-authored-by: Shaunak Kashyap <[email protected]>

blakerouse · 2025-08-29T12:19:08Z

@pchila I assume you are saying that using a startup probe could remove the need to have the health check endpoints even available during enrollment. That still will not work, because there is a failureThreshold on the startup probe. That will result in a point in time where it will still mark the container as failed. We want this to run indefinitely, if its trying to enroll then it is working.

I would think that an agent container that didn't managed to enroll after a given period of time (let's say 5 or 10 mins for example) is not starting correctly and it should be terminated (and restarted according to its restartPolicy). Having an agent that keeps running and trying to enroll indefinitely while reporting itself alive and ready doesn't feel correct, but I may be misunderstanding the usecase.

If you already evaluated the startup probe and discarded it because there's no upper bound to the time it takes to an agent to enroll, I assume that presenting the elastic-agent alive and ready even if it's not able to perform any work until enrollment happens is a deliberate design choice.

Yes this is a deliberate design choice. As the PR description explains a normal container will still fail with enrollment because enrollment has a timeout. That timeout is configurable, so users can adjust to what they find appropriate. In this case the timeout is -1, run indefinately.

blakerouse · 2025-08-29T12:21:28Z

@ycombinator I applied your suggestion, but that dismissed your review. If you can look again that would be great.

elastic-sonarqube · 2025-08-29T13:53:53Z

Quality Gate passed

Issues
3 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
43.6% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube

github-actions · 2025-08-29T17:28:42Z

@Mergifyio backport 8.17 8.18 8.19 9.0 9.1

mergify · 2025-08-29T17:28:52Z

backport 8.17 8.18 8.19 9.0 9.1

✅ Backports have been created

#9646 [8.17] (backport #9612) Add /readiness and /liveness when enrolling with the container has been created for branch 8.17 but encountered conflicts
#9647 [8.18] (backport #9612) Add /readiness and /liveness when enrolling with the container has been created for branch 8.18 but encountered conflicts
#9648 [8.19] (backport #9612) Add /readiness and /liveness when enrolling with the container has been created for branch 8.19 but encountered conflicts
#9649 [9.0] (backport #9612) Add /readiness and /liveness when enrolling with the container has been created for branch 9.0 but encountered conflicts
#9650 [9.1] (backport #9612) Add /readiness and /liveness when enrolling with the container has been created for branch 9.1 but encountered conflicts

@ycombinator

* Enable health checking pre-enroll in container start-up path. Add readiness endpoint. * Fix func signature. * Fix tests. * Add changelog entry. * Fix imports. * Apply suggestion from @ycombinator Co-authored-by: Shaunak Kashyap <[email protected]> --------- Co-authored-by: Shaunak Kashyap <[email protected]> (cherry picked from commit c028f68) # Conflicts: # internal/pkg/agent/application/monitoring/process.go # internal/pkg/agent/application/monitoring/v1_monitor.go # internal/pkg/agent/cmd/container.go

@ycombinator

* Enable health checking pre-enroll in container start-up path. Add readiness endpoint. * Fix func signature. * Fix tests. * Add changelog entry. * Fix imports. * Apply suggestion from @ycombinator Co-authored-by: Shaunak Kashyap <[email protected]> --------- Co-authored-by: Shaunak Kashyap <[email protected]> (cherry picked from commit c028f68) # Conflicts: # internal/pkg/agent/application/monitoring/process.go # internal/pkg/agent/application/monitoring/server_test.go # internal/pkg/agent/application/monitoring/v1_monitor.go # internal/pkg/agent/cmd/container.go

@ycombinator

* Enable health checking pre-enroll in container start-up path. Add readiness endpoint. * Fix func signature. * Fix tests. * Add changelog entry. * Fix imports. * Apply suggestion from @ycombinator Co-authored-by: Shaunak Kashyap <[email protected]> --------- Co-authored-by: Shaunak Kashyap <[email protected]> (cherry picked from commit c028f68) # Conflicts: # internal/pkg/agent/application/monitoring/server_test.go # internal/pkg/agent/cmd/container.go

@ycombinator

* Enable health checking pre-enroll in container start-up path. Add readiness endpoint. * Fix func signature. * Fix tests. * Add changelog entry. * Fix imports. * Apply suggestion from @ycombinator Co-authored-by: Shaunak Kashyap <[email protected]> --------- Co-authored-by: Shaunak Kashyap <[email protected]> (cherry picked from commit c028f68) # Conflicts: # internal/pkg/agent/application/monitoring/process.go # internal/pkg/agent/application/monitoring/server_test.go # internal/pkg/agent/application/monitoring/v1_monitor.go # internal/pkg/agent/cmd/container.go

@ycombinator

* Enable health checking pre-enroll in container start-up path. Add readiness endpoint. * Fix func signature. * Fix tests. * Add changelog entry. * Fix imports. * Apply suggestion from @ycombinator Co-authored-by: Shaunak Kashyap <[email protected]> --------- Co-authored-by: Shaunak Kashyap <[email protected]> (cherry picked from commit c028f68) # Conflicts: # internal/pkg/agent/application/monitoring/process.go # internal/pkg/agent/application/monitoring/v1_monitor.go # internal/pkg/agent/cmd/container.go

@ycombinator

* Enable health checking pre-enroll in container start-up path. Add readiness endpoint. * Fix func signature. * Fix tests. * Add changelog entry. * Fix imports. * Apply suggestion from @ycombinator Co-authored-by: Shaunak Kashyap <[email protected]> --------- Co-authored-by: Shaunak Kashyap <[email protected]> (cherry picked from commit c028f68) # Conflicts: # internal/pkg/agent/application/monitoring/process.go # internal/pkg/agent/application/monitoring/v1_monitor.go # internal/pkg/agent/cmd/container.go

blakerouse added 3 commits August 27, 2025 10:47

Enable health checking pre-enroll in container start-up path. Add rea…

c16d6a1

…diness endpoint.

Fix func signature.

00a4afa

Fix tests.

2f8b40a

blakerouse requested a review from a team as a code owner August 27, 2025 20:04

blakerouse requested review from ycombinator and pchila August 27, 2025 20:04

mergify bot assigned blakerouse Aug 27, 2025

blakerouse added 2 commits August 27, 2025 16:05

Add changelog entry.

043e710

Fix imports.

726fb96

blakerouse added Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team backport-active-all Automated backport with mergify to all the active branches labels Aug 28, 2025

pchila reviewed Aug 28, 2025

View reviewed changes

internal/pkg/agent/application/monitoring/liveness.go Show resolved Hide resolved

internal/pkg/agent/application/monitoring/server_test.go Show resolved Hide resolved

ycombinator reviewed Aug 28, 2025

View reviewed changes

internal/pkg/agent/application/monitoring/readiness.go Outdated Show resolved Hide resolved

ycombinator previously approved these changes Aug 28, 2025

View reviewed changes

Apply suggestion from @ycombinator

a707872

Co-authored-by: Shaunak Kashyap <[email protected]>

blakerouse dismissed ycombinator’s stale review via a707872 August 29, 2025 12:17

Merge branch 'main' into healthcheck-before-enroll

d75d800

ycombinator approved these changes Aug 29, 2025

View reviewed changes

blakerouse merged commit c028f68 into elastic:main Aug 29, 2025
19 checks passed

blakerouse deleted the healthcheck-before-enroll branch August 29, 2025 17:28

This was referenced Aug 29, 2025

[8.17] (backport #9612) Add /readiness and /liveness when enrolling with the container #9646

Closed

[8.18] (backport #9612) Add /readiness and /liveness when enrolling with the container #9647

Open

mergify bot mentioned this pull request Aug 29, 2025

[8.19] (backport #9612) Add /readiness and /liveness when enrolling with the container #9648

Open

5 tasks

This was referenced Aug 29, 2025

[9.0] (backport #9612) Add /readiness and /liveness when enrolling with the container #9649

Open

[9.1] (backport #9612) Add /readiness and /liveness when enrolling with the container #9650

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add /readiness and /liveness when enrolling with the container #9612

Add /readiness and /liveness when enrolling with the container #9612

Uh oh!

blakerouse commented Aug 27, 2025 •

edited

Loading

Uh oh!

mergify bot commented Aug 27, 2025

Uh oh!

elasticmachine commented Aug 28, 2025

Uh oh!

elasticmachine commented Aug 28, 2025

Uh oh!

pchila left a comment

Uh oh!

Uh oh!

Uh oh!

blakerouse commented Aug 28, 2025

Uh oh!

pchila commented Aug 28, 2025

Uh oh!

blakerouse commented Aug 28, 2025

Uh oh!

Uh oh!

pchila commented Aug 29, 2025

Uh oh!

blakerouse commented Aug 29, 2025

Uh oh!

blakerouse commented Aug 29, 2025

Uh oh!

elastic-sonarqube bot commented Aug 29, 2025

Uh oh!

Uh oh!

github-actions bot commented Aug 29, 2025

Uh oh!

mergify bot commented Aug 29, 2025 •

edited

Loading

Uh oh!

Uh oh!

Add /readiness and /liveness when enrolling with the container #9612

Add /readiness and /liveness when enrolling with the container #9612

Uh oh!

Conversation

blakerouse commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Why is it important?

Checklist

Disruptive User Impact

How to test this PR locally

Related issues

Uh oh!

mergify bot commented Aug 27, 2025

Uh oh!

elasticmachine commented Aug 28, 2025

Uh oh!

elasticmachine commented Aug 28, 2025

💚 Build Succeeded

History

Uh oh!

pchila left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

blakerouse commented Aug 28, 2025

Uh oh!

pchila commented Aug 28, 2025

Uh oh!

blakerouse commented Aug 28, 2025

Uh oh!

Uh oh!

pchila commented Aug 29, 2025

Uh oh!

blakerouse commented Aug 29, 2025

Uh oh!

blakerouse commented Aug 29, 2025

Uh oh!

elastic-sonarqube bot commented Aug 29, 2025

Quality Gate passed

Uh oh!

Uh oh!

github-actions bot commented Aug 29, 2025

Uh oh!

mergify bot commented Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Backports have been created

Uh oh!

Uh oh!

blakerouse commented Aug 27, 2025 •

edited

Loading

mergify bot commented Aug 29, 2025 •

edited

Loading