Skip to content

Conversation

blakerouse
Copy link
Contributor

@blakerouse blakerouse commented Aug 27, 2025

What does this PR do?

Starts the monitoring endpoint when Elastic Agent is running as a container and has enrollment enabled. This allows the healthchecks in Kubernetes to succeed even when it is enrolling into Fleet. If enrollment to Fleet takes longer than the k8s healthchecks then it can result in the pod not being reported as healthy, which can cause it to be killed and then it never enrolls.

This also adds a simple /readiness endpoint that provides a different between ready and alive. There is a difference, and this provides the difference between the two.

Why is it important?

This ensures that healthchecks do not prevent enrollment from actually working. Enrollment will still fail and the container will restart, giving the same behavior.

Checklist

  • I have read and understood the pull request guidelines of this project.
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • [ ] I have added an integration test or an E2E test

Disruptive User Impact

None

How to test this PR locally

$ docker run -it --rm -p 5066:5066 -e FLEET_ENROLL=1 -e FLEET_URL=https://invalid-url:443 -e FLEET_ENROLLMENT_TOKEN=invalid-token ${build_image}
$ curl -v http://localhost:5066/readiness
$ curl -v http://localhost:5066/liveness

Related issues

@blakerouse blakerouse requested a review from a team as a code owner August 27, 2025 20:04
Copy link
Contributor

mergify bot commented Aug 27, 2025

This pull request does not have a backport label. Could you fix it @blakerouse? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-./d./d is the label that automatically backports to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

@blakerouse blakerouse added Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team backport-active-all Automated backport with mergify to all the active branches labels Aug 28, 2025
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @blakerouse

Copy link
Member

@pchila pchila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple of comments, mostly for clarification.
Reading the PR description, I cannot help but wonder if keeping the elastic-agent alive long enough to complete enrollment should be implemented with a startup probe instead since readiness and liveness probes will not start until the startupProbe succeeds.
Maybe @pkoutsovasilis can weigh in on that as well

@blakerouse
Copy link
Contributor Author

Left a couple of comments, mostly for clarification. Reading the PR description, I cannot help but wonder if keeping the elastic-agent alive long enough to complete enrollment should be implemented with a startup probe instead since readiness and liveness probes will not start until the startupProbe succeeds. Maybe @pkoutsovasilis can weigh in on that as well

@pchila This is meant to not fail for agentless. We do not want the container to stop trying to enroll, we handle that with FLEET_ENROLL_TIMEOUT=-1. We want the pod to always be considered in good health, if its trying to enroll then its good. The default is not FLEET_ENROLL_TIMEOUT=-1 so for normal use cases the container will fail to enroll, the container will exit non-zero and that will be reported as a crash of the pod on the kubernetes side.

@pchila
Copy link
Member

pchila commented Aug 28, 2025

@pchila This is meant to not fail for agentless. We do not want the container to stop trying to enroll, we handle that with FLEET_ENROLL_TIMEOUT=-1. We want the pod to always be considered in good health, if its trying to enroll then its good. The default is not FLEET_ENROLL_TIMEOUT=-1 so for normal use cases the container will fail to enroll, the container will exit non-zero and that will be reported as a crash of the pod on the kubernetes side.

The startup probe is meant to signal k8s when a container can be considered "started" so that liveness and readiness checks can take over. If elastic-agent is still enrolling and would normally fail the liveness checks, using the startup probe can be useful to give agent container the time it needs without changing the liveness and readiness checks.

@blakerouse
Copy link
Contributor Author

@pchila I assume you are saying that using a startup probe could remove the need to have the health check endpoints even available during enrollment. That still will not work, because there is a failureThreshold on the startup probe. That will result in a point in time where it will still mark the container as failed. We want this to run indefinitely, if its trying to enroll then it is working.

ycombinator
ycombinator previously approved these changes Aug 28, 2025
@pchila
Copy link
Member

pchila commented Aug 29, 2025

@pchila I assume you are saying that using a startup probe could remove the need to have the health check endpoints even available during enrollment. That still will not work, because there is a failureThreshold on the startup probe. That will result in a point in time where it will still mark the container as failed. We want this to run indefinitely, if its trying to enroll then it is working.

I would think that an agent container that didn't managed to enroll after a given period of time (let's say 5 or 10 mins for example) is not starting correctly and it should be terminated (and restarted according to its restartPolicy).
Having an agent that keeps running and trying to enroll indefinitely while reporting itself alive and ready doesn't feel correct, but I may be misunderstanding the usecase.

If you already evaluated the startup probe and discarded it because there's no upper bound to the time it takes to an agent to enroll, I assume that presenting the elastic-agent alive and ready even if it's not able to perform any work until enrollment happens is a deliberate design choice.

Co-authored-by: Shaunak Kashyap <[email protected]>
@blakerouse
Copy link
Contributor Author

@pchila I assume you are saying that using a startup probe could remove the need to have the health check endpoints even available during enrollment. That still will not work, because there is a failureThreshold on the startup probe. That will result in a point in time where it will still mark the container as failed. We want this to run indefinitely, if its trying to enroll then it is working.

I would think that an agent container that didn't managed to enroll after a given period of time (let's say 5 or 10 mins for example) is not starting correctly and it should be terminated (and restarted according to its restartPolicy). Having an agent that keeps running and trying to enroll indefinitely while reporting itself alive and ready doesn't feel correct, but I may be misunderstanding the usecase.

If you already evaluated the startup probe and discarded it because there's no upper bound to the time it takes to an agent to enroll, I assume that presenting the elastic-agent alive and ready even if it's not able to perform any work until enrollment happens is a deliberate design choice.

Yes this is a deliberate design choice. As the PR description explains a normal container will still fail with enrollment because enrollment has a timeout. That timeout is configurable, so users can adjust to what they find appropriate. In this case the timeout is -1, run indefinately.

@blakerouse
Copy link
Contributor Author

@ycombinator I applied your suggestion, but that dismissed your review. If you can look again that would be great.

Copy link

@blakerouse blakerouse merged commit c028f68 into elastic:main Aug 29, 2025
19 checks passed
@blakerouse blakerouse deleted the healthcheck-before-enroll branch August 29, 2025 17:28
Copy link
Contributor

@Mergifyio backport 8.17 8.18 8.19 9.0 9.1

Copy link
Contributor

mergify bot commented Aug 29, 2025

backport 8.17 8.18 8.19 9.0 9.1

✅ Backports have been created

mergify bot pushed a commit that referenced this pull request Aug 29, 2025
* Enable health checking pre-enroll in container start-up path. Add readiness endpoint.

* Fix func signature.

* Fix tests.

* Add changelog entry.

* Fix imports.

* Apply suggestion from @ycombinator

Co-authored-by: Shaunak Kashyap <[email protected]>

---------

Co-authored-by: Shaunak Kashyap <[email protected]>
(cherry picked from commit c028f68)

# Conflicts:
#	internal/pkg/agent/application/monitoring/process.go
#	internal/pkg/agent/application/monitoring/v1_monitor.go
#	internal/pkg/agent/cmd/container.go
mergify bot pushed a commit that referenced this pull request Aug 29, 2025
* Enable health checking pre-enroll in container start-up path. Add readiness endpoint.

* Fix func signature.

* Fix tests.

* Add changelog entry.

* Fix imports.

* Apply suggestion from @ycombinator

Co-authored-by: Shaunak Kashyap <[email protected]>

---------

Co-authored-by: Shaunak Kashyap <[email protected]>
(cherry picked from commit c028f68)

# Conflicts:
#	internal/pkg/agent/application/monitoring/process.go
#	internal/pkg/agent/application/monitoring/server_test.go
#	internal/pkg/agent/application/monitoring/v1_monitor.go
#	internal/pkg/agent/cmd/container.go
mergify bot pushed a commit that referenced this pull request Aug 29, 2025
* Enable health checking pre-enroll in container start-up path. Add readiness endpoint.

* Fix func signature.

* Fix tests.

* Add changelog entry.

* Fix imports.

* Apply suggestion from @ycombinator

Co-authored-by: Shaunak Kashyap <[email protected]>

---------

Co-authored-by: Shaunak Kashyap <[email protected]>
(cherry picked from commit c028f68)

# Conflicts:
#	internal/pkg/agent/application/monitoring/server_test.go
#	internal/pkg/agent/cmd/container.go
mergify bot pushed a commit that referenced this pull request Aug 29, 2025
* Enable health checking pre-enroll in container start-up path. Add readiness endpoint.

* Fix func signature.

* Fix tests.

* Add changelog entry.

* Fix imports.

* Apply suggestion from @ycombinator

Co-authored-by: Shaunak Kashyap <[email protected]>

---------

Co-authored-by: Shaunak Kashyap <[email protected]>
(cherry picked from commit c028f68)

# Conflicts:
#	internal/pkg/agent/application/monitoring/process.go
#	internal/pkg/agent/application/monitoring/server_test.go
#	internal/pkg/agent/application/monitoring/v1_monitor.go
#	internal/pkg/agent/cmd/container.go
mergify bot pushed a commit that referenced this pull request Aug 29, 2025
* Enable health checking pre-enroll in container start-up path. Add readiness endpoint.

* Fix func signature.

* Fix tests.

* Add changelog entry.

* Fix imports.

* Apply suggestion from @ycombinator

Co-authored-by: Shaunak Kashyap <[email protected]>

---------

Co-authored-by: Shaunak Kashyap <[email protected]>
(cherry picked from commit c028f68)

# Conflicts:
#	internal/pkg/agent/application/monitoring/process.go
#	internal/pkg/agent/application/monitoring/v1_monitor.go
#	internal/pkg/agent/cmd/container.go
blakerouse added a commit that referenced this pull request Sep 2, 2025
* Enable health checking pre-enroll in container start-up path. Add readiness endpoint.

* Fix func signature.

* Fix tests.

* Add changelog entry.

* Fix imports.

* Apply suggestion from @ycombinator

Co-authored-by: Shaunak Kashyap <[email protected]>

---------

Co-authored-by: Shaunak Kashyap <[email protected]>
(cherry picked from commit c028f68)

# Conflicts:
#	internal/pkg/agent/application/monitoring/process.go
#	internal/pkg/agent/application/monitoring/v1_monitor.go
#	internal/pkg/agent/cmd/container.go
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-active-all Automated backport with mergify to all the active branches Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Liveness endpoint is not present when the container is enrolling
4 participants