-
Notifications
You must be signed in to change notification settings - Fork 188
Add /readiness and /liveness when enrolling with the container #9612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add /readiness and /liveness when enrolling with the container #9612
Conversation
This pull request does not have a backport label. Could you fix it @blakerouse? 🙏
|
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
💚 Build Succeeded
History
cc @blakerouse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a couple of comments, mostly for clarification.
Reading the PR description, I cannot help but wonder if keeping the elastic-agent alive long enough to complete enrollment should be implemented with a startup probe
instead since readiness and liveness probes will not start until the startupProbe succeeds.
Maybe @pkoutsovasilis can weigh in on that as well
@pchila This is meant to not fail for agentless. We do not want the container to stop trying to enroll, we handle that with |
The startup probe is meant to signal k8s when a container can be considered "started" so that liveness and readiness checks can take over. If elastic-agent is still enrolling and would normally fail the liveness checks, using the startup probe can be useful to give agent container the time it needs without changing the liveness and readiness checks. |
@pchila I assume you are saying that using a startup probe could remove the need to have the health check endpoints even available during enrollment. That still will not work, because there is a |
I would think that an agent container that didn't managed to enroll after a given period of time (let's say 5 or 10 mins for example) is not starting correctly and it should be terminated (and restarted according to its If you already evaluated the startup probe and discarded it because there's no upper bound to the time it takes to an agent to enroll, I assume that presenting the elastic-agent alive and ready even if it's not able to perform any work until enrollment happens is a deliberate design choice. |
Co-authored-by: Shaunak Kashyap <[email protected]>
Yes this is a deliberate design choice. As the PR description explains a normal container will still fail with enrollment because enrollment has a timeout. That timeout is configurable, so users can adjust to what they find appropriate. In this case the timeout is -1, run indefinately. |
@ycombinator I applied your suggestion, but that dismissed your review. If you can look again that would be great. |
|
@Mergifyio backport 8.17 8.18 8.19 9.0 9.1 |
✅ Backports have been created
|
* Enable health checking pre-enroll in container start-up path. Add readiness endpoint. * Fix func signature. * Fix tests. * Add changelog entry. * Fix imports. * Apply suggestion from @ycombinator Co-authored-by: Shaunak Kashyap <[email protected]> --------- Co-authored-by: Shaunak Kashyap <[email protected]> (cherry picked from commit c028f68) # Conflicts: # internal/pkg/agent/application/monitoring/process.go # internal/pkg/agent/application/monitoring/v1_monitor.go # internal/pkg/agent/cmd/container.go
* Enable health checking pre-enroll in container start-up path. Add readiness endpoint. * Fix func signature. * Fix tests. * Add changelog entry. * Fix imports. * Apply suggestion from @ycombinator Co-authored-by: Shaunak Kashyap <[email protected]> --------- Co-authored-by: Shaunak Kashyap <[email protected]> (cherry picked from commit c028f68) # Conflicts: # internal/pkg/agent/application/monitoring/process.go # internal/pkg/agent/application/monitoring/server_test.go # internal/pkg/agent/application/monitoring/v1_monitor.go # internal/pkg/agent/cmd/container.go
* Enable health checking pre-enroll in container start-up path. Add readiness endpoint. * Fix func signature. * Fix tests. * Add changelog entry. * Fix imports. * Apply suggestion from @ycombinator Co-authored-by: Shaunak Kashyap <[email protected]> --------- Co-authored-by: Shaunak Kashyap <[email protected]> (cherry picked from commit c028f68) # Conflicts: # internal/pkg/agent/application/monitoring/server_test.go # internal/pkg/agent/cmd/container.go
* Enable health checking pre-enroll in container start-up path. Add readiness endpoint. * Fix func signature. * Fix tests. * Add changelog entry. * Fix imports. * Apply suggestion from @ycombinator Co-authored-by: Shaunak Kashyap <[email protected]> --------- Co-authored-by: Shaunak Kashyap <[email protected]> (cherry picked from commit c028f68) # Conflicts: # internal/pkg/agent/application/monitoring/process.go # internal/pkg/agent/application/monitoring/server_test.go # internal/pkg/agent/application/monitoring/v1_monitor.go # internal/pkg/agent/cmd/container.go
* Enable health checking pre-enroll in container start-up path. Add readiness endpoint. * Fix func signature. * Fix tests. * Add changelog entry. * Fix imports. * Apply suggestion from @ycombinator Co-authored-by: Shaunak Kashyap <[email protected]> --------- Co-authored-by: Shaunak Kashyap <[email protected]> (cherry picked from commit c028f68) # Conflicts: # internal/pkg/agent/application/monitoring/process.go # internal/pkg/agent/application/monitoring/v1_monitor.go # internal/pkg/agent/cmd/container.go
* Enable health checking pre-enroll in container start-up path. Add readiness endpoint. * Fix func signature. * Fix tests. * Add changelog entry. * Fix imports. * Apply suggestion from @ycombinator Co-authored-by: Shaunak Kashyap <[email protected]> --------- Co-authored-by: Shaunak Kashyap <[email protected]> (cherry picked from commit c028f68) # Conflicts: # internal/pkg/agent/application/monitoring/process.go # internal/pkg/agent/application/monitoring/v1_monitor.go # internal/pkg/agent/cmd/container.go
What does this PR do?
Starts the monitoring endpoint when Elastic Agent is running as a container and has enrollment enabled. This allows the healthchecks in Kubernetes to succeed even when it is enrolling into Fleet. If enrollment to Fleet takes longer than the k8s healthchecks then it can result in the pod not being reported as healthy, which can cause it to be killed and then it never enrolls.
This also adds a simple
/readiness
endpoint that provides a different between ready and alive. There is a difference, and this provides the difference between the two.Why is it important?
This ensures that healthchecks do not prevent enrollment from actually working. Enrollment will still fail and the container will restart, giving the same behavior.
Checklist
[ ] I have made corresponding changes to the documentation[ ] I have made corresponding change to the default configuration files./changelog/fragments
using the changelog tool[ ] I have added an integration test or an E2E testDisruptive User Impact
None
How to test this PR locally
Related issues