-
Notifications
You must be signed in to change notification settings - Fork 198
fix: zombie processes during restart #10650
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: zombie processes during restart #10650
Conversation
6186951 to
a32c16f
Compare
|
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, some relatively minor nitpicks.
2cae2b1 to
b37db6c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small nits, but it looks good, will approve after green CI
b37db6c to
f66af91
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, one nit about a comment that isn't blocking.
💛 Build succeeded, but was flaky
Failed CI StepsHistory
|
* fix: zombie processes during restart by extending shutdown timeout to 35sRetry * fix: linter QF1003 could use tagged switch on state * fix: linter QF1012 * doc: add changelog * doc: reword test code comments * fix: make otel manager process stop timeout way shorter * doc: add more documentation * doc: remove changelog fragment * doc: reword managerShutdownTimeout comment (cherry picked from commit 9c001b0)
* fix: zombie processes during restart by extending shutdown timeout to 35sRetry * fix: linter QF1003 could use tagged switch on state * fix: linter QF1012 * doc: add changelog * doc: reword test code comments * fix: make otel manager process stop timeout way shorter * doc: add more documentation * doc: remove changelog fragment * doc: reword managerShutdownTimeout comment (cherry picked from commit 9c001b0) # Conflicts: # internal/pkg/agent/application/application.go # internal/pkg/otel/manager/execution.go # internal/pkg/otel/manager/execution_subprocess.go # internal/pkg/otel/manager/manager.go # testing/integration/ess/metrics_monitoring_test.go
* fix: zombie processes during restart by extending shutdown timeout to 35sRetry * fix: linter QF1003 could use tagged switch on state * fix: linter QF1012 * doc: add changelog * doc: reword test code comments * fix: make otel manager process stop timeout way shorter * doc: add more documentation * doc: remove changelog fragment * doc: reword managerShutdownTimeout comment (cherry picked from commit 9c001b0)
* fix: zombie processes during restart by extending shutdown timeout to 35sRetry * fix: linter QF1003 could use tagged switch on state * fix: linter QF1012 * doc: add changelog * doc: reword test code comments * fix: make otel manager process stop timeout way shorter * doc: add more documentation * doc: remove changelog fragment * doc: reword managerShutdownTimeout comment (cherry picked from commit 9c001b0) Co-authored-by: Panos Koutsovasilis <[email protected]>
* fix: zombie processes during restart by extending shutdown timeout to 35sRetry * fix: linter QF1003 could use tagged switch on state * fix: linter QF1012 * doc: add changelog * doc: reword test code comments * fix: make otel manager process stop timeout way shorter * doc: add more documentation * doc: remove changelog fragment * doc: reword managerShutdownTimeout comment (cherry picked from commit 9c001b0) Co-authored-by: Panos Koutsovasilis <[email protected]>
What does this PR do?
This PR fixes zombie/defunct processes that are left behind when Elastic Agent re-executes itself during restart. The fix involves:
Wait()is calledWhy is it important?
Root Cause
When the Elastic Agent re-executes itself during restart, the following sequence occurs:
execveitselfexecve, all threads other than the calling thread are destroyedPDeathSigmechanism we enable for subprocessesWhy This Affects EDOT More Than Beats
Beats subprocesses typically terminate almost immediately (within the 5-second window), so they don't become zombies. However, the EDOT collector's shutdown time seemed to be affected by:
Impact
This fix ensures proper process cleanup regardless of shutdown duration while maintaining graceful termination when possible.
Checklist
./changelog/fragmentsusing the changelog toolDisruptive User Impact
Users may notice:
How to test this PR locally
Run
TestMetricsMonitoringCorrectBinariesintegration testRelated issues