Skip to content

Conversation

@mergify
Copy link
Contributor

@mergify mergify bot commented Oct 24, 2025

What does this PR do?

This PR fixes zombie/defunct processes that are left behind when Elastic Agent re-executes itself during restart. The fix involves:

  1. Decreasing the EDOT collector shutdown timeout from 30 seconds to 3 seconds to accommodate the default 5-second timeout of the coordinator shutdown timeout
    • Adding a safety net that waits an additional second after killing a process to ensure Wait() is called
  2. Improving graceful shutdown handling in the EDOT collector subprocess manager to ensure proper process cleanup
  3. Adding debug logging throughout the shutdown process to better trace subprocess termination
  4. Adding an integration test that verifies no zombie processes are left behind after agent restart

Why is it important?

Root Cause

When the Elastic Agent re-executes itself during restart, the following sequence occurs:

  1. If a subprocess (particularly the EDOT collector or command components) takes longer than the coordinator's 5-second shutdown timeout, the agent proceeds to execve itself
  2. During execve, all threads other than the calling thread are destroyed
  3. This triggers the PDeathSig mechanism we enable for subprocesses
  4. However, the parent process (pre-execve Elastic Agent) never reaps (waits for) the exit status of spawned subprocesses
  5. Result: these subprocesses end up as defunct/zombie processes

Why This Affects EDOT More Than Beats

Beats subprocesses typically terminate almost immediately (within the 5-second window), so they don't become zombies. However, the EDOT collector's shutdown time seemed to be affected by:

  • Number of pipeline workers
  • Elasticsearch exporter configuration

Impact

  • Resource leaks: Zombie processes consume PIDs and kernel memory
  • Operational issues: Accumulation of zombies over multiple restarts
  • Config update delays: EDOT subprocess restarts on every config change, and 20+ second shutdowns create significant latency

This fix ensures proper process cleanup regardless of shutdown duration while maintaining graceful termination when possible.

Checklist

  • I have read and understood the pull request guidelines of this project.
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

Disruptive User Impact

Users may notice:

  • Agent restarts take slightly longer (up to 35 seconds instead of 5 seconds in worst case)
  • However, this ensures clean shutdowns and prevents zombie accumulation
  • The tradeoff is worthwhile as zombie processes can cause operational issues over time

How to test this PR locally

Run TestMetricsMonitoringCorrectBinaries integration test

Related issues

* fix: zombie processes during restart by extending shutdown timeout to 35sRetry

* fix: linter QF1003 could use tagged switch on state

* fix: linter QF1012

* doc: add changelog

* doc: reword test code comments

* fix: make otel manager process stop timeout way shorter

* doc: add more documentation

* doc: remove changelog fragment

* doc: reword managerShutdownTimeout comment

(cherry picked from commit 9c001b0)
@mergify mergify bot added the backport label Oct 24, 2025
@mergify mergify bot requested a review from a team as a code owner October 24, 2025 12:27
@mergify mergify bot requested review from blakerouse and pkoutsovasilis and removed request for a team October 24, 2025 12:27
@github-actions github-actions bot added Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team skip-changelog labels Oct 24, 2025
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@pkoutsovasilis pkoutsovasilis enabled auto-merge (squash) October 24, 2025 12:36
@elasticmachine
Copy link
Collaborator

elasticmachine commented Oct 24, 2025

💔 Build Failed

Failed CI Steps

History

cc @pkoutsovasilis

@pkoutsovasilis pkoutsovasilis merged commit 2d30374 into 8.19 Oct 24, 2025
19 checks passed
@pkoutsovasilis pkoutsovasilis deleted the mergify/bp/8.19/pr-10650 branch October 24, 2025 17:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants