-
Notifications
You must be signed in to change notification settings - Fork 198
Description
- Relates [beats receivers] Defunct
elastic-agent otel --supervisedprocess left behind when Elastic Agent re-executes itself #10632 - Relates fix: zombie processes during restart #10650
The coordinator waits up to 5s for the managers to exit:
elastic-agent/internal/pkg/agent/application/coordinator/coordinator.go
Lines 231 to 233 in 1237cc3
| // managerShutdownTimeout is how long the coordinator will wait during shutdown | |
| // to receive termination states from its managers. | |
| const managerShutdownTimeout = time.Second * 5 |
In #10650 a new collector stop timeout was introduced at 3s long to guarantee the collector sub-process was killed without failing to wait for it, avoiding a defunct zombie process.
Similarly, the default stop timeout for other sub-processes is 30s, and it is by coincidence that existing sub-processes exit faster than this:
elastic-agent/pkg/core/process/config.go
Line 22 in 1237cc3
| StopTimeout: 30 * time.Second, |
The amount of time to wait before killing sub-processes needs to be configurable by the user, and in most cases set to a longer value like the existing 30s (like the collector timeout originally was) to allow for graceful shutdown and for final data to be shipped. The collector is currently special cased to a shorter timeout because it has to restart whenever it's configuration changes as of the time of writing.
However, in certain circumstances waiting this long is inconveniencing, for example when enrolling an Elastic Agent to Fleet it restarts itself and would incur this shutdown delay. In this case the existing configuration is being discarded and agent can restart immediately.
Acceptance Criteria
- The shutdown timeout for all sub-processes needs to be configurable by the user in elastic-agent.yml.
- Cases where graceful shutdown is not necessary should use the lowest timeout possible by default. For example, when enrolling Elastic Agent into a new agent policy.
- The increased shutdown delay must not cause unexpected impact to other parts of agent that re-execute as part of its implementation, particularly the upgrade process.