Conversation
… check-in in parallel to its initialisation (#49796) The Start method from BeatV2Manager is split into two methods: - PreInit: responsible for starting the Elastic Agent client and start replying to check-ins. - PostInit: responsible for setting the Beats status to 'Running' and start executing Unit changes. A new method, WaitForStop is also added. It stops the BeatV2Manager and waits until all goroutines have returned. Currently it is only used in tests that use `testing.T` as the logger output to ensure no panics happen because the logger was used after the test ended. Multiple lint warnings are fixed GenAI-Assisted: Yes Human-Reviewed: Yes Tool: Cursor-CLI, Model: GPT-5.3 Codex Extra High Fast (cherry picked from commit 034546f) # Conflicts: # x-pack/libbeat/management/managerV2.go # x-pack/osquerybeat/beater/osquerybeat_status_test.go
|
Cherry-pick of 034546f has failed: To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally |
🤖 GitHub commentsJust comment with:
|
|
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane) |
|
Pinging @elastic/sec-linux-platform (Team:Security-Linux Platform) |
As described in #49388, `BeatV2Manager` can miss the shutdown signal because its `Stop` method notifies the manager by sending to its signal channel `stopChan` rather than closing it, but there are two goroutines that both listen on that channel. This PR changes `Stop` to close the channel rather than just sending. It also removes the second `stopChan` listener in `watchErrChan`, since the main goroutine already calls the context canceler for that helper when `stopChan` unblocks (this isn't strictly necessary but it will keep error states visible for a little longer during shutdown, and is what was previously happening in the "good" path where the main worker received the stop signal first). (cherry picked from commit d39cb49)
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Pull request overview
Backports the ManagerV2 startup refactor to allow Beats to respond to Elastic Agent check-ins in parallel with expensive initialization (e.g., Filebeat state store load), and adds a shutdown-wait helper primarily for tests.
Changes:
- Split
BeatV2Manager.Start()intoPreInit()(start client/check-ins) andPostInit()(mark running + allow applying unit changes), keepingStart()as a backwards-compatible wrapper. - Add
WaitForStop(timeout)to stop the manager and wait for its goroutines to exit; update tests to use it to avoid post-test logger panics. - Move Filebeat management startup earlier (
PreInit()before state store load;PostInit()once fully initialized), and adjust integration/unit tests + changelog fragments.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
x-pack/otel/otelmanager/manager.go |
Implements new Manager interface methods (PreInit/PostInit/WaitForStop) for the Otel manager stub. |
x-pack/libbeat/management/managerV2.go |
Core refactor: introduces PreInit/PostInit, WaitForStop, stop channel semantics, and buffering behavior before beat readiness. |
x-pack/libbeat/management/managerV2_test.go |
Updates tests to use WaitForStop and adds coverage for buffered units being applied after PostInit. |
x-pack/filebeat/tests/integration/status_reporter_test.go |
Minor test fix (stream selection + gosec suppression comment). |
x-pack/filebeat/tests/integration/managerV2_test.go |
Aligns expectations with updated startup/status timing. |
libbeat/management/management.go |
Extends Manager interface with PreInit/PostInit/WaitForStop; updates FallbackManager. |
libbeat/cmd/instance/beat_test.go |
Updates mock manager to satisfy new interface. |
libbeat/beat/beat_test.go |
Updates test manager to satisfy new interface. |
filebeat/beater/filebeat.go |
Starts manager check-in loop earlier (PreInit) and calls PostInit once initialization completes. |
changelog/fragments/1774984035-move-manager-start.yaml |
Changelog entry for Filebeat crash-loop fix under Elastic Agent. |
changelog/fragments/1773263871-beats-manager-shutdown.yaml |
Changelog entry for manager shutdown race fix backport dependency. |
Comments suppressed due to low confidence (1)
x-pack/libbeat/management/managerV2.go:304
PreInitdrops the underlying error fromcm.client.Start(ctx)(it returns a constant message). This makes troubleshooting connection/startup failures much harder; wrap the original error (e.g., using%w) so the caller gets the real cause.
ctx := context.Background()
err := cm.client.Start(ctx)
if err != nil {
return fmt.Errorf("error starting connection to client")
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| func (cm *BeatV2Manager) watchErrChan(ctx context.Context) { | ||
| for { | ||
| select { | ||
| case <-ctx.Done(): | ||
| return | ||
| case err := <-cm.client.Errors(): | ||
| // Don't print the context cancelled errors that happen normally during shutdown, restart, etc | ||
| if !errors.Is(err, context.Canceled) { | ||
| cm.logger.Errorf("elastic-agent-client error: %s", err) | ||
| } | ||
| case <-cm.stopChan: | ||
| return | ||
| } | ||
| } |
There was a problem hiding this comment.
watchErrChan now only exits on ctx.Done(), but Stop() relies on unitListen calling stopBeat() to cancel errCanceller. If unitListen exits via the signal path (it returns without calling stopBeat()), the context is never canceled and this goroutine can run forever (and WaitForStop can hang/time out). Consider ensuring Stop()/shutdown always cancels errCanceller and stops the client even when stopFunc must not be invoked.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
TL;DRBuildkite failed in Remediation
Investigation detailsRoot Cause
The failing commit updated Running Evidence
Verification
Follow-upAfter committing the formatting fix, if CI still reports dirty files, run root Note 🔒 Integrity filtering filtered 1 itemIntegrity filtering activated and filtered the following item during workflow execution.
What is this? | From workflow: PR Buildkite Detective Give us feedback! React with 🚀 if perfect, 👍 if helpful, 👎 if not. |
This backport also includes the commit from another backport (#49449). Those two backports are inter-dependent and need to be merged together.
Proposed commit message
Checklist
I have made corresponding changes to the documentationI have made corresponding change to the default configuration filesstresstest.shscript to run them under stress conditions and race detector to verify their stability../changelog/fragmentsusing the changelog tool.## Disruptive User Impact## Author's ChecklistHow to test this PR locally
Because this PR is a implementation detail change, there is no directly observable behaviour change. The best way to test is to run the new and existing tests.
Run the new test
Run the tests from the modified packages
Related issues
Manager.Start()before state store loading to prevent check-in timeout livelock #49512## Use cases## Screenshots## LogsThis is an automatic backport of pull request #49796 done by [Mergify](https://mergify.com).