Conversation
… check-in in parallel to its initialisation (#49796) The Start method from BeatV2Manager is split into two methods: - PreInit: responsible for starting the Elastic Agent client and start replying to check-ins. - PostInit: responsible for setting the Beats status to 'Running' and start executing Unit changes. A new method, WaitForStop is also added. It stops the BeatV2Manager and waits until all goroutines have returned. Currently it is only used in tests that use `testing.T` as the logger output to ensure no panics happen because the logger was used after the test ended. Multiple lint warnings are fixed GenAI-Assisted: Yes Human-Reviewed: Yes Tool: Cursor-CLI, Model: GPT-5.3 Codex Extra High Fast (cherry picked from commit 034546f) # Conflicts: # x-pack/libbeat/management/managerV2.go # x-pack/osquerybeat/beater/osquerybeat_status_test.go
|
Cherry-pick of 034546f has failed: To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally |
🤖 GitHub commentsJust comment with:
|
|
Pinging @elastic/sec-linux-platform (Team:Security-Linux Platform) |
|
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane) |
TL;DRBuildkite is failing for two verified reasons: unresolved merge-conflict markers in Remediation
Investigation detailsRoot Cause
Evidence
Verification
Follow-up
Note 🔒 Integrity filtering filtered 1 itemIntegrity filtering activated and filtered the following item during workflow execution.
What is this? | From workflow: PR Buildkite Detective Give us feedback! React with 🚀 if perfect, 👍 if helpful, 👎 if not. |
As described in #49388, `BeatV2Manager` can miss the shutdown signal because its `Stop` method notifies the manager by sending to its signal channel `stopChan` rather than closing it, but there are two goroutines that both listen on that channel. This PR changes `Stop` to close the channel rather than just sending. It also removes the second `stopChan` listener in `watchErrChan`, since the main goroutine already calls the context canceler for that helper when `stopChan` unblocks (this isn't strictly necessary but it will keep error states visible for a little longer during shutdown, and is what was previously happening in the "good" path where the main worker received the stop signal first). (cherry picked from commit d39cb49)
TL;DR
Remediation
Investigation detailsRoot CauseThe failing step is Evidence
Validation
Follow-up
Note 🔒 Integrity filtering filtered 14 itemsIntegrity filtering activated and filtered the following items during workflow execution.
What is this? | From workflow: PR Actions Detective Give us feedback! React with 🚀 if perfect, 👍 if helpful, 👎 if not. |
There was a problem hiding this comment.
Pull request overview
Backport that refactors the Elastic Agent V2 management startup sequence so Beats can begin responding to Agent check-ins earlier (during potentially slow initialization), and adds a shutdown-wait helper primarily for safer tests.
Changes:
- Split
BeatV2Manager.Start()intoPreInit()(start client + check-ins) andPostInit()(mark running + enable applying unit changes); keepStart()as a deprecated compatibility wrapper. - Add
WaitForStop(timeout)to the management interface and implementations; update tests to use it to avoid goroutine/logging issues during teardown. - Move Filebeat’s manager startup earlier in
Run()(callPreInit()before expensive initialization;PostInit()once ready).
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
x-pack/libbeat/management/managerV2.go |
Implements PreInit/PostInit, deprecates Start, introduces WaitForStop, and gates applying unit changes until the beat is ready. |
libbeat/management/management.go |
Extends the Manager interface with PreInit, PostInit, and WaitForStop. |
filebeat/beater/filebeat.go |
Calls Manager.PreInit() earlier and Manager.PostInit() once initialization completes. |
x-pack/libbeat/management/managerV2_test.go |
Adds coverage for buffered unit application after PostInit; updates shutdown to wait for stop. |
x-pack/filebeat/tests/integration/*.go |
Adjusts integration expectations around unit state/streams and adds minor lint suppressions. |
x-pack/otel/otelmanager/manager.go |
Updates Otel manager to satisfy the expanded Manager interface. |
libbeat/beat/beat_test.go, libbeat/cmd/instance/beat_test.go |
Updates mock managers for the new interface methods. |
changelog/fragments/*.yaml |
Adds changelog entries for the Filebeat crash-loop fix and manager shutdown behavior. |
Comments suppressed due to low confidence (1)
x-pack/libbeat/management/managerV2.go:305
- PreInit() drops the underlying error from cm.client.Start(ctx): it returns a generic "error starting connection to client" without including the original error, which makes troubleshooting failures harder. Wrap/return the underlying error (e.g., using %w) so callers get the cause in logs and test output.
cm.logger.Debug("Manager starting")
ctx := context.Background()
err := cm.client.Start(ctx)
if err != nil {
return fmt.Errorf("error starting connection to client")
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This backport also includes the commit from another backport (#49448). Those two backports are inter-dependent and need to be merged together.
Proposed commit message
Checklist
I have made corresponding changes to the documentationI have made corresponding change to the default configuration filesstresstest.shscript to run them under stress conditions and race detector to verify their stability../changelog/fragmentsusing the changelog tool.## Disruptive User Impact## Author's ChecklistHow to test this PR locally
Because this PR is a implementation detail change, there is no directly observable behaviour change. The best way to test is to run the new and existing tests.
Run the new test
Run the tests from the modified packages
Related issues
Manager.Start()before state store loading to prevent check-in timeout livelock #49512## Use cases## Screenshots## LogsThis is an automatic backport of pull request #49796 done by [Mergify](https://mergify.com).