Skip to content

[9.2](backport #49796) Split BeatV2Manager Start into two methods, so Beats can reply to check-in in parallel to its initialisation#49849

Merged
belimawr merged 5 commits into9.2from
mergify/bp/9.2/pr-49796
Apr 1, 2026
Merged

[9.2](backport #49796) Split BeatV2Manager Start into two methods, so Beats can reply to check-in in parallel to its initialisation#49849
belimawr merged 5 commits into9.2from
mergify/bp/9.2/pr-49796

Conversation

@mergify
Copy link
Copy Markdown
Contributor

@mergify mergify bot commented Apr 1, 2026

This backport also includes the commit from another backport (#49449). Those two backports are inter-dependent and need to be merged together.

Proposed commit message

The Start method from BeatV2Manager is split into two methods:
 - PreInit: responsible for starting the Elastic Agent client and
   start replying to check-ins.
 - PostInit: responsible for setting the Beats status to 'Running' and
   start executing Unit changes.

A new method, WaitForStop is also added. It stops the BeatV2Manager
and waits until all goroutines have returned. Currently it is only
used in tests that use `testing.T` as the logger output to ensure no
panics happen because the logger was used after the test ended.

Multiple lint warnings are fixed

GenAI-Assisted: Yes
Human-Reviewed: Yes
Tool: Cursor-CLI, Model: GPT-5.3 Codex Extra High Fast

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works. Where relevant, I have used the stresstest.sh script to run them under stress conditions and race detector to verify their stability.
  • I have added an entry in ./changelog/fragments using the changelog tool.

## Disruptive User Impact
## Author's Checklist

How to test this PR locally

Because this PR is a implementation detail change, there is no directly observable behaviour change. The best way to test is to run the new and existing tests.

Run the new test

cd x-pack/libbeat/management/
go test -count=1 -v . -run=TestManagerV2_PreInitAppliesBufferedUnitsAfterPostInit

Run the tests from the modified packages

# Run all management unit tests
go test -count=1 ./x-pack/libbeat/management/...

cd x-pack/filebeat
mage BuildSystemTestbinary
mage -v docker:composeUP

# Run all integration tests from the ManagerV2
go test -count=1 -tags=integration ./tests/integration -run="TestInputReloadUnderElasticAgent|TestFailedOutputReportsUnhealthy|TestRecoverFromInvalidOutputConfiguration|TestAgentPackageVersionOnStartUpInfo|TestHTTPJSONInputReloadUnderElasticAgentWithElasticStateStore|TestReloadErrorHandling|TestPipelineConnectionErrorFailsInput"

# Run all integration tests
go test -count=1 -tags=integration ./tests/integration

Related issues

## Use cases
## Screenshots
## Logs


This is an automatic backport of pull request #49796 done by [Mergify](https://mergify.com).

… check-in in parallel to its initialisation (#49796)

The Start method from BeatV2Manager is split into two methods:
 - PreInit: responsible for starting the Elastic Agent client and
   start replying to check-ins.
 - PostInit: responsible for setting the Beats status to 'Running' and
   start executing Unit changes.

A new method, WaitForStop is also added. It stops the BeatV2Manager
and waits until all goroutines have returned. Currently it is only
used in tests that use `testing.T` as the logger output to ensure no
panics happen because the logger was used after the test ended.

Multiple lint warnings are fixed

GenAI-Assisted: Yes
Human-Reviewed: Yes
Tool: Cursor-CLI, Model: GPT-5.3 Codex Extra High Fast
(cherry picked from commit 034546f)

# Conflicts:
#	x-pack/libbeat/management/managerV2.go
#	x-pack/osquerybeat/beater/osquerybeat_status_test.go
@mergify mergify bot requested a review from a team as a code owner April 1, 2026 14:09
@mergify mergify bot added backport conflicts There is a conflict in the backported pull request labels Apr 1, 2026
@mergify mergify bot requested review from a team as code owners April 1, 2026 14:09
@mergify mergify bot requested review from andrzej-stencel and khushijain21 and removed request for a team April 1, 2026 14:09
@mergify
Copy link
Copy Markdown
Contributor Author

mergify bot commented Apr 1, 2026

Cherry-pick of 034546f has failed:

On branch mergify/bp/9.2/pr-49796
Your branch is up to date with 'origin/9.2'.

You are currently cherry-picking commit 034546fe9.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:
	new file:   changelog/fragments/1774984035-move-manager-start.yaml
	modified:   filebeat/beater/filebeat.go
	modified:   libbeat/beat/beat_test.go
	modified:   libbeat/cmd/instance/beat_test.go
	modified:   libbeat/management/management.go
	modified:   x-pack/filebeat/tests/integration/managerV2_test.go
	modified:   x-pack/filebeat/tests/integration/status_reporter_test.go
	modified:   x-pack/libbeat/management/managerV2_test.go
	modified:   x-pack/otel/otelmanager/manager.go

Unmerged paths:
  (use "git add/rm <file>..." as appropriate to mark resolution)
	both modified:   x-pack/libbeat/management/managerV2.go
	deleted by us:   x-pack/osquerybeat/beater/osquerybeat_status_test.go

To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally

@mergify mergify bot requested review from samuelvl and swiatekm and removed request for a team April 1, 2026 14:09
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Apr 1, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

🤖 GitHub comments

Just comment with:

  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

@github-actions github-actions bot added Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team Team:Security-Linux Platform Linux Platform Team in Security Solution labels Apr 1, 2026
@elasticmachine
Copy link
Copy Markdown
Contributor

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@elasticmachine
Copy link
Copy Markdown
Contributor

Pinging @elastic/sec-linux-platform (Team:Security-Linux Platform)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Apr 1, 2026
belimawr and others added 2 commits April 1, 2026 10:40
As described in #49388, `BeatV2Manager` can miss the shutdown signal because its `Stop` method notifies the manager by sending to its signal channel `stopChan` rather than closing it, but there are two goroutines that both listen on that channel.

This PR changes `Stop` to close the channel rather than just sending. It also removes the second `stopChan` listener in `watchErrChan`, since the main goroutine already calls the context canceler for that helper when `stopChan` unblocks (this isn't strictly necessary but it will keep error states visible for a little longer during shutdown, and is what was previously happening in the "good" path where the main worker received the stop signal first).

(cherry picked from commit d39cb49)
@github-actions

This comment has been minimized.

@belimawr belimawr enabled auto-merge (squash) April 1, 2026 15:23
@belimawr belimawr requested a review from Copilot April 1, 2026 15:27
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Backports the ManagerV2 startup refactor to allow Beats to respond to Elastic Agent check-ins in parallel with expensive initialization (e.g., Filebeat state store load), and adds a shutdown-wait helper primarily for tests.

Changes:

  • Split BeatV2Manager.Start() into PreInit() (start client/check-ins) and PostInit() (mark running + allow applying unit changes), keeping Start() as a backwards-compatible wrapper.
  • Add WaitForStop(timeout) to stop the manager and wait for its goroutines to exit; update tests to use it to avoid post-test logger panics.
  • Move Filebeat management startup earlier (PreInit() before state store load; PostInit() once fully initialized), and adjust integration/unit tests + changelog fragments.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
x-pack/otel/otelmanager/manager.go Implements new Manager interface methods (PreInit/PostInit/WaitForStop) for the Otel manager stub.
x-pack/libbeat/management/managerV2.go Core refactor: introduces PreInit/PostInit, WaitForStop, stop channel semantics, and buffering behavior before beat readiness.
x-pack/libbeat/management/managerV2_test.go Updates tests to use WaitForStop and adds coverage for buffered units being applied after PostInit.
x-pack/filebeat/tests/integration/status_reporter_test.go Minor test fix (stream selection + gosec suppression comment).
x-pack/filebeat/tests/integration/managerV2_test.go Aligns expectations with updated startup/status timing.
libbeat/management/management.go Extends Manager interface with PreInit/PostInit/WaitForStop; updates FallbackManager.
libbeat/cmd/instance/beat_test.go Updates mock manager to satisfy new interface.
libbeat/beat/beat_test.go Updates test manager to satisfy new interface.
filebeat/beater/filebeat.go Starts manager check-in loop earlier (PreInit) and calls PostInit once initialization completes.
changelog/fragments/1774984035-move-manager-start.yaml Changelog entry for Filebeat crash-loop fix under Elastic Agent.
changelog/fragments/1773263871-beats-manager-shutdown.yaml Changelog entry for manager shutdown race fix backport dependency.
Comments suppressed due to low confidence (1)

x-pack/libbeat/management/managerV2.go:304

  • PreInit drops the underlying error from cm.client.Start(ctx) (it returns a constant message). This makes troubleshooting connection/startup failures much harder; wrap the original error (e.g., using %w) so the caller gets the real cause.
	ctx := context.Background()
	err := cm.client.Start(ctx)
	if err != nil {
		return fmt.Errorf("error starting connection to client")
	}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 537 to 548
func (cm *BeatV2Manager) watchErrChan(ctx context.Context) {
for {
select {
case <-ctx.Done():
return
case err := <-cm.client.Errors():
// Don't print the context cancelled errors that happen normally during shutdown, restart, etc
if !errors.Is(err, context.Canceled) {
cm.logger.Errorf("elastic-agent-client error: %s", err)
}
case <-cm.stopChan:
return
}
}
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

watchErrChan now only exits on ctx.Done(), but Stop() relies on unitListen calling stopBeat() to cancel errCanceller. If unitListen exits via the signal path (it returns without calling stopBeat()), the context is never canceled and this goroutine can run forever (and WaitForStop can hang/time out). Consider ensuring Stop()/shutdown always cancels errCanceller and stops the client even when stopFunc must not be invoked.

Copilot uses AI. Check for mistakes.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

TL;DR

Buildkite failed in Libbeat: Run check/update because libbeat/management/management.go is not gofmt-normalized in commit cc3c3d6f6e3403076f2a813749b7188c546ef85a. Regenerate/format and commit the resulting change.

Remediation

  • Run gofmt -w libbeat/management/management.go (or make -C libbeat update) and commit the formatting-only diff.
  • Re-run CI (or locally run make -C libbeat check update && make check-no-changes) to confirm the tree is clean.
Investigation details

Root Cause

mage check failed dirty-tree validation and reported exactly one modified file: libbeat/management/management.go.

The failing commit updated WaitForStop in that file, and the committed version is not gofmt-normalized around the adjacent one-line methods (Enabled, AgentInfo, PreInit, PostInit, Start) near libbeat/management/management.go lines ~183-188.

Running gofmt -d against the file content from commit cc3c3d6f6e3403076f2a813749b7188c546ef85a produces formatting changes in that section, which explains why make ... check update mutates it and then check-no-changes fails.

Evidence

Error: some files are not up-to-date. Run 'make update' then review and commit the changes. Modified: [libbeat/management/management.go]
make: *** [scripts/Makefile:151: check] Error 1

Verification

  • Verified from prefetched failed log: /tmp/gh-aw/buildkite-logs/beats-libbeat-libbeat-run-checkupdate.txt.
  • Verified commit contents for cc3c3d6... and confirmed gofmt -d on that blob emits a diff for libbeat/management/management.go.
  • Full local make -C libbeat check update was not runnable in this runner due missing mage in PATH, but the failure mechanism is directly evidenced by CI log + gofmt diff on the commit blob.

Follow-up

After committing the formatting fix, if CI still reports dirty files, run root make update once to ensure no additional generated artifacts are pending.

Note

🔒 Integrity filtering filtered 1 item

Integrity filtering activated and filtered the following item during workflow execution.
This happens when a tool call accesses a resource that does not meet the required integrity or secrecy level of the workflow.


What is this? | From workflow: PR Buildkite Detective

Give us feedback! React with 🚀 if perfect, 👍 if helpful, 👎 if not.

@belimawr belimawr merged commit f673dff into 9.2 Apr 1, 2026
206 of 209 checks passed
@belimawr belimawr deleted the mergify/bp/9.2/pr-49796 branch April 1, 2026 19:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport conflicts There is a conflict in the backported pull request Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team Team:Security-Linux Platform Linux Platform Team in Security Solution

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants