Skip to content

[8.19](backport #49796) Split BeatV2Manager Start into two methods, so Beats can reply to check-in in parallel to its initialisation#49848

Merged
belimawr merged 6 commits into8.19from
mergify/bp/8.19/pr-49796
Apr 1, 2026
Merged

[8.19](backport #49796) Split BeatV2Manager Start into two methods, so Beats can reply to check-in in parallel to its initialisation#49848
belimawr merged 6 commits into8.19from
mergify/bp/8.19/pr-49796

Conversation

@mergify
Copy link
Copy Markdown
Contributor

@mergify mergify bot commented Apr 1, 2026

This backport also includes the commit from another backport (#49448). Those two backports are inter-dependent and need to be merged together.

Proposed commit message

The Start method from BeatV2Manager is split into two methods:
 - PreInit: responsible for starting the Elastic Agent client and
   start replying to check-ins.
 - PostInit: responsible for setting the Beats status to 'Running' and
   start executing Unit changes.

A new method, WaitForStop is also added. It stops the BeatV2Manager
and waits until all goroutines have returned. Currently it is only
used in tests that use `testing.T` as the logger output to ensure no
panics happen because the logger was used after the test ended.

Multiple lint warnings are fixed

GenAI-Assisted: Yes
Human-Reviewed: Yes
Tool: Cursor-CLI, Model: GPT-5.3 Codex Extra High Fast

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works. Where relevant, I have used the stresstest.sh script to run them under stress conditions and race detector to verify their stability.
  • I have added an entry in ./changelog/fragments using the changelog tool.

## Disruptive User Impact
## Author's Checklist

How to test this PR locally

Because this PR is a implementation detail change, there is no directly observable behaviour change. The best way to test is to run the new and existing tests.

Run the new test

cd x-pack/libbeat/management/
go test -count=1 -v . -run=TestManagerV2_PreInitAppliesBufferedUnitsAfterPostInit

Run the tests from the modified packages

# Run all management unit tests
go test -count=1 ./x-pack/libbeat/management/...

cd x-pack/filebeat
mage BuildSystemTestbinary
mage -v docker:composeUP

# Run all integration tests from the ManagerV2
go test -count=1 -tags=integration ./tests/integration -run="TestInputReloadUnderElasticAgent|TestFailedOutputReportsUnhealthy|TestRecoverFromInvalidOutputConfiguration|TestAgentPackageVersionOnStartUpInfo|TestHTTPJSONInputReloadUnderElasticAgentWithElasticStateStore|TestReloadErrorHandling|TestPipelineConnectionErrorFailsInput"

# Run all integration tests
go test -count=1 -tags=integration ./tests/integration

Related issues

## Use cases
## Screenshots
## Logs


This is an automatic backport of pull request #49796 done by [Mergify](https://mergify.com).

… check-in in parallel to its initialisation (#49796)

The Start method from BeatV2Manager is split into two methods:
 - PreInit: responsible for starting the Elastic Agent client and
   start replying to check-ins.
 - PostInit: responsible for setting the Beats status to 'Running' and
   start executing Unit changes.

A new method, WaitForStop is also added. It stops the BeatV2Manager
and waits until all goroutines have returned. Currently it is only
used in tests that use `testing.T` as the logger output to ensure no
panics happen because the logger was used after the test ended.

Multiple lint warnings are fixed

GenAI-Assisted: Yes
Human-Reviewed: Yes
Tool: Cursor-CLI, Model: GPT-5.3 Codex Extra High Fast
(cherry picked from commit 034546f)

# Conflicts:
#	x-pack/libbeat/management/managerV2.go
#	x-pack/osquerybeat/beater/osquerybeat_status_test.go
@mergify mergify bot added backport conflicts There is a conflict in the backported pull request labels Apr 1, 2026
@mergify mergify bot requested review from a team as code owners April 1, 2026 14:08
@mergify
Copy link
Copy Markdown
Contributor Author

mergify bot commented Apr 1, 2026

Cherry-pick of 034546f has failed:

On branch mergify/bp/8.19/pr-49796
Your branch is up to date with 'origin/8.19'.

You are currently cherry-picking commit 034546fe9.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:
	new file:   changelog/fragments/1774984035-move-manager-start.yaml
	modified:   filebeat/beater/filebeat.go
	modified:   libbeat/beat/beat_test.go
	modified:   libbeat/cmd/instance/beat_test.go
	modified:   libbeat/management/management.go
	modified:   x-pack/filebeat/tests/integration/managerV2_test.go
	modified:   x-pack/filebeat/tests/integration/status_reporter_test.go
	modified:   x-pack/libbeat/management/managerV2_test.go
	modified:   x-pack/otel/otelmanager/manager.go

Unmerged paths:
  (use "git add/rm <file>..." as appropriate to mark resolution)
	both modified:   x-pack/libbeat/management/managerV2.go
	deleted by us:   x-pack/osquerybeat/beater/osquerybeat_status_test.go

To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally

@mergify mergify bot requested review from faec and leehinman and removed request for a team April 1, 2026 14:09
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Apr 1, 2026
@mergify mergify bot requested review from blakerouse and ycombinator and removed request for a team April 1, 2026 14:09
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

🤖 GitHub comments

Just comment with:

  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

@github-actions github-actions bot added Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team Team:Security-Linux Platform Linux Platform Team in Security Solution labels Apr 1, 2026
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Apr 1, 2026
@elasticmachine
Copy link
Copy Markdown
Contributor

Pinging @elastic/sec-linux-platform (Team:Security-Linux Platform)

@elasticmachine
Copy link
Copy Markdown
Contributor

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

TL;DR

Buildkite is failing for two verified reasons: unresolved merge-conflict markers in x-pack/libbeat/management/managerV2.go and an import in x-pack/osquerybeat/beater/osquerybeat_status_test.go that cannot be resolved on this backport branch. Resolve both, then rerun CI.

Remediation

  • Resolve and remove conflict markers in x-pack/libbeat/management/managerV2.go (<<<<<<<, =======, >>>>>>>) and keep the intended merged code.
  • In x-pack/osquerybeat/beater/osquerybeat_status_test.go, replace/remove libbeat/beatmonitoring usage (line 18 and NewMonitoring() call sites) with an API available on 8.19, or backport the missing package if that is intentional.
  • Validate with:
    • pre-commit run --all-files
    • make -C x-pack/libbeat check update
    • make -C filebeat check update (or one representative beat pipeline first)
Investigation details

Root Cause

  1. Configuration / merge-conflict artifact

    • Multiple pre-commit jobs fail on check-merge-conflict and point to the same file and lines.
    • go vet in x-pack jobs also fails parsing the same markers.
  2. Code/backport mismatch in test import

    • Multiple check-no-changes jobs fail while resolving github.com/elastic/beats/v7/libbeat/beatmonitoring from x-pack/osquerybeat/beater/osquerybeat_status_test.go.
    • On this CI branch, that package is not being resolved in-module, so go attempts module download and fails.

Evidence

  • Build: https://buildkite.com/elastic/beats/builds/43419
  • Representative log evidence:
    • /tmp/gh-aw/buildkite-logs/beats-libbeat-libbeat-run-pre-commit.txt lines 131-143:
      • check for merge conflicts...Failed
      • x-pack/libbeat/management/managerV2.go:94: Merge conflict string '<<<<<<<' found
      • ...:95: '=======' found
      • ...:99: '>>>>>>>' found
    • /tmp/gh-aw/buildkite-logs/beats-xpack-winlogbeat-x-packwinlogbeat-run-checkupdate.txt lines 133-141:
      • ../libbeat/management/managerV2.go:94:1: syntax error: unexpected <<
      • ...:95:1: syntax error: unexpected ==
      • ...:99:1: syntax error: unexpected >>
    • /tmp/gh-aw/buildkite-logs/filebeat-filebeat-run-checkupdate.txt lines 135-145:
      • finding module for package github.com/elastic/beats/v7/libbeat/beatmonitoring
      • module github.com/elastic/beats@latest found (v7.6.2+incompatible), but does not contain package .../libbeat/beatmonitoring
  • Source reference:
    • x-pack/osquerybeat/beater/osquerybeat_status_test.go:18 imports github.com/elastic/beats/v7/libbeat/beatmonitoring.

Verification

  • Local reproduction was not run because this workflow only has prefetched Buildkite logs for this failing run.
  • Findings are based on repeated, consistent errors across 19 failed jobs.

Follow-up

  • After resolving conflicts and the osquerybeat test import mismatch, rerun CI; if only infra/transient failures remain, retry affected jobs.

Note

🔒 Integrity filtering filtered 1 item

Integrity filtering activated and filtered the following item during workflow execution.
This happens when a tool call accesses a resource that does not meet the required integrity or secrecy level of the workflow.


What is this? | From workflow: PR Buildkite Detective

Give us feedback! React with 🚀 if perfect, 👍 if helpful, 👎 if not.

As described in #49388, `BeatV2Manager` can miss the shutdown signal because its `Stop` method notifies the manager by sending to its signal channel `stopChan` rather than closing it, but there are two goroutines that both listen on that channel.

This PR changes `Stop` to close the channel rather than just sending. It also removes the second `stopChan` listener in `watchErrChan`, since the main goroutine already calls the context canceler for that helper when `stopChan` unblocks (this isn't strictly necessary but it will keep error states visible for a little longer during shutdown, and is what was previously happening in the "good" path where the main worker received the stop signal first).

(cherry picked from commit d39cb49)
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

TL;DR

golangci-lint failed in all three matrix jobs because x-pack/libbeat/management/managerV2.go introduces two unused struct fields and one errors.Is call with reversed arguments; tests also call deprecated Start(). Fix those code issues and rerun CI.

Remediation

  • In x-pack/libbeat/management/managerV2.go, remove or use the unused fields reported by lint:
    • stopOnce sync.Once (around line 94)
    • wg sync.WaitGroup (around line 126)
  • In watchErrChan, fix the errors.Is argument order:
    • from errors.Is(context.Canceled, err)
    • to errors.Is(err, context.Canceled)
  • In x-pack/libbeat/management/managerV2_test.go, replace deprecated m.Start() usage in flagged tests with PreInit(...) + PostInit(...) (as suggested by staticcheck), then rerun lint.
Investigation details

Root Cause

The failing step is golangci-lint in all OS variants (ubuntu-latest, macos-latest, windows-latest). The same 6 lint issues are reported each time: 4 staticcheck + 2 unused.

Evidence

  • Workflow: https://github.com/elastic/beats/actions/runs/23854897321
  • Job/step: lint (ubuntu-latest)golangci-lint (same findings on macOS/Windows)
  • Key log excerpt:
    • x-pack/libbeat/management/managerV2.go:544:8: SA1032: arguments have the wrong order (staticcheck)
    • x-pack/libbeat/management/managerV2_test.go:398:8: SA1019: m.Start is deprecated: Use [PreInit] and [PostInit] instead
    • x-pack/libbeat/management/managerV2_test.go:644:12: SA1019: m.Start is deprecated: Use [PreInit] and [PostInit] instead
    • x-pack/libbeat/management/managerV2_test.go:804:12: SA1019: m.Start is deprecated: Use [PreInit] and [PostInit] instead
    • x-pack/libbeat/management/managerV2.go:94:2: field stopOnce is unused (unused)
    • x-pack/libbeat/management/managerV2.go:126:2: field wg is unused (unused)

Validation

  • Not run locally (this was a log-based investigation only).

Follow-up

  • After applying the fixes above, rerun golangci-lint (or rerun the PR check) to confirm the matrix passes.

Note

🔒 Integrity filtering filtered 14 items

Integrity filtering activated and filtered the following items during workflow execution.
This happens when a tool call accesses a resource that does not meet the required integrity or secrecy level of the workflow.


What is this? | From workflow: PR Actions Detective

Give us feedback! React with 🚀 if perfect, 👍 if helpful, 👎 if not.

@belimawr belimawr requested a review from Copilot April 1, 2026 15:29
@belimawr belimawr enabled auto-merge (squash) April 1, 2026 15:30
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Backport that refactors the Elastic Agent V2 management startup sequence so Beats can begin responding to Agent check-ins earlier (during potentially slow initialization), and adds a shutdown-wait helper primarily for safer tests.

Changes:

  • Split BeatV2Manager.Start() into PreInit() (start client + check-ins) and PostInit() (mark running + enable applying unit changes); keep Start() as a deprecated compatibility wrapper.
  • Add WaitForStop(timeout) to the management interface and implementations; update tests to use it to avoid goroutine/logging issues during teardown.
  • Move Filebeat’s manager startup earlier in Run() (call PreInit() before expensive initialization; PostInit() once ready).

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
x-pack/libbeat/management/managerV2.go Implements PreInit/PostInit, deprecates Start, introduces WaitForStop, and gates applying unit changes until the beat is ready.
libbeat/management/management.go Extends the Manager interface with PreInit, PostInit, and WaitForStop.
filebeat/beater/filebeat.go Calls Manager.PreInit() earlier and Manager.PostInit() once initialization completes.
x-pack/libbeat/management/managerV2_test.go Adds coverage for buffered unit application after PostInit; updates shutdown to wait for stop.
x-pack/filebeat/tests/integration/*.go Adjusts integration expectations around unit state/streams and adds minor lint suppressions.
x-pack/otel/otelmanager/manager.go Updates Otel manager to satisfy the expanded Manager interface.
libbeat/beat/beat_test.go, libbeat/cmd/instance/beat_test.go Updates mock managers for the new interface methods.
changelog/fragments/*.yaml Adds changelog entries for the Filebeat crash-loop fix and manager shutdown behavior.
Comments suppressed due to low confidence (1)

x-pack/libbeat/management/managerV2.go:305

  • PreInit() drops the underlying error from cm.client.Start(ctx): it returns a generic "error starting connection to client" without including the original error, which makes troubleshooting failures harder. Wrap/return the underlying error (e.g., using %w) so callers get the cause in logs and test output.
	cm.logger.Debug("Manager starting")
	ctx := context.Background()
	err := cm.client.Start(ctx)
	if err != nil {
		return fmt.Errorf("error starting connection to client")
	}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

belimawr and others added 2 commits April 1, 2026 11:42
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@belimawr belimawr merged commit 3f10c97 into 8.19 Apr 1, 2026
203 of 206 checks passed
@belimawr belimawr deleted the mergify/bp/8.19/pr-49796 branch April 1, 2026 17:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport conflicts There is a conflict in the backported pull request Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team Team:Security-Linux Platform Linux Platform Team in Security Solution

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants