Skip to content

[9.2](backport #49632) Fix check config and take over interaction#49784

Open
mergify[bot] wants to merge 4 commits into9.2from
mergify/bp/9.2/pr-49632
Open

[9.2](backport #49632) Fix check config and take over interaction#49784
mergify[bot] wants to merge 4 commits into9.2from
mergify/bp/9.2/pr-49632

Conversation

@mergify
Copy link
Copy Markdown
Contributor

@mergify mergify bot commented Mar 30, 2026

Proposed commit message

When using Filestream's take_over feature with autodiscover, files were being
re-ingested from the beginning instead of continuing from the offset recorded
by the Log input.

Autodiscover validates each rendered configuration by instantiating the input
with a temporary, suffixed ID before starting it. Because take_over ran during
input initialisation, states were migrated to the temporary ID rather than the
real input ID. When the real input started, the Log input states had already
been consumed, so all files appeared new.

The fix moves the take_over migration step from input initialisation to input
start. This ensures that config validation (CheckConfig) never triggers state
migration, and only the input that actually runs performs the takeover.

Additionally, the Log input state is no longer deleted from the registry after
migration. Instead, Filestream checks whether it already holds a state for the
file before migrating, skipping the takeover if a state is found. This makes
the mechanism idempotent and removes reliance on the TTL=-2 heuristic that was
used to detect previously-migrated states.

Last, but not least, a few other issues in the TakeOver implementation
are also fixed:
- Incorrect resource release
- ephemeralStore is now locked throughout the whole TakeOver duration

GenAI-Assisted: Yes
Human-Reviewed: Yes
Tool: Claude-CLI, Model: Claude 4.6 Opus (Thinking)
Tool: Cursor-CLI, Model: GPT-5.3 Codex Extra High

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works. Where relevant, I have used the stresstest.sh script to run them under stress conditions and race detector to verify their stability.
  • I have added an entry in ./changelog/fragments using the changelog tool.

## Disruptive User Impact
## Author's Checklist

How to test this PR locally

The integration test TestAutodiscoverFilestreamTakeOverDoesNotReingest Kind (and Docker) to created a K8s cluster for testing.

Run the tests

cd filebeat
go test -v -run '(?i)takeover' ./input/filestream/... -race
mage buildSystemTestBinary
go test -v -tags integration -run '(?i)takeover' ./tests/integration/... -race

Manual Test: Filestream take_over does not re-ingest with autodiscover

Requirements: Linux, Docker, root (needs /var/lib/docker/containers read access)


  1. Start a container that writes one log line per second:

    docker run -d --name flog-test mingrammer/flog -l -d 1 -s 1
    export CONTAINER_ID=$(docker inspect -f '{{.Id}}' flog-test)
    
  2. Start Filebeat with the Log input via autodiscover, pointed at the container log file:

    # filebeat-log.yml
    filebeat.autodiscover:
      providers:
        - type: docker
          templates:
            - condition:
                contains:
                  docker.container.id: ${CONTAINER_ID}
              config:
                - type: log
                  allow_deprecated_use: true
                  paths:
                    - /var/lib/docker/containers/${data.docker.container.id}/*.log
                  json:
                    message_key: log
                    keys_under_root: true
                    overwrite_keys: true
    output.file:
      path: /tmp/fb-test
      filename: output
      rotate_on_startup: false
    
    logging:
      to_stderr: true

    Start Filebeat:

    filebeat -c filebeat-log.yml
    
  3. Wait until at least 5 events appear in the output file, then stop Filebeat. Note the line count:

    wc -l /tmp/fb-test/output*
    
  4. Restart Filebeat with the Filestream input and take_over: enabled: true, using the same output file (no rotation):

    # filebeat-filestream.yml
    filebeat.autodiscover:
      providers:
        - type: docker
          templates:
            - condition:
                contains:
                  docker.container.id: ${CONTAINER_ID}
              config:
                - type: filestream
                  id: "${data.docker.container.id}-logs"
                  take_over:
                    enabled: true
                  file_identity.native: ~
                  prospector.scanner.fingerprint.enabled: false
                  close.on_state_change.inactive: 2s
                  paths:
                    - /var/lib/docker/containers/${data.docker.container.id}/*.log
                  parsers:
                    - container: ~
    output.file:
      path: /tmp/fb-test
      filename: output
      rotate_on_startup: false
    
    logging:
      to_stderr: true
      level: debug
      selectors:
        - "input.filestream"

Start Filebeat:

filebeat -c filebeat-filestream.yml
  1. Wait until at least 2 new lines appear in the output (check with wc -l /tmp/fb-test/output*), confirming Filestream picked up where the Log input left off.

  2. Stop the container and count the total lines it generated:

    docker stop flog-test
    GENERATED=$(docker logs flog-test 2>/dev/null | wc -l)
    echo "Container generated: $GENERATED"
    
  3. Wait for Filebeat to log "File is inactive. Closing.", then stop it and count total ingested events:

    TOTAL_INGESTED=$(wc -l < /tmp/fb-test/output*)
    echo "Total ingested: $TOTAL_INGESTED"
    

Expected result

TOTAL_INGESTED == GENERATED

No lines should be duplicated or missing. If TOTAL_INGESTED > GENERATED, re-ingestion occurred — the Filestream input restarted from offset 0 instead of continuing from where the Log input stopped.

Related issues

## Use cases
## Screenshots
## Logs


This is an automatic backport of pull request #49632 done by [Mergify](https://mergify.com).

When using Filestream's take_over feature with autodiscover, files were being
re-ingested from the beginning instead of continuing from the offset recorded
by the Log input.

Autodiscover validates each rendered configuration by instantiating the input
with a temporary, suffixed ID before starting it. Because take_over ran during
input initialisation, states were migrated to the temporary ID rather than the
real input ID. When the real input started, the Log input states had already
been consumed, so all files appeared new.

The fix moves the take_over migration step from input initialisation to input
start. This ensures that config validation (CheckConfig) never triggers state
migration, and only the input that actually runs performs the takeover.

Additionally, the Log input state is no longer deleted from the registry after
migration. Instead, Filestream checks whether it already holds a state for the
file before migrating, skipping the takeover if a state is found. This makes
the mechanism idempotent and removes reliance on the TTL=-2 heuristic that was
used to detect previously-migrated states.

Last, but not least, a few other issues in the TakeOver implementation
are also fixed:
- Incorrect resource release
- ephemeralStore is now locked throughout the whole TakeOver duration

GenAI-Assisted: Yes
Human-Reviewed: Yes
Tool: Claude-CLI, Model: Claude 4.6 Opus (Thinking)
Tool: Cursor-CLI, Model: GPT-5.3 Codex Extra High

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
(cherry picked from commit 8a648cf)

# Conflicts:
#	filebeat/input/filestream/internal/input-logfile/input.go
#	filebeat/input/filestream/internal/input-logfile/store.go
@mergify mergify bot requested review from a team as code owners March 30, 2026 15:03
@mergify mergify bot requested review from AndersonQ and andrzej-stencel and removed request for a team March 30, 2026 15:03
@mergify mergify bot added backport conflicts There is a conflict in the backported pull request labels Mar 30, 2026
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Mar 30, 2026
@mergify
Copy link
Copy Markdown
Contributor Author

mergify bot commented Mar 30, 2026

Cherry-pick of 8a648cf has failed:

On branch mergify/bp/9.2/pr-49632
Your branch is up to date with 'origin/9.2'.

You are currently cherry-picking commit 8a648cf55.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:
	modified:   .buildkite/filebeat/filebeat-pipeline.yml
	new file:   changelog/fragments/1769099693-filestream-takeover-autodiscover-reingest.yaml
	modified:   filebeat/input/filestream/internal/input-logfile/manager.go
	modified:   filebeat/input/filestream/internal/input-logfile/manager_test.go
	modified:   filebeat/input/filestream/internal/input-logfile/prospector.go
	modified:   filebeat/input/filestream/internal/input-logfile/store_test.go
	modified:   filebeat/input/filestream/prospector.go
	modified:   filebeat/tests/integration/autodiscover_test.go
	new file:   filebeat/tests/integration/testdata/autodiscover/take-over-filestream-input-k8s.yml
	new file:   filebeat/tests/integration/testdata/autodiscover/take-over-log-input-k8s.yml

Unmerged paths:
  (use "git add <file>..." to mark resolution)
	both modified:   filebeat/input/filestream/internal/input-logfile/input.go
	both modified:   filebeat/input/filestream/internal/input-logfile/store.go

To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally

@github-actions
Copy link
Copy Markdown
Contributor

🤖 GitHub comments

Just comment with:

  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

@github-actions github-actions bot added Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team bugfix labels Mar 30, 2026
@elasticmachine
Copy link
Copy Markdown
Contributor

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Mar 30, 2026
@github-actions

This comment has been minimized.

GenAI-Assisted: Yes
Human-Reviewed: Yes
Tool: Cursor-CLI, Model: GPT-5.3 Codex Extra High Fast
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions
Copy link
Copy Markdown
Contributor

TL;DR

All 4 failing Buildkite jobs are failing before packaging starts because mage is not available in the agent environment (No preset version installed for command mage, exit 126). This is a CI configuration/environment issue, not a test/assertion failure in Beat code.

Remediation

  • Ensure the beats-xpack-agentbeat pipeline sets ASDF_MAGE_VERSION=1.15.0 (or installs mage 1.15.0) before running .buildkite/scripts/packaging/packaging.sh x-pack/agentbeat.
  • Re-run the failed packaging jobs after fixing pipeline env/tool bootstrap.
  • Validation: confirm mage --version succeeds in the job, then verify artifacts appear under x-pack/agentbeat/build/distributions/.
Investigation details

Root Cause

The packaging script invokes mage directly:

  • .buildkite/scripts/packaging/packaging.sh:23mage package

In each failed Buildkite log, execution reaches ~~~ Packaging : x-pack/agentbeat and immediately fails with:

  • No preset version installed for command mage
  • Suggested install: asdf install mage 1.15.0
  • Job exits with status 126

This happened consistently in all 4 failing jobs:

  • Agentbeat packaging Linux
  • Agentbeat packaging linux/amd64 FIPS
  • Agentbeat packaging linux/arm64 FIPS
  • Agentbeat packaging windows/arm64

The repository’s other Buildkite pipeline definitions commonly set ASDF_MAGE_VERSION: 1.15.0 (for example .buildkite/packaging.pipeline.yml:5), which matches the version requested in the failing logs.

Evidence

  • Build: https://buildkite.com/elastic/beats/builds/43253
  • Failing command: .buildkite/scripts/packaging/packaging.sh x-pack/agentbeat
  • Key excerpt (from all failed logs):
    • ~~~ Packaging : x-pack/agentbeat
    • No preset version installed for command mage
    • asdf install mage 1.15.0
    • Error: The command exited with status 126

Verification

  • Local code inspection performed for packaging entrypoint and pipeline config references.
  • Build/test reruns were not executed here (read-only detective workflow; no CI config mutation).

Follow-up

  • If beats-xpack-agentbeat is generated from a shared template, update the template so ASDF_MAGE_VERSION stays aligned and this does not regress.

Note

🔒 Integrity filtering filtered 1 item

Integrity filtering activated and filtered the following item during workflow execution.
This happens when a tool call accesses a resource that does not meet the required integrity or secrecy level of the workflow.


What is this? | From workflow: PR Buildkite Detective

Give us feedback! React with 🚀 if perfect, 👍 if helpful, 👎 if not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport bugfix conflicts There is a conflict in the backported pull request Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants