elastic
diff --git a/‎.github/workflows/sweep-filestream-registry.yml‎
Lines changed: 101 additions & 0 deletions b/‎.github/workflows/sweep-filestream-registry.yml‎
Lines changed: 101 additions & 0 deletions
diff --git a/‎.github/workflows/sweep-libbeat-pipeline-lifecycle.yml‎
Lines changed: 100 additions & 0 deletions b/‎.github/workflows/sweep-libbeat-pipeline-lifecycle.yml‎
Lines changed: 100 additions & 0 deletions
diff --git a/‎.github/workflows/sweep-otel-beatreceiver-isolation.yml‎
Lines changed: 96 additions & 0 deletions b/‎.github/workflows/sweep-otel-beatreceiver-isolation.yml‎
Lines changed: 96 additions & 0 deletions
@@ -0,0 +1,101 @@
+name: "Sweeper: Filestream Registry and State Machine"
+on:
+  schedule:
+    - cron: "0 9 * * 2"
+  workflow_dispatch:
+
+permissions:
+  actions: read
+  contents: read
+  issues: write
+  pull-requests: read
+
+jobs:
+  run:
+    uses: elastic/ai-github-actions/.github/workflows/gh-aw-code-quality-audit.lock.yml@v0
+    with:
+      title-prefix: "[filestream-registry]"
+      severity-threshold: "high"
+      additional-instructions: |
+        You are a **re-ingestion and data-loss sweeper** for the Filestream input. Your goal is
+        to find every code path that can cause a file to be re-read from the beginning or events
+        to be silently dropped — and write a test or reproduction scenario for each one.
+
+        ## The component
+
+        The Filestream input lives under `filebeat/input/filestream/`. Its registry (which tracks
+        how far each file has been read) is implemented under
+        `filebeat/input/filestream/internal/input-logfile/`. The registry is the heart of
+        Filestream's correctness guarantee: if it loses a file's offset, that file will be
+        re-ingested from the beginning, producing duplicate data in the output.
+
+        ## The bug class
+
+        There are two failure modes that have produced bugs repeatedly in this component:
+
+        **Re-ingestion**: A file's offset is lost or reset to zero. This happens when a registry
+        entry is prematurely cleaned, when a file is matched to the wrong registry entry (identity
+        mismatch), when a cursor update is silently discarded, or when a migration path
+        (take_over, identifier type change) fails to preserve the offset.
+
+        **Silent data loss**: Events are published to the pipeline but the cursor is never
+        advanced. This happens when an ACK callback is discarded due to a version mismatch,
+        when a resource is cleaned while events are still in-flight, or when an error path
+        exits without calling the event's done/ACK callback.
+
+        ## How to investigate
+
+        Read the registry store, the harvester lifecycle, and the ACK/publish path end-to-end.
+        Understand how a file's state flows from "new file discovered" through "harvesting"
+        through "ACK received" to "registry updated on disk". Then ask: at each transition,
+        what can go wrong?
+
+        Key areas to focus on:
+
+        **Identity and matching**: How does the input decide which registry entry belongs to a
+        given file? There are two identity modes (inode-based and fingerprint-based). What
+        happens when files are rotated, renamed, or when the identity mode is changed in config?
+        Can a file end up matched to the wrong entry?
+
+        **Cursor update lifecycle**: When an event is ACKed, how does the cursor get written
+        to the persistent registry? Are there conditions (version mismatch, resource marked
+        deleted, concurrent cleanup) where a valid ACK silently discards the cursor update?
+        If so, the input will re-read from the last successfully persisted offset on next start.
+
+        **Cleanup timing**: The `clean_inactive` and `clean_removed` settings delete registry
+        entries after some time. What is the earliest moment an entry can be cleaned? Is it
+        possible for an entry to be cleaned while a harvester for that file is still running
+        or while events from that file are in-flight in the pipeline?
+
+        **Migration paths**: When filestream takes over from another input type, or when
+        `harvester_limit` causes files to be queued, are there windows where the state for a
+        file can be lost or reset?
+
+        ## For each risk you confirm
+
+        Write a unit test using the existing test infrastructure in the package (look at existing
+        `*_test.go` files for helper patterns). The test should set up the registry in the
+        relevant state, trigger the problematic transition, and assert that the cursor is
+        preserved. Run `go test ./filebeat/input/filestream/...` to confirm the failure.
+
+        ## The bar for filing
+
+        Only report findings that a real user could encounter with a realistic Filestream
+        configuration. The bug must be triggerable through normal user actions: starting and
+        stopping Filebeat, rotating log files, changing config, hitting harvester limits, or
+        running on a filesystem that performs renames. Do not file findings that require
+        manually corrupting the registry store or calling internal functions in an order that
+        the real code path never produces. If you cannot describe a concrete sequence of
+        user-observable events that leads to the bug, it is not worth filing.
+
+        ## Output
+
+        File a single issue containing:
+        - Confirmed risks with test code or reproduction steps, the exact code path, and the
+          fix direction
+        - A description of any scenarios you investigated and found safe, so reviewers know
+          the coverage
+        - A priority ranking: which risks are most likely to affect users in production vs
+          only in edge-case configurations
+    secrets:
+      COPILOT_GITHUB_TOKEN: ${{ secrets.COPILOT_GITHUB_TOKEN }}
@@ -0,0 +1,100 @@
+name: "Sweeper: Libbeat Pipeline Shutdown and Queue Lifecycle"
+on:
+  schedule:
+    - cron: "0 9 * * 3"
+  workflow_dispatch:
+
+permissions:
+  actions: read
+  contents: read
+  issues: write
+  pull-requests: read
+
+jobs:
+  run:
+    uses: elastic/ai-github-actions/.github/workflows/gh-aw-code-quality-audit.lock.yml@v0
+    with:
+      title-prefix: "[libbeat-pipeline-lifecycle]"
+      severity-threshold: "high"
+      additional-instructions: |
+        You are a **pipeline lifecycle sweeper** for libbeat. Your goal is to find every path
+        where events can be silently dropped, goroutines can leak, or panics can be triggered
+        during startup, shutdown, output reconnection, or backpressure — and write a failing
+        test for each confirmed issue.
+
+        ## The component
+
+        The libbeat publisher pipeline connects inputs to outputs. It lives under
+        `libbeat/publisher/pipeline/` (the pipeline orchestrator and consumer),
+        `libbeat/publisher/queue/` (memqueue and diskqueue implementations), and
+        `libbeat/outputs/` (elasticsearch, logstash, kafka, and others). The pipeline has a
+        carefully ordered shutdown sequence that, if violated, produces panics (send on closed
+        channel) or hangs (goroutines blocked waiting for signals that never come).
+
+        ## The bug class
+
+        Three categories of bugs recur here:
+
+        **Shutdown ordering**: The pipeline shuts down in stages — queue, then consumer, then
+        output workers. If any stage closes a channel or signals done before the upstream stage
+        has finished sending to it, the result is either a "send on closed channel" panic or
+        a goroutine that blocks forever. Look for channels that are closed by one goroutine
+        while another goroutine may still be sending to them.
+
+        **Signal broadcasting**: Go channels deliver a message to exactly ONE receiver. When
+        multiple goroutines need to observe a shutdown signal, the correct pattern is
+        `close(ch)` (with `sync.Once` to prevent double-close), not `ch <- struct{}{}`.
+        Any buffered-send channel used as a signal to multiple goroutines is a latent bug —
+        only one goroutine will wake up, the others hang forever.
+
+        **ACK callback blocking**: When events are ACKed, the queue calls user-provided callbacks
+        synchronously. If a callback does slow work (filesystem I/O, network), it blocks the
+        ACK loop, which blocks the queue's shutdown drain, which blocks the pipeline's
+        `WaitClose()`. This manifests as a hang on graceful shutdown. Look for ACK callbacks
+        that do more than increment a counter.
+
+        ## How to investigate
+
+        Read the pipeline shutdown sequence end-to-end. Understand what each component closes,
+        in what order, and what it is waiting for before considering itself done. Then look at
+        each channel in the system and ask: who sends to this channel, who receives from it,
+        and what happens when the pipeline is being torn down while a send or receive is in
+        progress?
+
+        Also read the queue implementations (memqueue and diskqueue) for:
+        - Response channels returned to object pools — if a channel is pooled while another
+          goroutine still holds a reference to it, the next user of the channel will receive
+          a stale message
+        - Error paths in the disk queue's write path — if a write fails midway, is the partial
+          state cleaned up, or will the next startup encounter corrupt data?
+
+        For outputs, read the backoff client wrapper. If `Close()` is called while a Publish
+        is sleeping in a backoff wait, does the sleep abort immediately (correct) or block
+        until the full backoff duration elapses (incorrect, causes slow shutdown)?
+
+        ## For each risk you confirm
+
+        Write a Go test. Use goroutines, channels, and short timeouts to create the concurrent
+        scenario. For shutdown hangs, the test should call `Close()` and assert it returns within
+        a short timeout. For panics, use `require.NotPanics`. Run with `-race` where applicable:
+        `go test -race ./libbeat/publisher/...`
+
+        ## The bar for filing
+
+        Only report findings that a real deployment could hit. Shutdown races need to be
+        triggerable under normal operating conditions (e.g. sending SIGTERM while the output
+        is processing events), not only under artificially timed test scenarios that can't
+        occur in practice. For goroutine leaks and hangs, confirm that the problematic code
+        path is reachable from the normal pipeline lifecycle — not just from unit test helpers
+        that bypass the real startup sequence. If the finding requires a precondition that no
+        production deployment would have, skip it.
+
+        ## Output
+
+        File a single issue containing:
+        - Confirmed issues with test code or reproduction steps, the specific code path, and
+          the fix direction
+        - A note on which issues are only detectable under `-race` vs reproducible deterministically
+        - Any components you audited and found clean
+    secrets:
+      COPILOT_GITHUB_TOKEN: ${{ secrets.COPILOT_GITHUB_TOKEN }}
@@ -0,0 +1,96 @@
+name: "Sweeper: OTel BeatReceiver Global State Isolation"
+on:
+  schedule:
+    - cron: "0 9 * * 4"
+  workflow_dispatch:
+
+permissions:
+  actions: read
+  contents: read
+  issues: write
+  pull-requests: read
+
+jobs:
+  run:
+    uses: elastic/ai-github-actions/.github/workflows/gh-aw-code-quality-audit.lock.yml@v0
+    with:
+      title-prefix: "[otel-beatreceiver-isolation]"
+      severity-threshold: "high"
+      additional-instructions: |
+        You are a **global state isolation sweeper** for the OTel BeatReceiver integration.
+        Your goal is to find every package-level variable that is written during receiver
+        construction and determine whether concurrent construction of multiple receivers will
+        produce a data race, incorrect behavior, or a panic.
+
+        ## The component
+
+        Beats can run as OpenTelemetry receivers inside an OTel Collector process. The receiver
+        implementations live under `x-pack/filebeat/fbreceiver/` and
+        `x-pack/metricbeat/mbreceiver/`. When an OTel Collector is configured with multiple
+        beat receivers, or when receivers are restarted, their factory and construction code
+        runs concurrently in the same process.
+
+        The shared initialization code lives under `x-pack/libbeat/cmd/instance/` — this is
+        where a Beat is constructed for use as a receiver. It is the highest-risk area because
+        it was written before OTel receiver support existed, and some of it still assumes it
+        runs once per process.
+
+        ## The bug class
+
+        The recurring pattern is: **code that was safe when only one beat ran per process
+        becomes a data race or produces incorrect behavior when two receivers initialize
+        concurrently**. This manifests as:
+
+        - A package-level variable written by both receivers — the second write silently
+          overwrites the first, and one receiver ends up with the other's configuration
+        - A function called during construction that panics on the second invocation
+          (e.g. registering a plugin that is already registered)
+        - A global singleton initialized by the first receiver and then read by the second,
+          which gets the first receiver's value instead of its own
+
+        ## How to investigate
+
+        Read the beat construction path for receivers — from the factory's `Create*()` function
+        down through all initialization calls. For each package-level function call or variable
+        assignment you encounter, ask three questions:
+
+        1. **Is it idempotent?** If two receivers call it with the same arguments, is the
+           result the same as calling it once?
+        2. **Is it thread-safe?** If two receivers call it concurrently with different arguments,
+           does it produce a data race?
+        3. **Is it per-instance or per-process?** State that should be per-receiver but is
+           stored globally will cause receivers to interfere with each other.
+
+        Pay particular attention to: path initialization (each receiver should resolve paths
+        relative to its own config, not a global), plugin and processor registration (should
+        happen once at startup, not once per receiver), version and identity fields (each
+        receiver should report its own), and manager/factory singletons.
+
+        ## For each risk you confirm
+
+        Write a test that constructs two receivers concurrently (look at existing tests in the
+        receiver packages for construction patterns), then asserts they each have independent
+        state. Run with `-race`: `go test -race ./x-pack/filebeat/fbreceiver/... ./x-pack/metricbeat/mbreceiver/...`
+
+        For double-registration panics, write a test that constructs a receiver twice sequentially
+        and asserts the second construction does not panic.
+
+        ## The bar for filing
+
+        Only report findings relevant to a real OTel Collector deployment that uses multiple
+        beat receivers, or that restarts receivers (e.g. on config reload). A global variable
+        that is written once during process startup and never again is not a problem — that is
+        normal Go initialization. The bug must be triggerable by constructing two receivers
+        concurrently or sequentially in the same process, which is exactly what the OTel
+        Collector does. If concurrent construction is safe but the behavior is merely surprising
+        or inconsistent, that is worth noting but not necessarily worth filing as a bug.
+
+        ## Output
+
+        File a single issue containing:
+        - Confirmed races or incorrect behaviors with test code, the specific global variable,
+          and the fix direction (sync.Once guard, per-instance state, idempotency check)
+        - Confirmed panics on double-construction with reproduction and fix direction
+        - A summary of what you found to already be safe and why
+    secrets:
+      COPILOT_GITHUB_TOKEN: ${{ secrets.COPILOT_GITHUB_TOKEN }}