Skip to content

Commit 11d1fe7

Browse files
strawgateclaude
andauthored
Add PR Buildkite Detective trigger workflow (#49589)
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent 4666186 commit 11d1fe7

6 files changed

+522
-0
lines changed
Lines changed: 101 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,101 @@
1+
name: "Sweeper: Filestream Registry and State Machine"
2+
on:
3+
schedule:
4+
- cron: "0 9 * * 2"
5+
workflow_dispatch:
6+
7+
permissions:
8+
actions: read
9+
contents: read
10+
issues: write
11+
pull-requests: read
12+
13+
jobs:
14+
run:
15+
uses: elastic/ai-github-actions/.github/workflows/gh-aw-code-quality-audit.lock.yml@v0
16+
with:
17+
title-prefix: "[filestream-registry]"
18+
severity-threshold: "high"
19+
additional-instructions: |
20+
You are a **re-ingestion and data-loss sweeper** for the Filestream input. Your goal is
21+
to find every code path that can cause a file to be re-read from the beginning or events
22+
to be silently dropped — and write a test or reproduction scenario for each one.
23+
24+
## The component
25+
26+
The Filestream input lives under `filebeat/input/filestream/`. Its registry (which tracks
27+
how far each file has been read) is implemented under
28+
`filebeat/input/filestream/internal/input-logfile/`. The registry is the heart of
29+
Filestream's correctness guarantee: if it loses a file's offset, that file will be
30+
re-ingested from the beginning, producing duplicate data in the output.
31+
32+
## The bug class
33+
34+
There are two failure modes that have produced bugs repeatedly in this component:
35+
36+
**Re-ingestion**: A file's offset is lost or reset to zero. This happens when a registry
37+
entry is prematurely cleaned, when a file is matched to the wrong registry entry (identity
38+
mismatch), when a cursor update is silently discarded, or when a migration path
39+
(take_over, identifier type change) fails to preserve the offset.
40+
41+
**Silent data loss**: Events are published to the pipeline but the cursor is never
42+
advanced. This happens when an ACK callback is discarded due to a version mismatch,
43+
when a resource is cleaned while events are still in-flight, or when an error path
44+
exits without calling the event's done/ACK callback.
45+
46+
## How to investigate
47+
48+
Read the registry store, the harvester lifecycle, and the ACK/publish path end-to-end.
49+
Understand how a file's state flows from "new file discovered" through "harvesting"
50+
through "ACK received" to "registry updated on disk". Then ask: at each transition,
51+
what can go wrong?
52+
53+
Key areas to focus on:
54+
55+
**Identity and matching**: How does the input decide which registry entry belongs to a
56+
given file? There are two identity modes (inode-based and fingerprint-based). What
57+
happens when files are rotated, renamed, or when the identity mode is changed in config?
58+
Can a file end up matched to the wrong entry?
59+
60+
**Cursor update lifecycle**: When an event is ACKed, how does the cursor get written
61+
to the persistent registry? Are there conditions (version mismatch, resource marked
62+
deleted, concurrent cleanup) where a valid ACK silently discards the cursor update?
63+
If so, the input will re-read from the last successfully persisted offset on next start.
64+
65+
**Cleanup timing**: The `clean_inactive` and `clean_removed` settings delete registry
66+
entries after some time. What is the earliest moment an entry can be cleaned? Is it
67+
possible for an entry to be cleaned while a harvester for that file is still running
68+
or while events from that file are in-flight in the pipeline?
69+
70+
**Migration paths**: When filestream takes over from another input type, or when
71+
`harvester_limit` causes files to be queued, are there windows where the state for a
72+
file can be lost or reset?
73+
74+
## For each risk you confirm
75+
76+
Write a unit test using the existing test infrastructure in the package (look at existing
77+
`*_test.go` files for helper patterns). The test should set up the registry in the
78+
relevant state, trigger the problematic transition, and assert that the cursor is
79+
preserved. Run `go test ./filebeat/input/filestream/...` to confirm the failure.
80+
81+
## The bar for filing
82+
83+
Only report findings that a real user could encounter with a realistic Filestream
84+
configuration. The bug must be triggerable through normal user actions: starting and
85+
stopping Filebeat, rotating log files, changing config, hitting harvester limits, or
86+
running on a filesystem that performs renames. Do not file findings that require
87+
manually corrupting the registry store or calling internal functions in an order that
88+
the real code path never produces. If you cannot describe a concrete sequence of
89+
user-observable events that leads to the bug, it is not worth filing.
90+
91+
## Output
92+
93+
File a single issue containing:
94+
- Confirmed risks with test code or reproduction steps, the exact code path, and the
95+
fix direction
96+
- A description of any scenarios you investigated and found safe, so reviewers know
97+
the coverage
98+
- A priority ranking: which risks are most likely to affect users in production vs
99+
only in edge-case configurations
100+
secrets:
101+
COPILOT_GITHUB_TOKEN: ${{ secrets.COPILOT_GITHUB_TOKEN }}
Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
name: "Sweeper: Libbeat Pipeline Shutdown and Queue Lifecycle"
2+
on:
3+
schedule:
4+
- cron: "0 9 * * 3"
5+
workflow_dispatch:
6+
7+
permissions:
8+
actions: read
9+
contents: read
10+
issues: write
11+
pull-requests: read
12+
13+
jobs:
14+
run:
15+
uses: elastic/ai-github-actions/.github/workflows/gh-aw-code-quality-audit.lock.yml@v0
16+
with:
17+
title-prefix: "[libbeat-pipeline-lifecycle]"
18+
severity-threshold: "high"
19+
additional-instructions: |
20+
You are a **pipeline lifecycle sweeper** for libbeat. Your goal is to find every path
21+
where events can be silently dropped, goroutines can leak, or panics can be triggered
22+
during startup, shutdown, output reconnection, or backpressure — and write a failing
23+
test for each confirmed issue.
24+
25+
## The component
26+
27+
The libbeat publisher pipeline connects inputs to outputs. It lives under
28+
`libbeat/publisher/pipeline/` (the pipeline orchestrator and consumer),
29+
`libbeat/publisher/queue/` (memqueue and diskqueue implementations), and
30+
`libbeat/outputs/` (elasticsearch, logstash, kafka, and others). The pipeline has a
31+
carefully ordered shutdown sequence that, if violated, produces panics (send on closed
32+
channel) or hangs (goroutines blocked waiting for signals that never come).
33+
34+
## The bug class
35+
36+
Three categories of bugs recur here:
37+
38+
**Shutdown ordering**: The pipeline shuts down in stages — queue, then consumer, then
39+
output workers. If any stage closes a channel or signals done before the upstream stage
40+
has finished sending to it, the result is either a "send on closed channel" panic or
41+
a goroutine that blocks forever. Look for channels that are closed by one goroutine
42+
while another goroutine may still be sending to them.
43+
44+
**Signal broadcasting**: Go channels deliver a message to exactly ONE receiver. When
45+
multiple goroutines need to observe a shutdown signal, the correct pattern is
46+
`close(ch)` (with `sync.Once` to prevent double-close), not `ch <- struct{}{}`.
47+
Any buffered-send channel used as a signal to multiple goroutines is a latent bug —
48+
only one goroutine will wake up, the others hang forever.
49+
50+
**ACK callback blocking**: When events are ACKed, the queue calls user-provided callbacks
51+
synchronously. If a callback does slow work (filesystem I/O, network), it blocks the
52+
ACK loop, which blocks the queue's shutdown drain, which blocks the pipeline's
53+
`WaitClose()`. This manifests as a hang on graceful shutdown. Look for ACK callbacks
54+
that do more than increment a counter.
55+
56+
## How to investigate
57+
58+
Read the pipeline shutdown sequence end-to-end. Understand what each component closes,
59+
in what order, and what it is waiting for before considering itself done. Then look at
60+
each channel in the system and ask: who sends to this channel, who receives from it,
61+
and what happens when the pipeline is being torn down while a send or receive is in
62+
progress?
63+
64+
Also read the queue implementations (memqueue and diskqueue) for:
65+
- Response channels returned to object pools — if a channel is pooled while another
66+
goroutine still holds a reference to it, the next user of the channel will receive
67+
a stale message
68+
- Error paths in the disk queue's write path — if a write fails midway, is the partial
69+
state cleaned up, or will the next startup encounter corrupt data?
70+
71+
For outputs, read the backoff client wrapper. If `Close()` is called while a Publish
72+
is sleeping in a backoff wait, does the sleep abort immediately (correct) or block
73+
until the full backoff duration elapses (incorrect, causes slow shutdown)?
74+
75+
## For each risk you confirm
76+
77+
Write a Go test. Use goroutines, channels, and short timeouts to create the concurrent
78+
scenario. For shutdown hangs, the test should call `Close()` and assert it returns within
79+
a short timeout. For panics, use `require.NotPanics`. Run with `-race` where applicable:
80+
`go test -race ./libbeat/publisher/...`
81+
82+
## The bar for filing
83+
84+
Only report findings that a real deployment could hit. Shutdown races need to be
85+
triggerable under normal operating conditions (e.g. sending SIGTERM while the output
86+
is processing events), not only under artificially timed test scenarios that can't
87+
occur in practice. For goroutine leaks and hangs, confirm that the problematic code
88+
path is reachable from the normal pipeline lifecycle — not just from unit test helpers
89+
that bypass the real startup sequence. If the finding requires a precondition that no
90+
production deployment would have, skip it.
91+
92+
## Output
93+
94+
File a single issue containing:
95+
- Confirmed issues with test code or reproduction steps, the specific code path, and
96+
the fix direction
97+
- A note on which issues are only detectable under `-race` vs reproducible deterministically
98+
- Any components you audited and found clean
99+
secrets:
100+
COPILOT_GITHUB_TOKEN: ${{ secrets.COPILOT_GITHUB_TOKEN }}
Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
name: "Sweeper: OTel BeatReceiver Global State Isolation"
2+
on:
3+
schedule:
4+
- cron: "0 9 * * 4"
5+
workflow_dispatch:
6+
7+
permissions:
8+
actions: read
9+
contents: read
10+
issues: write
11+
pull-requests: read
12+
13+
jobs:
14+
run:
15+
uses: elastic/ai-github-actions/.github/workflows/gh-aw-code-quality-audit.lock.yml@v0
16+
with:
17+
title-prefix: "[otel-beatreceiver-isolation]"
18+
severity-threshold: "high"
19+
additional-instructions: |
20+
You are a **global state isolation sweeper** for the OTel BeatReceiver integration.
21+
Your goal is to find every package-level variable that is written during receiver
22+
construction and determine whether concurrent construction of multiple receivers will
23+
produce a data race, incorrect behavior, or a panic.
24+
25+
## The component
26+
27+
Beats can run as OpenTelemetry receivers inside an OTel Collector process. The receiver
28+
implementations live under `x-pack/filebeat/fbreceiver/` and
29+
`x-pack/metricbeat/mbreceiver/`. When an OTel Collector is configured with multiple
30+
beat receivers, or when receivers are restarted, their factory and construction code
31+
runs concurrently in the same process.
32+
33+
The shared initialization code lives under `x-pack/libbeat/cmd/instance/` — this is
34+
where a Beat is constructed for use as a receiver. It is the highest-risk area because
35+
it was written before OTel receiver support existed, and some of it still assumes it
36+
runs once per process.
37+
38+
## The bug class
39+
40+
The recurring pattern is: **code that was safe when only one beat ran per process
41+
becomes a data race or produces incorrect behavior when two receivers initialize
42+
concurrently**. This manifests as:
43+
44+
- A package-level variable written by both receivers — the second write silently
45+
overwrites the first, and one receiver ends up with the other's configuration
46+
- A function called during construction that panics on the second invocation
47+
(e.g. registering a plugin that is already registered)
48+
- A global singleton initialized by the first receiver and then read by the second,
49+
which gets the first receiver's value instead of its own
50+
51+
## How to investigate
52+
53+
Read the beat construction path for receivers — from the factory's `Create*()` function
54+
down through all initialization calls. For each package-level function call or variable
55+
assignment you encounter, ask three questions:
56+
57+
1. **Is it idempotent?** If two receivers call it with the same arguments, is the
58+
result the same as calling it once?
59+
2. **Is it thread-safe?** If two receivers call it concurrently with different arguments,
60+
does it produce a data race?
61+
3. **Is it per-instance or per-process?** State that should be per-receiver but is
62+
stored globally will cause receivers to interfere with each other.
63+
64+
Pay particular attention to: path initialization (each receiver should resolve paths
65+
relative to its own config, not a global), plugin and processor registration (should
66+
happen once at startup, not once per receiver), version and identity fields (each
67+
receiver should report its own), and manager/factory singletons.
68+
69+
## For each risk you confirm
70+
71+
Write a test that constructs two receivers concurrently (look at existing tests in the
72+
receiver packages for construction patterns), then asserts they each have independent
73+
state. Run with `-race`: `go test -race ./x-pack/filebeat/fbreceiver/... ./x-pack/metricbeat/mbreceiver/...`
74+
75+
For double-registration panics, write a test that constructs a receiver twice sequentially
76+
and asserts the second construction does not panic.
77+
78+
## The bar for filing
79+
80+
Only report findings relevant to a real OTel Collector deployment that uses multiple
81+
beat receivers, or that restarts receivers (e.g. on config reload). A global variable
82+
that is written once during process startup and never again is not a problem — that is
83+
normal Go initialization. The bug must be triggerable by constructing two receivers
84+
concurrently or sequentially in the same process, which is exactly what the OTel
85+
Collector does. If concurrent construction is safe but the behavior is merely surprising
86+
or inconsistent, that is worth noting but not necessarily worth filing as a bug.
87+
88+
## Output
89+
90+
File a single issue containing:
91+
- Confirmed races or incorrect behaviors with test code, the specific global variable,
92+
and the fix direction (sync.Once guard, per-instance state, idempotency check)
93+
- Confirmed panics on double-construction with reproduction and fix direction
94+
- A summary of what you found to already be safe and why
95+
secrets:
96+
COPILOT_GITHUB_TOKEN: ${{ secrets.COPILOT_GITHUB_TOKEN }}

0 commit comments

Comments
 (0)