-
Notifications
You must be signed in to change notification settings - Fork 36
[bpf-ci-bot] Flaky test: stream_success/stream_arena_callback_fault on test_progs_cpuv4 #449
Description
Summary
The stream_success/stream_arena_callback_fault BPF selftest is flaky,
failing intermittently on the test_progs_cpuv4 variant in BPF CI
(kernel-patches/bpf). The failure appears on both baseline branches
(to-test) and independent patch series, confirming it is not
patch-specific.
Failure Pattern
do_prog_test_run:PASS:bpf_prog_test_run 0 nsec
get_stream:FAIL:stream read unexpected stream read: actual 0 <= expected 0
run_subtest:FAIL:1307 Unexpected retval from get_stream(): 0, errno = 95
#428/8 stream_success/stream_arena_callback_fault:FAIL
The BPF program executes successfully (retval 0), but the stream read
returns 0 bytes when it expects diagnostic output about an arena fault.
Affected CI Runs
| Run ID | Branch | Date | Arch/Compiler |
|---|---|---|---|
| 22269280442 | to-test (baseline) |
Feb 22 | x86_64 / gcc-15, test_progs_cpuv4 |
| 22326859477 | series/1056806=>bpf-next |
Feb 23 | x86_64 / gcc-15, test_progs_cpuv4 |
The test passes on most runs including the regular test_progs and
test_progs_no_alu32 variants, confirming it is a timing-sensitive flake
rather than a deterministic failure.
Root Cause Analysis
Test structure
The test (tools/testing/selftests/bpf/progs/stream.c:228) creates a
BPF timer callback that intentionally writes to an unmapped arena address:
static __noinline int timer_cb(void *map, int *key, struct bpf_timer *timer)
{
int __arena *addr = (int __arena *)0xdeadbeef;
arena_ptr = &arena;
*addr = 1; /* Triggers arena fault */
return 0;
}
int stream_arena_callback_fault(void *ctx)
{
...
bpf_timer_init(arr_timer, &array, 1);
bpf_timer_set_callback(arr_timer, timer_cb);
bpf_timer_start(arr_timer, 0, 0); /* 0ns delay, softirq mode */
return 0;
}The fault is handled by ex_handler_bpf() → bpf_prog_report_arena_violation()
which writes diagnostic output to the program's BPF stream. The test
framework then reads the stream via bpf_prog_stream_read() and validates
it matches the expected __stderr() patterns.
The race condition
bpf_timer_start() with nsecs=0 uses HRTIMER_MODE_REL_SOFT
(kernel/bpf/helpers.c:1520), which means the timer callback executes in
hrtimer_run_softirq() context. The execution sequence is:
- BPF syscall program calls
bpf_timer_start(arr_timer, 0, 0)→
hrtimer is programmed to expire at current time - BPF program returns →
bpf_prog_test_run_opts()returns to userspace - (sometime later) Hardware timer interrupt fires →
TIMER_SOFTIRQ
is raised - Softirq handler calls
bpf_timer_cb()→ timer callback triggers
arena fault → fault handler writes to stream - Test framework calls
bpf_prog_stream_read()→ reads stream data
If step 5 happens before step 4, the stream is empty and the test
fails. This is the observed failure mode.
The race is inherent to the asynchronous nature of softirq timers. While
the softirq often runs during the return-to-userspace path of the
preceding syscall (step 2), this depends on the hardware timer interrupt
having already fired by that point. With a 0-delay softirq timer, the
hrtimer is programmed to expire "now", but the hardware timer interrupt
may not fire until the next tick (up to 1ms with HZ=1000).
Why cpuv4 specifically?
The cpuv4 variant compiles BPF programs with -mcpu=v4, which enables
newer BPF instructions. The JIT code may execute marginally faster, or
the different instruction encoding may affect scheduling. However, the
fundamental issue is timing-dependent and could theoretically manifest
on any variant. The cpuv4 variant is simply where it was observed due to
timing characteristics.
Contrast with other stream_arena tests
Other tests in the same file don't have this issue because they trigger
arena faults synchronously:
stream_arena_write_fault— directly accesses unmapped arena memorystream_arena_read_fault— directly reads unmapped arena memorystream_arena_subprog_fault— calls a subprogram that accesses arena
These faults occur during the BPF syscall program execution itself, so
the stream output is available immediately when the program returns.
Proposed Fix
Add retry logic to get_stream() in test_loader.c. When the stream
read returns 0 bytes, retry up to 10 times with 1ms delays (10ms total
budget). This gives the softirq enough time to process the timer callback
and populate the stream.
For tests with synchronous stream output, the first read succeeds
immediately with no added latency. The retry is only exercised when the
stream is empty, which should only happen with asynchronous producers like
timer callbacks.
The fix also guards text[ret] with ret > 0 to prevent undefined
behavior if bpf_prog_stream_read returns a negative error code.
See the attached patch:
0001-selftests-bpf-fix-flaky-stream_arena_callback_fault-test.patch
Impact
This flaky test affects BPF CI signal quality by occasionally causing
unrelated patch series to appear as failing. Since the failure occurs on
baseline branches (to-test), it affects all CI runs that include
test_progs_cpuv4.