Skip to content

[bpf-ci-bot] Flaky test: stream_success/stream_arena_callback_fault on test_progs_cpuv4 #449

@kernel-patches-review-bot

Description

@kernel-patches-review-bot

Summary

The stream_success/stream_arena_callback_fault BPF selftest is flaky,
failing intermittently on the test_progs_cpuv4 variant in BPF CI
(kernel-patches/bpf). The failure appears on both baseline branches
(to-test) and independent patch series, confirming it is not
patch-specific.

Failure Pattern

do_prog_test_run:PASS:bpf_prog_test_run 0 nsec
get_stream:FAIL:stream read unexpected stream read: actual 0 <= expected 0
run_subtest:FAIL:1307 Unexpected retval from get_stream(): 0, errno = 95
#428/8   stream_success/stream_arena_callback_fault:FAIL

The BPF program executes successfully (retval 0), but the stream read
returns 0 bytes when it expects diagnostic output about an arena fault.

Affected CI Runs

Run ID Branch Date Arch/Compiler
22269280442 to-test (baseline) Feb 22 x86_64 / gcc-15, test_progs_cpuv4
22326859477 series/1056806=>bpf-next Feb 23 x86_64 / gcc-15, test_progs_cpuv4

The test passes on most runs including the regular test_progs and
test_progs_no_alu32 variants, confirming it is a timing-sensitive flake
rather than a deterministic failure.

Root Cause Analysis

Test structure

The test (tools/testing/selftests/bpf/progs/stream.c:228) creates a
BPF timer callback that intentionally writes to an unmapped arena address:

static __noinline int timer_cb(void *map, int *key, struct bpf_timer *timer)
{
    int __arena *addr = (int __arena *)0xdeadbeef;
    arena_ptr = &arena;
    *addr = 1;  /* Triggers arena fault */
    return 0;
}

int stream_arena_callback_fault(void *ctx)
{
    ...
    bpf_timer_init(arr_timer, &array, 1);
    bpf_timer_set_callback(arr_timer, timer_cb);
    bpf_timer_start(arr_timer, 0, 0);  /* 0ns delay, softirq mode */
    return 0;
}

The fault is handled by ex_handler_bpf()bpf_prog_report_arena_violation()
which writes diagnostic output to the program's BPF stream. The test
framework then reads the stream via bpf_prog_stream_read() and validates
it matches the expected __stderr() patterns.

The race condition

bpf_timer_start() with nsecs=0 uses HRTIMER_MODE_REL_SOFT
(kernel/bpf/helpers.c:1520), which means the timer callback executes in
hrtimer_run_softirq() context. The execution sequence is:

  1. BPF syscall program calls bpf_timer_start(arr_timer, 0, 0)
    hrtimer is programmed to expire at current time
  2. BPF program returns → bpf_prog_test_run_opts() returns to userspace
  3. (sometime later) Hardware timer interrupt fires → TIMER_SOFTIRQ
    is raised
  4. Softirq handler calls bpf_timer_cb() → timer callback triggers
    arena fault → fault handler writes to stream
  5. Test framework calls bpf_prog_stream_read() → reads stream data

If step 5 happens before step 4, the stream is empty and the test
fails. This is the observed failure mode.

The race is inherent to the asynchronous nature of softirq timers. While
the softirq often runs during the return-to-userspace path of the
preceding syscall (step 2), this depends on the hardware timer interrupt
having already fired by that point. With a 0-delay softirq timer, the
hrtimer is programmed to expire "now", but the hardware timer interrupt
may not fire until the next tick (up to 1ms with HZ=1000).

Why cpuv4 specifically?

The cpuv4 variant compiles BPF programs with -mcpu=v4, which enables
newer BPF instructions. The JIT code may execute marginally faster, or
the different instruction encoding may affect scheduling. However, the
fundamental issue is timing-dependent and could theoretically manifest
on any variant. The cpuv4 variant is simply where it was observed due to
timing characteristics.

Contrast with other stream_arena tests

Other tests in the same file don't have this issue because they trigger
arena faults synchronously:

  • stream_arena_write_fault — directly accesses unmapped arena memory
  • stream_arena_read_fault — directly reads unmapped arena memory
  • stream_arena_subprog_fault — calls a subprogram that accesses arena

These faults occur during the BPF syscall program execution itself, so
the stream output is available immediately when the program returns.

Proposed Fix

Add retry logic to get_stream() in test_loader.c. When the stream
read returns 0 bytes, retry up to 10 times with 1ms delays (10ms total
budget). This gives the softirq enough time to process the timer callback
and populate the stream.

For tests with synchronous stream output, the first read succeeds
immediately with no added latency. The retry is only exercised when the
stream is empty, which should only happen with asynchronous producers like
timer callbacks.

The fix also guards text[ret] with ret > 0 to prevent undefined
behavior if bpf_prog_stream_read returns a negative error code.

See the attached patch:
0001-selftests-bpf-fix-flaky-stream_arena_callback_fault-test.patch

Impact

This flaky test affects BPF CI signal quality by occasionally causing
unrelated patch series to appear as failing. Since the failure occurs on
baseline branches (to-test), it affects all CI runs that include
test_progs_cpuv4.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions