[bpf-ci-bot] Flaky test: stream_success/stream_arena_callback_fault on test_progs_cpuv4


## Summary

The `stream_success/stream_arena_callback_fault` BPF selftest is flaky,
failing intermittently on the `test_progs_cpuv4` variant in BPF CI
(`kernel-patches/bpf`). The failure appears on both baseline branches
(`to-test`) and independent patch series, confirming it is not
patch-specific.

## Failure Pattern

```
do_prog_test_run:PASS:bpf_prog_test_run 0 nsec
get_stream:FAIL:stream read unexpected stream read: actual 0 <= expected 0
run_subtest:FAIL:1307 Unexpected retval from get_stream(): 0, errno = 95
#428/8   stream_success/stream_arena_callback_fault:FAIL
```

The BPF program executes successfully (retval 0), but the stream read
returns 0 bytes when it expects diagnostic output about an arena fault.

## Affected CI Runs

| Run ID | Branch | Date | Arch/Compiler |
|--------|--------|------|---------------|
| 22269280442 | `to-test` (baseline) | Feb 22 | x86_64 / gcc-15, test_progs_cpuv4 |
| 22326859477 | `series/1056806=>bpf-next` | Feb 23 | x86_64 / gcc-15, test_progs_cpuv4 |

The test passes on most runs including the regular `test_progs` and
`test_progs_no_alu32` variants, confirming it is a timing-sensitive flake
rather than a deterministic failure.

## Root Cause Analysis

### Test structure

The test (`tools/testing/selftests/bpf/progs/stream.c:228`) creates a
BPF timer callback that intentionally writes to an unmapped arena address:

```c
static __noinline int timer_cb(void *map, int *key, struct bpf_timer *timer)
{
    int __arena *addr = (int __arena *)0xdeadbeef;
    arena_ptr = &arena;
    *addr = 1;  /* Triggers arena fault */
    return 0;
}

int stream_arena_callback_fault(void *ctx)
{
    ...
    bpf_timer_init(arr_timer, &array, 1);
    bpf_timer_set_callback(arr_timer, timer_cb);
    bpf_timer_start(arr_timer, 0, 0);  /* 0ns delay, softirq mode */
    return 0;
}
```

The fault is handled by `ex_handler_bpf()` → `bpf_prog_report_arena_violation()`
which writes diagnostic output to the program's BPF stream. The test
framework then reads the stream via `bpf_prog_stream_read()` and validates
it matches the expected `__stderr()` patterns.

### The race condition

`bpf_timer_start()` with `nsecs=0` uses `HRTIMER_MODE_REL_SOFT`
(`kernel/bpf/helpers.c:1520`), which means the timer callback executes in
`hrtimer_run_softirq()` context. The execution sequence is:

1. BPF syscall program calls `bpf_timer_start(arr_timer, 0, 0)` →
   hrtimer is programmed to expire at current time
2. BPF program returns → `bpf_prog_test_run_opts()` returns to userspace
3. *(sometime later)* Hardware timer interrupt fires → `TIMER_SOFTIRQ`
   is raised
4. Softirq handler calls `bpf_timer_cb()` → timer callback triggers
   arena fault → fault handler writes to stream
5. Test framework calls `bpf_prog_stream_read()` → reads stream data

**If step 5 happens before step 4**, the stream is empty and the test
fails. This is the observed failure mode.

The race is inherent to the asynchronous nature of softirq timers. While
the softirq often runs during the return-to-userspace path of the
preceding syscall (step 2), this depends on the hardware timer interrupt
having already fired by that point. With a 0-delay softirq timer, the
hrtimer is programmed to expire "now", but the hardware timer interrupt
may not fire until the next tick (up to 1ms with HZ=1000).

### Why cpuv4 specifically?

The cpuv4 variant compiles BPF programs with `-mcpu=v4`, which enables
newer BPF instructions. The JIT code may execute marginally faster, or
the different instruction encoding may affect scheduling. However, the
fundamental issue is timing-dependent and could theoretically manifest
on any variant. The cpuv4 variant is simply where it was observed due to
timing characteristics.

### Contrast with other stream_arena tests

Other tests in the same file don't have this issue because they trigger
arena faults synchronously:

- `stream_arena_write_fault` — directly accesses unmapped arena memory
- `stream_arena_read_fault` — directly reads unmapped arena memory
- `stream_arena_subprog_fault` — calls a subprogram that accesses arena

These faults occur during the BPF syscall program execution itself, so
the stream output is available immediately when the program returns.

## Proposed Fix

Add retry logic to `get_stream()` in `test_loader.c`. When the stream
read returns 0 bytes, retry up to 10 times with 1ms delays (10ms total
budget). This gives the softirq enough time to process the timer callback
and populate the stream.

For tests with synchronous stream output, the first read succeeds
immediately with no added latency. The retry is only exercised when the
stream is empty, which should only happen with asynchronous producers like
timer callbacks.

The fix also guards `text[ret]` with `ret > 0` to prevent undefined
behavior if `bpf_prog_stream_read` returns a negative error code.

See the attached patch:
`0001-selftests-bpf-fix-flaky-stream_arena_callback_fault-test.patch`

## Impact

This flaky test affects BPF CI signal quality by occasionally causing
unrelated patch series to appear as failing. Since the failure occurs on
baseline branches (`to-test`), it affects all CI runs that include
`test_progs_cpuv4`.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bpf-ci-bot] Flaky test: stream_success/stream_arena_callback_fault on test_progs_cpuv4 #449

Summary

Failure Pattern

Affected CI Runs

Root Cause Analysis

Test structure

The race condition

Why cpuv4 specifically?

Contrast with other stream_arena tests

Proposed Fix

Impact

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Run ID	Branch	Date	Arch/Compiler
22269280442	`to-test` (baseline)	Feb 22	x86_64 / gcc-15, test_progs_cpuv4
22326859477	`series/1056806=>bpf-next`	Feb 23	x86_64 / gcc-15, test_progs_cpuv4

[bpf-ci-bot] Flaky test: stream_success/stream_arena_callback_fault on test_progs_cpuv4 #449

Description

Summary

Failure Pattern

Affected CI Runs

Root Cause Analysis

Test structure

The race condition

Why cpuv4 specifically?

Contrast with other stream_arena tests

Proposed Fix

Impact

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions