Skip to content

bpf: cpumap: fix race in bq_flush_to_queue on PREEMPT_RT#11019

Closed
kernel-patches-daemon-bpf[bot] wants to merge 1 commit intobpf_basefrom
series/1053000=>bpf
Closed

bpf: cpumap: fix race in bq_flush_to_queue on PREEMPT_RT#11019
kernel-patches-daemon-bpf[bot] wants to merge 1 commit intobpf_basefrom
series/1053000=>bpf

Conversation

@kernel-patches-daemon-bpf
Copy link

Pull request for series with
subject: bpf: cpumap: fix race in bq_flush_to_queue on PREEMPT_RT
version: 1
url: https://patchwork.kernel.org/project/netdevbpf/list/?series=1053000

@kernel-patches-daemon-bpf
Copy link
Author

Upstream branch: 1a7eb7a
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1053000
version: 1

@kernel-patches-daemon-bpf
Copy link
Author

Upstream branch: 8552286
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1053000
version: 1

@kernel-patches-daemon-bpf
Copy link
Author

Upstream branch: 8419dbe
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1053000
version: 1

On PREEMPT_RT kernels, the per-CPU xdp_bulk_queue (bq) can be accessed
concurrently by multiple preemptible tasks on the same CPU.

The original code assumes bq_enqueue() and __cpu_map_flush() run
atomically with respect to each other on the same CPU, relying on
local_bh_disable() to prevent preemption. However, on PREEMPT_RT,
local_bh_disable() only calls migrate_disable() and does not disable
preemption. spin_lock() also becomes a sleeping rt_mutex. Together,
this allows CFS scheduling to preempt a task during bq_flush_to_queue(),
enabling another task on the same CPU to enter bq_enqueue() and operate
on the same per-CPU bq concurrently.

This leads to several races:

1. Double __list_del_clearprev(): after spin_unlock() in
 bq_flush_to_queue(), a preempting task can call bq_enqueue() ->
 bq_flush_to_queue() on the same bq when bq->count reaches
 CPU_MAP_BULK_SIZE. Both tasks then call __list_del_clearprev()
 on the same bq->flush_node, the second call dereferences the
 prev pointer that was already set to NULL by the first.

2. bq->count and bq->q[] races: concurrent bq_enqueue() can corrupt
 the packet queue while bq_flush_to_queue() is processing it.

The race between task A (__cpu_map_flush -> bq_flush_to_queue) and
task B (bq_enqueue -> bq_flush_to_queue) on the same CPU:

Task A (xdp_do_flush)          Task B (cpu_map_enqueue)
----------------------         ------------------------
bq_flush_to_queue(bq)
  spin_lock(&q->producer_lock)
  /* flush bq->q[] to ptr_ring */
  bq->count = 0
  spin_unlock(&q->producer_lock)
                                 bq_enqueue(rcpu, xdpf)
  <-- CFS preempts Task A -->      bq->q[bq->count++] = xdpf
                                   /* ... more enqueues until full ... */
                                   bq_flush_to_queue(bq)
                                     spin_lock(&q->producer_lock)
                                     /* flush to ptr_ring */
                                     spin_unlock(&q->producer_lock)
                                     __list_del_clearprev(flush_node)
                                       /* sets flush_node.prev = NULL */
  <-- Task A resumes -->
  __list_del_clearprev(flush_node)
    flush_node.prev->next = ...
    /* prev is NULL -> kernel oops */

Fix this by adding a local_lock_t to xdp_bulk_queue and acquiring it
in bq_enqueue() and __cpu_map_flush(). On non-RT kernels, local_lock
maps to preempt_disable/enable with zero additional overhead. On
PREEMPT_RT, it provides a per-CPU sleeping lock that serializes
access to the bq.

An alternative approach of snapshotting bq->count and bq->q[] before
releasing the producer_lock was considered, but it requires copying
the entire bq->q[] array on every flush, adding unnecessary overhead.

To reproduce, insert an mdelay(100) between spin_unlock() and
__list_del_clearprev() in bq_flush_to_queue(), then run reproducer
provided by syzkaller.

Panic:
===
BUG: kernel NULL pointer dereference, address: 0000000000000000
PF: supervisor write access in kernel mode
PF: error_code(0x0002) - not-present page
PGD 0 P4D 0
Oops: Oops: 0002 [#1] SMP PTI
CPU: 0 UID: 0 PID: 377 Comm: a.out Not tainted 6.19.0+ #21 PREEMPT_RT
RIP: 0010:bq_flush_to_queue+0x145/0x200
RSP: 0018:ffffc90000eb78e8 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: ffffc90000eb7a10 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffffc90000eb7928 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
R13: ffffe8ffffc255c0 R14: 0000000000000000 R15: ffff88800c01b800
FS:  00007fae0762d6c0(0000) GS:ffff8880f9c75000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000000d698005 CR4: 0000000000770ef0
PKRU: 55555554
Call Trace:
 <TASK>
 __cpu_map_flush+0x2c/0x70
 xdp_do_flush+0x64/0x1b0
 xdp_test_run_batch.constprop.0+0x4d4/0x6d0
 bpf_test_run_xdp_live+0x24b/0x3e0
 ? get_page_from_freelist+0x5ea/0x1cf0
 ? __pfx_xdp_test_run_init_page+0x10/0x10
 ? __check_object_size+0x25e/0x320
 ? bpf_dispatcher_change_prog+0x20d/0x4e0
 bpf_prog_test_run_xdp+0x4a1/0x6e0
 ? rt_spin_unlock+0x61/0xb0
 __sys_bpf+0x44a/0x2760
 ? set_ptes.isra.0+0x3b/0x90
 ? rt_spin_unlock+0x61/0xb0
 ? do_anonymous_page+0x106/0x430
 __x64_sys_bpf+0x1a/0x30
 x64_sys_call+0x146c/0x26e0
 do_syscall_64+0xd5/0x5a0
 ? do_user_addr_fault+0x1d0/0x870
 ? irqentry_exit+0x3e/0x670
 ? clear_bhb_loop+0x30/0x80
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7fae0777928d
RSP: 002b:00007fae0762ce68 EFLAGS: 00000206 ORIG_RAX: 0000000000000141
RAX: ffffffffffffffda RBX: 00007fae0762dcdc RCX: 00007fae0777928d
RDX: 0000000000000050 RSI: 0000200000000000 RDI: 000000000000000a
RBP: 00007fae0762ce90 R08: 00007fae0762d6c0 R09: 203a6362696c6720
R10: 0000000000000000 R11: 0000000000000206 R12: 00007fae0762d6c0
R13: ffffffffffffff88 R14: 000000000000006e R15: 00007ffdd6e54110
 </TASK>
Modules linked in:
CR2: 0000000000000000
---[ end trace 0000000000000000 ]---
RIP: 0010:bq_flush_to_queue+0x145/0x200
RSP: 0018:ffffc90000eb78e8 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: ffffc90000eb7a10 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffffc90000eb7928 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
R13: ffffe8ffffc255c0 R14: 0000000000000000 R15: ffff88800c01b800
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000000d698005 CR4: 0000000000770ef0
PKRU: 55555554
Kernel panic - not syncing: Fatal exception in interrupt
Kernel Offset: disabled
---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---

Fixes: d2d6422 ("x86: Allow to enable PREEMPT_RT.")
Reported-by: syzbot+2b3391f44313b3983e91@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/69369331.a70a0220.38f243.009d.GAE@google.com/T/
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant