bpf: cpumap: fix race in bq_flush_to_queue on PREEMPT_RT#11019
Closed
kernel-patches-daemon-bpf[bot] wants to merge 1 commit intobpf_basefrom
Closed
bpf: cpumap: fix race in bq_flush_to_queue on PREEMPT_RT#11019kernel-patches-daemon-bpf[bot] wants to merge 1 commit intobpf_basefrom
kernel-patches-daemon-bpf[bot] wants to merge 1 commit intobpf_basefrom
Conversation
Author
|
Upstream branch: 1a7eb7a |
073dbc3 to
964e178
Compare
Author
|
Upstream branch: 8552286 |
42f1acb to
616781f
Compare
964e178 to
90e5b9d
Compare
Author
|
Upstream branch: 8419dbe |
On PREEMPT_RT kernels, the per-CPU xdp_bulk_queue (bq) can be accessed
concurrently by multiple preemptible tasks on the same CPU.
The original code assumes bq_enqueue() and __cpu_map_flush() run
atomically with respect to each other on the same CPU, relying on
local_bh_disable() to prevent preemption. However, on PREEMPT_RT,
local_bh_disable() only calls migrate_disable() and does not disable
preemption. spin_lock() also becomes a sleeping rt_mutex. Together,
this allows CFS scheduling to preempt a task during bq_flush_to_queue(),
enabling another task on the same CPU to enter bq_enqueue() and operate
on the same per-CPU bq concurrently.
This leads to several races:
1. Double __list_del_clearprev(): after spin_unlock() in
bq_flush_to_queue(), a preempting task can call bq_enqueue() ->
bq_flush_to_queue() on the same bq when bq->count reaches
CPU_MAP_BULK_SIZE. Both tasks then call __list_del_clearprev()
on the same bq->flush_node, the second call dereferences the
prev pointer that was already set to NULL by the first.
2. bq->count and bq->q[] races: concurrent bq_enqueue() can corrupt
the packet queue while bq_flush_to_queue() is processing it.
The race between task A (__cpu_map_flush -> bq_flush_to_queue) and
task B (bq_enqueue -> bq_flush_to_queue) on the same CPU:
Task A (xdp_do_flush) Task B (cpu_map_enqueue)
---------------------- ------------------------
bq_flush_to_queue(bq)
spin_lock(&q->producer_lock)
/* flush bq->q[] to ptr_ring */
bq->count = 0
spin_unlock(&q->producer_lock)
bq_enqueue(rcpu, xdpf)
<-- CFS preempts Task A --> bq->q[bq->count++] = xdpf
/* ... more enqueues until full ... */
bq_flush_to_queue(bq)
spin_lock(&q->producer_lock)
/* flush to ptr_ring */
spin_unlock(&q->producer_lock)
__list_del_clearprev(flush_node)
/* sets flush_node.prev = NULL */
<-- Task A resumes -->
__list_del_clearprev(flush_node)
flush_node.prev->next = ...
/* prev is NULL -> kernel oops */
Fix this by adding a local_lock_t to xdp_bulk_queue and acquiring it
in bq_enqueue() and __cpu_map_flush(). On non-RT kernels, local_lock
maps to preempt_disable/enable with zero additional overhead. On
PREEMPT_RT, it provides a per-CPU sleeping lock that serializes
access to the bq.
An alternative approach of snapshotting bq->count and bq->q[] before
releasing the producer_lock was considered, but it requires copying
the entire bq->q[] array on every flush, adding unnecessary overhead.
To reproduce, insert an mdelay(100) between spin_unlock() and
__list_del_clearprev() in bq_flush_to_queue(), then run reproducer
provided by syzkaller.
Panic:
===
BUG: kernel NULL pointer dereference, address: 0000000000000000
PF: supervisor write access in kernel mode
PF: error_code(0x0002) - not-present page
PGD 0 P4D 0
Oops: Oops: 0002 [#1] SMP PTI
CPU: 0 UID: 0 PID: 377 Comm: a.out Not tainted 6.19.0+ #21 PREEMPT_RT
RIP: 0010:bq_flush_to_queue+0x145/0x200
RSP: 0018:ffffc90000eb78e8 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: ffffc90000eb7a10 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffffc90000eb7928 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
R13: ffffe8ffffc255c0 R14: 0000000000000000 R15: ffff88800c01b800
FS: 00007fae0762d6c0(0000) GS:ffff8880f9c75000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000000d698005 CR4: 0000000000770ef0
PKRU: 55555554
Call Trace:
<TASK>
__cpu_map_flush+0x2c/0x70
xdp_do_flush+0x64/0x1b0
xdp_test_run_batch.constprop.0+0x4d4/0x6d0
bpf_test_run_xdp_live+0x24b/0x3e0
? get_page_from_freelist+0x5ea/0x1cf0
? __pfx_xdp_test_run_init_page+0x10/0x10
? __check_object_size+0x25e/0x320
? bpf_dispatcher_change_prog+0x20d/0x4e0
bpf_prog_test_run_xdp+0x4a1/0x6e0
? rt_spin_unlock+0x61/0xb0
__sys_bpf+0x44a/0x2760
? set_ptes.isra.0+0x3b/0x90
? rt_spin_unlock+0x61/0xb0
? do_anonymous_page+0x106/0x430
__x64_sys_bpf+0x1a/0x30
x64_sys_call+0x146c/0x26e0
do_syscall_64+0xd5/0x5a0
? do_user_addr_fault+0x1d0/0x870
? irqentry_exit+0x3e/0x670
? clear_bhb_loop+0x30/0x80
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7fae0777928d
RSP: 002b:00007fae0762ce68 EFLAGS: 00000206 ORIG_RAX: 0000000000000141
RAX: ffffffffffffffda RBX: 00007fae0762dcdc RCX: 00007fae0777928d
RDX: 0000000000000050 RSI: 0000200000000000 RDI: 000000000000000a
RBP: 00007fae0762ce90 R08: 00007fae0762d6c0 R09: 203a6362696c6720
R10: 0000000000000000 R11: 0000000000000206 R12: 00007fae0762d6c0
R13: ffffffffffffff88 R14: 000000000000006e R15: 00007ffdd6e54110
</TASK>
Modules linked in:
CR2: 0000000000000000
---[ end trace 0000000000000000 ]---
RIP: 0010:bq_flush_to_queue+0x145/0x200
RSP: 0018:ffffc90000eb78e8 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: ffffc90000eb7a10 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffffc90000eb7928 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
R13: ffffe8ffffc255c0 R14: 0000000000000000 R15: ffff88800c01b800
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000000d698005 CR4: 0000000000770ef0
PKRU: 55555554
Kernel panic - not syncing: Fatal exception in interrupt
Kernel Offset: disabled
---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---
Fixes: d2d6422 ("x86: Allow to enable PREEMPT_RT.")
Reported-by: syzbot+2b3391f44313b3983e91@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/69369331.a70a0220.38f243.009d.GAE@google.com/T/
Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com>
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
616781f to
cc8f527
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Pull request for series with
subject: bpf: cpumap: fix race in bq_flush_to_queue on PREEMPT_RT
version: 1
url: https://patchwork.kernel.org/project/netdevbpf/list/?series=1053000