Skip to content

Commit 76b8c14

Browse files
peilin-yeKernel Patches Daemon
authored andcommitted
bpf/helpers: Use __GFP_HIGH instead of GFP_ATOMIC in __bpf_async_init()
Currently, calling bpf_map_kmalloc_node() from __bpf_async_init() can cause various locking issues; see the following stack trace (edited for style) as one example: ... [10.011566] do_raw_spin_lock.cold [10.011570] try_to_wake_up (5) double-acquiring the same [10.011575] kick_pool rq_lock, causing a hardlockup [10.011579] __queue_work [10.011582] queue_work_on [10.011585] kernfs_notify [10.011589] cgroup_file_notify [10.011593] try_charge_memcg (4) memcg accounting raises an [10.011597] obj_cgroup_charge_pages MEMCG_MAX event [10.011599] obj_cgroup_charge_account [10.011600] __memcg_slab_post_alloc_hook [10.011603] __kmalloc_node_noprof ... [10.011611] bpf_map_kmalloc_node [10.011612] __bpf_async_init [10.011615] bpf_timer_init (3) BPF calls bpf_timer_init() [10.011617] bpf_prog_xxxxxxxxxxxxxxxx_fcg_runnable [10.011619] bpf__sched_ext_ops_runnable [10.011620] enqueue_task_scx (2) BPF runs with rq_lock held [10.011622] enqueue_task [10.011626] ttwu_do_activate [10.011629] sched_ttwu_pending (1) grabs rq_lock ... The above was reproduced on bpf-next (b338cf8) by modifying ./tools/sched_ext/scx_flatcg.bpf.c to call bpf_timer_init() during ops.runnable(), and hacking [1] the memcg accounting code a bit to make a bpf_timer_init() call much more likely to raise an MEMCG_MAX event. We have also run into other similar variants (both internally and on bpf-next), including double-acquiring cgroup_file_kn_lock, the same worker_pool::lock, etc. As suggested by Shakeel, fix this by using __GFP_HIGH instead of GFP_ATOMIC in __bpf_async_init(), so that if try_charge_memcg() raises an MEMCG_MAX event, we call __memcg_memory_event() with @allow_spinning=false and skip calling cgroup_file_notify(), in order to avoid the locking issues described above. Depends on mm patch "memcg: skip cgroup_file_notify if spinning is not allowed". Tested with vmtest.sh (llvm-18, x86-64): $ ./test_progs -a '*timer*' -a '*wq*' ... Summary: 7/12 PASSED, 0 SKIPPED, 0 FAILED [1] Making bpf_timer_init() much more likely to raise an MEMCG_MAX event (gist-only, for brevity): kernel/bpf/helpers.c:__bpf_async_init(): - cb = bpf_map_kmalloc_node(map, size, GFP_ATOMIC, map->numa_node); + cb = bpf_map_kmalloc_node(map, size, GFP_ATOMIC | __GFP_HACK, + map->numa_node); mm/memcontrol.c:try_charge_memcg(): if (!do_memsw_account() || - page_counter_try_charge(&memcg->memsw, batch, &counter)) { - if (page_counter_try_charge(&memcg->memory, batch, &counter)) + page_counter_try_charge_hack(&memcg->memsw, batch, &counter, + gfp_mask & __GFP_HACK)) { + if (page_counter_try_charge_hack(&memcg->memory, batch, + &counter, + gfp_mask & __GFP_HACK)) goto done_restock; mm/page_counter.c:page_counter_try_charge(): -bool page_counter_try_charge(struct page_counter *counter, - unsigned long nr_pages, - struct page_counter **fail) +bool page_counter_try_charge_hack(struct page_counter *counter, + unsigned long nr_pages, + struct page_counter **fail, bool hack) { ... - if (new > c->max) { + if (hack || new > c->max) { // goto failed; atomic_long_sub(nr_pages, &c->usage); Fixes: b00628b ("bpf: Introduce bpf timers.") Suggested-by: Shakeel Butt <[email protected]> Signed-off-by: Peilin Ye <[email protected]>
1 parent 83390c0 commit 76b8c14

File tree

1 file changed

+8
-2
lines changed

1 file changed

+8
-2
lines changed

kernel/bpf/helpers.c

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1274,8 +1274,14 @@ static int __bpf_async_init(struct bpf_async_kern *async, struct bpf_map *map, u
12741274
goto out;
12751275
}
12761276

1277-
/* allocate hrtimer via map_kmalloc to use memcg accounting */
1278-
cb = bpf_map_kmalloc_node(map, size, GFP_ATOMIC, map->numa_node);
1277+
/* Allocate via bpf_map_kmalloc_node() for memcg accounting. Use
1278+
* __GFP_HIGH instead of GFP_ATOMIC to avoid calling
1279+
* cgroup_file_notify() if an MEMCG_MAX event is raised by
1280+
* try_charge_memcg(). This prevents various locking issues, including
1281+
* double-acquiring locks that may already be held here (e.g.,
1282+
* cgroup_file_kn_lock, rq_lock).
1283+
*/
1284+
cb = bpf_map_kmalloc_node(map, size, __GFP_HIGH, map->numa_node);
12791285
if (!cb) {
12801286
ret = -ENOMEM;
12811287
goto out;

0 commit comments

Comments
 (0)