Bpf task work #9403

mykyta5 · 2025-08-02T01:32:32Z

No description provided.

This patch adds necessary plumbing in verifier, syscall and maps to support handling new kfunc bpf_task_work_schedule and kernel structure bpf_task_work. The idea is similar to how we already handle bpf_wq and bpf_timer. verifier changes validate calls to bpf_task_work_schedule to make sure it is safe and expected invariants hold. btf part is required to detect bpf_task_work structure inside map value and store its offset, which will be used in the next patch to calculate key and value addresses. arraymap and hashtab changes are needed to handle freeing of the bpf_task_work: run code needed to deinitialize it, for example cancel task_work callback if possible. The use of bpf_task_work and proper implementation for kfuncs are introduced in the next patch. Signed-off-by: Mykyta Yatsenko <[email protected]>

Calculation of the BPF map key, given the pointer to a value is duplicated in a couple of places in helpers already, in the next patch another use case is introduced as well. This patch extracts that functionality into a separate function. Signed-off-by: Mykyta Yatsenko <[email protected]>

Implementation of the bpf_task_work_schedule kfuncs. Main components: * struct bpf_task_work_context – Metadata and state management per task work. * enum bpf_task_work_state – A state machine to serialize work scheduling and execution. * bpf_task_work_schedule() – The central helper that initiates scheduling. * bpf_task_work_callback() – Invoked when the actual task_work runs. * bpf_task_work_irq() – An intermediate step (runs in softirq context) to enqueue task work. * bpf_task_work_cancel_and_free() – Cleanup for deleted BPF map entries. Flow of task work scheduling 1) bpf_task_work_schedule_* is called from BPF code. 2) Transition state from STANDBY to PENDING. 3) irq_work_queue() schedules bpf_task_work_irq(). 4) Transition state from PENDING to SCHEDULING. 4) bpf_task_work_irq() attempts task_work_add(). If successful, state transitions to SCHEDULED. 5) Task work calls bpf_task_work_callback(), which transition state to RUNNING. 6) BPF callback is executed 7) Context is cleaned up, refcounts released, state set back to STANDBY. Map value deletion If map value that contains bpf_task_work_context is deleted, BPF map implementation calls bpf_task_work_cancel_and_free(). Deletion is handled by atomically setting state to FREED and releasing references or letting scheduler do that, depending on the last state before the deletion: * SCHEDULING: release references in bpf_task_work_cancel_and_free(), expect bpf_task_work_irq() to cancel task work. * SCHEDULED: release references and try to cancel task work in bpf_task_work_cancel_and_free(). * other states: one of bpf_task_work_irq(), bpf_task_work_schedule(), bpf_task_work_callback() should cleanup upon detecting the state switching to FREED. The state transitions are controlled with atomic_cmpxchg, ensuring: * Only one thread can successfully enqueue work. * Proper handling of concurrent deletes (BPF_TW_FREED). * Safe rollback if task_work_add() fails. Signed-off-by: Mykyta Yatsenko <[email protected]>

Introducing selftests that check BPF task work scheduling mechanism. Validate that verifier does not accepts incorrect calls to bpf_task_work_schedule kfunc. Signed-off-by: Mykyta Yatsenko <[email protected]>

anakryiko · 2025-08-13T20:38:46Z

include/uapi/linux/bpf.h

@@ -7418,6 +7418,10 @@ struct bpf_timer {
 	__u64 __opaque[2];
 } __attribute__((aligned(8)));

+struct bpf_task_work {
+	__u64 ctx;


why not follow the __opaque pattern?

anakryiko · 2025-08-13T20:40:01Z

kernel/bpf/helpers.c

@@ -3703,8 +3703,53 @@ __bpf_kfunc int bpf_strstr(const char *s1__ign, const char *s2__ign)
 	return bpf_strnstr(s1__ign, s2__ign, XATTR_SIZE_MAX);
 }

+typedef void (*bpf_task_work_callback_t)(struct bpf_map *, void *, void *);


keep argument names?

anakryiko · 2025-08-13T20:43:00Z

kernel/bpf/verifier.c

 {
-	return func_id == BPF_FUNC_timer_set_callback;
+	return func_id == BPF_FUNC_timer_set_callback || is_task_work_add_kfunc(func_id);


I believe I pointed this out before, maybe I'm forgetting the answer, but this looks wrong to me. This check is meant for old-style helpers, not kfuncs.

anakryiko · 2025-08-13T20:45:55Z

kernel/bpf/verifier.c

@@ -12751,7 +12844,8 @@ static bool is_sync_callback_calling_kfunc(u32 btf_id)

 static bool is_async_callback_calling_kfunc(u32 btf_id)
 {
-	return btf_id == special_kfunc_list[KF_bpf_wq_set_callback_impl];
+	return btf_id == special_kfunc_list[KF_bpf_wq_set_callback_impl] ||


nit: this should probably be a call to is_bpf_wq_set_callback_impl_kfunc()?

anakryiko · 2025-08-13T21:07:04Z

kernel/bpf/helpers.c

+	 * Otherwise it is safe to access map key value under the rcu_read_lock
+	 */
+	rcu_read_lock_trace();
+	state = cmpxchg(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_RUNNING);


I think the comment is warranted here explaining why we are trying SCHEDULING -> RUNNING transition at all, and why it has to be done before SCHEDULED -> RUNNING, it's kind of tricky and non-obvious

anakryiko · 2025-08-13T21:20:15Z

kernel/bpf/helpers.c

+		err = -EPERM;
+		goto release_prog;
+	}
+	ctx = bpf_task_work_aquire_ctx(tw, map);


typo: acquire

anakryiko · 2025-08-13T21:28:12Z

kernel/bpf/helpers.c

+			return ERR_PTR(-ENOMEM);
+		memset(ctx, 0, sizeof(*ctx));
+		if (atomic_long_cmpxchg(ctx_ptr, 0, (unsigned long)ctx) != 0) {
+			kfree(ctx);


I have this nagging feeling that we can't just do kfree(ctx), this ctx has to be RCU protected and (maybe) refcounted, otherwise we run the risk of races between kfree and another CPU still accessing ctx fields

anakryiko · 2025-08-13T21:28:40Z

kernel/bpf/helpers.c

+
+	ctx = (void *)atomic_long_read(ctx_ptr);
+	if (!ctx) {
+		ctx = bpf_map_kmalloc_node(map,


discussed offline, this has to be bpf_mem_alloc()

anakryiko · 2025-08-13T21:38:55Z

kernel/bpf/helpers.c

+		bpf_reset_task_work_context(ctx);
+		fallthrough;
+	case BPF_TW_STANDBY:
+		kfree(ctx);


as I mentioned somewhere else, we can't just kfree(ctx) here, that ctx can still be referenced from one of irq/task work callbacks

In this particular case it should not be referenced in task work or irq work, because the state we found it in was STANDBY, although it's still possible that bpf_task_work_schedule has a pointer to the ctx

sure, so you agree with me that we can't just kfree it? it has to be RCU protected, right?

anakryiko · 2025-08-13T21:40:55Z

kernel/bpf/helpers.c

+		memset(ctx, 0, sizeof(*ctx));
+		if (atomic_long_cmpxchg(ctx_ptr, 0, (unsigned long)ctx) != 0) {
+			kfree(ctx);
+			return ERR_PTR(-EBUSY);


shouldn't we retry here, someone might have allocated memory, scheduled and executed callback, and returned to STANDBY, so we can reuse that state here (very unlikely, but with NMI involvement this can theoretically happen)

mykyta5 force-pushed the bpf_task_work branch 8 times, most recently from 09fdfb7 to 9b0503b Compare August 6, 2025 14:12

kernel-patches-daemon-bpf bot force-pushed the bpf-next_base branch 8 times, most recently from 601ea2d to 7390c2c Compare August 12, 2025 22:59

mykyta5 added 2 commits August 13, 2025 12:14

mykyta5 force-pushed the bpf_task_work branch 2 times, most recently from dbe62d1 to 7089387 Compare August 13, 2025 12:39

mykyta5 added 2 commits August 13, 2025 14:14

selftests/bpf: BPF task work scheduling tests

3ba03e2

Introducing selftests that check BPF task work scheduling mechanism. Validate that verifier does not accepts incorrect calls to bpf_task_work_schedule kfunc. Signed-off-by: Mykyta Yatsenko <[email protected]>

mykyta5 force-pushed the bpf_task_work branch from 7089387 to 3ba03e2 Compare August 13, 2025 13:14

anakryiko reviewed Aug 13, 2025

View reviewed changes

kernel-patches-daemon-bpf bot force-pushed the bpf-next_base branch 5 times, most recently from 61c9cef to 715d6cb Compare August 15, 2025 23:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bpf task work #9403

Bpf task work #9403

Uh oh!

mykyta5 commented Aug 2, 2025

Uh oh!

anakryiko Aug 13, 2025

Uh oh!

anakryiko Aug 13, 2025

Uh oh!

anakryiko Aug 13, 2025

Uh oh!

anakryiko Aug 13, 2025

Uh oh!

anakryiko Aug 13, 2025

Uh oh!

anakryiko Aug 13, 2025

Uh oh!

anakryiko Aug 13, 2025

Uh oh!

anakryiko Aug 13, 2025

Uh oh!

anakryiko Aug 13, 2025

Uh oh!

mykyta5 Aug 14, 2025

Uh oh!

anakryiko Aug 14, 2025

Uh oh!

anakryiko Aug 13, 2025

Uh oh!

Uh oh!

Bpf task work #9403

Are you sure you want to change the base?

Bpf task work #9403

Uh oh!

Conversation

mykyta5 commented Aug 2, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!