Skip to content

Bpf task work #9403

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 4 commits into
base: bpf-next_base
Choose a base branch
from
Draft

Conversation

mykyta5
Copy link
Contributor

@mykyta5 mykyta5 commented Aug 2, 2025

No description provided.

@mykyta5 mykyta5 force-pushed the bpf_task_work branch 8 times, most recently from 09fdfb7 to 9b0503b Compare August 6, 2025 14:12
@kernel-patches-daemon-bpf kernel-patches-daemon-bpf bot force-pushed the bpf-next_base branch 8 times, most recently from 601ea2d to 7390c2c Compare August 12, 2025 22:59
This patch adds necessary plumbing in verifier, syscall and maps to
support handling new kfunc bpf_task_work_schedule and kernel structure
bpf_task_work. The idea is similar to how we already handle bpf_wq and
bpf_timer.
verifier changes validate calls to bpf_task_work_schedule to make sure
it is safe and expected invariants hold.
btf part is required to detect bpf_task_work structure inside map value
and store its offset, which will be used in the next patch to calculate
key and value addresses.
arraymap and hashtab changes are needed to handle freeing of the
bpf_task_work: run code needed to deinitialize it, for example cancel
task_work callback if possible.
The use of bpf_task_work and proper implementation for kfuncs are
introduced in the next patch.

Signed-off-by: Mykyta Yatsenko <[email protected]>
Calculation of the BPF map key, given the pointer to a value is
duplicated in a couple of places in helpers already, in the next patch
another use case is introduced as well.
This patch extracts that functionality into a separate function.

Signed-off-by: Mykyta Yatsenko <[email protected]>
@mykyta5 mykyta5 force-pushed the bpf_task_work branch 2 times, most recently from dbe62d1 to 7089387 Compare August 13, 2025 12:39
Implementation of the bpf_task_work_schedule kfuncs.

Main components:
 * struct bpf_task_work_context – Metadata and state management per task
work.
 * enum bpf_task_work_state – A state machine to serialize work
 scheduling and execution.
 * bpf_task_work_schedule() – The central helper that initiates
scheduling.
 * bpf_task_work_callback() – Invoked when the actual task_work runs.
 * bpf_task_work_irq() – An intermediate step (runs in softirq context)
to enqueue task work.
 * bpf_task_work_cancel_and_free() – Cleanup for deleted BPF map entries.

Flow of task work scheduling
 1) bpf_task_work_schedule_* is called from BPF code.
 2) Transition state from STANDBY to PENDING.
 3) irq_work_queue() schedules bpf_task_work_irq().
 4) Transition state from PENDING to SCHEDULING.
 4) bpf_task_work_irq() attempts task_work_add(). If successful, state
 transitions to SCHEDULED.
 5) Task work calls bpf_task_work_callback(), which transition state to
 RUNNING.
 6) BPF callback is executed
 7) Context is cleaned up, refcounts released, state set back to
 STANDBY.

Map value deletion
If map value that contains bpf_task_work_context is deleted, BPF map
implementation calls bpf_task_work_cancel_and_free().
Deletion is handled by atomically setting state to FREED and
releasing references or letting scheduler do that, depending on the
last state before the deletion:
 * SCHEDULING: release references in bpf_task_work_cancel_and_free(),
 expect bpf_task_work_irq() to cancel task work.
 * SCHEDULED: release references and try to cancel task work in
 bpf_task_work_cancel_and_free().
  * other states: one of bpf_task_work_irq(), bpf_task_work_schedule(),
  bpf_task_work_callback() should cleanup upon detecting the state
  switching to FREED.

The state transitions are controlled with atomic_cmpxchg, ensuring:
 * Only one thread can successfully enqueue work.
 * Proper handling of concurrent deletes (BPF_TW_FREED).
 * Safe rollback if task_work_add() fails.

Signed-off-by: Mykyta Yatsenko <[email protected]>
Introducing selftests that check BPF task work scheduling mechanism.
Validate that verifier does not accepts incorrect calls to
bpf_task_work_schedule kfunc.

Signed-off-by: Mykyta Yatsenko <[email protected]>
@@ -7418,6 +7418,10 @@ struct bpf_timer {
__u64 __opaque[2];
} __attribute__((aligned(8)));

struct bpf_task_work {
__u64 ctx;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not follow the __opaque pattern?

@@ -3703,8 +3703,53 @@ __bpf_kfunc int bpf_strstr(const char *s1__ign, const char *s2__ign)
return bpf_strnstr(s1__ign, s2__ign, XATTR_SIZE_MAX);
}

typedef void (*bpf_task_work_callback_t)(struct bpf_map *, void *, void *);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep argument names?

{
return func_id == BPF_FUNC_timer_set_callback;
return func_id == BPF_FUNC_timer_set_callback || is_task_work_add_kfunc(func_id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I pointed this out before, maybe I'm forgetting the answer, but this looks wrong to me. This check is meant for old-style helpers, not kfuncs.

@@ -12751,7 +12844,8 @@ static bool is_sync_callback_calling_kfunc(u32 btf_id)

static bool is_async_callback_calling_kfunc(u32 btf_id)
{
return btf_id == special_kfunc_list[KF_bpf_wq_set_callback_impl];
return btf_id == special_kfunc_list[KF_bpf_wq_set_callback_impl] ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this should probably be a call to is_bpf_wq_set_callback_impl_kfunc()?

* Otherwise it is safe to access map key value under the rcu_read_lock
*/
rcu_read_lock_trace();
state = cmpxchg(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_RUNNING);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment is warranted here explaining why we are trying SCHEDULING -> RUNNING transition at all, and why it has to be done before SCHEDULED -> RUNNING, it's kind of tricky and non-obvious

err = -EPERM;
goto release_prog;
}
ctx = bpf_task_work_aquire_ctx(tw, map);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: acquire

return ERR_PTR(-ENOMEM);
memset(ctx, 0, sizeof(*ctx));
if (atomic_long_cmpxchg(ctx_ptr, 0, (unsigned long)ctx) != 0) {
kfree(ctx);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have this nagging feeling that we can't just do kfree(ctx), this ctx has to be RCU protected and (maybe) refcounted, otherwise we run the risk of races between kfree and another CPU still accessing ctx fields


ctx = (void *)atomic_long_read(ctx_ptr);
if (!ctx) {
ctx = bpf_map_kmalloc_node(map,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed offline, this has to be bpf_mem_alloc()

bpf_reset_task_work_context(ctx);
fallthrough;
case BPF_TW_STANDBY:
kfree(ctx);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as I mentioned somewhere else, we can't just kfree(ctx) here, that ctx can still be referenced from one of irq/task work callbacks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this particular case it should not be referenced in task work or irq work, because the state we found it in was STANDBY, although it's still possible that bpf_task_work_schedule has a pointer to the ctx

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, so you agree with me that we can't just kfree it? it has to be RCU protected, right?

memset(ctx, 0, sizeof(*ctx));
if (atomic_long_cmpxchg(ctx_ptr, 0, (unsigned long)ctx) != 0) {
kfree(ctx);
return ERR_PTR(-EBUSY);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we retry here, someone might have allocated memory, scheduled and executed callback, and returned to STANDBY, so we can reuse that state here (very unlikely, but with NMI involvement this can theoretically happen)

@kernel-patches-daemon-bpf kernel-patches-daemon-bpf bot force-pushed the bpf-next_base branch 5 times, most recently from 61c9cef to 715d6cb Compare August 15, 2025 23:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants