Skip to content

Conversation

@kernel-patches-daemon-bpf
Copy link

Pull request for series with
subject: mm: BPF OOM
version: 3
url: https://patchwork.kernel.org/project/netdevbpf/list/?series=1047339

@kernel-patches-daemon-bpf
Copy link
Author

Upstream branch: 78980b4
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1047339
version: 3

@kernel-patches-daemon-bpf
Copy link
Author

Upstream branch: 8016abd
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1047339
version: 3

@kernel-patches-review-bot
Copy link

commit 7f62628df4a78ae37e90a720f490018e3dbde345
Author: Roman Gushchin <[email protected]>

bpf: selftests: introduce read_cgroup_file() helper

This patch adds a read_cgroup_file() helper function for BPF selftests to
read from cgroup control files such as statistics.

> diff --git a/tools/testing/selftests/bpf/cgroup_helpers.c b/tools/testing/selftests/bpf/cgroup_helpers.c
> index 20cede4db3ce..fc5f22409ce5 100644
> --- a/tools/testing/selftests/bpf/cgroup_helpers.c
> +++ b/tools/testing/selftests/bpf/cgroup_helpers.c

[ ... ]

> +/**
> + * read_cgroup_file() - Read from a cgroup file
> + * @relative_path: The cgroup path, relative to the workdir
> + * @file: The name of the file in cgroupfs to read from
> + * @buf: Buffer to read from the file
          ^^^^^^^^^^^^^^^^^^^^^^^^

This isn't a bug, but should this say "Buffer to store data read from the
file" or "Buffer to read into"? The current wording suggests data is read
from the buffer, but the code reads from the file into the buffer. Compare
with write_cgroup_file() which says "Buffer to write to the file" - that
phrasing works for writes but the inverse doesn't quite work for reads.

> + * @size: Size of the buffer
> + *
> + * Read from a file in the given cgroup's directory.
> + *
> + * If successful, the number of read bytes is returned.
> + */

[ ... ]



AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: bpf: selftests: introduce read_cgroup_file() helper
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/21382620878

AI-authorship-score: low
AI-authorship-explanation: The code closely follows existing patterns in the file with a minor documentation error typical of copy-paste adaptation, suggesting human authorship.
issues-found: 1
issue-severity-score: low
issue-severity-explanation: Minor documentation error in test code where @buf description says the opposite of what the code does.

@kernel-patches-review-bot
Copy link

commit 33a0b9d2dea18cc35e8e70e8c1630d2228965e9d
Author: Roman Gushchin <[email protected]>

bpf: allow attaching struct_ops to cgroups

This commit introduces the ability to attach BPF struct_ops to cgroups.
Users pass BPF_F_CGROUP_FD flag and target cgroup fd when creating a
struct_ops link. The cgroup maintains a list of attached struct ops links.

> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> index 69988af44b37..7b1903be6f69 100644

[ ... ]

> @@ -307,12 +308,23 @@ static void cgroup_bpf_release(struct work_struct *work)
>  					       bpf.release_work);
>  	struct bpf_prog_array *old_array;
>  	struct list_head *storages = &cgrp->bpf.storages;
> +	struct bpf_struct_ops_link *st_link, *st_tmp;
>  	struct bpf_cgroup_storage *storage, *stmp;
> +	LIST_HEAD(st_links);
>
>  	unsigned int atype;
>
>  	cgroup_lock();
>
> +	list_splice_init(&cgrp->bpf.struct_ops_links, &st_links);
> +	list_for_each_entry_safe(st_link, st_tmp, &st_links, list) {
> +		st_link->cgroup = NULL;
> +		st_link->cgroup_removed = true;
> +		cgroup_put(cgrp);
> +		if (IS_ERR(bpf_link_inc_not_zero(&st_link->link)))
> +			list_del(&st_link->list);
> +	}

Can this code access st_link after it has been freed?

Once st_link->cgroup is set to NULL, if another CPU concurrently closes
the link FD, bpf_struct_ops_map_link_dealloc() will see st_link->cgroup
as NULL, skip the cgroup_bpf_detach_struct_ops() call, and proceed to
kfree(st_link). Then the bpf_link_inc_not_zero() call here would access
freed memory.

The race window is between setting st_link->cgroup = NULL and calling
bpf_link_inc_not_zero(). During this window:

  CPU 0 (cgroup_bpf_release)     CPU 1 (close link FD)
  ---------------------------    ---------------------
  st_link->cgroup = NULL
                                 bpf_link_put_direct()
                                   bpf_link_free()
                                     bpf_struct_ops_map_link_dealloc()
                                       if (st_link->cgroup) // false
                                       kfree(st_link) // freed!
  bpf_link_inc_not_zero() // use-after-free

Would it be safer to call bpf_link_inc_not_zero() before modifying
st_link fields? That way, if the link is already at refcount zero,
the loop would skip it; otherwise, the held reference would prevent
the concurrent free.



AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: bpf: allow attaching struct_ops to cgroups
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/21382620878

AI-authorship-score: low
AI-authorship-explanation: The commit is authored by a known kernel maintainer with a consistent, human-like technical writing style and appropriate kernel development patterns.
issues-found: 1
issue-severity-score: medium
issue-severity-explanation: Use-after-free race condition in cgroup_bpf_release() that can cause kernel crash when link FD is closed concurrently with cgroup deletion.

@kernel-patches-review-bot
Copy link

commit bf7e84edd9392f7a8fb1bc796b47c6f7f93e4fe3
Author: Roman Gushchin <[email protected]>

libbpf: introduce bpf_map__attach_struct_ops_opts()

Introduces bpf_map__attach_struct_ops_opts() with a new bpf_struct_ops_opts
structure to allow attaching struct_ops to cgroups via target_fd and flags.

> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index 46d2762f5993..9ba67089bf9d 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c

[ ... ]

> @@ -13503,7 +13510,9 @@ struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
>  		return &link->link;
>  	}
>
> -	fd = bpf_link_create(map->fd, 0, BPF_STRUCT_OPS, NULL);
> +	link_opts.flags = OPTS_GET(opts, flags, 0);
> +	target_fd = OPTS_GET(opts, target_fd, 0);
> +	fd = bpf_link_create(map->fd, target_fd, BPF_STRUCT_OPS, &link_opts);

The code reads flags and target_fd from opts, but the expected_revision field
declared in struct bpf_struct_ops_opts is never read. Looking at similar opts
structures (bpf_cgroup_opts, bpf_tcx_opts, bpf_netkit_opts), they all pass
expected_revision to bpf_link_create_opts. For example:

    link_create_opts.cgroup.expected_revision = OPTS_GET(opts, expected_revision, 0);

However, bpf_link_create_opts has no struct_ops sub-struct to receive
expected_revision, and the kernel's bpf_attr link_create union has no
struct_ops.expected_revision field either.

Is expected_revision in bpf_struct_ops_opts intentionally unused? If so,
should the field be removed from the struct to avoid user confusion? If users
set expected_revision expecting it to be honored, the value is silently
ignored.

[ ... ]

> diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
> index dfc37a615578..2c28cf80e7fe 100644
> --- a/tools/lib/bpf/libbpf.h
> +++ b/tools/lib/bpf/libbpf.h
> @@ -920,6 +920,20 @@ bpf_program__attach_cgroup_opts(const struct bpf_program *prog, int cgroup_fd,
>  struct bpf_map;
>
>  LIBBPF_API struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map);
> +
> +struct bpf_struct_ops_opts {
> +	/* size of this struct, for forward/backward compatibility */
> +	size_t sz;
> +	__u32 flags;
> +	__u32 target_fd;
> +	__u64 expected_revision;
                 ^^^^^^^^^^^^^^^^^

The expected_revision field is declared here, matching the pattern used in
bpf_cgroup_opts, bpf_tcx_opts, and bpf_netkit_opts. However, unlike those
structures where expected_revision is read and passed through, this field
has no corresponding code path in bpf_map__attach_struct_ops_opts().

> +	size_t :0;
> +};
> +#define bpf_struct_ops_opts__last_field expected_revision

[ ... ]



AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: libbpf: introduce bpf_map__attach_struct_ops_opts()
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/21382620878

AI-authorship-score: low
AI-authorship-explanation: The commit follows kernel conventions and existing code patterns, with an oversight (unused expected_revision field) more typical of human development than AI-generated code.
issues-found: 1
issue-severity-score: low
issue-severity-explanation: The expected_revision field in bpf_struct_ops_opts is declared but never used, causing silent data loss if users set this field expecting it to be honored.

@kernel-patches-daemon-bpf
Copy link
Author

Forwarding comment 3802813229 via email
In-Reply-To: [email protected]
Patch: https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/

@kernel-patches-daemon-bpf
Copy link
Author

Forwarding comment 3802814758 via email
In-Reply-To: [email protected]
Patch: https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/

@kernel-patches-daemon-bpf
Copy link
Author

Forwarding comment 3802823021 via email
In-Reply-To: [email protected]
Patch: https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/

@kernel-patches-daemon-bpf
Copy link
Author

Upstream branch: 8016abd
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1047339
version: 3

@kernel-patches-daemon-bpf
Copy link
Author

Upstream branch: 8016abd
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1047339
version: 3

@kernel-patches-daemon-bpf
Copy link
Author

Upstream branch: 1456ebb
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1047339
version: 3

@kernel-patches-daemon-bpf
Copy link
Author

Upstream branch: 35538db
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1047339
version: 3

@kernel-patches-daemon-bpf
Copy link
Author

Upstream branch: 35538db
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1047339
version: 3

rgushchin and others added 17 commits January 27, 2026 14:54
Move struct bpf_struct_ops_link's definition into bpf.h,
where other custom bpf links definitions are.

It's necessary to access its members from outside of generic
bpf_struct_ops implementation, which will be done by following
patches in the series.

Signed-off-by: Roman Gushchin <[email protected]>
Acked-by: Yafang Shao <[email protected]>
Introduce an ability to attach bpf struct_ops'es to cgroups.

From user's standpoint it works in the following way:
a user passes a BPF_F_CGROUP_FD flag and specifies the target cgroup
fd while creating a struct_ops link. As the result, the bpf struct_ops
link will be created and attached to a cgroup.

The cgroup.bpf structure maintains a list of attached struct ops links.
If the cgroup is getting deleted, attached struct ops'es are getting
auto-detached and the userspace program gets a notification.

This change doesn't answer the question how bpf programs belonging
to these struct ops'es will be executed. It will be done individually
for every bpf struct ops which supports this.

Please, note that unlike "normal" bpf programs, struct ops'es
are not propagated to cgroup sub-trees.

Signed-off-by: Roman Gushchin <[email protected]>
bpf_map__attach_struct_ops() returns -EINVAL instead of -ENOMEM
on the memory allocation failure. Fix it.

Fixes: 590a008 ("bpf: libbpf: Add STRUCT_OPS support")
Signed-off-by: Roman Gushchin <[email protected]>
Acked-by: Yafang Shao <[email protected]>
Introduce bpf_map__attach_struct_ops_opts(), an extended version of
bpf_map__attach_struct_ops(), which takes additional struct
bpf_struct_ops_opts argument.

This allows to pass a target_fd argument and the BPF_F_CGROUP_FD flag
and attach the struct ops to a cgroup as a result.

Signed-off-by: Roman Gushchin <[email protected]>
Struct oom_control is used to describe the OOM context.
It's memcg field defines the scope of OOM: it's NULL for global
OOMs and a valid memcg pointer for memcg-scoped OOMs.
Teach bpf verifier to recognize it as trusted or NULL pointer.
It will provide the bpf OOM handler a trusted memcg pointer,
which for example is required for iterating the memcg's subtree.

Signed-off-by: Roman Gushchin <[email protected]>
Acked-by: Kumar Kartikeya Dwivedi <[email protected]>
Acked-by: Yafang Shao <[email protected]>
mem_cgroup_get_from_ino() can be reused by the BPF OOM implementation,
but currently depends on CONFIG_SHRINKER_DEBUG. Remove this dependency.

Signed-off-by: Roman Gushchin <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Introduce a bpf struct ops for implementing custom OOM handling
policies.

It's possible to load one bpf_oom_ops for the system and one
bpf_oom_ops for every memory cgroup. In case of a memcg OOM, the
cgroup tree is traversed from the OOM'ing memcg up to the root and
corresponding BPF OOM handlers are executed until some memory is
freed. If no memory is freed, the kernel OOM killer is invoked.

The struct ops provides the bpf_handle_out_of_memory() callback,
which expected to return 1 if it was able to free some memory and 0
otherwise. If 1 is returned, the kernel also checks the bpf_memory_freed
field of the oom_control structure, which is expected to be set by
kfuncs suitable for releasing memory (which will be introduced later
in the patch series). If both are set, OOM is considered handled,
otherwise the next OOM handler in the chain is executed: e.g. BPF OOM
attached to the parent cgroup or the kernel OOM killer.

The bpf_handle_out_of_memory() callback program is sleepable to allow
using iterators, e.g. cgroup iterators. The callback receives struct
oom_control as an argument, so it can determine the scope of the OOM
event: if this is a memcg-wide or system-wide OOM. It also receives
bpf_struct_ops_link as the second argument, so it can detect the
cgroup level at which this specific instance is attached.

The bpf_handle_out_of_memory() callback is executed just before the
kernel victim task selection algorithm, so all heuristics and sysctls
like panic on oom, sysctl_oom_kill_allocating_task and
sysctl_oom_kill_allocating_task are respected.

The struct ops has the name field, which allows to define a custom
name for the implemented policy. It's printed in the OOM report
in the oom_handler=<name> format only if a bpf handler is invoked.

Signed-off-by: Roman Gushchin <[email protected]>
Introduce bpf_oom_kill_process() bpf kfunc, which is supposed
to be used by BPF OOM programs. It allows to kill a process
in exactly the same way the OOM killer does: using the OOM reaper,
bumping corresponding memcg and global statistics, respecting
memory.oom.group etc.

On success, it sets the oom_control's bpf_memory_freed field to true,
enabling the bpf program to bypass the kernel OOM killer.

Signed-off-by: Roman Gushchin <[email protected]>
Introduce bpf_out_of_memory() bpf kfunc, which allows to declare
an out of memory events and trigger the corresponding kernel OOM
handling mechanism.

It takes a trusted memcg pointer (or NULL for system-wide OOMs)
as an argument, as well as the page order.

If the BPF_OOM_FLAGS_WAIT_ON_OOM_LOCK flag is not set, only one OOM
can be declared and handled in the system at once, so if the function
is called in parallel to another OOM handling, it bails out with -EBUSY.
This mode is suited for global OOM's: any concurrent OOMs will likely
do the job and release some memory. In a blocking mode (which is
suited for memcg OOMs) the execution will wait on the oom_lock mutex.

The function is declared as sleepable. It guarantees that it won't
be called from an atomic context. It's required by the OOM handling
code, which shouldn't be called from a non-blocking context.

Handling of a memcg OOM almost always requires taking of the
css_set_lock spinlock. The fact that bpf_out_of_memory() is sleepable
also guarantees that it can't be called with acquired css_set_lock,
so the kernel can't deadlock on it.

To avoid deadlocks on the oom lock, the function is filtered out for
bpf oom struct ops programs and all tracing programs.

Signed-off-by: Roman Gushchin <[email protected]>
Export tsk_is_oom_victim() helper as a BPF kfunc.
It's very useful to avoid redundant oom kills.

Signed-off-by: Roman Gushchin <[email protected]>
Suggested-by: Michal Hocko <[email protected]>
Implement read_cgroup_file() helper to read from cgroup control files,
e.g. statistics.

Signed-off-by: Roman Gushchin <[email protected]>
Implement a kselftest for the OOM handling functionality.

The OOM handling policy which is implemented in BPF is to
kill all tasks belonging to the biggest leaf cgroup, which
doesn't contain unkillable tasks (tasks with oom_score_adj
set to -1000). Pagecache size is excluded from the accounting.

The test creates a hierarchy of memory cgroups, causes an
OOM at the top level, checks that the expected process is
killed and verifies the memcg's oom statistics.

The same BPF OOM policy is attached to a memory cgroup and
system-wide. In the first case the program does nothing and
returns false, so it's executed the second time, when it
properly handles the OOM.

Signed-off-by: Roman Gushchin <[email protected]>
Add a trace point to psi_avgs_work().

It can be used to attach a bpf handler which can monitor
PSI values system-wide or for specific cgroup(s) and
potentially perform some actions, e.g. declare an OOM.

Signed-off-by: Roman Gushchin <[email protected]>
To allow a more efficient filtering of cgroups in the psi work
tracepoint handler, let's add a u64 cgroup_id field to the psi_group
structure. For system PSI, 0 will be used.

Signed-off-by: Roman Gushchin <[email protected]>
Allow calling bpf_out_of_memory() from a PSI tracepoint to enable
PSI-based OOM killer policies.

Signed-off-by: Roman Gushchin <[email protected]>
Include CONFIG_PSI to allow dependent tests to build.

Suggested-by: Song Liu <[email protected]>
Signed-off-by: JP Kobryn <[email protected]>
Signed-off-by: Roman Gushchin <[email protected]>
Add a PSI struct ops test.

The test creates a cgroup with two child sub-cgroups, sets up
memory.high for one of those and puts there a memory hungry
process (initially frozen).

The memory hungry task is creating a high memory pressure in one
memory cgroup, which triggers a PSI event. The PSI BPF handler
declares a memcg oom in the corresponding cgroup.

Signed-off-by: Roman Gushchin <[email protected]>
@kernel-patches-daemon-bpf
Copy link
Author

Upstream branch: 35538db
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1047339
version: 3

@kernel-patches-daemon-bpf
Copy link
Author

At least one diff in series https://patchwork.kernel.org/project/netdevbpf/list/?series=1047339 expired. Closing PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants