Skip to content

Conversation

@kernel-patches-daemon-bpf
Copy link

Pull request for series with
subject: mm: memcontrol: Add BPF hooks for memory controller
version: 5
url: https://patchwork.kernel.org/project/netdevbpf/list/?series=1047479

rgushchin and others added 12 commits January 27, 2026 01:57
Move struct bpf_struct_ops_link's definition into bpf.h,
where other custom bpf links definitions are.

It's necessary to access its members from outside of generic
bpf_struct_ops implementation, which will be done by following
patches in the series.

Signed-off-by: Roman Gushchin <[email protected]>
When a struct ops is being attached and a bpf link is created,
allow to pass a cgroup fd using bpf attr, so that struct ops
can be attached to a cgroup instead of globally.

Attached struct ops doesn't hold a reference to the cgroup,
only preserves cgroup id.

Signed-off-by: Roman Gushchin <[email protected]>
Struct oom_control is used to describe the OOM context.
It's memcg field defines the scope of OOM: it's NULL for global
OOMs and a valid memcg pointer for memcg-scoped OOMs.
Teach bpf verifier to recognize it as trusted or NULL pointer.
It will provide the bpf OOM handler a trusted memcg pointer,
which for example is required for iterating the memcg's subtree.

Signed-off-by: Roman Gushchin <[email protected]>
Acked-by: Kumar Kartikeya Dwivedi <[email protected]>
mem_cgroup_get_from_ino() can be reused by the BPF OOM implementation,
but currently depends on CONFIG_SHRINKER_DEBUG. Remove this dependency.

Signed-off-by: Roman Gushchin <[email protected]>
Introduce bpf_map__attach_struct_ops_opts(), an extended version of
bpf_map__attach_struct_ops(), which takes additional struct
bpf_struct_ops_opts argument.

struct bpf_struct_ops_opts has the relative_fd member, which allows
to pass an additional file descriptor argument. It can be used to
attach struct ops maps to cgroups.

Signed-off-by: Roman Gushchin <[email protected]>
To support features like allowing overrides in cgroup hierarchies,
we need a way to pass flags from userspace to the kernel when
attaching a struct_ops.

Extend `bpf_struct_ops_link` to include a `flags` field. This field
is populated from `attr->link_create.flags` during link creation. This
will allow struct_ops implementations, such as the upcoming memory
controller ops, to interpret these flags and modify their attachment
behavior accordingly.

Signed-off-by: Geliang Tang <[email protected]>
Signed-off-by: Hui Zhu <[email protected]>
Building on the previous change that added flags to the kernel's link
creation path, this patch exposes this functionality through libbpf.

The `bpf_struct_ops_opts` struct is extended with a `flags` member,
which is then passed to the `bpf_link_create` syscall within
`bpf_map__attach_struct_ops_opts`.

This enables userspace applications to pass flags, such as
`BPF_F_ALLOW_OVERRIDE`, when attaching struct_ops to cgroups,
providing more control over the attachment behavior in nested
hierarchies.

Signed-off-by: Geliang Tang <[email protected]>
Signed-off-by: Hui Zhu <[email protected]>
Introduce BPF struct_ops support to the memory controller, enabling
custom and dynamic control over memory pressure. This is achieved
through a new struct_ops type, `memcg_bpf_ops`.

This new interface allows a BPF program to implement hooks that
influence a memory cgroup's behavior. The `memcg_bpf_ops` struct
provides the following hooks:

- `get_high_delay_ms`: Returns a custom throttling delay in
  milliseconds for a cgroup that has breached its `memory.high`
  limit. This is the primary mechanism for BPF-driven throttling.

- `below_low`: Overrides the `memory.low` protection check. If this
  hook returns true, the cgroup is considered to be protected by its
  `memory.low` setting, regardless of its actual usage.

- `below_min`: Similar to `below_low`, this overrides the `memory.min`
  protection check.

- `handle_cgroup_online`/`offline`: Callbacks invoked when a cgroup
  with an attached program comes online or goes offline, allowing for
  state management.

This patch integrates these hooks into the core memory control logic.
The `get_high_delay_ms` value is incorporated into charge paths like
`try_charge_memcg` and the high-limit handler
`__mem_cgroup_handle_over_high`. The `below_low` and `below_min`
hooks are checked within their respective protection functions.

Lifecycle management is handled to ensure BPF programs are correctly
inherited by child cgroups and cleaned up on detachment. SRCU is used
to protect concurrent access to the `memcg->bpf_ops` pointer.

Signed-off-by: Geliang Tang <[email protected]>
Signed-off-by: Hui Zhu <[email protected]>
Add a comprehensive selftest suite for the `memcg_bpf_ops`
functionality. These tests validate that BPF programs can correctly
influence memory cgroup throttling behavior by implementing the new
hooks.

The test suite is added in `prog_tests/memcg_ops.c` and covers
several key scenarios:

1. `test_memcg_ops_over_high`:
   Verifies that a BPF program can trigger throttling on a low-priority
   cgroup by returning a delay from the `get_high_delay_ms` hook when a
   high-priority cgroup is under pressure.

2. `test_memcg_ops_below_low_over_high`:
   Tests the combination of the `below_low` and `get_high_delay_ms`
   hooks, ensuring they work together as expected.

3. `test_memcg_ops_below_min_over_high`:
   Validates the interaction between the `below_min` and
   `get_high_delay_ms` hooks.

The test framework sets up a cgroup hierarchy with high and low
priority groups, attaches BPF programs, runs memory-intensive
workloads, and asserts that the observed throttling (measured by
workload execution time) matches expectations.

The BPF program (`progs/memcg_ops.c`) uses a tracepoint on
`memcg:count_memcg_events` (specifically PGFAULT) to detect memory
pressure and trigger the appropriate hooks in response. This test
suite provides essential validation for the new memory control
mechanisms.

Signed-off-by: Geliang Tang <[email protected]>
Signed-off-by: Hui Zhu <[email protected]>
To allow for more flexible attachment policies in nested cgroup
hierarchies, this patch introduces support for the
`BPF_F_ALLOW_OVERRIDE` flag for `memcg_bpf_ops`.

When a `memcg_bpf_ops` is attached to a cgroup with this flag, it
permits child cgroups to attach their own, different `memcg_bpf_ops`,
overriding the parent's inherited program. Without this flag,
attaching a BPF program to a cgroup that already has one (either
directly or via inheritance) will fail.

The implementation involves:
- Adding a `bpf_ops_flags` field to `struct mem_cgroup`.
- During registration (`bpf_memcg_ops_reg`), checking for existing
  programs and the `BPF_F_ALLOW_OVERRIDE` flag.
- During unregistration (`bpf_memcg_ops_unreg`), correctly restoring
  the parent's BPF program to the cgroup hierarchy.
- Ensuring flags are inherited by child cgroups during online events.

This change enables complex, multi-level policy enforcement where
different subtrees of the cgroup hierarchy can have distinct memory
management BPF programs.

Signed-off-by: Geliang Tang <[email protected]>
Signed-off-by: Hui Zhu <[email protected]>
Add a new selftest, `test_memcg_ops_hierarchies`, to validate the
behavior of attaching `memcg_bpf_ops` in a nested cgroup hierarchy,
specifically testing the `BPF_F_ALLOW_OVERRIDE` flag.

The test case performs the following steps:
1. Creates a three-level deep cgroup hierarchy: `/cg`, `/cg/cg`, and
   `/cg/cg/cg`.
2. Attaches a BPF struct_ops to the top-level cgroup (`/cg`) with the
   `BPF_F_ALLOW_OVERRIDE` flag.
3. Successfully attaches a new struct_ops to the middle cgroup
   (`/cg/cg`) without the flag, overriding the inherited one.
4. Asserts that attaching another struct_ops to the deepest cgroup
   (`/cg/cg/cg`) fails with -EBUSY, because its parent did not specify
   `BPF_F_ALLOW_OVERRIDE`.

This test ensures that the attachment logic correctly enforces the
override rules across a cgroup subtree.

Signed-off-by: Geliang Tang <[email protected]>
Signed-off-by: Hui Zhu <[email protected]>
Add a sample program to demonstrate a practical use case for the
`memcg_bpf_ops` feature: priority-based memory throttling.

The sample consists of a BPF program and a userspace loader:

1. memcg.bpf.c: A BPF program that monitors PGFAULT events on a
   high-priority cgroup. When activity exceeds a threshold, it uses
   the `get_high_delay_ms`, `below_low`, or `below_min` hooks to
   apply pressure on a low-priority cgroup.

2. memcg.c: A userspace loader that configures and attaches the BPF
   program. It takes command-line arguments for the high and low
   priority cgroup paths, a pressure threshold, and the desired
   throttling delay (`over_high_ms`).

This provides a clear, working example of how to implement a dynamic,
priority-aware memory management policy. A user can create two
cgroups, run workloads of different priorities, and observe the
low-priority workload being throttled to protect the high-priority one.

Example usage:
  # ./memcg --low_path /sys/fs/cgroup/low \
  #         --high_path /sys/fs/cgroup/high \
  #         --threshold 100 --over_high_ms 1024

Signed-off-by: Geliang Tang <[email protected]>
Signed-off-by: Hui Zhu <[email protected]>
@kernel-patches-daemon-bpf
Copy link
Author

Upstream branch: 8016abd
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1047479
version: 5

@kernel-patches-review-bot
Copy link

commit 5605dc8c8ea3d755c5b2e1e3b08b9c0438b74665
Author: Hui Zhu <[email protected]>

mm/bpf: Add BPF_F_ALLOW_OVERRIDE support for memcg_bpf_ops

This patch adds support for BPF_F_ALLOW_OVERRIDE flag to allow child
cgroups to attach their own memcg_bpf_ops, overriding inherited programs
from parent cgroups.

> diff --git a/mm/bpf_memcontrol.c b/mm/bpf_memcontrol.c
> index e746eb9cbd56..7cd983e350d7 100644
> --- a/mm/bpf_memcontrol.c
> +++ b/mm/bpf_memcontrol.c

[ ... ]

> static int bpf_memcg_ops_reg(void *kdata, struct bpf_link *link)
> {
> 	struct bpf_struct_ops_link *ops_link
> 		= container_of(link, struct bpf_struct_ops_link, link);
>-	struct memcg_bpf_ops *ops = kdata;
>+	struct memcg_bpf_ops *ops = kdata, *old_ops;
> 	struct mem_cgroup *memcg, *iter = NULL;
> 	int err = 0;

[ ... ]

>+	/* Check for incompatible bpf_ops in descendants. */
> 	while ((iter = mem_cgroup_iter(memcg, iter, NULL))) {
>-		if (READ_ONCE(iter->bpf_ops)) {
>-			mem_cgroup_iter_break(memcg, iter);
>+		struct memcg_bpf_ops *iter_ops = READ_ONCE(iter->bpf_ops);
>+
>+		if (iter_ops && iter_ops != old_ops) {
>+			/* cannot override existing bpf_ops of sub-cgroup. */
> 			err = -EBUSY;
>-			break;
>+			goto unlock_out;
                        ^^^^

It looks like this error path leaks a CSS reference. The original code
called mem_cgroup_iter_break(memcg, iter) before breaking out of the loop,
which releases the CSS reference held on iter. The new code jumps directly
to unlock_out without releasing this reference.

mem_cgroup_iter() acquires a reference on each returned memcg via
css_tryget() in mm/memcontrol.c. When breaking out of the loop early,
mem_cgroup_iter_break() must be called to release that reference.

Should this be:

    if (iter_ops && iter_ops != old_ops) {
        mem_cgroup_iter_break(memcg, iter);
        err = -EBUSY;
        goto unlock_out;
    }

> 		}
>+	}
>+
>+	while ((iter = mem_cgroup_iter(memcg, iter, NULL))) {
> 		WRITE_ONCE(iter->bpf_ops, ops);
>+		iter->bpf_ops_flags = ops_link->flags;
> 	}
>-	if (err)
>-		clean_memcg_bpf_ops(memcg, ops);
>+
>+unlock_out:
> 	cgroup_unlock();

[ ... ]



AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: mm/bpf: Add BPF_F_ALLOW_OVERRIDE support for memcg_bpf_ops
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/21392729810

AI-authorship-score: low
AI-authorship-explanation: The commit follows standard kernel documentation patterns without distinctive AI markers.
issues-found: 1
issue-severity-score: medium
issue-severity-explanation: CSS reference leak in error path leads to memcg lifetime issues but does not cause immediate system instability.

@kernel-patches-review-bot
Copy link

commit ec406f5cccc01d230180ca220ebf52d99fe413c7
Author: Hui Zhu <[email protected]>

samples/bpf: Add memcg priority control example

This commit adds a sample BPF program and userspace loader demonstrating
priority-based memory throttling using the memcg_bpf_ops feature. The
BPF program monitors PGFAULT events on a high-priority cgroup and applies
pressure on a low-priority cgroup when activity exceeds a threshold.

> diff --git a/samples/bpf/memcg.bpf.c b/samples/bpf/memcg.bpf.c
> new file mode 100644
> index 000000000000..e611ac0e641a
> --- /dev/null
> +++ b/samples/bpf/memcg.bpf.c

[ ... ]

> +SEC("tp/memcg/count_memcg_events")
> +int
> +handle_count_memcg_events(struct trace_event_raw_memcg_rstat_events *ctx)
> +{
> +	u32 key = 0;
> +	struct AggregationData *data;
> +	u64 current_ts;
> +
> +	if (ctx->id != local_config.high_cgroup_id ||
> +	    (ctx->item != PGFAULT))
                         ^^^^^^^

The BPF code monitors PGFAULT events here.

> diff --git a/samples/bpf/memcg.c b/samples/bpf/memcg.c
> new file mode 100644
> index 000000000000..0c47ed53f6ae
> --- /dev/null
> +++ b/samples/bpf/memcg.c

[ ... ]

> +static void usage(char *name)
> +{

[ ... ]

> +	fprintf(stderr,
> +		"  -t, --threshold=VALUE  The sum of 'val' PGSCAN of\n");
                                                     ^^^^^^

Should this say PGFAULT instead of PGSCAN? The commit message and the BPF
code in memcg.bpf.c both indicate that the program monitors PGFAULT events,
but the help text here mentions PGSCAN.



AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: samples/bpf: Add memcg priority control example
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/21392729810

AI-authorship-score: low
AI-authorship-explanation: Code is a straightforward adaptation of existing selftest code with standard kernel patterns; documentation error is more consistent with human copy-paste mistake than AI generation.
issues-found: 1
issue-severity-score: low
issue-severity-explanation: Documentation mismatch in help text (PGSCAN vs PGFAULT) could cause user confusion but has no runtime impact.

@kernel-patches-review-bot
Copy link

commit 380bb1065b4190dd6548209ddc6827289922a8c8
Author: Roman Gushchin <[email protected]>

libbpf: introduce bpf_map__attach_struct_ops_opts()

This patch introduces bpf_map__attach_struct_ops_opts(), an extended version
of bpf_map__attach_struct_ops() that takes an additional bpf_struct_ops_opts
argument for attaching struct ops maps to cgroups.

> diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
> index 5846de364209..84a53c594f48 100644
> --- a/tools/lib/bpf/bpf.c
> +++ b/tools/lib/bpf/bpf.c
> @@ -884,6 +884,14 @@ int bpf_link_create(int prog_fd, int target_fd,
>  		if (!OPTS_ZEROED(opts, cgroup))
>  			return libbpf_err(-EINVAL);
>  		break;
> +	case BPF_STRUCT_OPS:
> +		relative_fd = OPTS_GET(opts, cgroup.relative_fd, 0);
> +		attr.link_create.cgroup.relative_fd = relative_fd;
> +		attr.link_create.cgroup.expected_revision =
> +			OPTS_GET(opts, cgroup.expected_revision, 0);

The bpf_link_create() BPF_STRUCT_OPS case reads expected_revision from
the opts, but looking at bpf_map__attach_struct_ops_opts() below...

> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index 0c8bf0b5cce4..70a00da54ff5 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c

[ ... ]

> @@ -13503,7 +13510,9 @@ struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
>  		return &link->link;
>  	}
>
> -	fd = bpf_link_create(map->fd, 0, BPF_STRUCT_OPS, NULL);
> +	link_opts.cgroup.relative_fd = OPTS_GET(opts, relative_fd, 0);
> +
> +	fd = bpf_link_create(map->fd, 0, BPF_STRUCT_OPS, &link_opts);

Should there also be a line here to forward expected_revision? The
bpf_struct_ops_opts struct defines expected_revision and the BPF_STRUCT_OPS
case in bpf_link_create() reads it, but this function only sets relative_fd.
If a user passes expected_revision in their opts, it appears to be silently
ignored.

> diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
> index dfc37a615578..5aef44bcfcc2 100644
> --- a/tools/lib/bpf/libbpf.h
> +++ b/tools/lib/bpf/libbpf.h

[ ... ]

> +struct bpf_struct_ops_opts {
> +	/* size of this struct, for forward/backward compatibility */
> +	size_t sz;
> +	__u32 flags;
> +	__u32 relative_fd;
> +	__u64 expected_revision;
                ^^^^^^^^^^^^^^^^
> +	size_t :0;
> +};
> +#define bpf_struct_ops_opts__last_field expected_revision

The expected_revision field is declared here and is even the __last_field,
but it does not appear to be forwarded to bpf_link_create_opts in
bpf_map__attach_struct_ops_opts(). Is this intentional, or was the
forwarding line missed?



AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: libbpf: introduce bpf_map__attach_struct_ops_opts()
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/21392729810

AI-authorship-score: low
AI-authorship-explanation: Commit message and code follow standard kernel/libbpf conventions without unusual patterns typically associated with AI-generated content.
issues-found: 1
issue-severity-score: low
issue-severity-explanation: The expected_revision field in bpf_struct_ops_opts is defined but silently ignored, which could cause unexpected behavior for users relying on revision checking, but does not cause crashes or security issues.

@kernel-patches-review-bot
Copy link

commit db77479affdbf2aa45dd7873e993c1879fbb74fa
Author: Hui Zhu <[email protected]>

selftests/bpf: Add tests for memcg_bpf_ops

This commit adds a selftest suite for the memcg_bpf_ops functionality.
The tests validate BPF programs that influence memory cgroup throttling
behavior through the new hooks.

> diff --git a/tools/testing/selftests/bpf/prog_tests/memcg_ops.c b/tools/testing/selftests/bpf/prog_tests/memcg_ops.c
> new file mode 100644
> index 000000000000..a596926ea233
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/memcg_ops.c

[ ... ]

> +void test_memcg_ops_over_high(void)
> +{
> +	int err, map_fd;
> +	struct memcg_ops *skel = NULL;

[ ... ]

> +	skel = memcg_ops__open_and_load();
> +	if (!ASSERT_OK_PTR(skel, "memcg_ops__open_and_load"))
> +		goto out;

[ ... ]

> +out:
> +	bpf_link__destroy(link);
> +	bpf_link__destroy(link2);
> +	memcg_ops__detach(skel);
> +	memcg_ops__destroy(skel);
                  ^^^^

Can this cause a NULL pointer dereference? When memcg_ops__open_and_load()
fails, skel is NULL and the code jumps to out: where memcg_ops__detach(skel)
is called. The generated __detach() function does not check for NULL, unlike
__destroy() which does. Since __destroy() already calls __detach() internally
after checking for NULL, the explicit memcg_ops__detach(skel) call appears
to be unnecessary and would crash when skel is NULL.

The same pattern appears in test_memcg_ops_below_low_over_high() and
test_memcg_ops_below_min_over_high().



AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: selftests/bpf: Add tests for memcg_bpf_ops
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/21392729810

AI-authorship-score: medium
AI-authorship-explanation: The commit message uses verbose, explanatory language with markdown formatting that is somewhat atypical for kernel commits, and the repetitive test function structure with the redundant detach call suggests possible AI assistance in code generation.
issues-found: 1
issue-severity-score: low
issue-severity-explanation: The memcg_ops__detach(NULL) call can crash the test if the skeleton fails to load, but this only affects the selftest program and not the kernel itself.

@kernel-patches-review-bot
Copy link

commit 8c4e972cad3327d34815224916cef478241cf9c8
Author: Hui Zhu <[email protected]>

mm: memcontrol: Add BPF struct_ops for memory controller

This commit introduces BPF struct_ops support for the memory controller,
enabling custom memory pressure control through hooks like get_high_delay_ms,
below_low, below_min, and cgroup online/offline handlers.

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 1f74fce27677..8d90575aa77d 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c

[ ... ]

> @@ -2530,10 +2537,14 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	 * kernel. If this is successful, the return path will see it
>  	 * when it rechecks the overage and simply bail out.
>  	 */
> -	if (current->memcg_nr_pages_over_high > MEMCG_CHARGE_BATCH &&
> -	    !(current->flags & PF_MEMALLOC) &&
> -	    gfpflags_allow_blocking(gfp_mask))
> -		__mem_cgroup_handle_over_high(gfp_mask);
> +	if (gfpflags_allow_blocking(gfp_mask)) {
> +		unsigned long bpf_high_delay;
> +
> +		bpf_high_delay = bpf_memcg_get_high_delay(orig_memcg);
> +		if (bpf_high_delay ||
> +		    current->memcg_nr_pages_over_high > MEMCG_CHARGE_BATCH)
> +			__mem_cgroup_handle_over_high(gfp_mask, bpf_high_delay);
> +	}

The old code checked !(current->flags & PF_MEMALLOC) before calling
__mem_cgroup_handle_over_high(). This check appears to be missing in the
new code.

When PF_MEMALLOC is set, the task is already in the memory reclaim context.
Calling __mem_cgroup_handle_over_high() from such a context would invoke
reclaim_high() -> try_to_free_mem_cgroup_pages(), potentially causing
reclaim recursion.

Was the removal of the PF_MEMALLOC check intentional? If a BPF program
returns a non-zero delay from get_high_delay_ms while the task is in a
reclaim context, could this lead to recursive reclaim or deadlock?



AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: mm: memcontrol: Add BPF struct_ops for memory controller
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/21392729810

AI-authorship-score: low
AI-authorship-explanation: The code shows typical human characteristics including a refactoring oversight that AI would likely not make; the commit follows standard kernel conventions and has multiple human authors.
issues-found: 1
issue-severity-score: medium
issue-severity-explanation: Missing PF_MEMALLOC check could cause memory reclaim recursion when BPF programs are attached, potentially leading to system instability.

@kernel-patches-daemon-bpf
Copy link
Author

At least one diff in series https://patchwork.kernel.org/project/netdevbpf/list/?series=1047479 expired. Closing PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants