Skip to content

Conversation

@kernel-patches-daemon-bpf
Copy link

Pull request for series with
subject: mm: memcontrol: Add BPF hooks for memory controller
version: 6
url: https://patchwork.kernel.org/project/netdevbpf/list/?series=1050636

rgushchin and others added 12 commits February 4, 2026 01:06
Move struct bpf_struct_ops_link's definition into bpf.h,
where other custom bpf links definitions are.

It's necessary to access its members from outside of generic
bpf_struct_ops implementation, which will be done by following
patches in the series.

Signed-off-by: Roman Gushchin <[email protected]>
When a struct ops is being attached and a bpf link is created,
allow to pass a cgroup fd using bpf attr, so that struct ops
can be attached to a cgroup instead of globally.

Attached struct ops doesn't hold a reference to the cgroup,
only preserves cgroup id.

Signed-off-by: Roman Gushchin <[email protected]>
Struct oom_control is used to describe the OOM context.
It's memcg field defines the scope of OOM: it's NULL for global
OOMs and a valid memcg pointer for memcg-scoped OOMs.
Teach bpf verifier to recognize it as trusted or NULL pointer.
It will provide the bpf OOM handler a trusted memcg pointer,
which for example is required for iterating the memcg's subtree.

Signed-off-by: Roman Gushchin <[email protected]>
Acked-by: Kumar Kartikeya Dwivedi <[email protected]>
mem_cgroup_get_from_ino() can be reused by the BPF OOM implementation,
but currently depends on CONFIG_SHRINKER_DEBUG. Remove this dependency.

Signed-off-by: Roman Gushchin <[email protected]>
Introduce bpf_map__attach_struct_ops_opts(), an extended version of
bpf_map__attach_struct_ops(), which takes additional struct
bpf_struct_ops_opts argument.

struct bpf_struct_ops_opts has the relative_fd member, which allows
to pass an additional file descriptor argument. It can be used to
attach struct ops maps to cgroups.

Signed-off-by: Roman Gushchin <[email protected]>
To support features like allowing overrides in cgroup hierarchies,
we need a way to pass flags from userspace to the kernel when
attaching a struct_ops.

Extend `bpf_struct_ops_link` to include a `flags` field. This field
is populated from `attr->link_create.flags` during link creation. This
will allow struct_ops implementations, such as the upcoming memory
controller ops, to interpret these flags and modify their attachment
behavior accordingly.

UAPI Change:
This patch updates the comment in include/uapi/linux/bpf.h to reflect
that the cgroup-bpf attach flags (such as BPF_F_ALLOW_OVERRIDE) are
now applicable to both BPF_PROG_ATTACH and BPF_LINK_CREATE commands.
Previously, these flags were only documented for BPF_PROG_ATTACH.

The actual flag definitions remain unchanged, so this is a compatible
extension of the existing API. Older userspace will continue to work
(by not passing flags), and newer userspace can opt-in to the new
functionality by setting appropriate flags.

Signed-off-by: Geliang Tang <[email protected]>
Signed-off-by: Hui Zhu <[email protected]>
Building on the previous change that added flags to the kernel's link
creation path, this patch exposes this functionality through libbpf.

The `bpf_struct_ops_opts` struct is extended with a `flags` member,
which is then passed to the `bpf_link_create` syscall within
`bpf_map__attach_struct_ops_opts`.

This enables userspace applications to pass flags, such as
`BPF_F_ALLOW_OVERRIDE`, when attaching struct_ops to cgroups,
providing more control over the attachment behavior in nested
hierarchies.

Signed-off-by: Geliang Tang <[email protected]>
Signed-off-by: Hui Zhu <[email protected]>
Introduce BPF struct_ops support to the memory controller, enabling
custom and dynamic control over memory pressure. This is achieved
through a new struct_ops type, `memcg_bpf_ops`.

This new interface allows a BPF program to implement hooks that
influence a memory cgroup's behavior. The `memcg_bpf_ops` struct
provides the following hooks:

- `get_high_delay_ms`: Returns a custom throttling delay in
  milliseconds for a cgroup that has breached its `memory.high`
  limit. This is the primary mechanism for BPF-driven throttling.

- `below_low`: Overrides the `memory.low` protection check. If this
  hook returns true, the cgroup is considered to be protected by its
  `memory.low` setting, regardless of its actual usage.

- `below_min`: Similar to `below_low`, this overrides the `memory.min`
  protection check.

- `handle_cgroup_online`/`offline`: Callbacks invoked when a cgroup
  with an attached program comes online or goes offline, allowing for
  state management.

This patch integrates these hooks into the core memory control logic.
The `get_high_delay_ms` value is incorporated into charge paths like
`try_charge_memcg` and the high-limit handler
`__mem_cgroup_handle_over_high`. The `below_low` and `below_min`
hooks are checked within their respective protection functions.

Lifecycle management is handled to ensure BPF programs are correctly
inherited by child cgroups and cleaned up on detachment. SRCU is used
to protect concurrent access to the `memcg->bpf_ops` pointer.

Signed-off-by: Geliang Tang <[email protected]>
Signed-off-by: Hui Zhu <[email protected]>
Add a comprehensive selftest suite for the `memcg_bpf_ops`
functionality. These tests validate that BPF programs can correctly
influence memory cgroup throttling behavior by implementing the new
hooks.

The test suite is added in `prog_tests/memcg_ops.c` and covers
several key scenarios:

1. `test_memcg_ops_over_high`:
   Verifies that a BPF program can trigger throttling on a low-priority
   cgroup by returning a delay from the `get_high_delay_ms` hook when a
   high-priority cgroup is under pressure.

2. `test_memcg_ops_below_low_over_high`:
   Tests the combination of the `below_low` and `get_high_delay_ms`
   hooks, ensuring they work together as expected.

3. `test_memcg_ops_below_min_over_high`:
   Validates the interaction between the `below_min` and
   `get_high_delay_ms` hooks.

The test framework sets up a cgroup hierarchy with high and low
priority groups, attaches BPF programs, runs memory-intensive
workloads, and asserts that the observed throttling (measured by
workload execution time) matches expectations.

The BPF program (`progs/memcg_ops.c`) uses a tracepoint on
`memcg:count_memcg_events` (specifically PGFAULT) to detect memory
pressure and trigger the appropriate hooks in response. This test
suite provides essential validation for the new memory control
mechanisms.

Signed-off-by: Geliang Tang <[email protected]>
Signed-off-by: Hui Zhu <[email protected]>
To allow for more flexible attachment policies in nested cgroup
hierarchies, this patch introduces support for the
`BPF_F_ALLOW_OVERRIDE` flag for `memcg_bpf_ops`.

When a `memcg_bpf_ops` is attached to a cgroup with this flag, it
permits child cgroups to attach their own, different `memcg_bpf_ops`,
overriding the parent's inherited program. Without this flag,
attaching a BPF program to a cgroup that already has one (either
directly or via inheritance) will fail.

The implementation involves:
- Adding a `bpf_ops_flags` field to `struct mem_cgroup`.
- During registration (`bpf_memcg_ops_reg`), checking for existing
  programs and the `BPF_F_ALLOW_OVERRIDE` flag.
- During unregistration (`bpf_memcg_ops_unreg`), correctly restoring
  the parent's BPF program to the cgroup hierarchy.
- Ensuring flags are inherited by child cgroups during online events.

This change enables complex, multi-level policy enforcement where
different subtrees of the cgroup hierarchy can have distinct memory
management BPF programs.

Signed-off-by: Geliang Tang <[email protected]>
Signed-off-by: Hui Zhu <[email protected]>
Add a new selftest, `test_memcg_ops_hierarchies`, to validate the
behavior of attaching `memcg_bpf_ops` in a nested cgroup hierarchy,
specifically testing the `BPF_F_ALLOW_OVERRIDE` flag.

The test case performs the following steps:
1. Creates a three-level deep cgroup hierarchy: `/cg`, `/cg/cg`, and
   `/cg/cg/cg`.
2. Attaches a BPF struct_ops to the top-level cgroup (`/cg`) with the
   `BPF_F_ALLOW_OVERRIDE` flag.
3. Successfully attaches a new struct_ops to the middle cgroup
   (`/cg/cg`) without the flag, overriding the inherited one.
4. Asserts that attaching another struct_ops to the deepest cgroup
   (`/cg/cg/cg`) fails with -EBUSY, because its parent did not specify
   `BPF_F_ALLOW_OVERRIDE`.

This test ensures that the attachment logic correctly enforces the
override rules across a cgroup subtree.

Signed-off-by: Geliang Tang <[email protected]>
Signed-off-by: Hui Zhu <[email protected]>
Add a sample program to demonstrate a practical use case for the
`memcg_bpf_ops` feature: priority-based memory throttling.

The sample consists of a BPF program and a userspace loader:

1. memcg.bpf.c: A BPF program that monitors PGFAULT events on a
   high-priority cgroup. When activity exceeds a threshold, it uses
   the `get_high_delay_ms`, `below_low`, or `below_min` hooks to
   apply pressure on a low-priority cgroup.

2. memcg.c: A userspace loader that configures and attaches the BPF
   program. It takes command-line arguments for the high and low
   priority cgroup paths, a pressure threshold, and the desired
   throttling delay (`over_high_ms`).

This provides a clear, working example of how to implement a dynamic,
priority-aware memory management policy. A user can create two
cgroups, run workloads of different priorities, and observe the
low-priority workload being throttled to protect the high-priority one.

Example usage:
  # ./memcg --low_path /sys/fs/cgroup/low \
  #         --high_path /sys/fs/cgroup/high \
  #         --threshold 100 --over_high_ms 1024

Signed-off-by: Geliang Tang <[email protected]>
Signed-off-by: Hui Zhu <[email protected]>
@kernel-patches-daemon-bpf
Copy link
Author

Upstream branch: b28dac3
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1050636
version: 6

@kernel-patches-review-bot
Copy link

commit da7a5b313339b1c26dae93b1f2d1d371eb0c2618
Author: Hui Zhu <[email protected]>

selftests/bpf: Add test for memcg_bpf_ops hierarchies

This commit adds a selftest to validate the behavior of attaching
memcg_bpf_ops in a nested cgroup hierarchy, specifically testing
the BPF_F_ALLOW_OVERRIDE flag.

> diff --git a/tools/testing/selftests/bpf/prog_tests/memcg_ops.c b/tools/testing/selftests/bpf/prog_tests/memcg_ops.c
> index 8c787439f..378ee3b3b 100644
> --- a/tools/testing/selftests/bpf/prog_tests/memcg_ops.c
> +++ b/tools/testing/selftests/bpf/prog_tests/memcg_ops.c

[ ... ]

> +void test_memcg_ops_hierarchies(void)
> +{
> +	int ret, first = -1, second = -1, third = -1;
> +	struct memcg_ops *skel = NULL;

[ ... ]

> +cleanup:
> +	bpf_link__destroy(link1);
> +	bpf_link__destroy(link2);
> +	bpf_link__destroy(link3);
> +	memcg_ops__detach(skel);
> +	memcg_ops__destroy(skel);

Can this crash if skel is NULL? Unlike the other tests in this file
which guard these calls with if (skel), this cleanup path calls
memcg_ops__detach() unconditionally. If any of the earlier goto cleanup
paths are taken before memcg_ops__open_and_load() succeeds, skel remains
NULL.

The generated skeleton __detach() function does not have a NULL check:

    static inline void
    %1$s__detach(struct %1$s *obj)
    {
        bpf_object__detach_skeleton(obj->skeleton);
    }

This would dereference NULL when accessing obj->skeleton.

For comparison, test_memcg_ops_over_high(), test_memcg_ops_below_low_over_high(),
and test_memcg_ops_below_min_over_high() in the same file all use:

    if (skel) {
        memcg_ops__detach(skel);
        memcg_ops__destroy(skel);
    }

> +	close(first);
> +	close(second);
> +	close(third);
> +	cleanup_cgroup_environment();
> +}



AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: selftests/bpf: Add test for memcg_bpf_ops hierarchies
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/21665371660

AI-authorship-score: low
AI-authorship-explanation: The code follows consistent patterns with other tests in the file and uses standard BPF selftest conventions, suggesting human authorship with good domain knowledge.
issues-found: 1
issue-severity-score: low
issue-severity-explanation: NULL pointer dereference crash in selftest cleanup path when cgroup setup fails, affecting test reliability but not production kernel code.

@kernel-patches-review-bot
Copy link

commit fb84369b2a7c318b87f69c80a92ee3bb31f3114f
Author: Hui Zhu <[email protected]>

samples/bpf: Add memcg priority control example

This commit adds a sample BPF program demonstrating the memcg_bpf_ops
feature for priority-based memory throttling. The sample consists of a
BPF program that monitors PGFAULT events and a userspace loader.

> 1. memcg.bpf.c: A BPF program that monitors PGFAULT events on a
>    high-priority cgroup. When activity exceeds a threshold, it uses
>    the `get_high_delay_ms`, `below_low`, or `below_min` hooks to
>    apply pressure on a low-priority cgroup.

Is this description accurate? Looking at the code, below_low and
below_min are implemented in high_mcg_ops which is attached to the
high-priority cgroup:

    high_mcg_ops = {
        .below_low = (void *)below_low_impl,
        .below_min = (void *)below_min_impl,
    };

And in memcg.c, high_mcg_ops is attached via:

    .relative_fd = high_cgroup_fd,

According to the memcg_bpf_ops documentation, below_low and below_min
provide protection for a cgroup (returning true means the cgroup is
protected from reclaim), rather than applying pressure. Only
get_high_delay_ms (attached to the low-priority cgroup via low_mcg_ops)
actually throttles/applies pressure.

Could the commit message clarify that below_low/below_min protect the
high-priority cgroup, while get_high_delay_ms throttles the low-priority
one?

> diff --git a/samples/bpf/memcg.c b/samples/bpf/memcg.c
> new file mode 100644
> index 000000000..0ed174608
> --- /dev/null
> +++ b/samples/bpf/memcg.c

[ ... ]

> +	prog = bpf_object__find_program_by_name(obj,
> +						"handle_count_memcg_events");
> +	if (!prog) {
> +		fprintf(stderr,
> +			"ERROR: finding a prog in BPF object file failed\n");
> +		goto out;
> +	}

If bpf_object__find_program_by_name() fails here, should the err variable
be set to an error code before the goto? As written, err could be 0 (from
the previous successful bpf_map_update_elem() call), causing main() to
return 0 (success) even though the program failed.



AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: samples/bpf: Add memcg priority control example
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/21665371660

AI-authorship-score: low
AI-authorship-explanation: The technical misunderstanding about hook purposes and dual authorship suggest human collaboration rather than AI generation.
issues-found: 2
issue-severity-score: low
issue-severity-explanation: The issues found are a misleading commit message description and a missing error code assignment in sample code, neither of which causes system instability.

@kernel-patches-review-bot
Copy link

commit e83943e725841e2380aca0cf073e0a359fc7a0f7
Author: Hui Zhu <[email protected]>

libbpf: Support passing user-defined flags for struct_ops

This patch adds a single line to pass the flags field from bpf_struct_ops_opts
to bpf_link_create_opts, enabling userspace to pass flags like
BPF_F_ALLOW_OVERRIDE when attaching struct_ops.

> The `bpf_struct_ops_opts` struct is extended with a `flags` member,
> which is then passed to the `bpf_link_create` syscall within
> `bpf_map__attach_struct_ops_opts`.

The commit message states the struct is "extended with a `flags` member", but
the flags field already exists in bpf_struct_ops_opts. It was added in commit
a09c73b18 ("libbpf: introduce bpf_map__attach_struct_ops_opts()"):

    struct bpf_struct_ops_opts {
        size_t sz;
        __u32 flags;
        __u32 relative_fd;
        __u64 expected_revision;
        size_t :0;
    };

This patch adds code to USE the existing flags field, not extend the struct.
Would a message like "Pass the existing flags field from bpf_struct_ops_opts
to bpf_link_create" be more accurate?



AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: libbpf: Support passing user-defined flags for struct_ops
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/21665371660

AI-authorship-score: medium
AI-authorship-explanation: The commit message contains a factual inaccuracy about extending the struct when the field already existed, which could indicate AI-generated text that didn't fully analyze the prior commit in the series.
issues-found: 1
issue-severity-score: low
issue-severity-explanation: The only issue is an inaccurate commit message that claims to extend a struct with a field that already exists; this is a documentation issue, not a code bug.

@kernel-patches-review-bot
Copy link

commit f201f3066f96e0cbfa957ce452f76fa883e14951
Author: Hui Zhu <[email protected]>

bpf: Pass flags in bpf_link_create for struct_ops

This commit extends bpf_struct_ops_link to include a flags field populated
from attr->link_create.flags during link creation, enabling struct_ops
implementations to interpret attachment flags.

> UAPI Change:
> This patch updates the comment in include/uapi/linux/bpf.h to reflect
> that the cgroup-bpf attach flags (such as BPF_F_ALLOW_OVERRIDE) are
> now applicable to both BPF_PROG_ATTACH and BPF_LINK_CREATE commands.

The commit message states that include/uapi/linux/bpf.h was updated, but
the diff shows only tools/include/uapi/linux/bpf.h was modified.

Should include/uapi/linux/bpf.h also be updated to match? Currently the
main UAPI header still says:

    /* cgroup-bpf attach flags used in BPF_PROG_ATTACH command

while the tools copy now says:

    /* cgroup-bpf attach flags used in BPF_PROG_ATTACH and BPF_LINK_CREATE command

This leaves the two copies inconsistent.



AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: bpf: Pass flags in bpf_link_create for struct_ops
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/21665371660

AI-authorship-score: low
AI-authorship-explanation: The minimal code change, multi-author collaboration, and human-like error in the commit message suggest human authorship, though the well-structured commit message has some AI-like characteristics.
issues-found: 1
issue-severity-score: low
issue-severity-explanation: The commit message incorrectly claims to update include/uapi/linux/bpf.h when only tools/include/uapi/linux/bpf.h was modified, leaving the two UAPI header copies inconsistent.

@kernel-patches-review-bot
Copy link

commit a09c73b18bab14463b86906d2d0fd2fd66e6198a
Author: Roman Gushchin <[email protected]>

libbpf: introduce bpf_map__attach_struct_ops_opts()

This commit adds bpf_map__attach_struct_ops_opts(), an extended version of
bpf_map__attach_struct_ops() that takes a struct bpf_struct_ops_opts argument
with fields including relative_fd, flags, and expected_revision for attaching
struct ops maps to cgroups.

> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index 0c8bf0b5c..70a00da54 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c

[ ... ]

> @@ -13503,7 +13510,9 @@ struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
>  		return &link->link;
>  	}
>
> -	fd = bpf_link_create(map->fd, 0, BPF_STRUCT_OPS, NULL);
> +	link_opts.cgroup.relative_fd = OPTS_GET(opts, relative_fd, 0);
> +
> +	fd = bpf_link_create(map->fd, 0, BPF_STRUCT_OPS, &link_opts);

The expected_revision field from bpf_struct_ops_opts is defined in the public
API but doesn't appear to be forwarded here. bpf_link_create() reads
opts->cgroup.expected_revision for BPF_STRUCT_OPS, and similar functions like
bpf_program__attach_cgroup_opts() forward all fields including expected_revision.

Was the expected_revision field intentionally left unused, or should there be a
line here like:

    link_opts.cgroup.expected_revision = OPTS_GET(opts, expected_revision, 0);

Note: A later commit in this series (e83943e72 "libbpf: Support passing
user-defined flags for struct_ops") adds forwarding for the flags field, but
expected_revision remains unused through the end of the series.

> diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
> index dfc37a615..5aef44bcf 100644
> --- a/tools/lib/bpf/libbpf.h
> +++ b/tools/lib/bpf/libbpf.h
> @@ -920,6 +920,20 @@ bpf_program__attach_cgroup_opts(const struct bpf_program *prog, int cgroup_fd,
>  struct bpf_map;
>
>  LIBBPF_API struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map);
> +
> +struct bpf_struct_ops_opts {
> +	/* size of this struct, for forward/backward compatibility */
> +	size_t sz;
> +	__u32 flags;
> +	__u32 relative_fd;
> +	__u64 expected_revision;
                ^^^^^^^^^^^^^^^^^
The expected_revision field is defined here but is not read by
bpf_map__attach_struct_ops_opts() in libbpf.c, so any value set by users
will be silently ignored.

> +	size_t :0;
> +};



AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: libbpf: introduce bpf_map__attach_struct_ops_opts()
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/21665371660

AI-authorship-score: low
AI-authorship-explanation: The commit follows standard libbpf API extension patterns with consistent naming and structure typical of experienced kernel developers.
issues-found: 1
issue-severity-score: low
issue-severity-explanation: The expected_revision field in the public API struct is silently ignored, which could confuse users but does not cause system instability or crashes.

@kernel-patches-daemon-bpf
Copy link
Author

At least one diff in series https://patchwork.kernel.org/project/netdevbpf/list/?series=1050636 expired. Closing PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants