bpf: Allow decoupling memcg from sk->sk_prot->memory_allocated. #5918

kernel-patches-daemon-bpf-rc · 2025-09-08T22:45:27Z

Pull request for series with
subject: bpf: Allow decoupling memcg from sk->sk_prot->memory_allocated.
version: 6
url: https://patchwork.kernel.org/project/netdevbpf/list/?series=1000212

If memcg is enabled, accept() acquires lock_sock() twice for each new TCP/MPTCP socket in inet_csk_accept() and __inet_accept(). Let's move memcg operations from inet_csk_accept() to __inet_accept(). Note that SCTP somehow allocates a new socket by sk_alloc() in sk->sk_prot->accept() and clones fields manually, instead of using sk_clone_lock(). mem_cgroup_sk_alloc() is called for SCTP before __inet_accept(), so I added the protocol check in __inet_accept(), but this can be removed once SCTP uses sk_clone_lock(). Signed-off-by: Kuniyuki Iwashima <[email protected]> Reviewed-by: Shakeel Butt <[email protected]>

We will store a flag in sk->sk_memcg by bpf_setsockopt() at the BPF_CGROUP_INET_SOCK_CREATE hook. BPF_CGROUP_INET_SOCK_CREATE is invoked by __cgroup_bpf_run_filter_sk() that passes a pointer to struct sock to the bpf prog as void *ctx. But there are no bpf_func_proto for bpf_setsockopt() that receives the ctx as a pointer to struct sock. Also, bpf_getsockopt() will be necessary for a cgroup with multiple bpf progs running. Let's add new bpf_setsockopt() and bpf_getsockopt() variants for BPF_CGROUP_INET_SOCK_CREATE. Note that inet_create() is not under lock_sock() and has the same semantics with bpf_lsm_unlocked_sockopt_hooks. Signed-off-by: Kuniyuki Iwashima <[email protected]>

We will decouple sockets from the global protocol memory accounting if sockets have SK_BPF_MEMCG_SOCK_ISOLATED. This can be flagged (and cleared) at the BPF_CGROUP_INET_SOCK_CREATE hook by bpf_setsockopt() and is inherited to child sockets. u32 flags = SK_BPF_MEMCG_SOCK_ISOLATED; bpf_setsockopt(ctx, SOL_SOCKET, SK_BPF_MEMCG_FLAGS, &flags, sizeof(flags)); SK_BPF_MEMCG_FLAGS is only supported at BPF_CGROUP_INET_SOCK_CREATE and not supported on other hooks for some reasons: 1. UDP charges memory under sk->sk_receive_queue.lock instead of lock_sock() 2. For TCP child sockets, memory accounting is adjusted only in __inet_accept() which sk->sk_memcg allocation is deferred to 3. Modifying the flag after skb is charged to sk requires such adjustment during bpf_setsockopt() and complicates the logic unnecessarily We can support other hooks later if a real use case justifies that. Given sk->sk_memcg can be accessed in the fast path, it would be preferable to place the flag field in the same cache line as sk->sk_memcg. However, struct sock does not have such a 1-byte hole. Let's store the flag in the lowest bit of sk->sk_memcg and add a helper to check the bit. In the next patch, if mem_cgroup_sk_isolated() returns true, the socket will not be charged to sk->sk_prot->memory_allocated. Signed-off-by: Kuniyuki Iwashima <[email protected]>

@start

…ing. Some protocols (e.g., TCP, UDP) implement memory accounting for socket buffers and charge memory to per-protocol global counters pointed to by sk->sk_proto->memory_allocated. When running under a non-root cgroup, this memory is also charged to the memcg as "sock" in memory.stat. We do not need to pay costs for two orthogonal memory accounting mechanisms. Let's decouple sockets in memcg from the global per-protocol memory accounting if sockets have SK_BPF_MEMCG_SOCK_ISOLATED in sk->sk_memcg. Note that this does NOT disable memcg, but rather the per-protocol one. If mem_cgroup_sk_isolated(sk) returns true, the per-protocol memory accounting is skipped. # cat /sys/fs/cgroup/test/memory.stat | grep sock sock 2757468160 <--------------------------------- charged to memcg # cat /proc/net/sockstat | grep TCP TCP: inuse 2006 orphan 0 tw 0 alloc 2008 mem 0 <-- not charged to tcp_mem In __inet_accept(), we need to reclaim counts that are already charged for child sockets because we do not allocate sk->sk_memcg until accept(). trace_sock_exceed_buf_limit() will always show 0 as accounted for the isolated sockets, but this can be obtained via memory.stat. Most changes are inline and hard to trace, but a microbenchmark on __sk_mem_raise_allocated() during neper/tcp_stream showed that more samples completed faster with bpf. This will be more visible under tcp_mem pressure. # bpftrace -e 'kprobe:__sk_mem_raise_allocated { @start[tid] = nsecs; } kretprobe:__sk_mem_raise_allocated /@start[tid]/ { @EnD[tid] = nsecs - @start[tid]; @times = hist(@EnD[tid]); delete(@start[tid]); }' # tcp_stream -6 -F 1000 -N -T 256 Without bpf prog: [128, 256) 3846 | | [256, 512) 1505326 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [512, 1K) 1371006 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [1K, 2K) 198207 |@@@@@@ | [2K, 4K) 31199 |@ | With bpf prog in the next patch: (must be attached before tcp_stream) # bpftool prog load sk_memcg.bpf.o /sys/fs/bpf/sk_memcg type cgroup/sock_create # bpftool cgroup attach /sys/fs/cgroup/test cgroup_inet_sock_create pinned /sys/fs/bpf/sk_memcg [128, 256) 6413 | | [256, 512) 1868425 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [512, 1K) 1101697 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [1K, 2K) 117031 |@@@@ | [2K, 4K) 11773 | | Signed-off-by: Kuniyuki Iwashima <[email protected]> Nacked-by: Johannes Weiner <[email protected]>

The test does the following for IPv4/IPv6 x TCP/UDP sockets with/without SK_BPF_MEMCG_SOCK_ISOLATED. 1. Create socket pairs 2. Send a bunch of data that requires more than 1024 pages 3. Read memory_allocated from sk->sk_prot->memory_allocated and sk->sk_prot->memory_per_cpu_fw_alloc 4. Check if unread data is charged to memory_allocated If SK_BPF_MEMCG_SOCK_ISOLATED is set, memory_allocated should not be changed, but we allow a small error (up to 10 pages) in case other processes on the host use some amounts of TCP/UDP memory. The amount of allocated pages are buffered to per-cpu variable {tcp,udp}_memory_per_cpu_fw_alloc up to +/- net.core.mem_pcpu_rsv before reported to {tcp,udp}_memory_allocated. At 3., memory_allocated is calculated from the 2 variables twice at fentry and fexit of socket create function to check if the per-cpu value is drained during calculation. In that case, 3. is retried. We use kern_sync_rcu() for UDP because UDP recv queue is destroyed after RCU grace period. The test takes ~2s on QEMU (64 CPUs) w/ KVM but takes 6s w/o KVM. # time ./test_progs -t sk_memcg #370/1 sk_memcg/TCP :OK #370/2 sk_memcg/UDP :OK #370/3 sk_memcg/TCPv6:OK #370/4 sk_memcg/UDPv6:OK #370 sk_memcg:OK Summary: 1/4 PASSED, 0 SKIPPED, 0 FAILED real 0m1.623s user 0m0.165s sys 0m0.366s Signed-off-by: Kuniyuki Iwashima <[email protected]>

kernel-patches-daemon-bpf-rc · 2025-09-08T22:45:28Z

Upstream branch: 2dfd8b8
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1000212
version: 6

kernel-patches-daemon-bpf-rc · 2025-09-09T17:37:38Z

At least one diff in series https://patchwork.kernel.org/project/netdevbpf/list/?series=1000212 expired. Closing PR.

q2ven added 5 commits September 8, 2025 15:45

kernel-patches-daemon-bpf-rc bot added new V6 bpf-net V6-ci-pass labels Sep 8, 2025

kernel-patches-daemon-bpf-rc bot added changes-requested and removed new labels Sep 9, 2025

kernel-patches-daemon-bpf-rc bot closed this Sep 9, 2025

kernel-patches-daemon-bpf-rc bot deleted the series/1000212=>bpf-net branch September 12, 2025 00:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bpf: Allow decoupling memcg from sk->sk_prot->memory_allocated. #5918

bpf: Allow decoupling memcg from sk->sk_prot->memory_allocated. #5918

Uh oh!

kernel-patches-daemon-bpf-rc bot commented Sep 8, 2025

Uh oh!

kernel-patches-daemon-bpf-rc bot commented Sep 8, 2025

Uh oh!

kernel-patches-daemon-bpf-rc bot commented Sep 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bpf: Allow decoupling memcg from sk->sk_prot->memory_allocated. #5918

bpf: Allow decoupling memcg from sk->sk_prot->memory_allocated. #5918

Uh oh!

Conversation

kernel-patches-daemon-bpf-rc bot commented Sep 8, 2025

Uh oh!

kernel-patches-daemon-bpf-rc bot commented Sep 8, 2025

Uh oh!

kernel-patches-daemon-bpf-rc bot commented Sep 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants