mm, bpf: BPF-MM, BPF-THP #6146

kernel-patches-daemon-bpf-rc · 2025-10-15T14:27:54Z

Pull request for series with
subject: mm, bpf: BPF-MM, BPF-THP
version: 10
url: https://patchwork.kernel.org/project/netdevbpf/list/?series=1011915

The khugepaged_enter_vma() function requires handling in two specific scenarios: 1. New VMA creation When a new VMA is created (for anon vma, it is deferred to pagefault), if vma->vm_mm is not present in khugepaged_mm_slot, it must be added. In this case, khugepaged_enter_vma() is called after vma->vm_flags have been set, allowing direct use of the VMA's flags. 2. VMA flag modification When vma->vm_flags are modified (particularly when VM_HUGEPAGE is set), the system must recheck whether to add vma->vm_mm to khugepaged_mm_slot. Currently, khugepaged_enter_vma() is called before the flag update, so the call must be relocated to occur after vma->vm_flags have been set. In the VMA merging path, khugepaged_enter_vma() is also called. For this case, since VMA merging only occurs when the vm_flags of both VMAs are identical (excluding special flags like VM_SOFTDIRTY), we can safely use target->vm_flags instead. (It is worth noting that khugepaged_enter_vma() can be removed from the VMA merging path because the VMA has already been added in the two aforementioned cases. We will address this cleanup in a separate patch.) After this change, we can further remove vm_flags parameter from thp_vma_allowable_order(). That will be handled in a followup patch. Signed-off-by: Yafang Shao <[email protected]> Cc: Yang Shi <[email protected]> Cc: Usama Arif <[email protected]>

Because all calls to thp_vma_allowable_order() pass vma->vm_flags as the vma_flags argument, we can remove the parameter and have the function access vma->vm_flags directly. Signed-off-by: Yafang Shao <[email protected]> Acked-by: Usama Arif <[email protected]>

@vma

This patch introduces a new BPF struct_ops called bpf_thp_ops for dynamic THP tuning. It includes a hook bpf_hook_thp_get_order(), allowing BPF programs to influence THP order selection based on factors such as: - Workload identity For example, workloads running in specific containers or cgroups. - Allocation context Whether the allocation occurs during a page fault, khugepaged, swap or other paths. - VMA's memory advice settings MADV_HUGEPAGE or MADV_NOHUGEPAGE - Memory pressure PSI system data or associated cgroup PSI metrics The kernel API of this new BPF hook is as follows, /** * thp_order_fn_t: Get the suggested THP order from a BPF program for allocation * @vma: vm_area_struct associated with the THP allocation * @type: TVA type for current @vma * @orders: Bitmask of available THP orders for this allocation * * Return: The suggested THP order for allocation from the BPF program. Must be * a valid, available order. */ typedef int thp_order_fn_t(struct vm_area_struct *vma, enum tva_type type, unsigned long orders); Only a single BPF program can be attached at any given time, though it can be dynamically updated to adjust the policy. The implementation supports anonymous THP, shmem THP, and mTHP, with future extensions planned for file-backed THP. This functionality is only active when system-wide THP is configured to madvise or always mode. It remains disabled in never mode. Additionally, if THP is explicitly disabled for a specific task via prctl(), this BPF functionality will also be unavailable for that task. This BPF hook enables the implementation of flexible THP allocation policies at the system, per-cgroup, or per-task level. This feature requires CONFIG_BPF_THP (EXPERIMENTAL) to be enabled. Note that this capability is currently unstable and may undergo significant changes—including potential removal—in future kernel versions. Signed-off-by: Yafang Shao <[email protected]>

The new BPF capability enables finer-grained THP policy decisions by introducing separate handling for swap faults versus normal page faults. As highlighted by Barry: We’ve observed that swapping in large folios can lead to more swap thrashing for some workloads- e.g. kernel build. Consequently, some workloads might prefer swapping in smaller folios than those allocated by alloc_anon_folio(). While prtcl() could potentially be extended to leverage this new policy, doing so would require modifications to the uAPI. Signed-off-by: Yafang Shao <[email protected]> Reviewed-by: Lorenzo Stoakes <[email protected]> Acked-by: Usama Arif <[email protected]> Cc: Barry Song <[email protected]>

khugepaged_enter_vma() ultimately invokes any attached BPF function with the TVA_KHUGEPAGED flag set when determining whether or not to enable khugepaged THP for a freshly faulted in VMA. Currently, on fault, we invoke this in do_huge_pmd_anonymous_page(), as invoked by create_huge_pmd() and only when we have already checked to see if an allowable TVA_PAGEFAULT order is specified. Since we might want to disallow THP on fault-in but allow it via khugepaged, we move things around so we always attempt to enter khugepaged upon fault. This change is safe because: - khugepaged operates at the MM level rather than per-VMA. The THP allocation might fail during page faults due to transient conditions (e.g., memory pressure), it is safe to add this MM to khugepaged for subsequent defragmentation. - If __thp_vma_allowable_orders(TVA_PAGEFAULT) returns 0, then __thp_vma_allowable_orders(TVA_KHUGEPAGED) will also return 0. While we could also extend prctl() to utilize this new policy, such a change would require a uAPI modification to PR_SET_THP_DISABLE. Signed-off-by: Yafang Shao <[email protected]> Acked-by: Lance Yang <[email protected]> Cc: Usama Arif <[email protected]>

@mm-

When CONFIG_MEMCG is enabled, we can access mm->owner under RCU. The owner can be NULL. With this change, BPF helpers can safely access mm->owner to retrieve the associated task from the mm. We can then make policy decision based on the task attribute. The typical use case is as follows, bpf_rcu_read_lock(); // rcu lock must be held for rcu trusted field @owner = @mm->owner; // mm_struct::owner is rcu trusted or null if (!@owner) goto out; /* Do something based on the task attribute */ out: bpf_rcu_read_unlock(); Suggested-by: Andrii Nakryiko <[email protected]> Signed-off-by: Yafang Shao <[email protected]> Acked-by: Lorenzo Stoakes <[email protected]>

@mm

The vma->vm_mm might be NULL and it can be accessed outside of RCU. Thus, we can mark it as trusted_or_null. With this change, BPF helpers can safely access vma->vm_mm to retrieve the associated mm_struct from the VMA. Then we can make policy decision from the VMA. The "trusted" annotation enables direct access to vma->vm_mm within kfuncs marked with KF_TRUSTED_ARGS or KF_RCU, such as bpf_task_get_cgroup1() and bpf_task_under_cgroup(). Conversely, "null" enforcement requires all callsites using vma->vm_mm to perform NULL checks. The lsm selftest must be modified because it directly accesses vma->vm_mm without a NULL pointer check; otherwise it will break due to this change. For the VMA based THP policy, the use case is as follows, @mm = @vma->vm_mm; // vm_area_struct::vm_mm is trusted or null if (!@mm) return; bpf_rcu_read_lock(); // rcu lock must be held to dereference the owner @owner = @mm->owner; // mm_struct::owner is rcu trusted or null if (!@owner) goto out; @Cgroup1 = bpf_task_get_cgroup1(@owner, MEMCG_HIERARCHY_ID); /* make the decision based on the @Cgroup1 attribute */ bpf_cgroup_release(@Cgroup1); // release the associated cgroup out: bpf_rcu_read_unlock(); PSI memory information can be obtained from the associated cgroup to inform policy decisions. Since upstream PSI support is currently limited to cgroup v2, the following example demonstrates cgroup v2 implementation: @owner = @mm->owner; if (@owner) { // @ancestor_cgid is user-configured @ancestor = bpf_cgroup_from_id(@ancestor_cgid); if (bpf_task_under_cgroup(@owner, @ancestor)) { @psi_group = @ancestor->psi; /* Extract PSI metrics from @psi_group and * implement policy logic based on the values */ } } Signed-off-by: Yafang Shao <[email protected]> Acked-by: Lorenzo Stoakes <[email protected]> Cc: "Liam R. Howlett" <[email protected]>

This test case implements a basic THP policy that sets THPeligible to 1 for a specific task and to 0 for all others. I selected THPeligible for verification because its straightforward nature makes it ideal for validating the BPF THP policy functionality. Below configs must be enabled for this test: CONFIG_BPF_THP=y CONFIG_MEMCG=y CONFIG_TRANSPARENT_HUGEPAGE=y Signed-off-by: Yafang Shao <[email protected]>

Add the documentation. Signed-off-by: Yafang Shao <[email protected]>

kernel-patches-daemon-bpf-rc · 2025-10-15T14:27:55Z

Upstream branch: 48a97ff
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1011915
version: 10

kernel-patches-review-bot · 2025-10-15T14:40:28Z

commit 058107f10f0c4defd1fba88ee18f74ae5c206294
Author: Yafang Shao <[email protected]>

mm: thp: add support for BPF based THP order selection

This patch introduces a new BPF struct_ops called bpf_thp_ops for dynamic
THP tuning through a hook that allows BPF programs to influence THP order
selection. The implementation tracks mm_struct instances using a list
managed with RCU-protected pointers and spinlock synchronization.

> diff --git a/MAINTAINERS b/MAINTAINERS
> index 46126ce2f..50faf3860 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -16520,6 +16520,7 @@ F:	include/linux/huge_mm.h
>  F:	include/linux/khugepaged.h
>  F:	include/trace/events/huge_memory.h
>  F:	mm/huge_memory.c
> +F:	mm/huge_memory_bpf.c
>  F:	mm/khugepaged.c
>  F:	mm/mm_slot.h
>  F:	tools/testing/selftests/mm/khugepaged.c

[ ... ]

> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 90e5790c3..9d4dc6169 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -33,6 +33,7 @@
>  struct address_space;
>  struct futex_private_hash;
>  struct mem_cgroup;
> +struct bpf_mm_ops;
>
>  typedef struct {
>  	unsigned long f;
> @@ -930,6 +931,19 @@ struct mm_cid {
>  };
>  #endif
>
> +#ifdef CONFIG_BPF_THP
> +struct bpf_thp_ops;
> +#endif
> +
> +#ifdef CONFIG_BPF_MM
> +struct bpf_mm_ops {
> +#ifdef CONFIG_BPF_THP
> +	struct bpf_thp_ops __rcu *bpf_thp;
> +	struct list_head bpf_thp_list;
> +#endif
> +};
> +#endif
> +

[ ... ]

> diff --git a/mm/huge_memory_bpf.c b/mm/huge_memory_bpf.c
> new file mode 100644
> index 000000000..24ab432cb
> --- /dev/null
> +++ b/mm/huge_memory_bpf.c

[ ... ]

> +void bpf_thp_fork(struct mm_struct *mm, struct mm_struct *old_mm)
> +{
> +	struct bpf_thp_mm_list *mm_list;
> +	struct bpf_thp_ops *bpf_thp;
> +
> +	if (!rcu_access_pointer(old_mm->bpf_mm.bpf_thp))
> +		return;
> +
> +	spin_lock(&thp_ops_lock);
> +	bpf_thp = rcu_dereference_protected(old_mm->bpf_mm.bpf_thp,
> +					    lockdep_is_held(&thp_ops_lock));
> +	if (!bpf_thp) {
> +		spin_unlock(&thp_ops_lock);
> +		return;
> +	}
> +
> +	/* The new mm is still under initilization */
> +	RCU_INIT_POINTER(mm->bpf_mm.bpf_thp, bpf_thp);
> +
> +	mm_list = &bpf_thp->mm_list;
> +	list_add_tail(&mm->bpf_mm.bpf_thp_list, &mm_list->list);
                              ^^^^^^^^^^^^^^^^^^^^^^^
> +	spin_unlock(&thp_ops_lock);
> +}

Is the mm->bpf_mm.bpf_thp_list field initialized before this
list_add_tail() call?

In kernel/fork.c, mm_alloc() does memset(mm, 0, sizeof(*mm)) followed
by mm_init(). The bpf_thp_list field is never explicitly initialized
with INIT_LIST_HEAD() before bpf_thp_fork() is called from dup_mmap().

This means list_add_tail() operates on an uninitialized list_head with
NULL prev/next pointers instead of a properly initialized node (where
prev and next point to itself). This pattern differs from how other
mm_struct list fields are handled - for example, kernel/fork.c properly
calls INIT_LIST_HEAD(&mm->mmlist) during mm initialization.

The execution path is:
  copy_mm() -> dup_mm() -> mm_alloc() [memset to zero]
           -> mm_init() [no bpf_thp_list init]
           -> dup_mmap() -> bpf_thp_fork() [list_add_tail on uninitialized node]

Can this corrupt the bpf_thp ops tracking list?

[ ... ]

> diff --git a/mm/mmap.c b/mm/mmap.c
> index 5fd3b80fd..8ac7d3046 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1844,6 +1844,7 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
>  	vma_iter_free(&vmi);
>  	if (!retval) {
>  		mt_set_in_rcu(vmi.mas.tree);
> +		bpf_thp_fork(mm, oldmm);
>  		ksm_fork(mm, oldmm);
>  		khugepaged_fork(mm, oldmm);
>  	} else {

AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: mm: thp: add support for BPF based THP order selection
CI run summary: https://github.com/kernel-patches/bpf-rc/actions/runs/18532322684

kernel-patches-daemon-bpf-rc · 2025-10-15T14:49:04Z

Forwarding comment 3406802393 via email
In-Reply-To: [email protected]
Patch: https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/

kernel-patches-review-bot · 2025-10-15T14:49:37Z

commit 64eb2583d7e5f843344ab11c71f8a0ba0c0628c0
Author: Yafang Shao <[email protected]>

mm: thp: remove vm_flags parameter from khugepaged_enter_vma()

This patch refactors khugepaged_enter_vma() to remove the vm_flags
parameter and introduces a new khugepaged_enter_mm() helper. The change
also relocates the khugepaged registration call in the madvise path to
occur after vm_flags are updated rather than before.

> diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> index eb1946a70..b30814d3d 100644
> --- a/include/linux/khugepaged.h
> +++ b/include/linux/khugepaged.h

[ ... ]

> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 1b81680b4..ac6601f30 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c

[ ... ]

> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index abe54f004..0849fb652 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c

[ ... ]

> diff --git a/mm/madvise.c b/mm/madvise.c
> index fb1c86e63..8de7c3930 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1425,6 +1425,13 @@ static int madvise_vma_behavior(struct madvise_behavior *madv_behavior)
>  	VM_WARN_ON_ONCE(madv_behavior->lock_mode != MADVISE_MMAP_WRITE_LOCK);
>
>  	error = madvise_update_vma(new_flags, madv_behavior);
> +	/*
> +	 * If the vma become good for khugepaged to scan,
> +	 * register it here without waiting a page fault that
> +	 * may not happen any time soon.
> +	 */
> +	if (!error && new_flags & VM_HUGEPAGE)
> +		khugepaged_enter_mm(vma->vm_mm);

Can this access freed memory? In madvise_vma_behavior(), the local
variable 'vma' is captured at the function entry. When
madvise_update_vma() merges VMAs, it may free the original VMA and
update madv_behavior->vma to point to a different VMA (the merge result).

Looking at the VMA merge code in vma_merge_existing_range():

    vmg->__remove_middle = left_side && right_side;

When the merge spans the entire VMA, the middle VMA (the original 'vma')
is freed in vma_complete():

    vm_area_free(vp->remove);

After madvise_update_vma() returns, the code uses the local 'vma'
variable to access vma->vm_mm, but this VMA may have been freed if
merging occurred. The correct approach would be to use
madv_behavior->vma->vm_mm, which madvise_update_vma() updates to point
to the valid VMA after merging.

> diff --git a/mm/vma.c b/mm/vma.c
> index abe0da33c..872eb3c92 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c

[ ... ]

AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: mm: thp: remove vm_flags parameter from khugepaged_enter_vma()
CI run summary: https://github.com/kernel-patches/bpf-rc/actions/runs/18532322684

kernel-patches-daemon-bpf-rc · 2025-10-15T15:15:08Z

Forwarding comment 3406846173 via email
In-Reply-To: [email protected]
Patch: https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/

kernel-patches-daemon-bpf-rc · 2025-10-15T15:23:49Z

At least one diff in series https://patchwork.kernel.org/project/netdevbpf/list/?series=1011915 expired. Closing PR.

laoar added 9 commits October 15, 2025 07:26

Documentation: add BPF-based THP policy management

f840edd

Add the documentation. Signed-off-by: Yafang Shao <[email protected]>

kernel-patches-daemon-bpf-rc bot added rfc bpf-next V10 labels Oct 15, 2025

kernel-patches-review-bot bot added the ai-review label Oct 15, 2025

kernel-patches-daemon-bpf-rc bot removed the ai-review label Oct 15, 2025

kernel-patches-review-bot bot added the ai-review label Oct 15, 2025

kernel-patches-daemon-bpf-rc bot added V10-ci-fail and removed ai-review labels Oct 15, 2025

kernel-patches-daemon-bpf-rc bot closed this Oct 15, 2025

kernel-patches-daemon-bpf-rc bot deleted the series/1011915=>bpf-next branch October 17, 2025 22:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

mm, bpf: BPF-MM, BPF-THP #6146

mm, bpf: BPF-MM, BPF-THP #6146

Uh oh!

kernel-patches-daemon-bpf-rc bot commented Oct 15, 2025

Uh oh!

kernel-patches-daemon-bpf-rc bot commented Oct 15, 2025

Uh oh!

kernel-patches-review-bot bot commented Oct 15, 2025

Uh oh!

kernel-patches-daemon-bpf-rc bot commented Oct 15, 2025

Uh oh!

kernel-patches-review-bot bot commented Oct 15, 2025

Uh oh!

kernel-patches-daemon-bpf-rc bot commented Oct 15, 2025

Uh oh!

kernel-patches-daemon-bpf-rc bot commented Oct 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mm, bpf: BPF-MM, BPF-THP #6146

mm, bpf: BPF-MM, BPF-THP #6146

Uh oh!

Conversation

kernel-patches-daemon-bpf-rc bot commented Oct 15, 2025

Uh oh!

kernel-patches-daemon-bpf-rc bot commented Oct 15, 2025

Uh oh!

kernel-patches-review-bot bot commented Oct 15, 2025

Uh oh!

kernel-patches-daemon-bpf-rc bot commented Oct 15, 2025

Uh oh!

kernel-patches-review-bot bot commented Oct 15, 2025

Uh oh!

kernel-patches-daemon-bpf-rc bot commented Oct 15, 2025

Uh oh!

kernel-patches-daemon-bpf-rc bot commented Oct 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants