Skip to content

Conversation

@kernel-patches-daemon-bpf-rc
Copy link

Pull request for series with
subject: mm, bpf: BPF-MM, BPF-THP
version: 10
url: https://patchwork.kernel.org/project/netdevbpf/list/?series=1011915

laoar added 9 commits October 15, 2025 07:26
The khugepaged_enter_vma() function requires handling in two specific
scenarios:
1. New VMA creation
  When a new VMA is created (for anon vma, it is deferred to pagefault), if
  vma->vm_mm is not present in khugepaged_mm_slot, it must be added. In
  this case, khugepaged_enter_vma() is called after vma->vm_flags have been
  set, allowing direct use of the VMA's flags.
2. VMA flag modification
  When vma->vm_flags are modified (particularly when VM_HUGEPAGE is set),
  the system must recheck whether to add vma->vm_mm to khugepaged_mm_slot.
  Currently, khugepaged_enter_vma() is called before the flag update, so
  the call must be relocated to occur after vma->vm_flags have been set.

In the VMA merging path, khugepaged_enter_vma() is also called. For this
case, since VMA merging only occurs when the vm_flags of both VMAs are
identical (excluding special flags like VM_SOFTDIRTY), we can safely use
target->vm_flags instead. (It is worth noting that khugepaged_enter_vma()
can be removed from the VMA merging path because the VMA has already been
added in the two aforementioned cases. We will address this cleanup in a
separate patch.)

After this change, we can further remove vm_flags parameter from
thp_vma_allowable_order(). That will be handled in a followup patch.

Signed-off-by: Yafang Shao <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Usama Arif <[email protected]>
Because all calls to thp_vma_allowable_order() pass vma->vm_flags as the
vma_flags argument, we can remove the parameter and have the function
access vma->vm_flags directly.

Signed-off-by: Yafang Shao <[email protected]>
Acked-by: Usama Arif <[email protected]>
This patch introduces a new BPF struct_ops called bpf_thp_ops for dynamic
THP tuning. It includes a hook bpf_hook_thp_get_order(), allowing BPF
programs to influence THP order selection based on factors such as:
- Workload identity
  For example, workloads running in specific containers or cgroups.
- Allocation context
  Whether the allocation occurs during a page fault, khugepaged, swap or
  other paths.
- VMA's memory advice settings
  MADV_HUGEPAGE or MADV_NOHUGEPAGE
- Memory pressure
  PSI system data or associated cgroup PSI metrics

The kernel API of this new BPF hook is as follows,

/**
 * thp_order_fn_t: Get the suggested THP order from a BPF program for allocation
 * @vma: vm_area_struct associated with the THP allocation
 * @type: TVA type for current @vma
 * @orders: Bitmask of available THP orders for this allocation
 *
 * Return: The suggested THP order for allocation from the BPF program. Must be
 *         a valid, available order.
 */
typedef int thp_order_fn_t(struct vm_area_struct *vma,
			   enum tva_type type,
			   unsigned long orders);

Only a single BPF program can be attached at any given time, though it can
be dynamically updated to adjust the policy. The implementation supports
anonymous THP, shmem THP, and mTHP, with future extensions planned for
file-backed THP.

This functionality is only active when system-wide THP is configured to
madvise or always mode. It remains disabled in never mode. Additionally,
if THP is explicitly disabled for a specific task via prctl(), this BPF
functionality will also be unavailable for that task.

This BPF hook enables the implementation of flexible THP allocation
policies at the system, per-cgroup, or per-task level.

This feature requires CONFIG_BPF_THP (EXPERIMENTAL) to be enabled. Note
that this capability is currently unstable and may undergo significant
changes—including potential removal—in future kernel versions.

Signed-off-by: Yafang Shao <[email protected]>
The new BPF capability enables finer-grained THP policy decisions by
introducing separate handling for swap faults versus normal page faults.

As highlighted by Barry:

  We’ve observed that swapping in large folios can lead to more
  swap thrashing for some workloads- e.g. kernel build. Consequently,
  some workloads might prefer swapping in smaller folios than those
  allocated by alloc_anon_folio().

While prtcl() could potentially be extended to leverage this new policy,
doing so would require modifications to the uAPI.

Signed-off-by: Yafang Shao <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
Acked-by: Usama Arif <[email protected]>
Cc: Barry Song <[email protected]>
khugepaged_enter_vma() ultimately invokes any attached BPF function with
the TVA_KHUGEPAGED flag set when determining whether or not to enable
khugepaged THP for a freshly faulted in VMA.

Currently, on fault, we invoke this in do_huge_pmd_anonymous_page(), as
invoked by create_huge_pmd() and only when we have already checked to
see if an allowable TVA_PAGEFAULT order is specified.

Since we might want to disallow THP on fault-in but allow it via
khugepaged, we move things around so we always attempt to enter
khugepaged upon fault.

This change is safe because:
- khugepaged operates at the MM level rather than per-VMA. The THP
  allocation might fail during page faults due to transient conditions
  (e.g., memory pressure), it is safe to add this MM to khugepaged for
  subsequent defragmentation.
- If __thp_vma_allowable_orders(TVA_PAGEFAULT) returns 0, then
  __thp_vma_allowable_orders(TVA_KHUGEPAGED) will also return 0.

While we could also extend prctl() to utilize this new policy, such a
change would require a uAPI modification to PR_SET_THP_DISABLE.

Signed-off-by: Yafang Shao <[email protected]>
Acked-by: Lance Yang <[email protected]>
Cc: Usama Arif <[email protected]>
When CONFIG_MEMCG is enabled, we can access mm->owner under RCU. The
owner can be NULL. With this change, BPF helpers can safely access
mm->owner to retrieve the associated task from the mm. We can then make
policy decision based on the task attribute.

The typical use case is as follows,

  bpf_rcu_read_lock(); // rcu lock must be held for rcu trusted field
  @owner = @mm->owner; // mm_struct::owner is rcu trusted or null
  if (!@owner)
      goto out;

  /* Do something based on the task attribute */

out:
  bpf_rcu_read_unlock();

Suggested-by: Andrii Nakryiko <[email protected]>
Signed-off-by: Yafang Shao <[email protected]>
Acked-by: Lorenzo Stoakes <[email protected]>
The vma->vm_mm might be NULL and it can be accessed outside of RCU. Thus,
we can mark it as trusted_or_null. With this change, BPF helpers can safely
access vma->vm_mm to retrieve the associated mm_struct from the VMA.
Then we can make policy decision from the VMA.

The "trusted" annotation enables direct access to vma->vm_mm within kfuncs
marked with KF_TRUSTED_ARGS or KF_RCU, such as bpf_task_get_cgroup1() and
bpf_task_under_cgroup(). Conversely, "null" enforcement requires all
callsites using vma->vm_mm to perform NULL checks.

The lsm selftest must be modified because it directly accesses vma->vm_mm
without a NULL pointer check; otherwise it will break due to this
change.

For the VMA based THP policy, the use case is as follows,

  @mm = @vma->vm_mm; // vm_area_struct::vm_mm is trusted or null
  if (!@mm)
      return;
  bpf_rcu_read_lock(); // rcu lock must be held to dereference the owner
  @owner = @mm->owner; // mm_struct::owner is rcu trusted or null
  if (!@owner)
    goto out;
  @Cgroup1 = bpf_task_get_cgroup1(@owner, MEMCG_HIERARCHY_ID);

  /* make the decision based on the @Cgroup1 attribute */

  bpf_cgroup_release(@Cgroup1); // release the associated cgroup
out:
  bpf_rcu_read_unlock();

PSI memory information can be obtained from the associated cgroup to inform
policy decisions. Since upstream PSI support is currently limited to cgroup
v2, the following example demonstrates cgroup v2 implementation:

  @owner = @mm->owner;
  if (@owner) {
      // @ancestor_cgid is user-configured
      @ancestor = bpf_cgroup_from_id(@ancestor_cgid);
      if (bpf_task_under_cgroup(@owner, @ancestor)) {
          @psi_group = @ancestor->psi;

          /* Extract PSI metrics from @psi_group and
           * implement policy logic based on the values
           */

      }
  }

Signed-off-by: Yafang Shao <[email protected]>
Acked-by: Lorenzo Stoakes <[email protected]>
Cc: "Liam R. Howlett" <[email protected]>
This test case implements a basic THP policy that sets THPeligible to 1 for
a specific task and to 0 for all others. I selected THPeligible for
verification because its straightforward nature makes it ideal for
validating the BPF THP policy functionality.

Below configs must be enabled for this test:

  CONFIG_BPF_THP=y
  CONFIG_MEMCG=y
  CONFIG_TRANSPARENT_HUGEPAGE=y

Signed-off-by: Yafang Shao <[email protected]>
Add the documentation.

Signed-off-by: Yafang Shao <[email protected]>
@kernel-patches-daemon-bpf-rc
Copy link
Author

Upstream branch: 48a97ff
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1011915
version: 10

@kernel-patches-review-bot
Copy link

commit 058107f10f0c4defd1fba88ee18f74ae5c206294
Author: Yafang Shao <[email protected]>

mm: thp: add support for BPF based THP order selection

This patch introduces a new BPF struct_ops called bpf_thp_ops for dynamic
THP tuning through a hook that allows BPF programs to influence THP order
selection. The implementation tracks mm_struct instances using a list
managed with RCU-protected pointers and spinlock synchronization.

> diff --git a/MAINTAINERS b/MAINTAINERS
> index 46126ce2f..50faf3860 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -16520,6 +16520,7 @@ F:	include/linux/huge_mm.h
>  F:	include/linux/khugepaged.h
>  F:	include/trace/events/huge_memory.h
>  F:	mm/huge_memory.c
> +F:	mm/huge_memory_bpf.c
>  F:	mm/khugepaged.c
>  F:	mm/mm_slot.h
>  F:	tools/testing/selftests/mm/khugepaged.c

[ ... ]

> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 90e5790c3..9d4dc6169 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -33,6 +33,7 @@
>  struct address_space;
>  struct futex_private_hash;
>  struct mem_cgroup;
> +struct bpf_mm_ops;
>
>  typedef struct {
>  	unsigned long f;
> @@ -930,6 +931,19 @@ struct mm_cid {
>  };
>  #endif
>
> +#ifdef CONFIG_BPF_THP
> +struct bpf_thp_ops;
> +#endif
> +
> +#ifdef CONFIG_BPF_MM
> +struct bpf_mm_ops {
> +#ifdef CONFIG_BPF_THP
> +	struct bpf_thp_ops __rcu *bpf_thp;
> +	struct list_head bpf_thp_list;
> +#endif
> +};
> +#endif
> +

[ ... ]

> diff --git a/mm/huge_memory_bpf.c b/mm/huge_memory_bpf.c
> new file mode 100644
> index 000000000..24ab432cb
> --- /dev/null
> +++ b/mm/huge_memory_bpf.c

[ ... ]

> +void bpf_thp_fork(struct mm_struct *mm, struct mm_struct *old_mm)
> +{
> +	struct bpf_thp_mm_list *mm_list;
> +	struct bpf_thp_ops *bpf_thp;
> +
> +	if (!rcu_access_pointer(old_mm->bpf_mm.bpf_thp))
> +		return;
> +
> +	spin_lock(&thp_ops_lock);
> +	bpf_thp = rcu_dereference_protected(old_mm->bpf_mm.bpf_thp,
> +					    lockdep_is_held(&thp_ops_lock));
> +	if (!bpf_thp) {
> +		spin_unlock(&thp_ops_lock);
> +		return;
> +	}
> +
> +	/* The new mm is still under initilization */
> +	RCU_INIT_POINTER(mm->bpf_mm.bpf_thp, bpf_thp);
> +
> +	mm_list = &bpf_thp->mm_list;
> +	list_add_tail(&mm->bpf_mm.bpf_thp_list, &mm_list->list);
                              ^^^^^^^^^^^^^^^^^^^^^^^
> +	spin_unlock(&thp_ops_lock);
> +}

Is the mm->bpf_mm.bpf_thp_list field initialized before this
list_add_tail() call?

In kernel/fork.c, mm_alloc() does memset(mm, 0, sizeof(*mm)) followed
by mm_init(). The bpf_thp_list field is never explicitly initialized
with INIT_LIST_HEAD() before bpf_thp_fork() is called from dup_mmap().

This means list_add_tail() operates on an uninitialized list_head with
NULL prev/next pointers instead of a properly initialized node (where
prev and next point to itself). This pattern differs from how other
mm_struct list fields are handled - for example, kernel/fork.c properly
calls INIT_LIST_HEAD(&mm->mmlist) during mm initialization.

The execution path is:
  copy_mm() -> dup_mm() -> mm_alloc() [memset to zero]
           -> mm_init() [no bpf_thp_list init]
           -> dup_mmap() -> bpf_thp_fork() [list_add_tail on uninitialized node]

Can this corrupt the bpf_thp ops tracking list?

[ ... ]

> diff --git a/mm/mmap.c b/mm/mmap.c
> index 5fd3b80fd..8ac7d3046 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1844,6 +1844,7 @@ __latent_entropy int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
>  	vma_iter_free(&vmi);
>  	if (!retval) {
>  		mt_set_in_rcu(vmi.mas.tree);
> +		bpf_thp_fork(mm, oldmm);
>  		ksm_fork(mm, oldmm);
>  		khugepaged_fork(mm, oldmm);
>  	} else {



AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: mm: thp: add support for BPF based THP order selection
CI run summary: https://github.com/kernel-patches/bpf-rc/actions/runs/18532322684

@kernel-patches-daemon-bpf-rc
Copy link
Author

Forwarding comment 3406802393 via email
In-Reply-To: [email protected]
Patch: https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/

@kernel-patches-review-bot
Copy link

commit 64eb2583d7e5f843344ab11c71f8a0ba0c0628c0
Author: Yafang Shao <[email protected]>

mm: thp: remove vm_flags parameter from khugepaged_enter_vma()

This patch refactors khugepaged_enter_vma() to remove the vm_flags
parameter and introduces a new khugepaged_enter_mm() helper. The change
also relocates the khugepaged registration call in the madvise path to
occur after vm_flags are updated rather than before.

> diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> index eb1946a70..b30814d3d 100644
> --- a/include/linux/khugepaged.h
> +++ b/include/linux/khugepaged.h

[ ... ]

> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 1b81680b4..ac6601f30 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c

[ ... ]

> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index abe54f004..0849fb652 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c

[ ... ]

> diff --git a/mm/madvise.c b/mm/madvise.c
> index fb1c86e63..8de7c3930 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1425,6 +1425,13 @@ static int madvise_vma_behavior(struct madvise_behavior *madv_behavior)
>  	VM_WARN_ON_ONCE(madv_behavior->lock_mode != MADVISE_MMAP_WRITE_LOCK);
>
>  	error = madvise_update_vma(new_flags, madv_behavior);
> +	/*
> +	 * If the vma become good for khugepaged to scan,
> +	 * register it here without waiting a page fault that
> +	 * may not happen any time soon.
> +	 */
> +	if (!error && new_flags & VM_HUGEPAGE)
> +		khugepaged_enter_mm(vma->vm_mm);

Can this access freed memory? In madvise_vma_behavior(), the local
variable 'vma' is captured at the function entry. When
madvise_update_vma() merges VMAs, it may free the original VMA and
update madv_behavior->vma to point to a different VMA (the merge result).

Looking at the VMA merge code in vma_merge_existing_range():

    vmg->__remove_middle = left_side && right_side;

When the merge spans the entire VMA, the middle VMA (the original 'vma')
is freed in vma_complete():

    vm_area_free(vp->remove);

After madvise_update_vma() returns, the code uses the local 'vma'
variable to access vma->vm_mm, but this VMA may have been freed if
merging occurred. The correct approach would be to use
madv_behavior->vma->vm_mm, which madvise_update_vma() updates to point
to the valid VMA after merging.

> diff --git a/mm/vma.c b/mm/vma.c
> index abe0da33c..872eb3c92 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c

[ ... ]



AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: mm: thp: remove vm_flags parameter from khugepaged_enter_vma()
CI run summary: https://github.com/kernel-patches/bpf-rc/actions/runs/18532322684

@kernel-patches-daemon-bpf-rc
Copy link
Author

Forwarding comment 3406846173 via email
In-Reply-To: [email protected]
Patch: https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/

@kernel-patches-daemon-bpf-rc
Copy link
Author

At least one diff in series https://patchwork.kernel.org/project/netdevbpf/list/?series=1011915 expired. Closing PR.

@kernel-patches-daemon-bpf-rc kernel-patches-daemon-bpf-rc bot deleted the series/1011915=>bpf-next branch October 17, 2025 22:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants