Thp v11 test #10040

laoar · 2025-10-20T02:48:03Z

No description provided.

The khugepaged_enter_vma() function requires handling in two specific scenarios: 1. New VMA creation When a new VMA is created (for anon vma, it is deferred to pagefault), if vma->vm_mm is not present in khugepaged_mm_slot, it must be added. In this case, khugepaged_enter_vma() is called after vma->vm_flags have been set, allowing direct use of the VMA's flags. 2. VMA flag modification When vma->vm_flags are modified (particularly when VM_HUGEPAGE is set), the system must recheck whether to add vma->vm_mm to khugepaged_mm_slot. Currently, khugepaged_enter_vma() is called before the flag update, so the call must be relocated to occur after vma->vm_flags have been set. In the VMA merging path, khugepaged_enter_vma() is also called. For this case, since VMA merging only occurs when the vm_flags of both VMAs are identical (excluding special flags like VM_SOFTDIRTY), we can safely use target->vm_flags instead. (It is worth noting that khugepaged_enter_vma() can be removed from the VMA merging path because the VMA has already been added in the two aforementioned cases. We will address this cleanup in a separate patch.) After this change, we can further remove vm_flags parameter from thp_vma_allowable_order(). That will be handled in a followup patch. Signed-off-by: Yafang Shao <[email protected]> Cc: Yang Shi <[email protected]> Cc: Usama Arif <[email protected]>

Because all calls to thp_vma_allowable_order() pass vma->vm_flags as the vma_flags argument, we can remove the parameter and have the function access vma->vm_flags directly. Signed-off-by: Yafang Shao <[email protected]> Acked-by: Usama Arif <[email protected]>

@vma

The Motivation ============== This patch introduces a new BPF struct_ops called bpf_thp_ops for dynamic THP tuning. It includes a hook bpf_hook_thp_get_order(), allowing BPF programs to influence THP order selection based on factors such as: - Workload identity For example, workloads running in specific containers or cgroups. - Allocation context Whether the allocation occurs during a page fault, khugepaged, swap or other paths. - VMA's memory advice settings MADV_HUGEPAGE or MADV_NOHUGEPAGE - Memory pressure PSI system data or associated cgroup PSI metrics The BPF-THP Interface ===================== The kernel API of this new BPF hook is as follows, /** * thp_get_order: Get the suggested THP order from a BPF program for allocation * @vma: vm_area_struct associated with the THP allocation * @type: TVA type for current @vma * @orders: Bitmask of available THP orders for this allocation * * Return: The suggested THP order for allocation from the BPF program. Must be * a valid, available order. */ int thp_get_order(struct vm_area_struct *vma, enum tva_type type, unsigned long orders); This functionality is only active when system-wide THP is configured to madvise or always mode. It remains disabled in never mode. Additionally, if THP is explicitly disabled for a specific task via prctl(), this BPF functionality will also be unavailable for that task. The Design of Per Process BPF-THP ================================= As suggested by Alexei, we need to scoping the BPF-THP [0]. Scoping BPF-THP to cgroup is not acceptible ------------------------------------------- As explained by Gutierrez: [1] 1. It breaks the cgroup hierarchy when 2 siblings have different THP policies 2. Cgroup was designed for resource management not for grouping processes and une those processes 3. We set a precedent for other people adding new flags to cgroup and potentially polluting cgroups. We may end up with cgroups having tens of different flags, making sysadmin's job more complex Scoping BPF-THP to process -------------------------- To eliminate potential conflicts among competing BPF-THP instances, we enforce that each process is exclusively managed by a single BPF-THP. This approach has received agreement from David [2]. When registering a BPF-THP, we specify the PID of a target task. The BPF-THP is then installed in the task's `mm_struct` struct mm_struct { struct bpf_thp_ops __rcu *thp_thp; }; Inheritance Behavior: - Existing child processes are unaffected - Newly forked children inherit the BPF-THP from their parent - The BPF-THP persists across execve() calls A new linked list tracks all tasks managed by each BPF-THP instance: - Newly managed tasks are added to the list - Exiting tasks are automatically removed from the list - During BPF-THP unregistration (e.g., when the BPF link is removed), all managed tasks have their bpf_thp pointer set to NULL - BPF-THP instances can be dynamically updated, with all tracked tasks automatically migrating to the new version. This design simplifies BPF-THP management in production environments by providing clear lifecycle management and preventing conflicts between multiple BPF-THP instances. WARNING ======= This feature requires CONFIG_BPF_THP (EXPERIMENTAL) to be enabled. Note that this capability is currently unstable and may undergo significant changes—including potential removal—in future kernel versions. Link: https://lore.kernel.org/linux-mm/CAADnVQJtrJZOCWZKH498GBA8M0mYVztApk54mOEejs8Wr3nSiw@mail.gmail.com/ [0] Link: https://lore.kernel.org/linux-mm/[email protected]/ [1] Link: https://lore.kernel.org/linux-mm/[email protected]/ [2] Signed-off-by: Yafang Shao <[email protected]>

The new BPF capability enables finer-grained THP policy decisions by introducing separate handling for swap faults versus normal page faults. As highlighted by Barry: We’ve observed that swapping in large folios can lead to more swap thrashing for some workloads- e.g. kernel build. Consequently, some workloads might prefer swapping in smaller folios than those allocated by alloc_anon_folio(). While prtcl() could potentially be extended to leverage this new policy, doing so would require modifications to the uAPI. Signed-off-by: Yafang Shao <[email protected]> Reviewed-by: Lorenzo Stoakes <[email protected]> Acked-by: Usama Arif <[email protected]> Cc: Barry Song <[email protected]>

khugepaged_enter_vma() ultimately invokes any attached BPF function with the TVA_KHUGEPAGED flag set when determining whether or not to enable khugepaged THP for a freshly faulted in VMA. Currently, on fault, we invoke this in do_huge_pmd_anonymous_page(), as invoked by create_huge_pmd() and only when we have already checked to see if an allowable TVA_PAGEFAULT order is specified. Since we might want to disallow THP on fault-in but allow it via khugepaged, we move things around so we always attempt to enter khugepaged upon fault. This change is safe because: - khugepaged operates at the MM level rather than per-VMA. The THP allocation might fail during page faults due to transient conditions (e.g., memory pressure), it is safe to add this MM to khugepaged for subsequent defragmentation. - If __thp_vma_allowable_orders(TVA_PAGEFAULT) returns 0, then __thp_vma_allowable_orders(TVA_KHUGEPAGED) will also return 0. While we could also extend prctl() to utilize this new policy, such a change would require a uAPI modification to PR_SET_THP_DISABLE. Signed-off-by: Yafang Shao <[email protected]> Acked-by: Lance Yang <[email protected]> Cc: Usama Arif <[email protected]>

The per-process BPF-THP mode is unsuitable for managing shared resources such as shmem THP and file-backed THP. This aligns with known cgroup limitations for similar scenarios [0]. Introduce a global BPF-THP mode to address this gap. When registered: - All existing per-process instances are disabled - New per-process registrations are blocked - Existing per-process instances remain registered (no forced unregistration) The global mode takes precedence over per-process instances. Updates are type-isolated: global instances can only be updated by new global instances, and per-process instances by new per-process instances. Link: https://lore.kernel.org/linux-mm/[email protected]/ [0] Signed-off-by: Yafang Shao <[email protected]>

Add the documentation. Signed-off-by: Yafang Shao <[email protected]>

This test case implements a basic THP policy that sets THPeligible to 0 for a specific task. I selected THPeligible for verification because its straightforward nature makes it ideal for validating the BPF THP policy functionality. Below configs must be enabled for this test: CONFIG_BPF_MM=y CONFIG_BPF_THP=y CONFIG_TRANSPARENT_HUGEPAGE=y Signed-off-by: Yafang Shao <[email protected]>

This test case exercises the BPF THP update mechanism by modifying an existing policy. The behavior confirms that: - EBUSY error occurs when attempting to install a BPF program on a process that already has an active BPF program - Updates to currently running programs are successfully processed - Local prog can't be updated by a global prog - Global prog can't be updated by a local prog - Global prog can be attached even if there's a local prog - Local prog can't be attached if there's a global prog Signed-off-by: Yafang Shao <[email protected]>

Verify that child processes correctly inherit BPF-THP policy from their parent during fork() operations. Signed-off-by: Yafang Shao <[email protected]>

Kernel Patches Daemon and others added 11 commits October 18, 2025 19:33

adding ci files

6116807

Documentation: add BPF THP

4b0464e

Add the documentation. Signed-off-by: Yafang Shao <[email protected]>

selftests/bpf: add test case for BPF-THP inheritance across fork

92bc316

Verify that child processes correctly inherit BPF-THP policy from their parent during fork() operations. Signed-off-by: Yafang Shao <[email protected]>

kernel-patches-daemon-bpf bot force-pushed the bpf-next_base branch from 6116807 to 7b565ed Compare October 20, 2025 20:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Thp v11 test #10040

Thp v11 test #10040

Uh oh!

laoar commented Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Thp v11 test #10040

Are you sure you want to change the base?

Thp v11 test #10040

Uh oh!

Conversation

laoar commented Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant