-
Notifications
You must be signed in to change notification settings - Fork 5
mm, bpf: BPF-MM, BPF-THP #6146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mm, bpf: BPF-MM, BPF-THP #6146
Conversation
The khugepaged_enter_vma() function requires handling in two specific scenarios: 1. New VMA creation When a new VMA is created (for anon vma, it is deferred to pagefault), if vma->vm_mm is not present in khugepaged_mm_slot, it must be added. In this case, khugepaged_enter_vma() is called after vma->vm_flags have been set, allowing direct use of the VMA's flags. 2. VMA flag modification When vma->vm_flags are modified (particularly when VM_HUGEPAGE is set), the system must recheck whether to add vma->vm_mm to khugepaged_mm_slot. Currently, khugepaged_enter_vma() is called before the flag update, so the call must be relocated to occur after vma->vm_flags have been set. In the VMA merging path, khugepaged_enter_vma() is also called. For this case, since VMA merging only occurs when the vm_flags of both VMAs are identical (excluding special flags like VM_SOFTDIRTY), we can safely use target->vm_flags instead. (It is worth noting that khugepaged_enter_vma() can be removed from the VMA merging path because the VMA has already been added in the two aforementioned cases. We will address this cleanup in a separate patch.) After this change, we can further remove vm_flags parameter from thp_vma_allowable_order(). That will be handled in a followup patch. Signed-off-by: Yafang Shao <[email protected]> Cc: Yang Shi <[email protected]> Cc: Usama Arif <[email protected]>
Because all calls to thp_vma_allowable_order() pass vma->vm_flags as the vma_flags argument, we can remove the parameter and have the function access vma->vm_flags directly. Signed-off-by: Yafang Shao <[email protected]> Acked-by: Usama Arif <[email protected]>
This patch introduces a new BPF struct_ops called bpf_thp_ops for dynamic THP tuning. It includes a hook bpf_hook_thp_get_order(), allowing BPF programs to influence THP order selection based on factors such as: - Workload identity For example, workloads running in specific containers or cgroups. - Allocation context Whether the allocation occurs during a page fault, khugepaged, swap or other paths. - VMA's memory advice settings MADV_HUGEPAGE or MADV_NOHUGEPAGE - Memory pressure PSI system data or associated cgroup PSI metrics The kernel API of this new BPF hook is as follows, /** * thp_order_fn_t: Get the suggested THP order from a BPF program for allocation * @vma: vm_area_struct associated with the THP allocation * @type: TVA type for current @vma * @orders: Bitmask of available THP orders for this allocation * * Return: The suggested THP order for allocation from the BPF program. Must be * a valid, available order. */ typedef int thp_order_fn_t(struct vm_area_struct *vma, enum tva_type type, unsigned long orders); Only a single BPF program can be attached at any given time, though it can be dynamically updated to adjust the policy. The implementation supports anonymous THP, shmem THP, and mTHP, with future extensions planned for file-backed THP. This functionality is only active when system-wide THP is configured to madvise or always mode. It remains disabled in never mode. Additionally, if THP is explicitly disabled for a specific task via prctl(), this BPF functionality will also be unavailable for that task. This BPF hook enables the implementation of flexible THP allocation policies at the system, per-cgroup, or per-task level. This feature requires CONFIG_BPF_THP (EXPERIMENTAL) to be enabled. Note that this capability is currently unstable and may undergo significant changes—including potential removal—in future kernel versions. Signed-off-by: Yafang Shao <[email protected]>
The new BPF capability enables finer-grained THP policy decisions by introducing separate handling for swap faults versus normal page faults. As highlighted by Barry: We’ve observed that swapping in large folios can lead to more swap thrashing for some workloads- e.g. kernel build. Consequently, some workloads might prefer swapping in smaller folios than those allocated by alloc_anon_folio(). While prtcl() could potentially be extended to leverage this new policy, doing so would require modifications to the uAPI. Signed-off-by: Yafang Shao <[email protected]> Reviewed-by: Lorenzo Stoakes <[email protected]> Acked-by: Usama Arif <[email protected]> Cc: Barry Song <[email protected]>
khugepaged_enter_vma() ultimately invokes any attached BPF function with the TVA_KHUGEPAGED flag set when determining whether or not to enable khugepaged THP for a freshly faulted in VMA. Currently, on fault, we invoke this in do_huge_pmd_anonymous_page(), as invoked by create_huge_pmd() and only when we have already checked to see if an allowable TVA_PAGEFAULT order is specified. Since we might want to disallow THP on fault-in but allow it via khugepaged, we move things around so we always attempt to enter khugepaged upon fault. This change is safe because: - khugepaged operates at the MM level rather than per-VMA. The THP allocation might fail during page faults due to transient conditions (e.g., memory pressure), it is safe to add this MM to khugepaged for subsequent defragmentation. - If __thp_vma_allowable_orders(TVA_PAGEFAULT) returns 0, then __thp_vma_allowable_orders(TVA_KHUGEPAGED) will also return 0. While we could also extend prctl() to utilize this new policy, such a change would require a uAPI modification to PR_SET_THP_DISABLE. Signed-off-by: Yafang Shao <[email protected]> Acked-by: Lance Yang <[email protected]> Cc: Usama Arif <[email protected]>
When CONFIG_MEMCG is enabled, we can access mm->owner under RCU. The owner can be NULL. With this change, BPF helpers can safely access mm->owner to retrieve the associated task from the mm. We can then make policy decision based on the task attribute. The typical use case is as follows, bpf_rcu_read_lock(); // rcu lock must be held for rcu trusted field @owner = @mm->owner; // mm_struct::owner is rcu trusted or null if (!@owner) goto out; /* Do something based on the task attribute */ out: bpf_rcu_read_unlock(); Suggested-by: Andrii Nakryiko <[email protected]> Signed-off-by: Yafang Shao <[email protected]> Acked-by: Lorenzo Stoakes <[email protected]>
The vma->vm_mm might be NULL and it can be accessed outside of RCU. Thus, we can mark it as trusted_or_null. With this change, BPF helpers can safely access vma->vm_mm to retrieve the associated mm_struct from the VMA. Then we can make policy decision from the VMA. The "trusted" annotation enables direct access to vma->vm_mm within kfuncs marked with KF_TRUSTED_ARGS or KF_RCU, such as bpf_task_get_cgroup1() and bpf_task_under_cgroup(). Conversely, "null" enforcement requires all callsites using vma->vm_mm to perform NULL checks. The lsm selftest must be modified because it directly accesses vma->vm_mm without a NULL pointer check; otherwise it will break due to this change. For the VMA based THP policy, the use case is as follows, @mm = @vma->vm_mm; // vm_area_struct::vm_mm is trusted or null if (!@mm) return; bpf_rcu_read_lock(); // rcu lock must be held to dereference the owner @owner = @mm->owner; // mm_struct::owner is rcu trusted or null if (!@owner) goto out; @Cgroup1 = bpf_task_get_cgroup1(@owner, MEMCG_HIERARCHY_ID); /* make the decision based on the @Cgroup1 attribute */ bpf_cgroup_release(@Cgroup1); // release the associated cgroup out: bpf_rcu_read_unlock(); PSI memory information can be obtained from the associated cgroup to inform policy decisions. Since upstream PSI support is currently limited to cgroup v2, the following example demonstrates cgroup v2 implementation: @owner = @mm->owner; if (@owner) { // @ancestor_cgid is user-configured @ancestor = bpf_cgroup_from_id(@ancestor_cgid); if (bpf_task_under_cgroup(@owner, @ancestor)) { @psi_group = @ancestor->psi; /* Extract PSI metrics from @psi_group and * implement policy logic based on the values */ } } Signed-off-by: Yafang Shao <[email protected]> Acked-by: Lorenzo Stoakes <[email protected]> Cc: "Liam R. Howlett" <[email protected]>
This test case implements a basic THP policy that sets THPeligible to 1 for a specific task and to 0 for all others. I selected THPeligible for verification because its straightforward nature makes it ideal for validating the BPF THP policy functionality. Below configs must be enabled for this test: CONFIG_BPF_THP=y CONFIG_MEMCG=y CONFIG_TRANSPARENT_HUGEPAGE=y Signed-off-by: Yafang Shao <[email protected]>
Add the documentation. Signed-off-by: Yafang Shao <[email protected]>
|
Upstream branch: 48a97ff |
AI reviewed your patch. Please fix the bug or email reply why it's not a bug. In-Reply-To-Subject: |
|
Forwarding comment 3406802393 via email |
AI reviewed your patch. Please fix the bug or email reply why it's not a bug. In-Reply-To-Subject: |
|
Forwarding comment 3406846173 via email |
|
At least one diff in series https://patchwork.kernel.org/project/netdevbpf/list/?series=1011915 expired. Closing PR. |
Pull request for series with
subject: mm, bpf: BPF-MM, BPF-THP
version: 10
url: https://patchwork.kernel.org/project/netdevbpf/list/?series=1011915