Skip to content

Commit fc4951c

Browse files
Hugh Dickinsgregkh
authored andcommitted
mm/thp: fix deferred split unqueue naming and locking
commit f8f931b upstream. Recent changes are putting more pressure on THP deferred split queues: under load revealing long-standing races, causing list_del corruptions, "Bad page state"s and worse (I keep BUGs in both of those, so usually don't get to see how badly they end up without). The relevant recent changes being 6.8's mTHP, 6.10's mTHP swapout, and 6.12's mTHP swapin, improved swap allocation, and underused THP splitting. Before fixing locking: rename misleading folio_undo_large_rmappable(), which does not undo large_rmappable, to folio_unqueue_deferred_split(), which is what it does. But that and its out-of-line __callee are mm internals of very limited usability: add comment and WARN_ON_ONCEs to check usage; and return a bool to say if a deferred split was unqueued, which can then be used in WARN_ON_ONCEs around safety checks (sparing callers the arcane conditionals in __folio_unqueue_deferred_split()). Just omit the folio_unqueue_deferred_split() from free_unref_folios(), all of whose callers now call it beforehand (and if any forget then bad_page() will tell) - except for its caller put_pages_list(), which itself no longer has any callers (and will be deleted separately). Swapout: mem_cgroup_swapout() has been resetting folio->memcg_data 0 without checking and unqueueing a THP folio from deferred split list; which is unfortunate, since the split_queue_lock depends on the memcg (when memcg is enabled); so swapout has been unqueueing such THPs later, when freeing the folio, using the pgdat's lock instead: potentially corrupting the memcg's list. __remove_mapping() has frozen refcount to 0 here, so no problem with calling folio_unqueue_deferred_split() before resetting memcg_data. That goes back to 5.4 commit 87eaceb ("mm: thp: make deferred split shrinker memcg aware"): which included a check on swapcache before adding to deferred queue, but no check on deferred queue before adding THP to swapcache. That worked fine with the usual sequence of events in reclaim (though there were a couple of rare ways in which a THP on deferred queue could have been swapped out), but 6.12 commit dafff3f ("mm: split underused THPs") avoids splitting underused THPs in reclaim, which makes swapcache THPs on deferred queue commonplace. Keep the check on swapcache before adding to deferred queue? Yes: it is no longer essential, but preserves the existing behaviour, and is likely to be a worthwhile optimization (vmstat showed much more traffic on the queue under swapping load if the check was removed); update its comment. Memcg-v1 move (deprecated): mem_cgroup_move_account() has been changing folio->memcg_data without checking and unqueueing a THP folio from the deferred list, sometimes corrupting "from" memcg's list, like swapout. Refcount is non-zero here, so folio_unqueue_deferred_split() can only be used in a WARN_ON_ONCE to validate the fix, which must be done earlier: mem_cgroup_move_charge_pte_range() first try to split the THP (splitting of course unqueues), or skip it if that fails. Not ideal, but moving charge has been requested, and khugepaged should repair the THP later: nobody wants new custom unqueueing code just for this deprecated case. The 87eaceb commit did have the code to move from one deferred list to another (but was not conscious of its unsafety while refcount non-0); but that was removed by 5.6 commit fac0516 ("mm: thp: don't need care deferred split queue in memcg charge move path"), which argued that the existence of a PMD mapping guarantees that the THP cannot be on a deferred list. As above, false in rare cases, and now commonly false. Backport to 6.11 should be straightforward. Earlier backports must take care that other _deferred_list fixes and dependencies are included. There is not a strong case for backports, but they can fix cornercases. Link: https://lkml.kernel.org/r/8dc111ae-f6db-2da7-b25c-7a20b1effe3b@google.com Fixes: 87eaceb ("mm: thp: make deferred split shrinker memcg aware") Fixes: dafff3f ("mm: split underused THPs") Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [ Upstream commit itself does not apply cleanly, because there are fewer calls to folio_undo_large_rmappable() in this tree (in particular, folio migration does not migrate memcg charge), and mm/memcontrol-v1.c has not been split out of mm/memcontrol.c. ] Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
1 parent eb6b6d3 commit fc4951c

File tree

4 files changed

+61
-18
lines changed

4 files changed

+61
-18
lines changed

mm/huge_memory.c

Lines changed: 26 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -2767,18 +2767,38 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
27672767
return ret;
27682768
}
27692769

2770-
void __folio_undo_large_rmappable(struct folio *folio)
2770+
/*
2771+
* __folio_unqueue_deferred_split() is not to be called directly:
2772+
* the folio_unqueue_deferred_split() inline wrapper in mm/internal.h
2773+
* limits its calls to those folios which may have a _deferred_list for
2774+
* queueing THP splits, and that list is (racily observed to be) non-empty.
2775+
*
2776+
* It is unsafe to call folio_unqueue_deferred_split() until folio refcount is
2777+
* zero: because even when split_queue_lock is held, a non-empty _deferred_list
2778+
* might be in use on deferred_split_scan()'s unlocked on-stack list.
2779+
*
2780+
* If memory cgroups are enabled, split_queue_lock is in the mem_cgroup: it is
2781+
* therefore important to unqueue deferred split before changing folio memcg.
2782+
*/
2783+
bool __folio_unqueue_deferred_split(struct folio *folio)
27712784
{
27722785
struct deferred_split *ds_queue;
27732786
unsigned long flags;
2787+
bool unqueued = false;
2788+
2789+
WARN_ON_ONCE(folio_ref_count(folio));
2790+
WARN_ON_ONCE(!mem_cgroup_disabled() && !folio_memcg(folio));
27742791

27752792
ds_queue = get_deferred_split_queue(folio);
27762793
spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
27772794
if (!list_empty(&folio->_deferred_list)) {
27782795
ds_queue->split_queue_len--;
27792796
list_del_init(&folio->_deferred_list);
2797+
unqueued = true;
27802798
}
27812799
spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
2800+
2801+
return unqueued; /* useful for debug warnings */
27822802
}
27832803

27842804
void deferred_split_folio(struct folio *folio)
@@ -2797,14 +2817,11 @@ void deferred_split_folio(struct folio *folio)
27972817
return;
27982818

27992819
/*
2800-
* The try_to_unmap() in page reclaim path might reach here too,
2801-
* this may cause a race condition to corrupt deferred split queue.
2802-
* And, if page reclaim is already handling the same folio, it is
2803-
* unnecessary to handle it again in shrinker.
2804-
*
2805-
* Check the swapcache flag to determine if the folio is being
2806-
* handled by page reclaim since THP swap would add the folio into
2807-
* swap cache before calling try_to_unmap().
2820+
* Exclude swapcache: originally to avoid a corrupt deferred split
2821+
* queue. Nowadays that is fully prevented by mem_cgroup_swapout();
2822+
* but if page reclaim is already handling the same folio, it is
2823+
* unnecessary to handle it again in the shrinker, so excluding
2824+
* swapcache here may still be a useful optimization.
28082825
*/
28092826
if (folio_test_swapcache(folio))
28102827
return;

mm/internal.h

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -413,21 +413,21 @@ static inline void folio_set_order(struct folio *folio, unsigned int order)
413413
#endif
414414
}
415415

416-
void __folio_undo_large_rmappable(struct folio *folio);
417-
static inline void folio_undo_large_rmappable(struct folio *folio)
416+
bool __folio_unqueue_deferred_split(struct folio *folio);
417+
static inline bool folio_unqueue_deferred_split(struct folio *folio)
418418
{
419419
if (folio_order(folio) <= 1 || !folio_test_large_rmappable(folio))
420-
return;
420+
return false;
421421

422422
/*
423423
* At this point, there is no one trying to add the folio to
424424
* deferred_list. If folio is not in deferred_list, it's safe
425425
* to check without acquiring the split_queue_lock.
426426
*/
427427
if (data_race(list_empty(&folio->_deferred_list)))
428-
return;
428+
return false;
429429

430-
__folio_undo_large_rmappable(folio);
430+
return __folio_unqueue_deferred_split(folio);
431431
}
432432

433433
static inline struct folio *page_rmappable_folio(struct page *page)

mm/memcontrol.c

Lines changed: 29 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5873,6 +5873,8 @@ static int mem_cgroup_move_account(struct page *page,
58735873
css_get(&to->css);
58745874
css_put(&from->css);
58755875

5876+
/* Warning should never happen, so don't worry about refcount non-0 */
5877+
WARN_ON_ONCE(folio_unqueue_deferred_split(folio));
58765878
folio->memcg_data = (unsigned long)to;
58775879

58785880
__folio_memcg_unlock(from);
@@ -6237,7 +6239,10 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
62376239
enum mc_target_type target_type;
62386240
union mc_target target;
62396241
struct page *page;
6242+
struct folio *folio;
6243+
bool tried_split_before = false;
62406244

6245+
retry_pmd:
62416246
ptl = pmd_trans_huge_lock(pmd, vma);
62426247
if (ptl) {
62436248
if (mc.precharge < HPAGE_PMD_NR) {
@@ -6247,6 +6252,28 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
62476252
target_type = get_mctgt_type_thp(vma, addr, *pmd, &target);
62486253
if (target_type == MC_TARGET_PAGE) {
62496254
page = target.page;
6255+
folio = page_folio(page);
6256+
/*
6257+
* Deferred split queue locking depends on memcg,
6258+
* and unqueue is unsafe unless folio refcount is 0:
6259+
* split or skip if on the queue? first try to split.
6260+
*/
6261+
if (!list_empty(&folio->_deferred_list)) {
6262+
spin_unlock(ptl);
6263+
if (!tried_split_before)
6264+
split_folio(folio);
6265+
folio_unlock(folio);
6266+
folio_put(folio);
6267+
if (tried_split_before)
6268+
return 0;
6269+
tried_split_before = true;
6270+
goto retry_pmd;
6271+
}
6272+
/*
6273+
* So long as that pmd lock is held, the folio cannot
6274+
* be racily added to the _deferred_list, because
6275+
* page_remove_rmap() will find it still pmdmapped.
6276+
*/
62506277
if (isolate_lru_page(page)) {
62516278
if (!mem_cgroup_move_account(page, true,
62526279
mc.from, mc.to)) {
@@ -7153,9 +7180,6 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
71537180
struct obj_cgroup *objcg;
71547181

71557182
VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
7156-
VM_BUG_ON_FOLIO(folio_order(folio) > 1 &&
7157-
!folio_test_hugetlb(folio) &&
7158-
!list_empty(&folio->_deferred_list), folio);
71597183

71607184
/*
71617185
* Nobody should be changing or seriously looking at
@@ -7202,6 +7226,7 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
72027226
ug->nr_memory += nr_pages;
72037227
ug->pgpgout++;
72047228

7229+
WARN_ON_ONCE(folio_unqueue_deferred_split(folio));
72057230
folio->memcg_data = 0;
72067231
}
72077232

@@ -7495,6 +7520,7 @@ void mem_cgroup_swapout(struct folio *folio, swp_entry_t entry)
74957520
VM_BUG_ON_FOLIO(oldid, folio);
74967521
mod_memcg_state(swap_memcg, MEMCG_SWAP, nr_entries);
74977522

7523+
folio_unqueue_deferred_split(folio);
74987524
folio->memcg_data = 0;
74997525

75007526
if (!mem_cgroup_is_root(memcg))

mm/page_alloc.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -600,7 +600,7 @@ void destroy_large_folio(struct folio *folio)
600600
return;
601601
}
602602

603-
folio_undo_large_rmappable(folio);
603+
folio_unqueue_deferred_split(folio);
604604
mem_cgroup_uncharge(folio);
605605
free_the_page(&folio->page, folio_order(folio));
606606
}

0 commit comments

Comments
 (0)