Skip to content

Commit 93c0476

Browse files
ryncsnakpm00
authored andcommitted
mm/shmem, swap: rework swap entry and index calculation for large swapin
Instead of calculating the swap entry differently in different swapin paths, calculate it early before the swap cache lookup and use that for the lookup and later swapin. And after swapin have brought a folio, simply round it down against the size of the folio. This is simple and effective enough to verify the swap value. A folio's swap entry is always aligned by its size. Any kind of parallel split or race is acceptable because the final shmem_add_to_page_cache ensures that all entries covered by the folio are correct, and thus there will be no data corruption. This also prevents false positive cache lookup. If a shmem read request's index points to the middle of a large swap entry, previously, shmem will try the swap cache lookup using the large swap entry's starting value (which is the first sub swap entry of this large entry). This will lead to false positive lookup results if only the first few swap entries are cached but the actual requested swap entry pointed by the index is uncached. This is not a rare event, as swap readahead always tries to cache order 0 folios when possible. And this shouldn't cause any increased repeated faults. Instead, no matter how the shmem mapping is split in parallel, as long as the mapping still contains the right entries, the swapin will succeed. The final object size and stack usage are also reduced due to simplified code: ./scripts/bloat-o-meter mm/shmem.o.old mm/shmem.o add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-145 (-145) Function old new delta shmem_swapin_folio 4056 3911 -145 Total: Before=33242, After=33097, chg -0.44% Stack usage (Before vs After): mm/shmem.c:2314:12:shmem_swapin_folio 264 static mm/shmem.c:2314:12:shmem_swapin_folio 256 static And while at it, round down the index too if swap entry is round down. The index is used either for folio reallocation or confirming the mapping content. In either case, it should be aligned with the swap folio. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kairui Song <[email protected]> Reviewed-by: Baolin Wang <[email protected]> Tested-by: Baolin Wang <[email protected]> Cc: Baoquan He <[email protected]> Cc: Barry Song <[email protected]> Cc: Chris Li <[email protected]> Cc: Dev Jain <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Kemeng Shi <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Nhat Pham <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
1 parent 1326359 commit 93c0476

File tree

1 file changed

+33
-34
lines changed

1 file changed

+33
-34
lines changed

mm/shmem.c

Lines changed: 33 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -2302,7 +2302,7 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
23022302
if (xas_error(&xas))
23032303
return xas_error(&xas);
23042304

2305-
return entry_order;
2305+
return 0;
23062306
}
23072307

23082308
/*
@@ -2323,19 +2323,19 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
23232323
struct swap_info_struct *si;
23242324
struct folio *folio = NULL;
23252325
bool skip_swapcache = false;
2326-
int error, nr_pages, order, split_order;
2326+
int error, nr_pages, order;
23272327
pgoff_t offset;
23282328

23292329
VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
23302330
index_entry = radix_to_swp_entry(*foliop);
23312331
swap = index_entry;
23322332
*foliop = NULL;
23332333

2334-
if (is_poisoned_swp_entry(swap))
2334+
if (is_poisoned_swp_entry(index_entry))
23352335
return -EIO;
23362336

2337-
si = get_swap_device(swap);
2338-
order = shmem_confirm_swap(mapping, index, swap);
2337+
si = get_swap_device(index_entry);
2338+
order = shmem_confirm_swap(mapping, index, index_entry);
23392339
if (unlikely(!si)) {
23402340
if (order < 0)
23412341
return -EEXIST;
@@ -2347,6 +2347,12 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
23472347
return -EEXIST;
23482348
}
23492349

2350+
/* index may point to the middle of a large entry, get the sub entry */
2351+
if (order) {
2352+
offset = index - round_down(index, 1 << order);
2353+
swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
2354+
}
2355+
23502356
/* Look it up and read it in.. */
23512357
folio = swap_cache_get_folio(swap, NULL, 0);
23522358
if (!folio) {
@@ -2359,31 +2365,24 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
23592365

23602366
if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
23612367
/* Direct swapin skipping swap cache & readahead */
2362-
folio = shmem_swap_alloc_folio(inode, vma, index, swap, order, gfp);
2368+
folio = shmem_swap_alloc_folio(inode, vma, index,
2369+
index_entry, order, gfp);
23632370
if (IS_ERR(folio)) {
23642371
error = PTR_ERR(folio);
23652372
folio = NULL;
23662373
goto failed;
23672374
}
23682375
skip_swapcache = true;
23692376
} else {
2370-
/*
2371-
* Cached swapin only supports order 0 folio, it is
2372-
* necessary to recalculate the new swap entry based on
2373-
* the offset, as the swapin index might be unalgined.
2374-
*/
2375-
if (order) {
2376-
offset = index - round_down(index, 1 << order);
2377-
swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
2378-
}
2379-
2377+
/* Cached swapin only supports order 0 folio */
23802378
folio = shmem_swapin_cluster(swap, gfp, info, index);
23812379
if (!folio) {
23822380
error = -ENOMEM;
23832381
goto failed;
23842382
}
23852383
}
23862384
}
2385+
23872386
if (order > folio_order(folio)) {
23882387
/*
23892388
* Swapin may get smaller folios due to various reasons:
@@ -2393,24 +2392,25 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
23932392
* large swap entries. In such cases, we should split the
23942393
* large swap entry to prevent possible data corruption.
23952394
*/
2396-
split_order = shmem_split_large_entry(inode, index, index_entry, gfp);
2397-
if (split_order < 0) {
2398-
error = split_order;
2395+
error = shmem_split_large_entry(inode, index, index_entry, gfp);
2396+
if (error)
23992397
goto failed_nolock;
2400-
}
2398+
}
24012399

2402-
/*
2403-
* If the large swap entry has already been split, it is
2404-
* necessary to recalculate the new swap entry based on
2405-
* the old order alignment.
2406-
*/
2407-
if (split_order > 0) {
2408-
offset = index - round_down(index, 1 << split_order);
2409-
swap = swp_entry(swp_type(swap), swp_offset(index_entry) + offset);
2410-
}
2411-
} else if (order < folio_order(folio)) {
2412-
swap.val = round_down(swap.val, 1 << folio_order(folio));
2413-
index = round_down(index, 1 << folio_order(folio));
2400+
/*
2401+
* If the folio is large, round down swap and index by folio size.
2402+
* No matter what race occurs, the swap layer ensures we either get
2403+
* a valid folio that has its swap entry aligned by size, or a
2404+
* temporarily invalid one which we'll abort very soon and retry.
2405+
*
2406+
* shmem_add_to_page_cache ensures the whole range contains expected
2407+
* entries and prevents any corruption, so any race split is fine
2408+
* too, it will succeed as long as the entries are still there.
2409+
*/
2410+
nr_pages = folio_nr_pages(folio);
2411+
if (nr_pages > 1) {
2412+
swap.val = round_down(swap.val, nr_pages);
2413+
index = round_down(index, nr_pages);
24142414
}
24152415

24162416
/*
@@ -2446,8 +2446,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
24462446
goto failed;
24472447
}
24482448

2449-
error = shmem_add_to_page_cache(folio, mapping,
2450-
round_down(index, nr_pages),
2449+
error = shmem_add_to_page_cache(folio, mapping, index,
24512450
swp_to_radix_entry(swap), gfp);
24522451
if (error)
24532452
goto failed;

0 commit comments

Comments
 (0)