Skip to content

Commit 866fb14

Browse files
adam900710kdave
authored andcommitted
btrfs: protect folio::private when attaching extent buffer folios
[BUG] Since v6.8 there are rare kernel crashes reported by various people, the common factor is bad page status error messages like this: BUG: Bad page state in process kswapd0 pfn:d6e840 page: refcount:0 mapcount:0 mapping:000000007512f4f2 index:0x2796c2c7c pfn:0xd6e840 aops:btree_aops ino:1 flags: 0x17ffffe0000008(uptodate|node=0|zone=2|lastcpupid=0x3fffff) page_type: 0xffffffff() raw: 0017ffffe0000008 dead000000000100 dead000000000122 ffff88826d0be4c0 raw: 00000002796c2c7c 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: non-NULL mapping [CAUSE] Commit 09e6cef ("btrfs: refactor alloc_extent_buffer() to allocate-then-attach method") changes the sequence when allocating a new extent buffer. Previously we always called grab_extent_buffer() under mapping->i_private_lock, to ensure the safety on modification on folio::private (which is a pointer to extent buffer for regular sectorsize). This can lead to the following race: Thread A is trying to allocate an extent buffer at bytenr X, with 4 4K pages, meanwhile thread B is trying to release the page at X + 4K (the second page of the extent buffer at X). Thread A | Thread B -----------------------------------+------------------------------------- | btree_release_folio() | | This is for the page at X + 4K, | | Not page X. | | alloc_extent_buffer() | |- release_extent_buffer() |- filemap_add_folio() for the | | |- atomic_dec_and_test(eb->refs) | page at bytenr X (the first | | | | page). | | | | Which returned -EEXIST. | | | | | | | |- filemap_lock_folio() | | | | Returned the first page locked. | | | | | | | |- grab_extent_buffer() | | | | |- atomic_inc_not_zero() | | | | | Returned false | | | | |- folio_detach_private() | | |- folio_detach_private() for X | |- folio_test_private() | | |- folio_test_private() | Returned true | | | Returned true |- folio_put() | |- folio_put() Now there are two puts on the same folio at folio X, leading to refcount underflow of the folio X, and eventually causing the BUG_ON() on the page->mapping. The condition is not that easy to hit: - The release must be triggered for the middle page of an eb If the release is on the same first page of an eb, page lock would kick in and prevent the race. - folio_detach_private() has a very small race window It's only between folio_test_private() and folio_clear_private(). That's exactly when mapping->i_private_lock is used to prevent such race, and commit 09e6cef ("btrfs: refactor alloc_extent_buffer() to allocate-then-attach method") screwed that up. At that time, I thought the page lock would kick in as filemap_release_folio() also requires the page to be locked, but forgot the filemap_release_folio() only locks one page, not all pages of an extent buffer. [FIX] Move all the code requiring i_private_lock into attach_eb_folio_to_filemap(), so that everything is done with proper lock protection. Furthermore to prevent future problems, add an extra lockdep_assert_locked() to ensure we're holding the proper lock. To reproducer that is able to hit the race (takes a few minutes with instrumented code inserting delays to alloc_extent_buffer()): #!/bin/sh drop_caches () { while(true); do echo 3 > /proc/sys/vm/drop_caches echo 1 > /proc/sys/vm/compact_memory done } run_tar () { while(true); do for x in `seq 1 80` ; do tar cf /dev/zero /mnt > /dev/null & done wait done } mkfs.btrfs -f -d single -m single /dev/vda mount -o noatime /dev/vda /mnt # create 200,000 files, 1K each ./simoop -n 200000 -E -f 1k /mnt drop_caches & (run_tar) Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/linux-btrfs/CAHk-=wgt362nGfScVOOii8cgKn2LVVHeOvOA7OBwg1OwbuJQcw@mail.gmail.com/ Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Link: https://lore.kernel.org/lkml/CABXGCsPktcHQOvKTbPaTwegMExije=Gpgci5NW=hqORo-s7diA@mail.gmail.com/ Reported-by: Toralf Förster <toralf.foerster@gmx.de> Link: https://lore.kernel.org/linux-btrfs/e8b3311c-9a75-4903-907f-fc0f7a3fe423@gmx.de/ Fixes: 09e6cef ("btrfs: refactor alloc_extent_buffer() to allocate-then-attach method") CC: stable@vger.kernel.org # 6.8+ CC: Chris Mason <clm@fb.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
1 parent c83ea43 commit 866fb14

File tree

1 file changed

+31
-29
lines changed

1 file changed

+31
-29
lines changed

fs/btrfs/extent_io.c

Lines changed: 31 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -2913,6 +2913,8 @@ static struct extent_buffer *grab_extent_buffer(
29132913
struct folio *folio = page_folio(page);
29142914
struct extent_buffer *exists;
29152915

2916+
lockdep_assert_held(&page->mapping->i_private_lock);
2917+
29162918
/*
29172919
* For subpage case, we completely rely on radix tree to ensure we
29182920
* don't try to insert two ebs for the same bytenr. So here we always
@@ -2980,13 +2982,14 @@ static int check_eb_alignment(struct btrfs_fs_info *fs_info, u64 start)
29802982
* The caller needs to free the existing folios and retry using the same order.
29812983
*/
29822984
static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
2985+
struct btrfs_subpage *prealloc,
29832986
struct extent_buffer **found_eb_ret)
29842987
{
29852988

29862989
struct btrfs_fs_info *fs_info = eb->fs_info;
29872990
struct address_space *mapping = fs_info->btree_inode->i_mapping;
29882991
const unsigned long index = eb->start >> PAGE_SHIFT;
2989-
struct folio *existing_folio;
2992+
struct folio *existing_folio = NULL;
29902993
int ret;
29912994

29922995
ASSERT(found_eb_ret);
@@ -2998,12 +3001,14 @@ static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
29983001
ret = filemap_add_folio(mapping, eb->folios[i], index + i,
29993002
GFP_NOFS | __GFP_NOFAIL);
30003003
if (!ret)
3001-
return 0;
3004+
goto finish;
30023005

30033006
existing_folio = filemap_lock_folio(mapping, index + i);
30043007
/* The page cache only exists for a very short time, just retry. */
3005-
if (IS_ERR(existing_folio))
3008+
if (IS_ERR(existing_folio)) {
3009+
existing_folio = NULL;
30063010
goto retry;
3011+
}
30073012

30083013
/* For now, we should only have single-page folios for btree inode. */
30093014
ASSERT(folio_nr_pages(existing_folio) == 1);
@@ -3014,21 +3019,21 @@ static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
30143019
return -EAGAIN;
30153020
}
30163021

3017-
if (fs_info->nodesize < PAGE_SIZE) {
3018-
/*
3019-
* We're going to reuse the existing page, can drop our page
3020-
* and subpage structure now.
3021-
*/
3022+
finish:
3023+
spin_lock(&mapping->i_private_lock);
3024+
if (existing_folio && fs_info->nodesize < PAGE_SIZE) {
3025+
/* We're going to reuse the existing page, can drop our folio now. */
30223026
__free_page(folio_page(eb->folios[i], 0));
30233027
eb->folios[i] = existing_folio;
3024-
} else {
3028+
} else if (existing_folio) {
30253029
struct extent_buffer *existing_eb;
30263030

30273031
existing_eb = grab_extent_buffer(fs_info,
30283032
folio_page(existing_folio, 0));
30293033
if (existing_eb) {
30303034
/* The extent buffer still exists, we can use it directly. */
30313035
*found_eb_ret = existing_eb;
3036+
spin_unlock(&mapping->i_private_lock);
30323037
folio_unlock(existing_folio);
30333038
folio_put(existing_folio);
30343039
return 1;
@@ -3037,6 +3042,22 @@ static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
30373042
__free_page(folio_page(eb->folios[i], 0));
30383043
eb->folios[i] = existing_folio;
30393044
}
3045+
eb->folio_size = folio_size(eb->folios[i]);
3046+
eb->folio_shift = folio_shift(eb->folios[i]);
3047+
/* Should not fail, as we have preallocated the memory. */
3048+
ret = attach_extent_buffer_folio(eb, eb->folios[i], prealloc);
3049+
ASSERT(!ret);
3050+
/*
3051+
* To inform we have an extra eb under allocation, so that
3052+
* detach_extent_buffer_page() won't release the folio private when the
3053+
* eb hasn't been inserted into radix tree yet.
3054+
*
3055+
* The ref will be decreased when the eb releases the page, in
3056+
* detach_extent_buffer_page(). Thus needs no special handling in the
3057+
* error path.
3058+
*/
3059+
btrfs_folio_inc_eb_refs(fs_info, eb->folios[i]);
3060+
spin_unlock(&mapping->i_private_lock);
30403061
return 0;
30413062
}
30423063

@@ -3048,7 +3069,6 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
30483069
int attached = 0;
30493070
struct extent_buffer *eb;
30503071
struct extent_buffer *existing_eb = NULL;
3051-
struct address_space *mapping = fs_info->btree_inode->i_mapping;
30523072
struct btrfs_subpage *prealloc = NULL;
30533073
u64 lockdep_owner = owner_root;
30543074
bool page_contig = true;
@@ -3114,7 +3134,7 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
31143134
for (int i = 0; i < num_folios; i++) {
31153135
struct folio *folio;
31163136

3117-
ret = attach_eb_folio_to_filemap(eb, i, &existing_eb);
3137+
ret = attach_eb_folio_to_filemap(eb, i, prealloc, &existing_eb);
31183138
if (ret > 0) {
31193139
ASSERT(existing_eb);
31203140
goto out;
@@ -3151,24 +3171,6 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
31513171
* and free the allocated page.
31523172
*/
31533173
folio = eb->folios[i];
3154-
eb->folio_size = folio_size(folio);
3155-
eb->folio_shift = folio_shift(folio);
3156-
spin_lock(&mapping->i_private_lock);
3157-
/* Should not fail, as we have preallocated the memory */
3158-
ret = attach_extent_buffer_folio(eb, folio, prealloc);
3159-
ASSERT(!ret);
3160-
/*
3161-
* To inform we have extra eb under allocation, so that
3162-
* detach_extent_buffer_page() won't release the folio private
3163-
* when the eb hasn't yet been inserted into radix tree.
3164-
*
3165-
* The ref will be decreased when the eb released the page, in
3166-
* detach_extent_buffer_page().
3167-
* Thus needs no special handling in error path.
3168-
*/
3169-
btrfs_folio_inc_eb_refs(fs_info, folio);
3170-
spin_unlock(&mapping->i_private_lock);
3171-
31723174
WARN_ON(btrfs_folio_test_dirty(fs_info, folio, eb->start, eb->len));
31733175

31743176
/*

0 commit comments

Comments
 (0)