Skip to content

Commit 978b63f

Browse files
fdmananakdave
authored andcommitted
btrfs: fix race when detecting delalloc ranges during fiemap
For fiemap we recently stopped locking the target extent range for the whole duration of the fiemap call, in order to avoid a deadlock in a scenario where the fiemap buffer happens to be a memory mapped range of the same file. This use case is very unlikely to be useful in practice but it may be triggered by fuzz testing (syzbot, etc). This however introduced a race that makes us miss delalloc ranges for file regions that are currently holes, so the caller of fiemap will not be aware that there's data for some file regions. This can be quite serious for some use cases - for example in coreutils versions before 9.0, the cp program used fiemap to detect holes and data in the source file, copying only regions with data (extents or delalloc) from the source file to the destination file in order to preserve holes (see the documentation for its --sparse command line option). This means that if cp was used with a source file that had delalloc in a hole, the destination file could end up without that data, which is effectively a data loss issue, if it happened to hit the race described below. The race happens like this: 1) Fiemap is called, without the FIEMAP_FLAG_SYNC flag, for a file that has delalloc in the file range [64M, 65M[, which is currently a hole; 2) Fiemap locks the inode in shared mode, then starts iterating the inode's subvolume tree searching for file extent items, without having the whole fiemap target range locked in the inode's io tree - the change introduced recently by commit b0ad381 ("btrfs: fix deadlock with fiemap and extent locking"). It only locks ranges in the io tree when it finds a hole or prealloc extent since that commit; 3) Note that fiemap clones each leaf before using it, and this is to avoid deadlocks when locking a file range in the inode's io tree and the fiemap buffer is memory mapped to some file, because writing to the page with btrfs_page_mkwrite() will wait on any ordered extent for the page's range and the ordered extent needs to lock the range and may need to modify the same leaf, therefore leading to a deadlock on the leaf; 4) While iterating the file extent items in the cloned leaf before finding the hole in the range [64M, 65M[, the delalloc in that range is flushed and its ordered extent completes - meaning the corresponding file extent item is in the inode's subvolume tree, but not present in the cloned leaf that fiemap is iterating over; 5) When fiemap finds the hole in the [64M, 65M[ range by seeing the gap in the cloned leaf (or a file extent item with disk_bytenr == 0 in case the NO_HOLES feature is not enabled), it will lock that file range in the inode's io tree and then search for delalloc by checking for the EXTENT_DELALLOC bit in the io tree for that range and ordered extents (with btrfs_find_delalloc_in_range()). But it finds nothing since the delalloc in that range was already flushed and the ordered extent completed and is gone - as a result fiemap will not report that there's delalloc or an extent for the range [64M, 65M[, so user space will be mislead into thinking that there's a hole in that range. This could actually be sporadically triggered with test case generic/094 from fstests, which reports a missing extent/delalloc range like this: generic/094 2s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//generic/094.out.bad) --- tests/generic/094.out 2020-06-10 19:29:03.830519425 +0100 +++ /home/fdmanana/git/hub/xfstests/results//generic/094.out.bad 2024-02-28 11:00:00.381071525 +0000 @@ -1,3 +1,9 @@ QA output created by 094 fiemap run with sync fiemap run without sync +ERROR: couldn't find extent at 7 +map is 'HHDDHPPDPHPH' +logical: [ 5.. 6] phys: 301517.. 301518 flags: 0x800 tot: 2 +logical: [ 8.. 8] phys: 301520.. 301520 flags: 0x800 tot: 1 ... (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/generic/094.out /home/fdmanana/git/hub/xfstests/results//generic/094.out.bad' to see the entire diff) So in order to fix this, while still avoiding deadlocks in the case where the fiemap buffer is memory mapped to the same file, change fiemap to work like the following: 1) Always lock the whole range in the inode's io tree before starting to iterate the inode's subvolume tree searching for file extent items, just like we did before commit b0ad381 ("btrfs: fix deadlock with fiemap and extent locking"); 2) Now instead of writing to the fiemap buffer every time we have an extent to report, write instead to a temporary buffer (1 page), and when that buffer becomes full, stop iterating the file extent items, unlock the range in the io tree, release the search path, submit all the entries kept in that buffer to the fiemap buffer, and then resume the search for file extent items after locking again the remainder of the range in the io tree. The buffer having a size of a page, allows for 146 entries in a system with 4K pages. This is a large enough value to have a good performance by avoiding too many restarts of the search for file extent items. In other words this preserves the huge performance gains made in the last two years to fiemap, while avoiding the deadlocks in case the fiemap buffer is memory mapped to the same file (useless in practice, but possible and exercised by fuzz testing and syzbot). Fixes: b0ad381 ("btrfs: fix deadlock with fiemap and extent locking") Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
1 parent ae6bd7f commit 978b63f

File tree

1 file changed

+160
-61
lines changed

1 file changed

+160
-61
lines changed

fs/btrfs/extent_io.c

Lines changed: 160 additions & 61 deletions
Original file line numberDiff line numberDiff line change
@@ -2453,19 +2453,94 @@ int try_release_extent_mapping(struct page *page, gfp_t mask)
24532453
return try_release_extent_state(tree, page, mask);
24542454
}
24552455

2456+
struct btrfs_fiemap_entry {
2457+
u64 offset;
2458+
u64 phys;
2459+
u64 len;
2460+
u32 flags;
2461+
};
2462+
24562463
/*
2457-
* To cache previous fiemap extent
2464+
* Indicate the caller of emit_fiemap_extent() that it needs to unlock the file
2465+
* range from the inode's io tree, unlock the subvolume tree search path, flush
2466+
* the fiemap cache and relock the file range and research the subvolume tree.
2467+
* The value here is something negative that can't be confused with a valid
2468+
* errno value and different from 1 because that's also a return value from
2469+
* fiemap_fill_next_extent() and also it's often used to mean some btree search
2470+
* did not find a key, so make it some distinct negative value.
2471+
*/
2472+
#define BTRFS_FIEMAP_FLUSH_CACHE (-(MAX_ERRNO + 1))
2473+
2474+
/*
2475+
* Used to:
2476+
*
2477+
* - Cache the next entry to be emitted to the fiemap buffer, so that we can
2478+
* merge extents that are contiguous and can be grouped as a single one;
24582479
*
2459-
* Will be used for merging fiemap extent
2480+
* - Store extents ready to be written to the fiemap buffer in an intermediary
2481+
* buffer. This intermediary buffer is to ensure that in case the fiemap
2482+
* buffer is memory mapped to the fiemap target file, we don't deadlock
2483+
* during btrfs_page_mkwrite(). This is because during fiemap we are locking
2484+
* an extent range in order to prevent races with delalloc flushing and
2485+
* ordered extent completion, which is needed in order to reliably detect
2486+
* delalloc in holes and prealloc extents. And this can lead to a deadlock
2487+
* if the fiemap buffer is memory mapped to the file we are running fiemap
2488+
* against (a silly, useless in practice scenario, but possible) because
2489+
* btrfs_page_mkwrite() will try to lock the same extent range.
24602490
*/
24612491
struct fiemap_cache {
2492+
/* An array of ready fiemap entries. */
2493+
struct btrfs_fiemap_entry *entries;
2494+
/* Number of entries in the entries array. */
2495+
int entries_size;
2496+
/* Index of the next entry in the entries array to write to. */
2497+
int entries_pos;
2498+
/*
2499+
* Once the entries array is full, this indicates what's the offset for
2500+
* the next file extent item we must search for in the inode's subvolume
2501+
* tree after unlocking the extent range in the inode's io tree and
2502+
* releasing the search path.
2503+
*/
2504+
u64 next_search_offset;
2505+
/*
2506+
* This matches struct fiemap_extent_info::fi_mapped_extents, we use it
2507+
* to count ourselves emitted extents and stop instead of relying on
2508+
* fiemap_fill_next_extent() because we buffer ready fiemap entries at
2509+
* the @entries array, and we want to stop as soon as we hit the max
2510+
* amount of extents to map, not just to save time but also to make the
2511+
* logic at extent_fiemap() simpler.
2512+
*/
2513+
unsigned int extents_mapped;
2514+
/* Fields for the cached extent (unsubmitted, not ready, extent). */
24622515
u64 offset;
24632516
u64 phys;
24642517
u64 len;
24652518
u32 flags;
24662519
bool cached;
24672520
};
24682521

2522+
static int flush_fiemap_cache(struct fiemap_extent_info *fieinfo,
2523+
struct fiemap_cache *cache)
2524+
{
2525+
for (int i = 0; i < cache->entries_pos; i++) {
2526+
struct btrfs_fiemap_entry *entry = &cache->entries[i];
2527+
int ret;
2528+
2529+
ret = fiemap_fill_next_extent(fieinfo, entry->offset,
2530+
entry->phys, entry->len,
2531+
entry->flags);
2532+
/*
2533+
* Ignore 1 (reached max entries) because we keep track of that
2534+
* ourselves in emit_fiemap_extent().
2535+
*/
2536+
if (ret < 0)
2537+
return ret;
2538+
}
2539+
cache->entries_pos = 0;
2540+
2541+
return 0;
2542+
}
2543+
24692544
/*
24702545
* Helper to submit fiemap extent.
24712546
*
@@ -2480,8 +2555,8 @@ static int emit_fiemap_extent(struct fiemap_extent_info *fieinfo,
24802555
struct fiemap_cache *cache,
24812556
u64 offset, u64 phys, u64 len, u32 flags)
24822557
{
2558+
struct btrfs_fiemap_entry *entry;
24832559
u64 cache_end;
2484-
int ret = 0;
24852560

24862561
/* Set at the end of extent_fiemap(). */
24872562
ASSERT((flags & FIEMAP_EXTENT_LAST) == 0);
@@ -2494,7 +2569,9 @@ static int emit_fiemap_extent(struct fiemap_extent_info *fieinfo,
24942569
* find an extent that starts at an offset behind the end offset of the
24952570
* previous extent we processed. This happens if fiemap is called
24962571
* without FIEMAP_FLAG_SYNC and there are ordered extents completing
2497-
* while we call btrfs_next_leaf() (through fiemap_next_leaf_item()).
2572+
* after we had to unlock the file range, release the search path, emit
2573+
* the fiemap extents stored in the buffer (cache->entries array) and
2574+
* the lock the remainder of the range and re-search the btree.
24982575
*
24992576
* For example we are in leaf X processing its last item, which is the
25002577
* file extent item for file range [512K, 1M[, and after
@@ -2607,11 +2684,35 @@ static int emit_fiemap_extent(struct fiemap_extent_info *fieinfo,
26072684

26082685
emit:
26092686
/* Not mergeable, need to submit cached one */
2610-
ret = fiemap_fill_next_extent(fieinfo, cache->offset, cache->phys,
2611-
cache->len, cache->flags);
2612-
cache->cached = false;
2613-
if (ret)
2614-
return ret;
2687+
2688+
if (cache->entries_pos == cache->entries_size) {
2689+
/*
2690+
* We will need to research for the end offset of the last
2691+
* stored extent and not from the current offset, because after
2692+
* unlocking the range and releasing the path, if there's a hole
2693+
* between that end offset and this current offset, a new extent
2694+
* may have been inserted due to a new write, so we don't want
2695+
* to miss it.
2696+
*/
2697+
entry = &cache->entries[cache->entries_size - 1];
2698+
cache->next_search_offset = entry->offset + entry->len;
2699+
cache->cached = false;
2700+
2701+
return BTRFS_FIEMAP_FLUSH_CACHE;
2702+
}
2703+
2704+
entry = &cache->entries[cache->entries_pos];
2705+
entry->offset = cache->offset;
2706+
entry->phys = cache->phys;
2707+
entry->len = cache->len;
2708+
entry->flags = cache->flags;
2709+
cache->entries_pos++;
2710+
cache->extents_mapped++;
2711+
2712+
if (cache->extents_mapped == fieinfo->fi_extents_max) {
2713+
cache->cached = false;
2714+
return 1;
2715+
}
26152716
assign:
26162717
cache->cached = true;
26172718
cache->offset = offset;
@@ -2737,8 +2838,8 @@ static int fiemap_search_slot(struct btrfs_inode *inode, struct btrfs_path *path
27372838
* neighbour leaf).
27382839
* We also need the private clone because holding a read lock on an
27392840
* extent buffer of the subvolume's b+tree will make lockdep unhappy
2740-
* when we call fiemap_fill_next_extent(), because that may cause a page
2741-
* fault when filling the user space buffer with fiemap data.
2841+
* when we check if extents are shared, as backref walking may need to
2842+
* lock the same leaf we are processing.
27422843
*/
27432844
clone = btrfs_clone_extent_buffer(path->nodes[0]);
27442845
if (!clone)
@@ -2778,34 +2879,16 @@ static int fiemap_process_hole(struct btrfs_inode *inode,
27782879
* it beyond i_size.
27792880
*/
27802881
while (cur_offset < end && cur_offset < i_size) {
2781-
struct extent_state *cached_state = NULL;
27822882
u64 delalloc_start;
27832883
u64 delalloc_end;
27842884
u64 prealloc_start;
2785-
u64 lockstart;
2786-
u64 lockend;
27872885
u64 prealloc_len = 0;
27882886
bool delalloc;
27892887

2790-
lockstart = round_down(cur_offset, inode->root->fs_info->sectorsize);
2791-
lockend = round_up(end, inode->root->fs_info->sectorsize);
2792-
2793-
/*
2794-
* We are only locking for the delalloc range because that's the
2795-
* only thing that can change here. With fiemap we have a lock
2796-
* on the inode, so no buffered or direct writes can happen.
2797-
*
2798-
* However mmaps and normal page writeback will cause this to
2799-
* change arbitrarily. We have to lock the extent lock here to
2800-
* make sure that nobody messes with the tree while we're doing
2801-
* btrfs_find_delalloc_in_range.
2802-
*/
2803-
lock_extent(&inode->io_tree, lockstart, lockend, &cached_state);
28042888
delalloc = btrfs_find_delalloc_in_range(inode, cur_offset, end,
28052889
delalloc_cached_state,
28062890
&delalloc_start,
28072891
&delalloc_end);
2808-
unlock_extent(&inode->io_tree, lockstart, lockend, &cached_state);
28092892
if (!delalloc)
28102893
break;
28112894

@@ -2973,6 +3056,7 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
29733056
u64 start, u64 len)
29743057
{
29753058
const u64 ino = btrfs_ino(inode);
3059+
struct extent_state *cached_state = NULL;
29763060
struct extent_state *delalloc_cached_state = NULL;
29773061
struct btrfs_path *path;
29783062
struct fiemap_cache cache = { 0 };
@@ -2985,26 +3069,33 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
29853069
bool stopped = false;
29863070
int ret;
29873071

3072+
cache.entries_size = PAGE_SIZE / sizeof(struct btrfs_fiemap_entry);
3073+
cache.entries = kmalloc_array(cache.entries_size,
3074+
sizeof(struct btrfs_fiemap_entry),
3075+
GFP_KERNEL);
29883076
backref_ctx = btrfs_alloc_backref_share_check_ctx();
29893077
path = btrfs_alloc_path();
2990-
if (!backref_ctx || !path) {
3078+
if (!cache.entries || !backref_ctx || !path) {
29913079
ret = -ENOMEM;
29923080
goto out;
29933081
}
29943082

3083+
restart:
29953084
range_start = round_down(start, sectorsize);
29963085
range_end = round_up(start + len, sectorsize);
29973086
prev_extent_end = range_start;
29983087

3088+
lock_extent(&inode->io_tree, range_start, range_end, &cached_state);
3089+
29993090
ret = fiemap_find_last_extent_offset(inode, path, &last_extent_end);
30003091
if (ret < 0)
3001-
goto out;
3092+
goto out_unlock;
30023093
btrfs_release_path(path);
30033094

30043095
path->reada = READA_FORWARD;
30053096
ret = fiemap_search_slot(inode, path, range_start);
30063097
if (ret < 0) {
3007-
goto out;
3098+
goto out_unlock;
30083099
} else if (ret > 0) {
30093100
/*
30103101
* No file extent item found, but we may have delalloc between
@@ -3051,7 +3142,7 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
30513142
backref_ctx, 0, 0, 0,
30523143
prev_extent_end, hole_end);
30533144
if (ret < 0) {
3054-
goto out;
3145+
goto out_unlock;
30553146
} else if (ret > 0) {
30563147
/* fiemap_fill_next_extent() told us to stop. */
30573148
stopped = true;
@@ -3107,7 +3198,7 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
31073198
extent_gen,
31083199
backref_ctx);
31093200
if (ret < 0)
3110-
goto out;
3201+
goto out_unlock;
31113202
else if (ret > 0)
31123203
flags |= FIEMAP_EXTENT_SHARED;
31133204
}
@@ -3118,9 +3209,9 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
31183209
}
31193210

31203211
if (ret < 0) {
3121-
goto out;
3212+
goto out_unlock;
31223213
} else if (ret > 0) {
3123-
/* fiemap_fill_next_extent() told us to stop. */
3214+
/* emit_fiemap_extent() told us to stop. */
31243215
stopped = true;
31253216
break;
31263217
}
@@ -3129,12 +3220,12 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
31293220
next_item:
31303221
if (fatal_signal_pending(current)) {
31313222
ret = -EINTR;
3132-
goto out;
3223+
goto out_unlock;
31333224
}
31343225

31353226
ret = fiemap_next_leaf_item(inode, path);
31363227
if (ret < 0) {
3137-
goto out;
3228+
goto out_unlock;
31383229
} else if (ret > 0) {
31393230
/* No more file extent items for this inode. */
31403231
break;
@@ -3143,61 +3234,69 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
31433234
}
31443235

31453236
check_eof_delalloc:
3146-
/*
3147-
* Release (and free) the path before emitting any final entries to
3148-
* fiemap_fill_next_extent() to keep lockdep happy. This is because
3149-
* once we find no more file extent items exist, we may have a
3150-
* non-cloned leaf, and fiemap_fill_next_extent() can trigger page
3151-
* faults when copying data to the user space buffer.
3152-
*/
3153-
btrfs_free_path(path);
3154-
path = NULL;
3155-
31563237
if (!stopped && prev_extent_end < range_end) {
31573238
ret = fiemap_process_hole(inode, fieinfo, &cache,
31583239
&delalloc_cached_state, backref_ctx,
31593240
0, 0, 0, prev_extent_end, range_end - 1);
31603241
if (ret < 0)
3161-
goto out;
3242+
goto out_unlock;
31623243
prev_extent_end = range_end;
31633244
}
31643245

31653246
if (cache.cached && cache.offset + cache.len >= last_extent_end) {
31663247
const u64 i_size = i_size_read(&inode->vfs_inode);
31673248

31683249
if (prev_extent_end < i_size) {
3169-
struct extent_state *cached_state = NULL;
31703250
u64 delalloc_start;
31713251
u64 delalloc_end;
3172-
u64 lockstart;
3173-
u64 lockend;
31743252
bool delalloc;
31753253

3176-
lockstart = round_down(prev_extent_end, sectorsize);
3177-
lockend = round_up(i_size, sectorsize);
3178-
3179-
/*
3180-
* See the comment in fiemap_process_hole as to why
3181-
* we're doing the locking here.
3182-
*/
3183-
lock_extent(&inode->io_tree, lockstart, lockend, &cached_state);
31843254
delalloc = btrfs_find_delalloc_in_range(inode,
31853255
prev_extent_end,
31863256
i_size - 1,
31873257
&delalloc_cached_state,
31883258
&delalloc_start,
31893259
&delalloc_end);
3190-
unlock_extent(&inode->io_tree, lockstart, lockend, &cached_state);
31913260
if (!delalloc)
31923261
cache.flags |= FIEMAP_EXTENT_LAST;
31933262
} else {
31943263
cache.flags |= FIEMAP_EXTENT_LAST;
31953264
}
31963265
}
31973266

3267+
out_unlock:
3268+
unlock_extent(&inode->io_tree, range_start, range_end, &cached_state);
3269+
3270+
if (ret == BTRFS_FIEMAP_FLUSH_CACHE) {
3271+
btrfs_release_path(path);
3272+
ret = flush_fiemap_cache(fieinfo, &cache);
3273+
if (ret)
3274+
goto out;
3275+
len -= cache.next_search_offset - start;
3276+
start = cache.next_search_offset;
3277+
goto restart;
3278+
} else if (ret < 0) {
3279+
goto out;
3280+
}
3281+
3282+
/*
3283+
* Must free the path before emitting to the fiemap buffer because we
3284+
* may have a non-cloned leaf and if the fiemap buffer is memory mapped
3285+
* to a file, a write into it (through btrfs_page_mkwrite()) may trigger
3286+
* waiting for an ordered extent that in order to complete needs to
3287+
* modify that leaf, therefore leading to a deadlock.
3288+
*/
3289+
btrfs_free_path(path);
3290+
path = NULL;
3291+
3292+
ret = flush_fiemap_cache(fieinfo, &cache);
3293+
if (ret)
3294+
goto out;
3295+
31983296
ret = emit_last_fiemap_cache(fieinfo, &cache);
31993297
out:
32003298
free_extent_state(delalloc_cached_state);
3299+
kfree(cache.entries);
32013300
btrfs_free_backref_share_ctx(backref_ctx);
32023301
btrfs_free_path(path);
32033302
return ret;

0 commit comments

Comments
 (0)