Skip to content

Commit 1bbb97b

Browse files
adam900710kdave
authored andcommitted
btrfs: scrub: Require mandatory block group RO for dev-replace
[BUG] For dev-replace test cases with fsstress, like btrfs/06[45] btrfs/071, looped runs can lead to random failure, where scrub finds csum error. The possibility is not high, around 1/20 to 1/100, but it's causing data corruption. The bug is observable after commit b12de52 ("btrfs: scrub: Don't check free space before marking a block group RO") [CAUSE] Dev-replace has two source of writes: - Write duplication All writes to source device will also be duplicated to target device. Content: Not yet persisted data/meta - Scrub copy Dev-replace reused scrub code to iterate through existing extents, and copy the verified data to target device. Content: Previously persisted data and metadata The difference in contents makes the following race possible: Regular Writer | Dev-replace ----------------------------------------------------------------- ^ | | Preallocate one data extent | | at bytenr X, len 1M | v | ^ Commit transaction | | Now extent [X, X+1M) is in | v commit root | ================== Dev replace starts ========================= | ^ | | Scrub extent [X, X+1M) | | Read [X, X+1M) | | (The content are mostly garbage | | since it's preallocated) ^ | v | Write back happens for | | extent [X, X+512K) | | New data writes to both | | source and target dev. | v | | ^ | | Scrub writes back extent [X, X+1M) | | to target device. | | This will over write the new data in | | [X, X+512K) | v This race can only happen for nocow writes. Thus metadata and data cow writes are safe, as COW will never overwrite extents of previous transaction (in commit root). This behavior can be confirmed by disabling all fallocate related calls in fsstress (*), then all related tests can pass a 2000 run loop. *: FSSTRESS_AVOID="-f fallocate=0 -f allocsp=0 -f zero=0 -f insert=0 \ -f collapse=0 -f punch=0 -f resvsp=0" I didn't expect resvsp ioctl will fallback to fallocate in VFS... [FIX] Make dev-replace to require mandatory block group RO, and wait for current nocow writes before calling scrub_chunk(). This patch will mostly revert commit 76a8efa ("btrfs: Continue replace when set_block_ro failed") for dev-replace path. The side effect is, dev-replace can be more strict on avaialble space, but definitely worth to avoid data corruption. Reported-by: Filipe Manana <[email protected]> Fixes: 76a8efa ("btrfs: Continue replace when set_block_ro failed") Fixes: b12de52 ("btrfs: scrub: Don't check free space before marking a block group RO") Reviewed-by: Filipe Manana <[email protected]> Signed-off-by: Qu Wenruo <[email protected]> Signed-off-by: David Sterba <[email protected]>
1 parent b35cf1f commit 1bbb97b

File tree

1 file changed

+28
-5
lines changed

1 file changed

+28
-5
lines changed

fs/btrfs/scrub.c

Lines changed: 28 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -3577,17 +3577,27 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
35773577
* This can easily boost the amount of SYSTEM chunks if cleaner
35783578
* thread can't be triggered fast enough, and use up all space
35793579
* of btrfs_super_block::sys_chunk_array
3580+
*
3581+
* While for dev replace, we need to try our best to mark block
3582+
* group RO, to prevent race between:
3583+
* - Write duplication
3584+
* Contains latest data
3585+
* - Scrub copy
3586+
* Contains data from commit tree
3587+
*
3588+
* If target block group is not marked RO, nocow writes can
3589+
* be overwritten by scrub copy, causing data corruption.
3590+
* So for dev-replace, it's not allowed to continue if a block
3591+
* group is not RO.
35803592
*/
3581-
ret = btrfs_inc_block_group_ro(cache, false);
3582-
scrub_pause_off(fs_info);
3583-
3593+
ret = btrfs_inc_block_group_ro(cache, sctx->is_dev_replace);
35843594
if (ret == 0) {
35853595
ro_set = 1;
3586-
} else if (ret == -ENOSPC) {
3596+
} else if (ret == -ENOSPC && !sctx->is_dev_replace) {
35873597
/*
35883598
* btrfs_inc_block_group_ro return -ENOSPC when it
35893599
* failed in creating new chunk for metadata.
3590-
* It is not a problem for scrub/replace, because
3600+
* It is not a problem for scrub, because
35913601
* metadata are always cowed, and our scrub paused
35923602
* commit_transactions.
35933603
*/
@@ -3596,9 +3606,22 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
35963606
btrfs_warn(fs_info,
35973607
"failed setting block group ro: %d", ret);
35983608
btrfs_put_block_group(cache);
3609+
scrub_pause_off(fs_info);
35993610
break;
36003611
}
36013612

3613+
/*
3614+
* Now the target block is marked RO, wait for nocow writes to
3615+
* finish before dev-replace.
3616+
* COW is fine, as COW never overwrites extents in commit tree.
3617+
*/
3618+
if (sctx->is_dev_replace) {
3619+
btrfs_wait_nocow_writers(cache);
3620+
btrfs_wait_ordered_roots(fs_info, U64_MAX, cache->start,
3621+
cache->length);
3622+
}
3623+
3624+
scrub_pause_off(fs_info);
36023625
down_write(&dev_replace->rwsem);
36033626
dev_replace->cursor_right = found_key.offset + length;
36043627
dev_replace->cursor_left = found_key.offset;

0 commit comments

Comments
 (0)