Skip to content

Commit 9786531

Browse files
adam900710kdave
authored andcommitted
btrfs: fix corruption reading compressed range when block size is smaller than page size
[BUG] With 64K page size (aarch64 with 64K page size config) and 4K btrfs block size, the following workload can easily lead to a corrupted read: mkfs.btrfs -f -s 4k $dev > /dev/null mount -o compress $dev $mnt xfs_io -f -c "pwrite -S 0xff 0 64k" $mnt/base > /dev/null echo "correct result:" od -Ad -t x1 $mnt/base xfs_io -f -c "reflink $mnt/base 32k 0 32k" \ -c "reflink $mnt/base 0 32k 32k" \ -c "pwrite -S 0xff 60k 4k" $mnt/new > /dev/null echo "incorrect result:" od -Ad -t x1 $mnt/new umount $mnt This shows the following result: correct result: 0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff * 0065536 incorrect result: 0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff * 0032768 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 0061440 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff * 0065536 Notice the zero in the range [32K, 60K), which is incorrect. [CAUSE] With extra trace printk, it shows the following events during od: (some unrelated info removed like CPU and context) od-3457 btrfs_do_readpage: enter r/i=5/258 folio=0(65536) prev_em_start=0000000000000000 The "r/i" is indicating the root and inode number. In our case the file "new" is using ino 258 from fs tree (root 5). Here notice the @prev_em_start pointer is NULL. This means the btrfs_do_readpage() is called from btrfs_read_folio(), not from btrfs_readahead(). od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=0 got em start=0 len=32768 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=4096 got em start=0 len=32768 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=8192 got em start=0 len=32768 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=12288 got em start=0 len=32768 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=16384 got em start=0 len=32768 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=20480 got em start=0 len=32768 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=24576 got em start=0 len=32768 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=28672 got em start=0 len=32768 These above 32K blocks will be read from the first half of the compressed data extent. od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=32768 got em start=32768 len=32768 Note here there is no btrfs_submit_compressed_read() call. Which is incorrect now. Although both extent maps at 0 and 32K are pointing to the same compressed data, their offsets are different thus can not be merged into the same read. So this means the compressed data read merge check is doing something wrong. od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=36864 got em start=32768 len=32768 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=40960 got em start=32768 len=32768 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=45056 got em start=32768 len=32768 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=49152 got em start=32768 len=32768 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=53248 got em start=32768 len=32768 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=57344 got em start=32768 len=32768 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=61440 skip uptodate od-3457 btrfs_submit_compressed_read: cb orig_bio: file off=0 len=61440 The function btrfs_submit_compressed_read() is only called at the end of folio read. The compressed bio will only have an extent map of range [0, 32K), but the original bio passed in is for the whole 64K folio. This will cause the decompression part to only fill the first 32K, leaving the rest untouched (aka, filled with zero). This incorrect compressed read merge leads to the above data corruption. There were similar problems that happened in the past, commit 808f80b ("Btrfs: update fix for read corruption of compressed and shared extents") is doing pretty much the same fix for readahead. But that's back to 2015, where btrfs still only supports bs (block size) == ps (page size) cases. This means btrfs_do_readpage() only needs to handle a folio which contains exactly one block. Only btrfs_readahead() can lead to a read covering multiple blocks. Thus only btrfs_readahead() passes a non-NULL @prev_em_start pointer. With v5.15 kernel btrfs introduced bs < ps support. This breaks the above assumption that a folio can only contain one block. Now btrfs_read_folio() can also read multiple blocks in one go. But btrfs_read_folio() doesn't pass a @prev_em_start pointer, thus the existing bio force submission check will never be triggered. In theory, this can also happen for btrfs with large folios, but since large folio is still experimental, we don't need to bother it, thus only bs < ps support is affected for now. [FIX] Instead of passing @prev_em_start to do the proper compressed extent check, introduce one new member, btrfs_bio_ctrl::last_em_start, so that the existing bio force submission logic will always be triggered. CC: [email protected] # 5.15+ Reviewed-by: Filipe Manana <[email protected]> Signed-off-by: Qu Wenruo <[email protected]> Signed-off-by: David Sterba <[email protected]>
1 parent 6db1df4 commit 9786531

File tree

1 file changed

+30
-10
lines changed

1 file changed

+30
-10
lines changed

fs/btrfs/extent_io.c

Lines changed: 30 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -111,6 +111,24 @@ struct btrfs_bio_ctrl {
111111
*/
112112
unsigned long submit_bitmap;
113113
struct readahead_control *ractl;
114+
115+
/*
116+
* The start offset of the last used extent map by a read operation.
117+
*
118+
* This is for proper compressed read merge.
119+
* U64_MAX means we are starting the read and have made no progress yet.
120+
*
121+
* The current btrfs_bio_is_contig() only uses disk_bytenr as
122+
* the condition to check if the read can be merged with previous
123+
* bio, which is not correct. E.g. two file extents pointing to the
124+
* same extent but with different offset.
125+
*
126+
* So here we need to do extra checks to only merge reads that are
127+
* covered by the same extent map.
128+
* Just extent_map::start will be enough, as they are unique
129+
* inside the same inode.
130+
*/
131+
u64 last_em_start;
114132
};
115133

116134
static void submit_one_bio(struct btrfs_bio_ctrl *bio_ctrl)
@@ -909,7 +927,7 @@ static void btrfs_readahead_expand(struct readahead_control *ractl,
909927
* return 0 on success, otherwise return error
910928
*/
911929
static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
912-
struct btrfs_bio_ctrl *bio_ctrl, u64 *prev_em_start)
930+
struct btrfs_bio_ctrl *bio_ctrl)
913931
{
914932
struct inode *inode = folio->mapping->host;
915933
struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
@@ -1019,12 +1037,11 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
10191037
* non-optimal behavior (submitting 2 bios for the same extent).
10201038
*/
10211039
if (compress_type != BTRFS_COMPRESS_NONE &&
1022-
prev_em_start && *prev_em_start != (u64)-1 &&
1023-
*prev_em_start != em->start)
1040+
bio_ctrl->last_em_start != U64_MAX &&
1041+
bio_ctrl->last_em_start != em->start)
10241042
force_bio_submit = true;
10251043

1026-
if (prev_em_start)
1027-
*prev_em_start = em->start;
1044+
bio_ctrl->last_em_start = em->start;
10281045

10291046
btrfs_free_extent_map(em);
10301047
em = NULL;
@@ -1238,12 +1255,15 @@ int btrfs_read_folio(struct file *file, struct folio *folio)
12381255
const u64 start = folio_pos(folio);
12391256
const u64 end = start + folio_size(folio) - 1;
12401257
struct extent_state *cached_state = NULL;
1241-
struct btrfs_bio_ctrl bio_ctrl = { .opf = REQ_OP_READ };
1258+
struct btrfs_bio_ctrl bio_ctrl = {
1259+
.opf = REQ_OP_READ,
1260+
.last_em_start = U64_MAX,
1261+
};
12421262
struct extent_map *em_cached = NULL;
12431263
int ret;
12441264

12451265
lock_extents_for_read(inode, start, end, &cached_state);
1246-
ret = btrfs_do_readpage(folio, &em_cached, &bio_ctrl, NULL);
1266+
ret = btrfs_do_readpage(folio, &em_cached, &bio_ctrl);
12471267
btrfs_unlock_extent(&inode->io_tree, start, end, &cached_state);
12481268

12491269
btrfs_free_extent_map(em_cached);
@@ -2583,20 +2603,20 @@ void btrfs_readahead(struct readahead_control *rac)
25832603
{
25842604
struct btrfs_bio_ctrl bio_ctrl = {
25852605
.opf = REQ_OP_READ | REQ_RAHEAD,
2586-
.ractl = rac
2606+
.ractl = rac,
2607+
.last_em_start = U64_MAX,
25872608
};
25882609
struct folio *folio;
25892610
struct btrfs_inode *inode = BTRFS_I(rac->mapping->host);
25902611
const u64 start = readahead_pos(rac);
25912612
const u64 end = start + readahead_length(rac) - 1;
25922613
struct extent_state *cached_state = NULL;
25932614
struct extent_map *em_cached = NULL;
2594-
u64 prev_em_start = (u64)-1;
25952615

25962616
lock_extents_for_read(inode, start, end, &cached_state);
25972617

25982618
while ((folio = readahead_folio(rac)) != NULL)
2599-
btrfs_do_readpage(folio, &em_cached, &bio_ctrl, &prev_em_start);
2619+
btrfs_do_readpage(folio, &em_cached, &bio_ctrl);
26002620

26012621
btrfs_unlock_extent(&inode->io_tree, start, end, &cached_state);
26022622

0 commit comments

Comments
 (0)