Skip to content

Commit 3b19db2

Browse files
boryaskdave
authored andcommitted
btrfs: try to search for data csums in commit root
If you run a workload with: - a cgroup that does tons of parallel data reading, with a working set much larger than its memory limit - a second cgroup that writes relatively fewer files, with overwrites, with no memory limit (see full code listing at the bottom for a reproducer) then what quickly occurs is: - we have a large number of threads trying to read the csum tree - we have a decent number of threads deleting csums running delayed refs - we have a large number of threads in direct reclaim and thus high memory pressure The result of this is that we writeback the csum tree repeatedly mid transaction, to get back the extent_buffer folios for reclaim. As a result, we repeatedly COW the csum tree for the delayed refs that are deleting csums. This means repeatedly write locking the higher levels of the tree. As a result of this, we achieve an unpleasant priority inversion. We have: - a high degree of contention on the csum root node (and other upper nodes) eb rwsem - a memory starved cgroup doing tons of reclaim on CPU. - many reader threads in the memory starved cgroup "holding" the sem as readers, but not scheduling promptly. i.e., task __state == 0, but not running on a cpu. - btrfs_commit_transaction stuck trying to acquire the sem as a writer. (running delayed_refs, deleting csums for unreferenced data extents) This results in arbitrarily long transactions. This then results in seriously degraded performance for any cgroup using the filesystem (the victim cgroup in the script). It isn't an academic problem, as we see this exact problem in production at Meta with one cgroup over its memory limit ruining btrfs performance for the whole system, stalling critical system services that depend on btrfs syncs. The underlying scheduling "problem" with global rwsems is sort of thorny and apparently well known and was discussed at LPC 2024, for example. As a result, our main lever in the short term is just trying to reduce contention on our various rwsems with an eye to reducing the frequency of write locking, to avoid disabling the read lock fast acquistion path. Luckily, it seems likely that many reads are for old extents written many transactions ago, and that for those we *can* in fact search the commit root. The commit_root_sem only gets taken write once, near the end of transaction commit, no matter how much memory pressure there is, so we have much less contention between readers and writers. This change detects when we are trying to read an old extent (according to extent map generation) and then wires that through bio_ctrl to the btrfs_bio, which unfortunately isn't allocated yet when we have this information. When we go to lookup the csums in lookup_bio_sums we can check this condition on the btrfs_bio and do the commit root lookup accordingly. Note that a single bio_ctrl might collect a few extent_maps into a single bio, so it is important to track a maximum generation across all the extent_maps used for each bio to make an accurate decision on whether it is valid to look in the commit root. If any extent_map is updated in the current generation, we can't use the commit root. To test and reproduce this issue, I used the following script and accompanying C program (to avoid bottlenecks in constantly forking thousands of dd processes): ====== big-read.c ====== #include <fcntl.h> #include <stdio.h> #include <stdlib.h> #include <sys/mman.h> #include <sys/stat.h> #include <unistd.h> #include <errno.h> #define BUF_SZ (128 * (1 << 10UL)) int read_once(int fd, size_t sz) { char buf[BUF_SZ]; size_t rd = 0; int ret = 0; while (rd < sz) { ret = read(fd, buf, BUF_SZ); if (ret < 0) { if (errno == EINTR) continue; fprintf(stderr, "read failed: %d\n", errno); return -errno; } else if (ret == 0) { break; } else { rd += ret; } } return rd; } int read_loop(char *fname) { int fd; struct stat st; size_t sz = 0; int ret; while (1) { fd = open(fname, O_RDONLY); if (fd == -1) { perror("open"); return 1; } if (!sz) { if (!fstat(fd, &st)) { sz = st.st_size; } else { perror("stat"); return 1; } } ret = read_once(fd, sz); close(fd); } } int main(int argc, char *argv[]) { int fd; struct stat st; off_t sz; char *buf; int ret; if (argc != 2) { fprintf(stderr, "Usage: %s <filename>\n", argv[0]); return 1; } return read_loop(argv[1]); } ====== repro.sh ====== #!/usr/bin/env bash SCRIPT=$(readlink -f "$0") DIR=$(dirname "$SCRIPT") dev=$1 mnt=$2 shift shift CG_ROOT=/sys/fs/cgroup BAD_CG=$CG_ROOT/bad-nbr GOOD_CG=$CG_ROOT/good-nbr NR_BIGGOS=1 NR_LITTLE=10 NR_VICTIMS=32 NR_VILLAINS=512 START_SEC=$(date +%s) _elapsed() { echo "elapsed: $(($(date +%s) - $START_SEC))" } _stats() { local sysfs=/sys/fs/btrfs/$(findmnt -no UUID $dev) echo "================" date _elapsed cat $sysfs/commit_stats cat $BAD_CG/memory.pressure } _setup_cgs() { echo "+memory +cpuset" > $CG_ROOT/cgroup.subtree_control mkdir -p $GOOD_CG mkdir -p $BAD_CG echo max > $BAD_CG/memory.max # memory.high much less than the working set will cause heavy reclaim echo $((1 << 30)) > $BAD_CG/memory.high # victims get a subset of villain CPUs echo 0 > $GOOD_CG/cpuset.cpus echo 0,1,2,3 > $BAD_CG/cpuset.cpus } _kill_cg() { local cg=$1 local attempts=0 echo "kill cgroup $cg" [ -f $cg/cgroup.procs ] || return while true; do attempts=$((attempts + 1)) echo 1 > $cg/cgroup.kill sleep 1 procs=$(wc -l $cg/cgroup.procs | cut -d' ' -f1) [ $procs -eq 0 ] && break done rmdir $cg echo "killed cgroup $cg in $attempts attempts" } _biggo_vol() { echo $mnt/biggo_vol.$1 } _biggo_file() { echo $(_biggo_vol $1)/biggo } _subvoled_biggos() { total_sz=$((10 << 30)) per_sz=$((total_sz / $NR_VILLAINS)) dd_count=$((per_sz >> 20)) echo "create $NR_VILLAINS subvols with a file of size $per_sz bytes for a total of $total_sz bytes." for i in $(seq $NR_VILLAINS) do btrfs subvol create $(_biggo_vol $i) &>/dev/null dd if=/dev/zero of=$(_biggo_file $i) bs=1M count=$dd_count &>/dev/null done echo "done creating subvols." } _setup() { [ -f .done ] && rm .done findmnt -n $dev && exit 1 if [ -f .re-mkfs ]; then mkfs.btrfs -f -m single -d single $dev >/dev/null || exit 2 else echo "touch .re-mkfs to populate the test fs" fi mount -o noatime $dev $mnt || exit 3 [ -f .re-mkfs ] && _subvoled_biggos _setup_cgs } _my_cleanup() { echo "CLEANUP!" _kill_cg $BAD_CG _kill_cg $GOOD_CG sleep 1 umount $mnt } _bad_exit() { _err "Unexpected Exit! $?" _stats exit $? } trap _my_cleanup EXIT trap _bad_exit INT TERM _setup # Use a lot of page cache reading the big file _villain() { local i=$1 echo $BASHPID > $BAD_CG/cgroup.procs $DIR/big-read $(_biggo_file $i) } # Hit del_csum a lot by overwriting lots of small new files _victim() { echo $BASHPID > $GOOD_CG/cgroup.procs i=0; while (true) do local tmp=$mnt/tmp.$i dd if=/dev/zero of=$tmp bs=4k count=2 >/dev/null 2>&1 i=$((i+1)) [ $i -eq $NR_LITTLE ] && i=0 done } _one_sync() { echo "sync..." before=$(date +%s) sync after=$(date +%s) echo "sync done in $((after - before))s" _stats } # sync in a loop _sync() { echo "start sync loop" syncs=0 echo $BASHPID > $GOOD_CG/cgroup.procs while true do [ -f .done ] && break _one_sync syncs=$((syncs + 1)) [ -f .done ] && break sleep 10 done if [ $syncs -eq 0 ]; then echo "do at least one sync!" _one_sync fi echo "sync loop done." } _sleep() { local time=${1-60} local now=$(date +%s) local end=$((now + time)) while [ $now -lt $end ]; do echo "SLEEP: $((end - now))s left. Sleep 10." sleep 10 now=$(date +%s) done } echo "start $NR_VILLAINS villains" for i in $(seq $NR_VILLAINS) do _villain $i & disown # get rid of annoying log on kill (done via cgroup anyway) done echo "start $NR_VICTIMS victims" for i in $(seq $NR_VICTIMS) do _victim & disown done _sync & SYNC_PID=$! _sleep $1 _elapsed touch .done wait $SYNC_PID echo "OK" exit 0 Without this patch, that reproducer: - Ran for 6+ minutes instead of 60s - Hung hundreds of threads in D state on the csum reader lock - Got a commit stuck for 3 minutes sync done in 388s ================ Wed Jul 9 09:52:31 PM UTC 2025 elapsed: 420 commits 2 cur_commit_ms 0 last_commit_ms 159446 max_commit_ms 159446 total_commit_ms 160058 some avg10=99.03 avg60=98.97 avg300=75.43 total=418033386 full avg10=82.79 avg60=80.52 avg300=59.45 total=324995274 419 hits state R, D comms big-read btrfs_tree_read_lock_nested btrfs_read_lock_root_node btrfs_search_slot btrfs_lookup_csum btrfs_lookup_bio_sums btrfs_submit_bbio 1 hits state D comms btrfs-transacti btrfs_tree_lock_nested btrfs_lock_root_node btrfs_search_slot btrfs_del_csums __btrfs_run_delayed_refs btrfs_run_delayed_refs With the patch, the reproducer exits naturally, in 65s, completing a pretty decent 4 commits, despite heavy memory pressure. Occasionally you can still trigger a rather long commit (couple seconds) but never one that is minutes long. sync done in 3s ================ elapsed: 65 commits 4 cur_commit_ms 0 last_commit_ms 485 max_commit_ms 689 total_commit_ms 2453 some avg10=98.28 avg60=64.54 avg300=19.39 total=64849893 full avg10=74.43 avg60=48.50 avg300=14.53 total=48665168 some random rwalker samples showed the most common stack in reclaim, rather than the csum tree: 145 hits state R comms bash, sleep, dd, shuf shrink_folio_list shrink_lruvec shrink_node do_try_to_free_pages try_to_free_mem_cgroup_pages reclaim_high Link: https://lpc.events/event/18/contributions/1883/ Reviewed-by: Filipe Manana <[email protected]> Signed-off-by: Boris Burkov <[email protected]>
1 parent 95f0ea1 commit 3b19db2

File tree

5 files changed

+97
-3
lines changed

5 files changed

+97
-3
lines changed

fs/btrfs/bio.c

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -93,6 +93,7 @@ static struct btrfs_bio *btrfs_split_bio(struct btrfs_fs_info *fs_info,
9393
refcount_inc(&orig_bbio->ordered->refs);
9494
bbio->ordered = orig_bbio->ordered;
9595
}
96+
bbio->csum_search_commit_root = orig_bbio->csum_search_commit_root;
9697
atomic_inc(&orig_bbio->pending_ios);
9798
return bbio;
9899
}

fs/btrfs/bio.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -82,6 +82,8 @@ struct btrfs_bio {
8282
/* Save the first error status of split bio. */
8383
blk_status_t status;
8484

85+
/* Use the commit root to look up csums (data read bio only). */
86+
bool csum_search_commit_root;
8587
/*
8688
* This member must come last, bio_alloc_bioset will allocate enough
8789
* bytes for entire btrfs_bio but relies on bio being last.

fs/btrfs/compression.c

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -602,6 +602,7 @@ void btrfs_submit_compressed_read(struct btrfs_bio *bbio)
602602
cb->compressed_len = compressed_len;
603603
cb->compress_type = btrfs_extent_map_compression(em);
604604
cb->orig_bbio = bbio;
605+
cb->bbio.csum_search_commit_root = bbio->csum_search_commit_root;
605606

606607
btrfs_free_extent_map(em);
607608

fs/btrfs/extent_io.c

Lines changed: 61 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -101,6 +101,26 @@ struct btrfs_bio_ctrl {
101101
enum btrfs_compression_type compress_type;
102102
u32 len_to_oe_boundary;
103103
blk_opf_t opf;
104+
/*
105+
* For data read bios, we attempt to optimize csum lookups if the extent
106+
* generation is older than the current one. To make this possible, we
107+
* need to track the maximum generation of an extent in a bio_ctrl to
108+
* make the decision when submitting the bio.
109+
*
110+
* The pattern between do_readpage(), submit_one_bio() and
111+
* submit_extent_folio() is quite subtle, so tracking this is tricky.
112+
*
113+
* As we process extent E, we might submit a bio with existing built up
114+
* extents before adding E to a new bio, or we might just add E to the
115+
* bio. As a result, E's generation could apply to the current bio or
116+
* to the next one, so we need to be careful to update the bio_ctrl's
117+
* generation with E's only when we are sure E is added to bio_ctrl->bbio
118+
* in submit_extent_folio().
119+
*
120+
* See the comment in btrfs_lookup_bio_sums() for more detail on the
121+
* need for this optimization.
122+
*/
123+
u64 generation;
104124
btrfs_bio_end_io_t end_io_func;
105125
struct writeback_control *wbc;
106126

@@ -113,6 +133,26 @@ struct btrfs_bio_ctrl {
113133
struct readahead_control *ractl;
114134
};
115135

136+
/*
137+
* Helper to set the csum search commit root option for a bio_ctrl's bbio
138+
* before submitting the bio.
139+
*
140+
* Only for use by submit_one_bio().
141+
*/
142+
static void bio_set_csum_search_commit_root(struct btrfs_bio_ctrl *bio_ctrl)
143+
{
144+
struct btrfs_bio *bbio = bio_ctrl->bbio;
145+
146+
ASSERT(bbio);
147+
148+
if (!(btrfs_op(&bbio->bio) == BTRFS_MAP_READ && is_data_inode(bbio->inode)))
149+
return;
150+
151+
bio_ctrl->bbio->csum_search_commit_root =
152+
(bio_ctrl->generation &&
153+
bio_ctrl->generation < btrfs_get_fs_generation(bbio->inode->root->fs_info));
154+
}
155+
116156
static void submit_one_bio(struct btrfs_bio_ctrl *bio_ctrl)
117157
{
118158
struct btrfs_bio *bbio = bio_ctrl->bbio;
@@ -123,6 +163,8 @@ static void submit_one_bio(struct btrfs_bio_ctrl *bio_ctrl)
123163
/* Caller should ensure the bio has at least some range added */
124164
ASSERT(bbio->bio.bi_iter.bi_size);
125165

166+
bio_set_csum_search_commit_root(bio_ctrl);
167+
126168
if (btrfs_op(&bbio->bio) == BTRFS_MAP_READ &&
127169
bio_ctrl->compress_type != BTRFS_COMPRESS_NONE)
128170
btrfs_submit_compressed_read(bbio);
@@ -131,6 +173,12 @@ static void submit_one_bio(struct btrfs_bio_ctrl *bio_ctrl)
131173

132174
/* The bbio is owned by the end_io handler now */
133175
bio_ctrl->bbio = NULL;
176+
/*
177+
* We used the generation to decide whether to lookup csums in the
178+
* commit_root or not when we called bio_set_csum_search_commit_root()
179+
* above. Now, reset the generation for the next bio.
180+
*/
181+
bio_ctrl->generation = 0;
134182
}
135183

136184
/*
@@ -701,6 +749,8 @@ static void alloc_new_bio(struct btrfs_inode *inode,
701749
* @size: portion of page that we want to write to
702750
* @pg_offset: offset of the new bio or to check whether we are adding
703751
* a contiguous page to the previous one
752+
* @read_em_generation: generation of the extent_map we are submitting
753+
* (only used for read)
704754
*
705755
* The will either add the page into the existing @bio_ctrl->bbio, or allocate a
706756
* new one in @bio_ctrl->bbio.
@@ -709,7 +759,8 @@ static void alloc_new_bio(struct btrfs_inode *inode,
709759
*/
710760
static void submit_extent_folio(struct btrfs_bio_ctrl *bio_ctrl,
711761
u64 disk_bytenr, struct folio *folio,
712-
size_t size, unsigned long pg_offset)
762+
size_t size, unsigned long pg_offset,
763+
u64 read_em_generation)
713764
{
714765
struct btrfs_inode *inode = folio_to_inode(folio);
715766
loff_t file_offset = folio_pos(folio) + pg_offset;
@@ -740,6 +791,11 @@ static void submit_extent_folio(struct btrfs_bio_ctrl *bio_ctrl,
740791
submit_one_bio(bio_ctrl);
741792
continue;
742793
}
794+
/*
795+
* Now that the folio is definitely added to the bio, include its
796+
* generation in the max generation calculation.
797+
*/
798+
bio_ctrl->generation = max(bio_ctrl->generation, read_em_generation);
743799
bio_ctrl->next_file_offset += len;
744800

745801
if (bio_ctrl->wbc)
@@ -942,6 +998,7 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
942998
bool force_bio_submit = false;
943999
u64 disk_bytenr;
9441000
u64 block_start;
1001+
u64 em_gen;
9451002

9461003
ASSERT(IS_ALIGNED(cur, fs_info->sectorsize));
9471004
if (cur >= last_byte) {
@@ -1026,6 +1083,7 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
10261083
if (prev_em_start)
10271084
*prev_em_start = em->start;
10281085

1086+
em_gen = em->generation;
10291087
btrfs_free_extent_map(em);
10301088
em = NULL;
10311089

@@ -1049,7 +1107,7 @@ static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
10491107
if (force_bio_submit)
10501108
submit_one_bio(bio_ctrl);
10511109
submit_extent_folio(bio_ctrl, disk_bytenr, folio, blocksize,
1052-
pg_offset);
1110+
pg_offset, em_gen);
10531111
}
10541112
return 0;
10551113
}
@@ -1571,7 +1629,7 @@ static int submit_one_sector(struct btrfs_inode *inode,
15711629
ASSERT(folio_test_writeback(folio));
15721630

15731631
submit_extent_folio(bio_ctrl, disk_bytenr, folio,
1574-
sectorsize, filepos - folio_pos(folio));
1632+
sectorsize, filepos - folio_pos(folio), 0);
15751633
return 0;
15761634
}
15771635

fs/btrfs/file-item.c

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -397,6 +397,36 @@ int btrfs_lookup_bio_sums(struct btrfs_bio *bbio)
397397
path->skip_locking = 1;
398398
}
399399

400+
/*
401+
* If we are searching for a csum of an extent from a past
402+
* transaction, we can search in the commit root and reduce
403+
* lock contention on the csum tree extent buffers.
404+
*
405+
* This is important because that lock is an rwsem which gets
406+
* pretty heavy write load under memory pressure and sustained
407+
* csum overwrites, unlike the commit_root_sem. (Memory pressure
408+
* makes us writeback the nodes multiple times per transaction,
409+
* which makes us cow them each time, taking the write lock.)
410+
*
411+
* Due to how rwsem is implemented, there is a possible
412+
* priority inversion where the readers holding the lock don't
413+
* get scheduled (say they're in a cgroup stuck in heavy reclaim)
414+
* which then blocks writers, including transaction commit. By
415+
* using a semaphore with fewer writers (only a commit switching
416+
* the roots), we make this issue less likely.
417+
*
418+
* Note that we don't rely on btrfs_search_slot to lock the
419+
* commit root csum. We call search_slot multiple times, which would
420+
* create a potential race where a commit comes in between searches
421+
* while we are not holding the commit_root_sem, and we get csums
422+
* from across transactions.
423+
*/
424+
if (bbio->csum_search_commit_root) {
425+
path->search_commit_root = 1;
426+
path->skip_locking = 1;
427+
down_read(&fs_info->commit_root_sem);
428+
}
429+
400430
while (bio_offset < orig_len) {
401431
int count;
402432
u64 cur_disk_bytenr = orig_disk_bytenr + bio_offset;
@@ -442,6 +472,8 @@ int btrfs_lookup_bio_sums(struct btrfs_bio *bbio)
442472
bio_offset += count * sectorsize;
443473
}
444474

475+
if (bbio->csum_search_commit_root)
476+
up_read(&fs_info->commit_root_sem);
445477
return ret;
446478
}
447479

0 commit comments

Comments
 (0)