Skip to content

Commit 6422881

Browse files
boryaskdave
authored andcommitted
btrfs: fix ssd_spread overallocation
If the ssd_spread mount option is enabled, then we run the so called clustered allocator for data block groups. In practice, this results in creating a btrfs_free_cluster which caches a block_group and borrows its free extents for allocation. Since the introduction of allocation size classes, there has been a bug in the interaction between that feature and ssd_spread. find_free_extent() has a number of nested loops. The loop going over the allocation stages, stored in ffe_ctl->loop and managed by find_free_extent_update_loop(), the loop over the raid levels, and the loop over all the block_groups in a space_info. The size class feature relies on the block_group loop to ensure it gets a chance to see a block_group of a given size class. However, the clustered allocator uses the cached cluster block_group and breaks that loop. Each call to do_allocation() will really just go back to the same cached block_group. Normally, this is OK, as the allocation either succeeds and we don't want to loop any more or it fails, and we clear the cluster and return its space to the block_group. But with size classes, the allocation can succeed, then later fail, outside of do_allocation() due to size class mismatch. That latter failure is not properly handled due to the highly complex multi loop logic. The result is a painful loop where we continue to allocate the same num_bytes from the cluster in a tight loop until it fails and releases the cluster and lets us try a new block_group. But by then, we have skipped great swaths of the available block_groups and are likely to fail to allocate, looping the outer loop. In pathological cases like the reproducer below, the cached block_group is often the very last one, in which case we don't perform this tight bg loop but instead rip through the ffe stages to LOOP_CHUNK_ALLOC and allocate a chunk, which is now the last one, and we enter the tight inner loop until an allocation failure. Then allocation succeeds on the final block_group and if the next allocation is a size mismatch, the exact same thing happens again. Triggering this is as easy as mounting with -o ssd_spread and then running: mount -o ssd_spread $dev $mnt dd if=/dev/zero of=$mnt/big bs=16M count=1 &>/dev/null dd if=/dev/zero of=$mnt/med bs=4M count=1 &>/dev/null sync if you do the two writes + sync in a loop, you can force btrfs to spin an excessive amount on semi-successful clustered allocations, before ultimately failing and advancing to the stage where we force a chunk allocation. This results in 2G of data allocated per iteration, despite only using ~20M of data. By using a small size classed extent, the inner loop takes longer and we can spin for longer. The simplest, shortest term fix to unbreak this is to make the clustered allocator size_class aware in the dumbest way, where it fails on size class mismatch. This may hinder the operation of the clustered allocator, but better hindered than completely broken and terribly overallocating. Further re-design improvements are also in the works. Fixes: 52bb7a2 ("btrfs: introduce size class to block group allocator") Reported-by: David Sterba <[email protected]> Reviewed-by: Filipe Manana <[email protected]> Signed-off-by: Boris Burkov <[email protected]> Signed-off-by: David Sterba <[email protected]>
1 parent 544e4f9 commit 6422881

File tree

1 file changed

+17
-16
lines changed

1 file changed

+17
-16
lines changed

fs/btrfs/extent-tree.c

Lines changed: 17 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -3651,6 +3651,21 @@ btrfs_release_block_group(struct btrfs_block_group *cache,
36513651
btrfs_put_block_group(cache);
36523652
}
36533653

3654+
static bool find_free_extent_check_size_class(const struct find_free_extent_ctl *ffe_ctl,
3655+
const struct btrfs_block_group *bg)
3656+
{
3657+
if (ffe_ctl->policy == BTRFS_EXTENT_ALLOC_ZONED)
3658+
return true;
3659+
if (!btrfs_block_group_should_use_size_class(bg))
3660+
return true;
3661+
if (ffe_ctl->loop >= LOOP_WRONG_SIZE_CLASS)
3662+
return true;
3663+
if (ffe_ctl->loop >= LOOP_UNSET_SIZE_CLASS &&
3664+
bg->size_class == BTRFS_BG_SZ_NONE)
3665+
return true;
3666+
return ffe_ctl->size_class == bg->size_class;
3667+
}
3668+
36543669
/*
36553670
* Helper function for find_free_extent().
36563671
*
@@ -3672,7 +3687,8 @@ static int find_free_extent_clustered(struct btrfs_block_group *bg,
36723687
if (!cluster_bg)
36733688
goto refill_cluster;
36743689
if (cluster_bg != bg && (cluster_bg->ro ||
3675-
!block_group_bits(cluster_bg, ffe_ctl->flags)))
3690+
!block_group_bits(cluster_bg, ffe_ctl->flags) ||
3691+
!find_free_extent_check_size_class(ffe_ctl, cluster_bg)))
36763692
goto release_cluster;
36773693

36783694
offset = btrfs_alloc_from_cluster(cluster_bg, last_ptr,
@@ -4229,21 +4245,6 @@ static int find_free_extent_update_loop(struct btrfs_fs_info *fs_info,
42294245
return -ENOSPC;
42304246
}
42314247

4232-
static bool find_free_extent_check_size_class(struct find_free_extent_ctl *ffe_ctl,
4233-
struct btrfs_block_group *bg)
4234-
{
4235-
if (ffe_ctl->policy == BTRFS_EXTENT_ALLOC_ZONED)
4236-
return true;
4237-
if (!btrfs_block_group_should_use_size_class(bg))
4238-
return true;
4239-
if (ffe_ctl->loop >= LOOP_WRONG_SIZE_CLASS)
4240-
return true;
4241-
if (ffe_ctl->loop >= LOOP_UNSET_SIZE_CLASS &&
4242-
bg->size_class == BTRFS_BG_SZ_NONE)
4243-
return true;
4244-
return ffe_ctl->size_class == bg->size_class;
4245-
}
4246-
42474248
static int prepare_allocation_clustered(struct btrfs_fs_info *fs_info,
42484249
struct find_free_extent_ctl *ffe_ctl,
42494250
struct btrfs_space_info *space_info,

0 commit comments

Comments
 (0)