Skip to content

Commit 12c5128

Browse files
fdmananakdave
authored andcommitted
btrfs: add new unused block groups to the list of unused block groups
Space reservations for metadata are, most of the time, pessimistic as we reserve space for worst possible cases - where tree heights are at the maximum possible height (8), we need to COW every extent buffer in a tree path, need to split extent buffers, etc. For data, we generally reserve the exact amount of space we are going to allocate. The exception here is when using compression, in which case we reserve space matching the uncompressed size, as the compression only happens at writeback time and in the worst possible case we need that amount of space in case the data is not compressible. This means that when there's not available space in the corresponding space_info object, we may need to allocate a new block group, and then that block group might not be used after all. In this case the block group is never added to the list of unused block groups and ends up never being deleted - except if we unmount and mount again the fs, as when reading block groups from disk we add unused ones to the list of unused block groups (fs_info->unused_bgs). Otherwise a block group is only added to the list of unused block groups when we deallocate the last extent from it, so if no extent is ever allocated, the block group is kept around forever. This also means that if we have a bunch of tasks reserving space in parallel we can end up allocating many block groups that end up never being used or kept around for too long without being used, which has the potential to result in ENOSPC failures in case for example we over allocate too many metadata block groups and then end up in a state without enough unallocated space to allocate a new data block group. This is more likely to happen with metadata reservations as of kernel 6.7, namely since commit 28270e2 ("btrfs: always reserve space for delayed refs when starting transaction"), because we started to always reserve space for delayed references when starting a transaction handle for a non-zero number of items, and also to try to reserve space to fill the gap between the delayed block reserve's reserved space and its size. So to avoid this, when finishing the creation a new block group, add the block group to the list of unused block groups if it's still unused at that time. This way the next time the cleaner kthread runs, it will delete the block group if it's still unused and not needed to satisfy existing space reservations. Reported-by: Ivan Shapovalov <[email protected]> Link: https://lore.kernel.org/linux-btrfs/[email protected]/ CC: [email protected] # 6.7+ Reviewed-by: Johannes Thumshirn <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Reviewed-by: Boris Burkov <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
1 parent f4a9f21 commit 12c5128

File tree

1 file changed

+31
-0
lines changed

1 file changed

+31
-0
lines changed

fs/btrfs/block-group.c

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2729,6 +2729,37 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans)
27292729
btrfs_dec_delayed_refs_rsv_bg_inserts(fs_info);
27302730
list_del_init(&block_group->bg_list);
27312731
clear_bit(BLOCK_GROUP_FLAG_NEW, &block_group->runtime_flags);
2732+
2733+
/*
2734+
* If the block group is still unused, add it to the list of
2735+
* unused block groups. The block group may have been created in
2736+
* order to satisfy a space reservation, in which case the
2737+
* extent allocation only happens later. But often we don't
2738+
* actually need to allocate space that we previously reserved,
2739+
* so the block group may become unused for a long time. For
2740+
* example for metadata we generally reserve space for a worst
2741+
* possible scenario, but then don't end up allocating all that
2742+
* space or none at all (due to no need to COW, extent buffers
2743+
* were already COWed in the current transaction and still
2744+
* unwritten, tree heights lower than the maximum possible
2745+
* height, etc). For data we generally reserve the axact amount
2746+
* of space we are going to allocate later, the exception is
2747+
* when using compression, as we must reserve space based on the
2748+
* uncompressed data size, because the compression is only done
2749+
* when writeback triggered and we don't know how much space we
2750+
* are actually going to need, so we reserve the uncompressed
2751+
* size because the data may be uncompressible in the worst case.
2752+
*/
2753+
if (ret == 0) {
2754+
bool used;
2755+
2756+
spin_lock(&block_group->lock);
2757+
used = btrfs_is_block_group_used(block_group);
2758+
spin_unlock(&block_group->lock);
2759+
2760+
if (!used)
2761+
btrfs_mark_bg_unused(block_group);
2762+
}
27322763
}
27332764
btrfs_trans_release_chunk_metadata(trans);
27342765
}

0 commit comments

Comments
 (0)