Skip to content

Commit 485df75

Browse files
fdmananakdave
authored andcommitted
btrfs: always pin deleted leaves when there are active tree mod log users
When freeing a tree block we may end up adding its extent back to the free space cache/tree, as long as there are no more references for it, it was created in the current transaction and writeback for it never happened. This is generally fine, however when we have tree mod log operations it can result in inconsistent versions of a btree after unwinding extent buffers with the recorded tree mod log operations. This is because: * We only log operations for nodes (adding and removing key/pointers), for leaves we don't do anything; * This means that we can log a MOD_LOG_KEY_REMOVE_WHILE_FREEING operation for a node that points to a leaf that was deleted; * Before we apply the logged operation to unwind a node, we can have that leaf's extent allocated again, either as a node or as a leaf, and possibly for another btree. This is possible if the leaf was created in the current transaction and writeback for it never started, in which case btrfs_free_tree_block() returns its extent back to the free space cache/tree; * Then, before applying the tree mod log operation, some task allocates the metadata extent just freed before, and uses it either as a leaf or as a node for some btree (can be the same or another one, it does not matter); * After applying the MOD_LOG_KEY_REMOVE_WHILE_FREEING operation we now get the target node with an item pointing to the metadata extent that now has content different from what it had before the leaf was deleted. It might now belong to a different btree and be a node and not a leaf anymore. As a consequence, the results of searches after the unwinding can be unpredictable and produce unexpected results. So make sure we pin extent buffers corresponding to leaves when there are tree mod log users. CC: [email protected] # 4.14+ Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
1 parent dbcc7d5 commit 485df75

File tree

1 file changed

+22
-1
lines changed

1 file changed

+22
-1
lines changed

fs/btrfs/extent-tree.c

Lines changed: 22 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3323,6 +3323,7 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans,
33233323

33243324
if (last_ref && btrfs_header_generation(buf) == trans->transid) {
33253325
struct btrfs_block_group *cache;
3326+
bool must_pin = false;
33263327

33273328
if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID) {
33283329
ret = check_ref_cleanup(trans, buf->start);
@@ -3340,7 +3341,27 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans,
33403341
goto out;
33413342
}
33423343

3343-
if (btrfs_is_zoned(fs_info)) {
3344+
/*
3345+
* If this is a leaf and there are tree mod log users, we may
3346+
* have recorded mod log operations that point to this leaf.
3347+
* So we must make sure no one reuses this leaf's extent before
3348+
* mod log operations are applied to a node, otherwise after
3349+
* rewinding a node using the mod log operations we get an
3350+
* inconsistent btree, as the leaf's extent may now be used as
3351+
* a node or leaf for another different btree.
3352+
* We are safe from races here because at this point no other
3353+
* node or root points to this extent buffer, so if after this
3354+
* check a new tree mod log user joins, it will not be able to
3355+
* find a node pointing to this leaf and record operations that
3356+
* point to this leaf.
3357+
*/
3358+
if (btrfs_header_level(buf) == 0) {
3359+
read_lock(&fs_info->tree_mod_log_lock);
3360+
must_pin = !list_empty(&fs_info->tree_mod_seq_list);
3361+
read_unlock(&fs_info->tree_mod_log_lock);
3362+
}
3363+
3364+
if (must_pin || btrfs_is_zoned(fs_info)) {
33443365
btrfs_redirty_list_add(trans->transaction, buf);
33453366
pin_down_extent(trans, cache, buf->start, buf->len, 1);
33463367
btrfs_put_block_group(cache);

0 commit comments

Comments
 (0)