Skip to content

Commit 97477d5

Browse files
fdmananakdave
authored andcommitted
btrfs: fix log tree replay failure due to file with 0 links and extents
If we log a new inode (not persisted in a past transaction) that has 0 links and extents, then log another inode with an higher inode number, we end up with failing to replay the log tree with -EINVAL. The steps for this are: 1) create new file A 2) write some data to file A 3) open an fd on file A 4) unlink file A 5) fsync file A using the previously open fd 6) create file B (has higher inode number than file A) 7) fsync file B 8) power fail before current transaction commits Now when attempting to mount the fs, the log replay will fail with -ENOENT at replay_one_extent() when attempting to replay the first extent of file A. The failure comes when trying to open the inode for file A in the subvolume tree, since it doesn't exist. Before commit 5f61b96 ("btrfs: fix inode lookup error handling during log replay"), the returned error was -EIO instead of -ENOENT, since we converted any errors when attempting to read an inode during log replay to -EIO. The reason for this is that the log replay procedure fails to ignore the current inode when we are at the stage LOG_WALK_REPLAY_ALL, our current inode has 0 links and last inode we processed in the previous stage has a non 0 link count. In other words, the issue is that at replay_one_extent() we only update wc->ignore_cur_inode if the current replay stage is LOG_WALK_REPLAY_INODES. Fix this by updating wc->ignore_cur_inode whenever we find an inode item regardless of the current replay stage. This is a simple solution and easy to backport, but later we can do other alternatives like avoid logging extents or inode items other than the inode item for inodes with a link count of 0. The problem with the wc->ignore_cur_inode logic has been around since commit f2d72f4 ("Btrfs: fix warning when replaying log after fsync of a tmpfile") but it only became frequent to hit since the more recent commit 5e85262 ("btrfs: fix fsync of files with no hard links not persisting deletion"), because we stopped skipping inodes with a link count of 0 when logging, while before the problem would only be triggered if trying to replay a log tree created with an older kernel which has a logged inode with 0 links. A test case for fstests will be submitted soon. Reported-by: Peter Jung <[email protected]> Link: https://lore.kernel.org/linux-btrfs/[email protected]/ Reported-by: burneddi <[email protected]> Link: https://lore.kernel.org/linux-btrfs/lh4W-Lwc0Mbk-QvBhhQyZxf6VbM3E8VtIvU3fPIQgweP_Q1n7wtlUZQc33sYlCKYd-o6rryJQfhHaNAOWWRKxpAXhM8NZPojzsJPyHMf2qY=@protonmail.com/#t Reported-by: Russell Haley <[email protected]> Link: https://lore.kernel.org/linux-btrfs/[email protected]/ Fixes: f2d72f4 ("Btrfs: fix warning when replaying log after fsync of a tmpfile") Reviewed-by: Boris Burkov <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>
1 parent 21a57bc commit 97477d5

File tree

1 file changed

+29
-16
lines changed

1 file changed

+29
-16
lines changed

fs/btrfs/tree-log.c

Lines changed: 29 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -321,8 +321,7 @@ struct walk_control {
321321

322322
/*
323323
* Ignore any items from the inode currently being processed. Needs
324-
* to be set every time we find a BTRFS_INODE_ITEM_KEY and we are in
325-
* the LOG_WALK_REPLAY_INODES stage.
324+
* to be set every time we find a BTRFS_INODE_ITEM_KEY.
326325
*/
327326
bool ignore_cur_inode;
328327

@@ -2602,30 +2601,44 @@ static int replay_one_buffer(struct btrfs_root *log, struct extent_buffer *eb,
26022601

26032602
nritems = btrfs_header_nritems(eb);
26042603
for (i = 0; i < nritems; i++) {
2605-
btrfs_item_key_to_cpu(eb, &key, i);
2604+
struct btrfs_inode_item *inode_item;
26062605

2607-
/* inode keys are done during the first stage */
2608-
if (key.type == BTRFS_INODE_ITEM_KEY &&
2609-
wc->stage == LOG_WALK_REPLAY_INODES) {
2610-
struct btrfs_inode_item *inode_item;
2611-
u32 mode;
2606+
btrfs_item_key_to_cpu(eb, &key, i);
26122607

2613-
inode_item = btrfs_item_ptr(eb, i,
2614-
struct btrfs_inode_item);
2608+
if (key.type == BTRFS_INODE_ITEM_KEY) {
2609+
inode_item = btrfs_item_ptr(eb, i, struct btrfs_inode_item);
26152610
/*
2616-
* If we have a tmpfile (O_TMPFILE) that got fsync'ed
2617-
* and never got linked before the fsync, skip it, as
2618-
* replaying it is pointless since it would be deleted
2619-
* later. We skip logging tmpfiles, but it's always
2620-
* possible we are replaying a log created with a kernel
2621-
* that used to log tmpfiles.
2611+
* An inode with no links is either:
2612+
*
2613+
* 1) A tmpfile (O_TMPFILE) that got fsync'ed and never
2614+
* got linked before the fsync, skip it, as replaying
2615+
* it is pointless since it would be deleted later.
2616+
* We skip logging tmpfiles, but it's always possible
2617+
* we are replaying a log created with a kernel that
2618+
* used to log tmpfiles;
2619+
*
2620+
* 2) A non-tmpfile which got its last link deleted
2621+
* while holding an open fd on it and later got
2622+
* fsynced through that fd. We always log the
2623+
* parent inodes when inode->last_unlink_trans is
2624+
* set to the current transaction, so ignore all the
2625+
* inode items for this inode. We will delete the
2626+
* inode when processing the parent directory with
2627+
* replay_dir_deletes().
26222628
*/
26232629
if (btrfs_inode_nlink(eb, inode_item) == 0) {
26242630
wc->ignore_cur_inode = true;
26252631
continue;
26262632
} else {
26272633
wc->ignore_cur_inode = false;
26282634
}
2635+
}
2636+
2637+
/* Inode keys are done during the first stage. */
2638+
if (key.type == BTRFS_INODE_ITEM_KEY &&
2639+
wc->stage == LOG_WALK_REPLAY_INODES) {
2640+
u32 mode;
2641+
26292642
ret = replay_xattr_deletes(trans, root, log, path, key.objectid);
26302643
if (ret)
26312644
break;

0 commit comments

Comments
 (0)