Skip to content

Commit 7b5f775

Browse files
Dave Chinnercmaiolino
authored andcommitted
xfs: fix unmount hang with unflushable inodes stuck in the AIL
Unmount of a shutdown filesystem can hang with stale inode cluster buffers in the AIL like so: [95964.140623] Call Trace: [95964.144641] __schedule+0x699/0xb70 [95964.154003] schedule+0x64/0xd0 [95964.156851] xfs_ail_push_all_sync+0x9b/0xf0 [95964.164816] xfs_unmount_flush_inodes+0x41/0x70 [95964.168698] xfs_unmountfs+0x7f/0x170 [95964.171846] xfs_fs_put_super+0x3b/0x90 [95964.175216] generic_shutdown_super+0x77/0x160 [95964.178060] kill_block_super+0x1b/0x40 [95964.180553] xfs_kill_sb+0x12/0x30 [95964.182796] deactivate_locked_super+0x38/0x100 [95964.185735] deactivate_super+0x41/0x50 [95964.188245] cleanup_mnt+0x9f/0x160 [95964.190519] __cleanup_mnt+0x12/0x20 [95964.192899] task_work_run+0x89/0xb0 [95964.195221] resume_user_mode_work+0x4f/0x60 [95964.197931] syscall_exit_to_user_mode+0x76/0xb0 [95964.201003] do_syscall_64+0x74/0x130 $ pstree -N mnt |grep umount |-check-parallel---nsexec---run_test.sh---753---umount It always seems to be generic/753 that triggers this, and repeating a quick group test run triggers it every 10-15 iterations. Hence it generally triggers once up every 30-40 minutes of test time. just running generic/753 by itself or concurrently with a limited group of tests doesn't reproduce this issue at all. Tracing on a hung system shows the AIL repeating every 50ms a log force followed by an attempt to push pinned, aborted inodes from the AIL (trimmed for brevity): xfs_log_force: lsn 0x1c caller xfsaild+0x18e xfs_log_force: lsn 0x0 caller xlog_cil_flush+0xbd xfs_log_force: lsn 0x1c caller xfs_log_force+0x77 xfs_ail_pinned: lip 0xffff88826014afa0 lsn 1/37472 type XFS_LI_INODE flags IN_AIL|ABORTED xfs_ail_pinned: lip 0xffff88814000a708 lsn 1/37472 type XFS_LI_INODE flags IN_AIL|ABORTED xfs_ail_pinned: lip 0xffff88810b850c80 lsn 1/37472 type XFS_LI_INODE flags IN_AIL|ABORTED xfs_ail_pinned: lip 0xffff88810b850af0 lsn 1/37472 type XFS_LI_INODE flags IN_AIL|ABORTED xfs_ail_pinned: lip 0xffff888165cf0a28 lsn 1/37472 type XFS_LI_INODE flags IN_AIL|ABORTED xfs_ail_pinned: lip 0xffff88810b850bb8 lsn 1/37472 type XFS_LI_INODE flags IN_AIL|ABORTED .... The inode log items are marked as aborted, which means that either: a) a transaction commit has occurred, seen an error or shutdown, and called xfs_trans_free_items() to abort the items. This should happen before any pinning of log items occurs. or b) a dirty transaction has been cancelled. This should also happen before any pinning of log items occurs. or c) AIL insertion at journal IO completion is marked as aborted. In this case, the log item is pinned by the CIL until journal IO completes and hence needs to be unpinned. This is then done after the ->iop_committed() callback is run, so the pin count should be balanced correctly. Yet none of these seemed to be occurring. Further tracing indicated this: d) Shutdown during CIL pushing resulting in log item completion being called from checkpoint abort processing. Items are unpinned and released without serialisation against each other, journal IO completion or transaction commit completion. In this case, we may still have a transaction commit in flight that holds a reference to a xfs_buf_log_item (BLI) after CIL insertion. e.g. a synchronous transaction will flush the CIL before the transaction is torn down. The concurrent CIL push then aborts insertion it and drops the commit/AIL reference to the BLI. This can leave the transaction commit context with the last reference to the BLI which is dropped here: xfs_trans_free_items() ->iop_release xfs_buf_item_release xfs_buf_item_put if (XFS_LI_ABORTED) xfs_trans_ail_delete xfs_buf_item_relse() Unlike the journal completion ->iop_unpin path, this path does not run stale buffer completion process when it drops the last reference, hence leaving the stale inodes attached to the buffer sitting the AIL. There are no other references to those inodes, so there is no other mechanism to remove them from the AIL. Hence unmount hangs. The buffer lock context for stale buffers is passed to the last BLI reference. This is normally the last BLI unpin on journal IO completion. The unpin then processes the stale buffer completion and releases the buffer lock. However, if the final unpin from journal IO completion (or CIL push abort) does not hold the last reference to the BLI, there -must- still be a transaction context that references the BLI, and so that context must perform the stale buffer completion processing before the buffer is unlocked and the BLI torn down. The fix for this is to rework the xfs_buf_item_relse() path to run stale buffer completion processing if it drops the last reference to the BLI. We still hold the buffer locked, so the buffer owner and lock context is the same as if we passed the BLI and buffer to the ->iop_unpin() context to finish stale process on journal commit. However, we have to be careful here. In a shutdown state, we can be freeing dirty BLIs from xfs_buf_item_put() via xfs_trans_brelse() and xfs_trans_bdetach(). The existing code handles this case by considering shutdown state as "aborted", but in doing so largely masks the failure to clean up stale BLI state from the xfs_buf_item_relse() path. i.e regardless of the shutdown state and whether the item is in the AIL, we must finish the stale buffer cleanup if we are are dropping the last BLI reference from the ->iop_relse path in transaction commit context. Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Carlos Maiolino <[email protected]> Signed-off-by: Carlos Maiolino <[email protected]>
1 parent 816c330 commit 7b5f775

File tree

2 files changed

+86
-37
lines changed

2 files changed

+86
-37
lines changed

fs/xfs/xfs_buf_item.c

Lines changed: 85 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -612,43 +612,42 @@ xfs_buf_item_push(
612612
* Drop the buffer log item refcount and take appropriate action. This helper
613613
* determines whether the bli must be freed or not, since a decrement to zero
614614
* does not necessarily mean the bli is unused.
615-
*
616-
* Return true if the bli is freed, false otherwise.
617615
*/
618-
bool
616+
void
619617
xfs_buf_item_put(
620618
struct xfs_buf_log_item *bip)
621619
{
622-
struct xfs_log_item *lip = &bip->bli_item;
623-
bool aborted;
624-
bool dirty;
620+
621+
ASSERT(xfs_buf_islocked(bip->bli_buf));
625622

626623
/* drop the bli ref and return if it wasn't the last one */
627624
if (!atomic_dec_and_test(&bip->bli_refcount))
628-
return false;
625+
return;
629626

630-
/*
631-
* We dropped the last ref and must free the item if clean or aborted.
632-
* If the bli is dirty and non-aborted, the buffer was clean in the
633-
* transaction but still awaiting writeback from previous changes. In
634-
* that case, the bli is freed on buffer writeback completion.
635-
*/
636-
aborted = test_bit(XFS_LI_ABORTED, &lip->li_flags) ||
637-
xlog_is_shutdown(lip->li_log);
638-
dirty = bip->bli_flags & XFS_BLI_DIRTY;
639-
if (dirty && !aborted)
640-
return false;
627+
/* If the BLI is in the AIL, then it is still dirty and in use */
628+
if (test_bit(XFS_LI_IN_AIL, &bip->bli_item.li_flags)) {
629+
ASSERT(bip->bli_flags & XFS_BLI_DIRTY);
630+
return;
631+
}
641632

642633
/*
643-
* The bli is aborted or clean. An aborted item may be in the AIL
644-
* regardless of dirty state. For example, consider an aborted
645-
* transaction that invalidated a dirty bli and cleared the dirty
646-
* state.
634+
* In shutdown conditions, we can be asked to free a dirty BLI that
635+
* isn't in the AIL. This can occur due to a checkpoint aborting a BLI
636+
* instead of inserting it into the AIL at checkpoint IO completion. If
637+
* there's another bli reference (e.g. a btree cursor holds a clean
638+
* reference) and it is released via xfs_trans_brelse(), we can get here
639+
* with that aborted, dirty BLI. In this case, it is safe to free the
640+
* dirty BLI immediately, as it is not in the AIL and there are no
641+
* other references to it.
642+
*
643+
* We should never get here with a stale BLI via that path as
644+
* xfs_trans_brelse() specifically holds onto stale buffers rather than
645+
* releasing them.
647646
*/
648-
if (aborted)
649-
xfs_trans_ail_delete(lip, 0);
647+
ASSERT(!(bip->bli_flags & XFS_BLI_DIRTY) ||
648+
test_bit(XFS_LI_ABORTED, &bip->bli_item.li_flags));
649+
ASSERT(!(bip->bli_flags & XFS_BLI_STALE));
650650
xfs_buf_item_relse(bip);
651-
return true;
652651
}
653652

654653
/*
@@ -669,25 +668,35 @@ xfs_buf_item_put(
669668
* if necessary but do not unlock the buffer. This is for support of
670669
* xfs_trans_bhold(). Make sure the XFS_BLI_HOLD field is cleared if we don't
671670
* free the item.
671+
*
672+
* If the XFS_BLI_STALE flag is set, the last reference to the BLI *must*
673+
* perform a completion abort of any objects attached to the buffer for IO
674+
* tracking purposes. This generally only happens in shutdown situations,
675+
* normally xfs_buf_item_unpin() will drop the last BLI reference and perform
676+
* completion processing. However, because transaction completion can race with
677+
* checkpoint completion during a shutdown, this release context may end up
678+
* being the last active reference to the BLI and so needs to perform this
679+
* cleanup.
672680
*/
673681
STATIC void
674682
xfs_buf_item_release(
675683
struct xfs_log_item *lip)
676684
{
677685
struct xfs_buf_log_item *bip = BUF_ITEM(lip);
678686
struct xfs_buf *bp = bip->bli_buf;
679-
bool released;
680687
bool hold = bip->bli_flags & XFS_BLI_HOLD;
681688
bool stale = bip->bli_flags & XFS_BLI_STALE;
682-
#if defined(DEBUG) || defined(XFS_WARN)
683-
bool ordered = bip->bli_flags & XFS_BLI_ORDERED;
684-
bool dirty = bip->bli_flags & XFS_BLI_DIRTY;
685689
bool aborted = test_bit(XFS_LI_ABORTED,
686690
&lip->li_flags);
691+
bool dirty = bip->bli_flags & XFS_BLI_DIRTY;
692+
#if defined(DEBUG) || defined(XFS_WARN)
693+
bool ordered = bip->bli_flags & XFS_BLI_ORDERED;
687694
#endif
688695

689696
trace_xfs_buf_item_release(bip);
690697

698+
ASSERT(xfs_buf_islocked(bp));
699+
691700
/*
692701
* The bli dirty state should match whether the blf has logged segments
693702
* except for ordered buffers, where only the bli should be dirty.
@@ -703,16 +712,56 @@ xfs_buf_item_release(
703712
bp->b_transp = NULL;
704713
bip->bli_flags &= ~(XFS_BLI_LOGGED | XFS_BLI_HOLD | XFS_BLI_ORDERED);
705714

715+
/* If there are other references, then we have nothing to do. */
716+
if (!atomic_dec_and_test(&bip->bli_refcount))
717+
goto out_release;
718+
719+
/*
720+
* Stale buffer completion frees the BLI, unlocks and releases the
721+
* buffer. Neither the BLI or buffer are safe to reference after this
722+
* call, so there's nothing more we need to do here.
723+
*
724+
* If we get here with a stale buffer and references to the BLI remain,
725+
* we must not unlock the buffer as the last BLI reference owns lock
726+
* context, not us.
727+
*/
728+
if (stale) {
729+
xfs_buf_item_finish_stale(bip);
730+
xfs_buf_relse(bp);
731+
ASSERT(!hold);
732+
return;
733+
}
734+
735+
/*
736+
* Dirty or clean, aborted items are done and need to be removed from
737+
* the AIL and released. This frees the BLI, but leaves the buffer
738+
* locked and referenced.
739+
*/
740+
if (aborted || xlog_is_shutdown(lip->li_log)) {
741+
ASSERT(list_empty(&bip->bli_buf->b_li_list));
742+
xfs_buf_item_done(bp);
743+
goto out_release;
744+
}
745+
746+
/*
747+
* Clean, unreferenced BLIs can be immediately freed, leaving the buffer
748+
* locked and referenced.
749+
*
750+
* Dirty, unreferenced BLIs *must* be in the AIL awaiting writeback.
751+
*/
752+
if (!dirty)
753+
xfs_buf_item_relse(bip);
754+
else
755+
ASSERT(test_bit(XFS_LI_IN_AIL, &lip->li_flags));
756+
757+
/* Not safe to reference the BLI from here */
758+
out_release:
706759
/*
707-
* Unref the item and unlock the buffer unless held or stale. Stale
708-
* buffers remain locked until final unpin unless the bli is freed by
709-
* the unref call. The latter implies shutdown because buffer
710-
* invalidation dirties the bli and transaction.
760+
* If we get here with a stale buffer, we must not unlock the
761+
* buffer as the last BLI reference owns lock context, not us.
711762
*/
712-
released = xfs_buf_item_put(bip);
713-
if (hold || (stale && !released))
763+
if (stale || hold)
714764
return;
715-
ASSERT(!stale || aborted);
716765
xfs_buf_relse(bp);
717766
}
718767

fs/xfs/xfs_buf_item.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,7 @@ struct xfs_buf_log_item {
4949

5050
int xfs_buf_item_init(struct xfs_buf *, struct xfs_mount *);
5151
void xfs_buf_item_done(struct xfs_buf *bp);
52-
bool xfs_buf_item_put(struct xfs_buf_log_item *);
52+
void xfs_buf_item_put(struct xfs_buf_log_item *bip);
5353
void xfs_buf_item_log(struct xfs_buf_log_item *, uint, uint);
5454
bool xfs_buf_item_dirty_format(struct xfs_buf_log_item *);
5555
void xfs_buf_inode_iodone(struct xfs_buf *);

0 commit comments

Comments
 (0)