Skip to content

Commit 9d11001

Browse files
Dave ChinnerDarrick J. Wong
authored andcommitted
xfs: limit iclog tail updates
From the department of "generic/482 keeps on giving", we bring you another tail update race condition: iclog: S1 C1 +-----------------------+-----------------------+ S2 EOIC Two checkpoints in a single iclog. One is complete, the other just contains the start record and overruns into a new iclog. Timeline: Before S1: Cache flush, log tail = X At S1: Metadata stable, write start record and checkpoint At C1: Write commit record, set NEED_FUA Single iclog checkpoint, so no need for NEED_FLUSH Log tail still = X, so no need for NEED_FLUSH After C1, Before S2: Cache flush, log tail = X At S2: Metadata stable, write start record and checkpoint After S2: Log tail moves to X+1 At EOIC: End of iclog, more journal data to write Releases iclog Not a commit iclog, so no need for NEED_FLUSH Writes log tail X+1 into iclog. At this point, the iclog has tail X+1 and NEED_FUA set. There has been no cache flush for the metadata between X and X+1, and the iclog writes the new tail permanently to the log. THis is sufficient to violate on disk metadata/journal ordering. We have two options here. The first is to detect this case in some manner and ensure that the partial checkpoint write sets NEED_FLUSH when the iclog is already marked NEED_FUA and the log tail changes. This seems somewhat fragile and quite complex to get right, and it doesn't actually make it obvious what underlying problem it is actually addressing from reading the code. The second option seems much cleaner to me, because it is derived directly from the requirements of the C1 commit record in the iclog. That is, when we write this commit record to the iclog, we've guaranteed that the metadata/data ordering is correct for tail update purposes. Hence if we only write the log tail into the iclog for the *first* commit record rather than the log tail at the last release, we guarantee that the log tail does not move past where the the first commit record in the log expects it to be. IOWs, taking the first option means that replay of C1 becomes dependent on future operations doing the right thing, not just the C1 checkpoint itself doing the right thing. This makes log recovery almost impossible to reason about because now we have to take into account what might or might not have happened in the future when looking at checkpoints in the log rather than just having to reconstruct the past... Signed-off-by: Dave Chinner <[email protected]> Reviewed-by: Darrick J. Wong <[email protected]> Signed-off-by: Darrick J. Wong <[email protected]>
1 parent b2ae3a9 commit 9d11001

File tree

1 file changed

+37
-13
lines changed

1 file changed

+37
-13
lines changed

fs/xfs/xfs_log.c

Lines changed: 37 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -78,13 +78,12 @@ xlog_verify_iclog(
7878
STATIC void
7979
xlog_verify_tail_lsn(
8080
struct xlog *log,
81-
struct xlog_in_core *iclog,
82-
xfs_lsn_t tail_lsn);
81+
struct xlog_in_core *iclog);
8382
#else
8483
#define xlog_verify_dest_ptr(a,b)
8584
#define xlog_verify_grant_tail(a)
8685
#define xlog_verify_iclog(a,b,c)
87-
#define xlog_verify_tail_lsn(a,b,c)
86+
#define xlog_verify_tail_lsn(a,b)
8887
#endif
8988

9089
STATIC int
@@ -489,12 +488,31 @@ xfs_log_reserve(
489488

490489
/*
491490
* Flush iclog to disk if this is the last reference to the given iclog and the
492-
* it is in the WANT_SYNC state. If the caller passes in a non-zero
493-
* @old_tail_lsn and the current log tail does not match, there may be metadata
494-
* on disk that must be persisted before this iclog is written. To satisfy that
495-
* requirement, set the XLOG_ICL_NEED_FLUSH flag as a condition for writing this
496-
* iclog with the new log tail value.
491+
* it is in the WANT_SYNC state.
492+
*
493+
* If the caller passes in a non-zero @old_tail_lsn and the current log tail
494+
* does not match, there may be metadata on disk that must be persisted before
495+
* this iclog is written. To satisfy that requirement, set the
496+
* XLOG_ICL_NEED_FLUSH flag as a condition for writing this iclog with the new
497+
* log tail value.
498+
*
499+
* If XLOG_ICL_NEED_FUA is already set on the iclog, we need to ensure that the
500+
* log tail is updated correctly. NEED_FUA indicates that the iclog will be
501+
* written to stable storage, and implies that a commit record is contained
502+
* within the iclog. We need to ensure that the log tail does not move beyond
503+
* the tail that the first commit record in the iclog ordered against, otherwise
504+
* correct recovery of that checkpoint becomes dependent on future operations
505+
* performed on this iclog.
506+
*
507+
* Hence if NEED_FUA is set and the current iclog tail lsn is empty, write the
508+
* current tail into iclog. Once the iclog tail is set, future operations must
509+
* not modify it, otherwise they potentially violate ordering constraints for
510+
* the checkpoint commit that wrote the initial tail lsn value. The tail lsn in
511+
* the iclog will get zeroed on activation of the iclog after sync, so we
512+
* always capture the tail lsn on the iclog on the first NEED_FUA release
513+
* regardless of the number of active reference counts on this iclog.
497514
*/
515+
498516
int
499517
xlog_state_release_iclog(
500518
struct xlog *log,
@@ -519,6 +537,10 @@ xlog_state_release_iclog(
519537

520538
if (old_tail_lsn && tail_lsn != old_tail_lsn)
521539
iclog->ic_flags |= XLOG_ICL_NEED_FLUSH;
540+
541+
if ((iclog->ic_flags & XLOG_ICL_NEED_FUA) &&
542+
!iclog->ic_header.h_tail_lsn)
543+
iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn);
522544
}
523545

524546
if (!atomic_dec_and_test(&iclog->ic_refcnt))
@@ -530,8 +552,9 @@ xlog_state_release_iclog(
530552
}
531553

532554
iclog->ic_state = XLOG_STATE_SYNCING;
533-
iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn);
534-
xlog_verify_tail_lsn(log, iclog, tail_lsn);
555+
if (!iclog->ic_header.h_tail_lsn)
556+
iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn);
557+
xlog_verify_tail_lsn(log, iclog);
535558
trace_xlog_iclog_syncing(iclog, _RET_IP_);
536559

537560
spin_unlock(&log->l_icloglock);
@@ -2579,6 +2602,7 @@ xlog_state_activate_iclog(
25792602
memset(iclog->ic_header.h_cycle_data, 0,
25802603
sizeof(iclog->ic_header.h_cycle_data));
25812604
iclog->ic_header.h_lsn = 0;
2605+
iclog->ic_header.h_tail_lsn = 0;
25822606
}
25832607

25842608
/*
@@ -3614,10 +3638,10 @@ xlog_verify_grant_tail(
36143638
STATIC void
36153639
xlog_verify_tail_lsn(
36163640
struct xlog *log,
3617-
struct xlog_in_core *iclog,
3618-
xfs_lsn_t tail_lsn)
3641+
struct xlog_in_core *iclog)
36193642
{
3620-
int blocks;
3643+
xfs_lsn_t tail_lsn = be64_to_cpu(iclog->ic_header.h_tail_lsn);
3644+
int blocks;
36213645

36223646
if (CYCLE_LSN(tail_lsn) == log->l_prev_cycle) {
36233647
blocks =

0 commit comments

Comments
 (0)