Skip to content

Commit ecd49f7

Browse files
author
Darrick J. Wong
committed
xfs: fix per-cpu CIL structure aggregation racing with dying cpus
In commit 7c8ade2 ("xfs: implement percpu cil space used calculation"), the XFS committed (log) item list code was converted to use per-cpu lists and space tracking to reduce cpu contention when multiple threads are modifying different parts of the filesystem and hence end up contending on the log structures during transaction commit. Each CPU tracks its own commit items and space usage, and these do not have to be merged into the main CIL until either someone wants to push the CIL items, or we run over a soft threshold and switch to slower (but more accurate) accounting with atomics. Unfortunately, the for_each_cpu iteration suffers from the same race with cpu dying problem that was identified in commit 8b57b11 ("pcpcntrs: fix dying cpu summation race") -- CPUs are removed from cpu_online_mask before the CPUHP_XFS_DEAD callback gets called. As a result, both CIL percpu structure aggregation functions fail to collect the items and accounted space usage at the correct point in time. If we're lucky, the items that are collected from the online cpus exceed the space given to those cpus, and the log immediately shuts down in xlog_cil_insert_items due to the (apparent) log reservation overrun. This happens periodically with generic/650, which exercises cpu hotplug vs. the filesystem code: smpboot: CPU 3 is now offline XFS (sda3): ctx ticket reservation ran out. Need to up reservation XFS (sda3): ticket reservation summary: XFS (sda3): unit res = 9268 bytes XFS (sda3): current res = -40 bytes XFS (sda3): original count = 1 XFS (sda3): remaining count = 1 XFS (sda3): Filesystem has been shut down due to log error (0x2). Applying the same sort of fix from 8b57b11 to the CIL code seems to make the generic/650 problem go away, but I've been told that tglx was not happy when he saw: "...the only thing we actually need to care about is that percpu_counter_sum() iterates dying CPUs. That's trivial to do, and when there are no CPUs dying, it has no addition overhead except for a cpumask_or() operation." The CPU hotplug code is rather complex and difficult to understand and I don't want to try to understand the cpu hotplug locking well enough to use cpu_dying mask. Furthermore, there's a performance improvement that could be had here. Attach a private cpu mask to the CIL structure so that we can track exactly which cpus have accessed the percpu data at all. It doesn't matter if the cpu has since gone offline; log item aggregation will still find the items. Better yet, we skip cpus that have not recently logged anything. Worse yet, Ritesh Harjani and Eric Sandeen both reported today that CPU hot remove racing with an xfs mount can crash if the cpu_dead notifier tries to access the log but the mount hasn't yet set up the log. Link: https://lore.kernel.org/linux-xfs/[email protected]/T/ Link: https://lore.kernel.org/lkml/877cuj1mt1.ffs@tglx/ Link: https://lore.kernel.org/lkml/[email protected]/ Link: https://lore.kernel.org/linux-xfs/[email protected]/T/ Cc: [email protected] Cc: [email protected] Reported-by: [email protected] Reported-by: [email protected] Fixes: af1c214 ("xfs: introduce per-cpu CIL tracking structure") Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Dave Chinner <[email protected]>
1 parent cfa2df6 commit ecd49f7

File tree

3 files changed

+22
-45
lines changed

3 files changed

+22
-45
lines changed

fs/xfs/xfs_log_cil.c

Lines changed: 16 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -124,7 +124,7 @@ xlog_cil_push_pcp_aggregate(
124124
struct xlog_cil_pcp *cilpcp;
125125
int cpu;
126126

127-
for_each_online_cpu(cpu) {
127+
for_each_cpu(cpu, &ctx->cil_pcpmask) {
128128
cilpcp = per_cpu_ptr(cil->xc_pcp, cpu);
129129

130130
ctx->ticket->t_curr_res += cilpcp->space_reserved;
@@ -165,7 +165,13 @@ xlog_cil_insert_pcp_aggregate(
165165
if (!test_and_clear_bit(XLOG_CIL_PCP_SPACE, &cil->xc_flags))
166166
return;
167167

168-
for_each_online_cpu(cpu) {
168+
/*
169+
* We can race with other cpus setting cil_pcpmask. However, we've
170+
* atomically cleared PCP_SPACE which forces other threads to add to
171+
* the global space used count. cil_pcpmask is a superset of cilpcp
172+
* structures that could have a nonzero space_used.
173+
*/
174+
for_each_cpu(cpu, &ctx->cil_pcpmask) {
169175
int old, prev;
170176

171177
cilpcp = per_cpu_ptr(cil->xc_pcp, cpu);
@@ -554,6 +560,7 @@ xlog_cil_insert_items(
554560
int iovhdr_res = 0, split_res = 0, ctx_res = 0;
555561
int space_used;
556562
int order;
563+
unsigned int cpu_nr;
557564
struct xlog_cil_pcp *cilpcp;
558565

559566
ASSERT(tp);
@@ -577,7 +584,12 @@ xlog_cil_insert_items(
577584
* can't be scheduled away between split sample/update operations that
578585
* are done without outside locking to serialise them.
579586
*/
580-
cilpcp = get_cpu_ptr(cil->xc_pcp);
587+
cpu_nr = get_cpu();
588+
cilpcp = this_cpu_ptr(cil->xc_pcp);
589+
590+
/* Tell the future push that there was work added by this CPU. */
591+
if (!cpumask_test_cpu(cpu_nr, &ctx->cil_pcpmask))
592+
cpumask_test_and_set_cpu(cpu_nr, &ctx->cil_pcpmask);
581593

582594
/*
583595
* We need to take the CIL checkpoint unit reservation on the first
@@ -663,7 +675,7 @@ xlog_cil_insert_items(
663675
continue;
664676
list_add_tail(&lip->li_cil, &cilpcp->log_items);
665677
}
666-
put_cpu_ptr(cilpcp);
678+
put_cpu();
667679

668680
/*
669681
* If we've overrun the reservation, dump the tx details before we move
@@ -1790,38 +1802,6 @@ xlog_cil_force_seq(
17901802
return 0;
17911803
}
17921804

1793-
/*
1794-
* Move dead percpu state to the relevant CIL context structures.
1795-
*
1796-
* We have to lock the CIL context here to ensure that nothing is modifying
1797-
* the percpu state, either addition or removal. Both of these are done under
1798-
* the CIL context lock, so grabbing that exclusively here will ensure we can
1799-
* safely drain the cilpcp for the CPU that is dying.
1800-
*/
1801-
void
1802-
xlog_cil_pcp_dead(
1803-
struct xlog *log,
1804-
unsigned int cpu)
1805-
{
1806-
struct xfs_cil *cil = log->l_cilp;
1807-
struct xlog_cil_pcp *cilpcp = per_cpu_ptr(cil->xc_pcp, cpu);
1808-
struct xfs_cil_ctx *ctx;
1809-
1810-
down_write(&cil->xc_ctx_lock);
1811-
ctx = cil->xc_ctx;
1812-
if (ctx->ticket)
1813-
ctx->ticket->t_curr_res += cilpcp->space_reserved;
1814-
cilpcp->space_reserved = 0;
1815-
1816-
if (!list_empty(&cilpcp->log_items))
1817-
list_splice_init(&cilpcp->log_items, &ctx->log_items);
1818-
if (!list_empty(&cilpcp->busy_extents))
1819-
list_splice_init(&cilpcp->busy_extents, &ctx->busy_extents);
1820-
atomic_add(cilpcp->space_used, &ctx->space_used);
1821-
cilpcp->space_used = 0;
1822-
up_write(&cil->xc_ctx_lock);
1823-
}
1824-
18251805
/*
18261806
* Perform initial CIL structure initialisation.
18271807
*/

fs/xfs/xfs_log_priv.h

Lines changed: 6 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -231,6 +231,12 @@ struct xfs_cil_ctx {
231231
struct work_struct discard_endio_work;
232232
struct work_struct push_work;
233233
atomic_t order_id;
234+
235+
/*
236+
* CPUs that could have added items to the percpu CIL data. Access is
237+
* coordinated with xc_ctx_lock.
238+
*/
239+
struct cpumask cil_pcpmask;
234240
};
235241

236242
/*
@@ -278,9 +284,6 @@ struct xfs_cil {
278284
wait_queue_head_t xc_push_wait; /* background push throttle */
279285

280286
void __percpu *xc_pcp; /* percpu CIL structures */
281-
#ifdef CONFIG_HOTPLUG_CPU
282-
struct list_head xc_pcp_list;
283-
#endif
284287
} ____cacheline_aligned_in_smp;
285288

286289
/* xc_flags bit values */
@@ -705,9 +708,4 @@ xlog_kvmalloc(
705708
return p;
706709
}
707710

708-
/*
709-
* CIL CPU dead notifier
710-
*/
711-
void xlog_cil_pcp_dead(struct xlog *log, unsigned int cpu);
712-
713711
#endif /* __XFS_LOG_PRIV_H__ */

fs/xfs/xfs_super.c

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2337,7 +2337,6 @@ xfs_cpu_dead(
23372337
list_for_each_entry_safe(mp, n, &xfs_mount_list, m_mount_list) {
23382338
spin_unlock(&xfs_mount_list_lock);
23392339
xfs_inodegc_cpu_dead(mp, cpu);
2340-
xlog_cil_pcp_dead(mp->m_log, cpu);
23412340
spin_lock(&xfs_mount_list_lock);
23422341
}
23432342
spin_unlock(&xfs_mount_list_lock);

0 commit comments

Comments
 (0)