Skip to content

Commit c6e8d51

Browse files
author
Kent Overstreet
committed
bcachefs: Work around deadlock to btree node rewrites in journal replay
Don't mark btree nodes for rewrites, if they are or would be degraded, if journal replay hasn't finished, to avoid a deadlock. This is because btree node rewrites generate more updates for the interior updates (alloc, backpointers), and if those updates touch new nodes and generate more rewrites - we can only have so many interior btree updates in flight before we deadlock on open_buckets. The biggest cause is that we don't use the btree write buffer (for the backpointer updates - this needs some real thought on locking in order to fix. The problem with this workaround (not doing the rewrite for degraded nodes in journal replay) is that those degraded nodes persist, and we don't want that (this is a real bug when a btree node write completes with fewer replicas than we wanted and leaves a degraded node due to device _removal_, i.e. the device went away mid write). It's less of a bug here, but still a problem because we don't yet have a way of tracking degraded data - we another index (all extents/btree nodes, by replicas entry) in order to fix properly (re-replicate degraded data at the earliest possible time). Signed-off-by: Kent Overstreet <[email protected]>
1 parent fbf913c commit c6e8d51

File tree

1 file changed

+35
-8
lines changed

1 file changed

+35
-8
lines changed

fs/bcachefs/btree_io.c

Lines changed: 35 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1337,15 +1337,42 @@ int bch2_btree_node_read_done(struct bch_fs *c, struct bch_dev *ca,
13371337

13381338
btree_node_reset_sib_u64s(b);
13391339

1340-
scoped_guard(rcu)
1341-
bkey_for_each_ptr(bch2_bkey_ptrs(bkey_i_to_s(&b->key)), ptr) {
1342-
struct bch_dev *ca2 = bch2_dev_rcu(c, ptr->dev);
1343-
1344-
if (!ca2 || ca2->mi.state != BCH_MEMBER_STATE_rw) {
1345-
set_btree_node_need_rewrite(b);
1346-
set_btree_node_need_rewrite_degraded(b);
1340+
/*
1341+
* XXX:
1342+
*
1343+
* We deadlock if too many btree updates require node rewrites while
1344+
* we're still in journal replay.
1345+
*
1346+
* This is because btree node rewrites generate more updates for the
1347+
* interior updates (alloc, backpointers), and if those updates touch
1348+
* new nodes and generate more rewrites - well, you see the problem.
1349+
*
1350+
* The biggest cause is that we don't use the btree write buffer (for
1351+
* the backpointer updates - this needs some real thought on locking in
1352+
* order to fix.
1353+
*
1354+
* The problem with this workaround (not doing the rewrite for degraded
1355+
* nodes in journal replay) is that those degraded nodes persist, and we
1356+
* don't want that (this is a real bug when a btree node write completes
1357+
* with fewer replicas than we wanted and leaves a degraded node due to
1358+
* device _removal_, i.e. the device went away mid write).
1359+
*
1360+
* It's less of a bug here, but still a problem because we don't yet
1361+
* have a way of tracking degraded data - we another index (all
1362+
* extents/btree nodes, by replicas entry) in order to fix properly
1363+
* (re-replicate degraded data at the earliest possible time).
1364+
*/
1365+
if (c->recovery.passes_complete & BIT_ULL(BCH_RECOVERY_PASS_journal_replay)) {
1366+
scoped_guard(rcu)
1367+
bkey_for_each_ptr(bch2_bkey_ptrs(bkey_i_to_s(&b->key)), ptr) {
1368+
struct bch_dev *ca2 = bch2_dev_rcu(c, ptr->dev);
1369+
1370+
if (!ca2 || ca2->mi.state != BCH_MEMBER_STATE_rw) {
1371+
set_btree_node_need_rewrite(b);
1372+
set_btree_node_need_rewrite_degraded(b);
1373+
}
13471374
}
1348-
}
1375+
}
13491376

13501377
if (!ptr_written) {
13511378
set_btree_node_need_rewrite(b);

0 commit comments

Comments
 (0)