Skip to content

Commit 208c155

Browse files
zoumingzheaxboe
authored andcommitted
bcache: reserve more RESERVE_BTREE buckets to prevent allocator hang
Reported an IO hang and unrecoverable error in our testing environment. After careful research, we found that bch_allocator_thread is stuck, the call stack is as follows: [<0>] __switch_to+0xbc/0x108 [<0>] __closure_sync+0x7c/0xbc [bcache] [<0>] bch_prio_write+0x430/0x448 [bcache] [<0>] bch_allocator_thread+0xb44/0xb70 [bcache] [<0>] kthread+0x124/0x130 [<0>] ret_from_fork+0x10/0x18 Moreover, the RESERVE_BTREE type bucket slot are empty and journal_full occurs at the same time. When the cache disk is first used, the sb.nJournal_buckets defaults to 0. So, only 8 RESERVE_BTREE type buckets are reserved. If RESERVE_BTREE type buckets used up or btree_check_reserve() failed when request handle btree split, the request will be repeatedly retried and wait for alloc thread to fill in. After the alloc thread fills the buckets, it will call bch_prio_write(). If journal_full occurs simultaneously at this time, journal_reclaim() and btree_flush_write() will be called sequentially, journal_write cannot be completed. This is a low probability event, we believe that reserve more RESERVE_BTREE buckets can avoid the worst situation. Fixes: 682811b ("bcache: fix for allocator and register thread race") Signed-off-by: Mingzhe Zou <[email protected]> Signed-off-by: Coly Li <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
1 parent 5a08e49 commit 208c155

File tree

1 file changed

+40
-8
lines changed

1 file changed

+40
-8
lines changed

drivers/md/bcache/super.c

Lines changed: 40 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -2237,15 +2237,47 @@ static int cache_alloc(struct cache *ca)
22372237
bio_init(&ca->journal.bio, NULL, ca->journal.bio.bi_inline_vecs, 8, 0);
22382238

22392239
/*
2240-
* when ca->sb.njournal_buckets is not zero, journal exists,
2241-
* and in bch_journal_replay(), tree node may split,
2242-
* so bucket of RESERVE_BTREE type is needed,
2243-
* the worst situation is all journal buckets are valid journal,
2244-
* and all the keys need to replay,
2245-
* so the number of RESERVE_BTREE type buckets should be as much
2246-
* as journal buckets
2240+
* When the cache disk is first registered, ca->sb.njournal_buckets
2241+
* is zero, and it is assigned in run_cache_set().
2242+
*
2243+
* When ca->sb.njournal_buckets is not zero, journal exists,
2244+
* and in bch_journal_replay(), tree node may split.
2245+
* The worst situation is all journal buckets are valid journal,
2246+
* and all the keys need to replay, so the number of RESERVE_BTREE
2247+
* type buckets should be as much as journal buckets.
2248+
*
2249+
* If the number of RESERVE_BTREE type buckets is too few, the
2250+
* bch_allocator_thread() may hang up and unable to allocate
2251+
* bucket. The situation is roughly as follows:
2252+
*
2253+
* 1. In bch_data_insert_keys(), if the operation is not op->replace,
2254+
* it will call the bch_journal(), which increments the journal_ref
2255+
* counter. This counter is only decremented after bch_btree_insert
2256+
* completes.
2257+
*
2258+
* 2. When calling bch_btree_insert, if the btree needs to split,
2259+
* it will call btree_split() and btree_check_reserve() to check
2260+
* whether there are enough reserved buckets in the RESERVE_BTREE
2261+
* slot. If not enough, bcache_btree_root() will repeatedly retry.
2262+
*
2263+
* 3. Normally, the bch_allocator_thread is responsible for filling
2264+
* the reservation slots from the free_inc bucket list. When the
2265+
* free_inc bucket list is exhausted, the bch_allocator_thread
2266+
* will call invalidate_buckets() until free_inc is refilled.
2267+
* Then bch_allocator_thread calls bch_prio_write() once. and
2268+
* bch_prio_write() will call bch_journal_meta() and waits for
2269+
* the journal write to complete.
2270+
*
2271+
* 4. During journal_write, journal_write_unlocked() is be called.
2272+
* If journal full occurs, journal_reclaim() and btree_flush_write()
2273+
* will be called sequentially, then retry journal_write.
2274+
*
2275+
* 5. When 2 and 4 occur together, IO will hung up and cannot recover.
2276+
*
2277+
* Therefore, reserve more RESERVE_BTREE type buckets.
22472278
*/
2248-
btree_buckets = ca->sb.njournal_buckets ?: 8;
2279+
btree_buckets = clamp_t(size_t, ca->sb.nbuckets >> 7,
2280+
32, SB_JOURNAL_BUCKETS);
22492281
free = roundup_pow_of_two(ca->sb.nbuckets) >> 10;
22502282
if (!free) {
22512283
ret = -EPERM;

0 commit comments

Comments
 (0)