-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
dnode_next_offset
with txg
> 0 can be used by dmu_object_next
to iterate over all objects in a dataset that have been created or modified after the certain txg. At the moment, it's the only function that can pass non-zero txg to dnode_next_offset
.
The utility of dnode_next_offset
is to find a block / offset at a given block level and matching certain search criteria.
One criterion can be the block fill factor (how much of the block is in use) and the other can be the block's birth txg.
In the case of dmu_object_next
the search is done on a meta-dnode, on level 0 (offsets of dnodes in the meta-dnode's array) and the search criterion is the txg.
So, given the current offset, dnode_next_offset
checks a corresponding level zero block for matching the search criteria.
If at level zero there is no match, then the algorithm checks level 1, 2 and up to the highest level. Once a matching n-th level block is found, the assumption is that there must be a matching n-1 level block under it. And so on, down to level zero.
So, essentially the algorithm first searches upwards and then downwards.
But we can easily see that the assumption stated above is not true for the txg-based search (it is true for the search based on the fill factor).
Let's say that T is a cut-off txg.
Some new files can be created after that txg, say, at T + D1.
Then let's assume that all those files are removed at some later txg T + D2.
So, we can easily end up with a situation where many of L0 blocks in the meta-dnode are either empty (holes) or have only dnodes from before T (for files created before the latest snapshot).
At the same time L1 blocks covering those L0 blocks would have birth txg of T + D2 as that's when they were modified (which is due to ZFS CoW means that the new copies of those L1 blocks were "born").
And that's the situation that the current algorithm cannot handle.
When it finds an L1 block with birth txg T + D2 it assumes that there must be at least one L0 block under that L1 block that was born as recently (its birth txg being greater than T).
When the algorithm starts checking the L0 blocks and does not find a suitable one it returns ESRCH.
And that what the bug is.
ESRCH signals that no more suitable blocks exist.
But they can perfectly exist at farther offsets under different L1 blocks.
I observed the described data layout with zdb.
And I also confirmed with dtrace that that's how dnode_next_offset
gave up on the search.
The bug is pretty subtle as a very specific data layout is needed to hit it (but see below).
dmu_object_next
is used in a handful of places but only dsl_destroy_head
and traverse_pool
can call it with non-zero txg.
And dsl_destroy_head
passes non-zero txg only when destroying a cloned dataset.
The txg is a creation txg of the latest snapshot as only files created or modified since the snapshot are to be freed.
So, I observed the bug's impact only with dsl_destroy_head
, a cloned dataset and async_destroy disabled (otherwise there is no freeing in the open context at all and the sync context freeing is "chunked").
What I saw was that dsl_destroy_head
removed too few objects in the open context and left too much work for the sync context. The sync thread got bogged down with work and that was very visible.
Additionally, this problem happened after a previously interrupted dsl_destroy_head
of that dataset.
So, the sequence of events was:
- the dataset was being destroyed and its objects were being freed in the open context
- the system got rebooted
- I believe that that created the layout required to hit the bug as many object in the beginning of the dnode array had been destroyed
- when the cleanup of the inconsistent dataset started no work was done in the open context, so the remaining data had to be cleaned up in the syncing context
CC @ahrens