subtle bug in dnode_next_offset() with txg > 0

`dnode_next_offset` with `txg` > 0 can be used by `dmu_object_next` to iterate over all objects in a dataset that have been created or modified after the certain txg. At the moment, it's the only function that can pass non-zero txg to `dnode_next_offset`.

The utility of `dnode_next_offset` is to find a block / offset at a given block level and matching certain search criteria.
One criterion can be the block fill factor (how much of the block is in use) and the other can be the block's birth txg.
In the case of `dmu_object_next` the search is done on a meta-dnode, on level 0 (offsets of dnodes in the meta-dnode's array) and the search criterion is the txg.

So, given the current offset,` dnode_next_offset` checks a corresponding level zero block for matching the search criteria.
If at level zero there is no match, then the algorithm checks level 1, 2 and up to the highest level. Once a matching n-th level block is found, the assumption is that there must be a matching n-1 level block under it. And so on, down to level zero.
So, essentially the algorithm first searches upwards and then downwards.

But we can easily see that the assumption stated above is not true for the txg-based search (it is true for the search based on the fill factor).
Let's say that T is a cut-off txg.
Some new files can be created after that txg, say, at T + D1.
Then let's assume that all those files are removed at some later txg T + D2.
So, we can easily end up with a situation where many of L0 blocks in the meta-dnode are either empty (holes) or have only dnodes from before T (for files created before the latest snapshot).
At the same time L1 blocks covering those L0 blocks would have birth txg of T + D2 as that's when they were modified (which is due to ZFS CoW means that the new copies of those L1 blocks were "born").

And that's the situation that the current algorithm cannot handle.
When it finds an L1 block with birth txg T + D2 it assumes that there must be at least one L0 block under that L1 block that was born as recently (its birth txg being greater than T).
When the algorithm starts checking the L0 blocks and does not find a suitable one it returns ESRCH.
And that what the bug is.
ESRCH signals that no more suitable blocks exist.
But they can perfectly exist at farther offsets under different L1 blocks.

I observed the described data layout with zdb.
And I also confirmed with dtrace that that's how `dnode_next_offset` gave up on the search.

The bug is pretty subtle as a very specific data layout is needed to hit it (but see below).
`dmu_object_next` is used in a handful of places but only `dsl_destroy_head` and `traverse_pool` can call it with non-zero txg.
And `dsl_destroy_head` passes non-zero txg only when destroying a cloned dataset.
The txg is a creation txg of the latest snapshot as only files created or modified since the snapshot are to be freed.

So, I observed the bug's impact only with `dsl_destroy_head`, a cloned dataset and async_destroy disabled (otherwise there is no freeing in the open context at all and the sync context freeing is "chunked").
What I saw was that `dsl_destroy_head` removed too few objects in the open context and left too much work for the sync context. The sync thread got bogged down with work and that was very visible.
Additionally, this problem happened after a previously interrupted `dsl_destroy_head` of that dataset.
So, the sequence of events was:
- the dataset was being destroyed and its objects were being freed in the open context
- the system got rebooted
- I believe that that created the layout required to hit the bug as many object in the beginning of the dnode array had been destroyed
- when the cleanup of the inconsistent dataset started no work was done in the open context, so the remaining data had to be cleaned up in the syncing context

CC @ahrens 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

subtle bug in dnode_next_offset() with txg > 0 #11196

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

subtle bug in dnode_next_offset() with txg > 0 #11196

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions