dnode_next_offset: backtrack if lower level does not match #16025

rrevans · 2024-03-25T13:15:20Z

This changes the basic search algorithm from a single search up and down the tree to a full depth-first traversal to handle conditions where the tree matches at a higher level but not a lower level.

Motivation and Context

Normally a higher level search match in the first loop of dnode_next_offset always points to a matching block in the second loop, but there are cases where this does not happen:

Racing block pointer updates from dbuf_write_ready.

Before f664f1e (Reduce lock contention on dn_struct_rwlock #8946), both dbuf_write_ready and dnode_next_offset held dn_struct_rwlock which protected against pointer writes from concurrent syncs.

This no longer applies, so sync context can f.e. clear or fill all L1->L0 BPs before the L2->L1 BP and higher BP's are updated.

dnode_free_range in particular can reach this case and skip over L1 blocks that need to be dirtied. Later, sync will panic in free_children when trying to clear a non-dirty indirect block.

This case was found with ztest.
txg > 0, non-hole case. This is subtle bug in dnode_next_offset() with txg > 0 #11196.

Freeing blocks/dnodes breaks the assumption that a match at a higher level implies a match at a lower level when filtering txg > 0.

Whenever some but not all L0 blocks are freed, the parent L1 block is rewritten. Its updated L2->L1 BP reflects a newer birth txg.

Later when searching by txg, if the L1 block matches since the txg is newer, it is possible that none of the remaining L1->L0 BPs match if none have been updated.

The same behavior is possible with dnode search at L0.

This is reachable from dsl_destroy_head for synchronous freeing. When this happens open context fails to free objects leaving sync context stuck freeing potentially many objects.

This is also reachable from traverse_pool for extreme rewind where it is theoretically possible that datasets not dirtied after txg are skipped if the MOS has high enough indirection to trigger this case.

In both of these cases, without backtracking the search ends prematurely as ESRCH result implies no further matches in the entire object.

This PR is also a first step towards teaching dnode_next_offset to consider dirty dbuf state.

In the next PR, dnode_next_offset_level is modified to stop at any dirty indirect block when a new flag is set. This allows dnode_next_offset to match dirty L0 blocks (or freed-but-not-synced L0 ranges) the same as synced out data blocks (or holes). However that approach requires backtracking since a dirty higher-level indirect may not match once the L0/L1 state is inspected (f.e. consider a data search reaching an L2 block that is dirty but all L0 blocks previously created under that L2 are now newly freed in dirty state).

Description

Old algorithm:

Start at minlvl
Increment lvl until a matching block is found or maxlvl exceeded.
Decrement lvl until minlvl reached or no match found.

New algorithm:

Start at minlvl
Do a tree traversal checking for a match at each block:
a. If matched, decrement lvl until minlvl reached.
b. If not matched, adjust offset to next BP at lvl+1 and increment lvl.

The new algorithm continues the search at the next possible offset at the next higher level when no match is found. This performs in-order traversal of the tree while skipping non-existing or non-matching ranges.

How Has This Been Tested?

Many ztest and ZTS runs as well as seek(SEEK_DATA/SEEK_HOLE) stress tests. This surfaced a lot of problems getting *offset semantics right, and also found a novel PANIC in free_children which this happens to fix.

I don't know how to really test maxlvl == 0 changes (see also comments in #11200), and it would be nice to have more unit-oriented tests for dnode_next_offset. Any feedback appreciated.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

This walk is inherently racy w.r.t. dbuf eviction and sync. Consider: 0. A large sparse file with 3 levels of indirection. 1. A new L1 block is added to a brand new L2 block. 2. The L1 block syncs out and is immediately evicted. 3. Before the L3->L2 BP is updated in the L3 block, dnode_free_range attempts to free the new L1. In this case neither dnode_dirty_l1range nor dnode_next_offset can find the newly synced-out L1 block and its L0 blocks: - dnode_dirty_l1range uses in-memory index but the L1 is evicted - dnode_next_offset considers on-disk BPs but the L3->L2 is missing And then free_children will later PANIC because the L1 was not dirtied during open context when freeing the range. This case was found during testing llseek(SEEK_HOLE/SEEK_DATA) without txg sync and is distinct from the _other_ free_childen panic found and addressed by openzfs#16025. The fix is to replace dnode_dirty_l1range with dnode_next_offset(DNODE_FIND_DIRTY) which knows how to find all dirty L1 blocks. This PR also changes to use minlvl=1 to avoid redirtying L2 blocks that are only dirtied in a prior txg. Successive frees otherwise needlessly redirty already-empty L1s which wastes time during txg sync turning them back into holes. Signed-off-by: Robert Evans <[email protected]>

rrevans · 2024-03-28T13:13:42Z

See master...rrevans:zfs:find_dirty for the rest of the patchset here.

dnode_next_offset: add DNODE_FIND_DIRTY
dmu_offset_next: Use DNODE_FIND_DIRTY for SEEK_HOLE/SEEK_DATA
dnode_free_range: Replace L1 dirty walk with DNODE_FIND_DIRTY

adamdmoss · 2024-05-14T15:56:29Z

I found your notes quite educational as a background so I'm repeating the link here for future readers:
https://gist.github.com/rrevans/e6a2e14be9dea7f9711b83c2d18303d5

tonyhutter · 2024-08-19T21:55:44Z

@rrevans sorry no one has taken a look at this yet. I just tried pulling down the patch, but looks like it's now out of sync with master. Would you mind re-basing on top of master?

jumbi77 · 2024-09-20T17:04:43Z

@rrevans sorry no one has taken a look at this yet. I just tried pulling down the patch, but looks like it's now out of sync with master. Would you mind re-basing on top of master?

Politely ping @rrevans to rebase this great work!

rrevans · 2024-12-23T01:45:12Z

@jumbi77 @tonyhutter thanks for the ping. I'll have a look here and post a rebase in the next week or so.

Edit: Updated!

This walk is inherently racy w.r.t. dbuf eviction and sync. Consider: 0. A large sparse file with 3 levels of indirection. 1. A new L1 block is added to a brand new L2 block. 2. The L1 block syncs out and is immediately evicted. 3. Before the L3->L2 BP is updated in the L3 block, dnode_free_range attempts to free the new L1. In this case neither dnode_dirty_l1range nor dnode_next_offset can find the newly synced-out L1 block and its L0 blocks: - dnode_dirty_l1range uses in-memory index but the L1 is evicted - dnode_next_offset considers on-disk BPs but the L3->L2 is missing And then free_children will later PANIC because the L1 was not dirtied during open context when freeing the range. This case was found during testing llseek(SEEK_HOLE/SEEK_DATA) without txg sync and is distinct from the _other_ free_childen panic found and addressed by openzfs#16025. The fix is to replace dnode_dirty_l1range with dnode_next_offset(DNODE_FIND_DIRTY) which knows how to find all dirty L1 blocks. This PR also changes to use minlvl=1 to avoid redirtying L2 blocks that are only dirtied in a prior txg. Successive frees otherwise needlessly redirty already-empty L1s which wastes time during txg sync turning them back into holes. Signed-off-by: Robert Evans <[email protected]>

rrevans · 2025-01-08T12:09:23Z

@tonyhutter Updated this PR as well as the rest of the series in master...rrevans:zfs:find_dirty. Please take a look if you get a chance!

Those other patches mostly rework llseek(..., SEEK_DATA or SEEK_HOLE) so that it operates entirely on dirty state and removes the forced txg syncs. If this is accepted I'd be glad to move those ones forward too.

jumbi77 · 2025-06-13T20:44:44Z

In case this optimization is still applicable, maybe @robn is interested to take a look and finishing this? (Its just a hint, so feel free to not respond :) )

This walk is inherently racy w.r.t. dbuf eviction and sync. Consider: 0. A large sparse file with 3 levels of indirection. 1. A new L1 block is added to a brand new L2 block. 2. The L1 block syncs out and is immediately evicted. 3. Before the L3->L2 BP is updated in the L3 block, dnode_free_range attempts to free the new L1. In this case neither dnode_dirty_l1range nor dnode_next_offset can find the newly synced-out L1 block and its L0 blocks: - dnode_dirty_l1range uses in-memory index but the L1 is evicted - dnode_next_offset considers on-disk BPs but the L3->L2 is missing And then free_children will later PANIC because the L1 was not dirtied during open context when freeing the range. This case was found during testing llseek(SEEK_HOLE/SEEK_DATA) without txg sync and is distinct from the _other_ free_childen panic found and addressed by openzfs#16025. The fix is to replace dnode_dirty_l1range with dnode_next_offset(DNODE_FIND_DIRTY) which knows how to find all dirty L1 blocks. This PR also changes to use minlvl=1 to avoid redirtying L2 blocks that are only dirtied in a prior txg. Successive frees otherwise needlessly redirty already-empty L1s which wastes time during txg sync turning them back into holes. Signed-off-by: Robert Evans <[email protected]>

rrevans · 2025-08-31T12:04:55Z

@robn For when you are back: I've updated this patch to merge at head as well as the rest of the series in master...rrevans:zfs:find_dirty.

Let me know when you might have time to pick this up? Happy to answer questions or rework the patches as needed.

(Following up from #17652 (comment))

Edit:

I could be convinced to split ... [out] ... the offset handling at boundary conditions ... in dnode_next_offset

Upon staring more, the changes to dnode_next_offset_level aren't necessary at all, and I've removed them.

The new traversal requires that, when error == 0, the output offset of dnode_next_offset_level is always >= the input offset (or <= for backwards search). If not, the walk might loop endlessly going up and down the same part of the tree. Turns out that this is already satisfied by the existing code even for edge cases.

I'll also have another think about how to test the txg > 0 case as mentioned in #11200 (review).

This walk is inherently racy w.r.t. dbuf eviction and sync. Consider: 0. A large sparse file with 3 levels of indirection. 1. A new L1 block is added to a brand new L2 block. 2. The L1 block syncs out and is immediately evicted. 3. Before the L3->L2 BP is updated in the L3 block, dnode_free_range attempts to free the new L1. In this case neither dnode_dirty_l1range nor dnode_next_offset can find the newly synced-out L1 block and its L0 blocks: - dnode_dirty_l1range uses in-memory index but the L1 is evicted - dnode_next_offset considers on-disk BPs but the L3->L2 is missing And then free_children will later PANIC because the L1 was not dirtied during open context when freeing the range. This case was found during testing llseek(SEEK_HOLE/SEEK_DATA) without txg sync and is distinct from the _other_ free_childen panic found and addressed by openzfs#16025. The fix is to replace dnode_dirty_l1range with dnode_next_offset(DNODE_FIND_DIRTY) which knows how to find all dirty L1 blocks. This PR also changes to use minlvl=1 to avoid redirtying L2 blocks that are only dirtied in a prior txg. Successive frees otherwise needlessly redirty already-empty L1s which wastes time during txg sync turning them back into holes. Signed-off-by: Robert Evans <[email protected]>

This changes the basic search algorithm from a single search up and down the tree to a full depth-first traversal to handle conditions where the tree matches at a higher level but not a lower level. Normally higher level blocks always point to matching blocks, but there are cases where this does not happen: 1. Racing block pointer updates from dbuf_write_ready. Before f664f1e (openzfs#8946), both dbuf_write_ready and dnode_next_offset held dn_struct_rwlock which protected against pointer writes from concurrent syncs. This no longer applies, so sync context can f.e. clear or fill all L1->L0 BPs before the L2->L1 BP and higher BP's are updated. dnode_free_range in particular can reach this case and skip over L1 blocks that need to be dirtied. Later, sync will panic in free_children when trying to clear a non-dirty indirect block. This case was found with ztest. 2. txg > 0, non-hole case. This is openzfs#11196. Freeing blocks/dnodes breaks the assumption that a match at a higher level implies a match at a lower level when filtering txg > 0. Whenever some but not all L0 blocks are freed, the parent L1 block is rewritten. Its updated L2->L1 BP reflects a newer birth txg. Later when searching by txg, if the L1 block matches since the txg is newer, it is possible that none of the remaining L1->L0 BPs match if none have been updated. The same behavior is possible with dnode search at L0. This is reachable from dsl_destroy_head for synchronous freeing. When this happens open context fails to free objects leaving sync context stuck freeing potentially many objects. This is also reachable from traverse_pool for extreme rewind where it is theoretically possible that datasets not dirtied after txg are skipped if the MOS has high enough indirection to trigger this case. In both of these cases, without backtracking the search ends prematurely as ESRCH result implies no more matches in the entire object. Signed-off-by: Robert Evans <[email protected]>

This walk is inherently racy w.r.t. dbuf eviction and sync. Consider: 0. A large sparse file with 3 levels of indirection. 1. A new L1 block is added to a brand new L2 block. 2. The L1 block syncs out and is immediately evicted. 3. Before the L3->L2 BP is updated in the L3 block, dnode_free_range attempts to free the new L1. In this case neither dnode_dirty_l1range nor dnode_next_offset can find the newly synced-out L1 block and its L0 blocks: - dnode_dirty_l1range uses in-memory index but the L1 is evicted - dnode_next_offset considers on-disk BPs but the L3->L2 is missing And then free_children will later PANIC because the L1 was not dirtied during open context when freeing the range. This case was found during testing llseek(SEEK_HOLE/SEEK_DATA) without txg sync and is distinct from the _other_ free_childen panic found and addressed by openzfs#16025. The fix is to replace dnode_dirty_l1range with dnode_next_offset(DNODE_FIND_DIRTY) which knows how to find all dirty L1 blocks. This PR also changes to use minlvl=1 to avoid redirtying L2 blocks that are only dirtied in a prior txg. Successive frees otherwise needlessly redirty already-empty L1s which wastes time during txg sync turning them back into holes. Signed-off-by: Robert Evans <[email protected]>

rrevans · 2025-09-13T19:55:30Z

I've been working on reproducing the two issues here more precisely. With new tooling, I've been able to clearly identify the problems and demonstrate that this PR definitely fixes them.

Apologies for the long comment, but hopefully this aids in building a better understanding of the problems here.

@robn Let me know if you have some time soon to work on this?

1) free_children panic

What causes the panic?

Consider a sparse file with 3 populated L2 blocks (recordsize=128k, compression=on, and 128k indirect blocks):

# zdb -Ovv test/ tmp | grep -v 0:0:0 | sed 's/0:[0-9a-f:]*//; s/ B=.*//'
...
               0 L3     20000L/400P F=3
     10000000000  L2    20000L/400P F=1
     10000000000   L1   20000L/400P F=1
     10000000000    L0  20000L/400P F=1
     20000000000  L2    20000L/400P F=1
     20000000000   L1   20000L/400P F=1
     20000000000    L0  20000L/400P F=1
     30000000000  L2    20000L/400P F=1
     30000000000   L1   20000L/400P F=1
     30000000000    L0  20000L/400P F=1
...

If the L0 block at offset 0x10000000000 is overwritten with zeros, sync will perform these steps:

zio_write_compress will convert the all-zeros L0 block into a hole
dbuf_write_ready will write the hole BP into the L1 dbuf
dbuf_write_children_ready will zero out the L1 dbuf since there are no more children
zio_write_compress will convert the all-zeros L1 block into a hole
dbuf_write_ready will write the hole BP into the L2 dbuf
dbuf_write_children_ready will zero out the L2 dbuf since there are no more children
zio_write_compress will convert the all-zeros L2 dbuf into a hole
dbuf_write_ready will write the new hole BP into the in-memory L3 dbuf

There is a window of time between steps (5) and (8) where the L3->L2 BP still reflects the prior on-disk state, but the L2 dbuf contains only holes (or zeros).

If the file is truncated to zero bytes during this window, dnode_free_range will perform these steps in open context:

dirty the first and last L1 blocks in the file
dirty all in-memory L1 blocks using dnode_dirty_l1range
dirty all on-disk L1 blocks with dnode_next_offset and minlvl=2 search starting from offset 0x20000

The search for on-disk blocks proceeds as follows:

dnode_next_offset searches the L2 block at offset 0x20000, which is a hole
dnode_next_offset searches the L3 block at offset 0x20000 and finds the L2 BP at 0x10000000000
dnode_next_offset searches the now-empty L2 at 0x10000000000 and returns ESRCH
ESRCH is treated as the end of the whole search and the loop terminates before reaching 0x20000000000

(Note that unlike dmu_offset_next, dnode_free_range does not wait for the on-disk state to be clean.)

This means the L1 at offset 0x20000000000 is not dirtied at all if both:

it is not already in memory (was evicted, or is a freshly imported pool), and
it is not the last block in the range (e.g. the file ends with a hole beyond)

Later, free_children walks the L1 blocks in sync context and frees them. It panics if any L1 block is not dirty or empty, which is the case for the block at offset 0x20000000000. ZFS panics with VERIFY(BP_GET_FILL(db->db_blkptr) == 0 || db->db_dirtycnt > 0).

For this condition to be hit, the second free range must start in an L2 block prior to the one that contains the blocks freed in the first sync. This is so that it will walk the L3->L2 BP in a downward search. If it starts in the same block, then it will simply search the empty L2 block and continue upwards at an offset at the end of the L2 block.

How to reproduce it?

The window above depends on timing, so I've created two new debug tools to allow reliably hitting it:

A new zinject command ZINJECT_DELAY_READY that sleeps in zio_ready after children are ready but before calling the ready callback. This allows pausing the pipeline deterministically after zeroing the L2 block but before L3->L2 BP update. rrevans@677eb8c
A new ioctl ZFS_IOC_WAIT_INJECT and zinject command that blocks in the kernel until some injection handler matches and injects a fault. This wakes up immediately after sync reaches the point above to give maximum time to hit the window. rrevans@04cacd3

With these, it's now possible to create a reliable reproducer:

# create suitable dataset
zfs create test/ds -o recordsize=128k -o compression=on

# create the file
dd if=<(echo z) of=/test/ds/f bs=1 count=1 seek=1T
dd if=<(echo z) of=/test/ds/f bs=1 count=1 seek=2T
dd if=<(echo z) of=/test/ds/f bs=1 count=1 seek=3T

# sync the file to disk
zpool sync

# inject a 5 second delay at zio_ready for L2 writes
# and remount (-m) to evict all L1 dbufs
zinject -m -E 5 -l2 -T write -t data /test/ds/f

# prepare to wait (get sync token from kernel)
STATE=$(zinject -w0 -W0)

# zero out the block to eventually free it
dd if=/dev/zero of=/test/ds/f bs=1 count=1 seek=1T conv=notrunc

# wait for events after sync point above
zinject -w "${STATE?}"
zinject -c all

# now, racing with zio_ready, free the whole file
truncate --size=0 /test/ds/f

# sync will panic if truncate hits the window
zpool sync

The above script reaches the panic for me 100% of the time.

How does this PR fix the problem?

After this PR, the search will continue with an upwards search if the lower level does not match.

The script above no longer reproduces the panic with this PR applied.

Alternatives considered to this PR?

Make dnode_free_range recursively walk every indirect like free_children does
Make dnode_free_range check the returned offset instead of just the return code
Make dnode_free_range trigger sync and wait for it like dmu_offset_next

2) txg > 0, non-hole case

What is the problem?

When searching for blocks or dnodes with txg > 0, dnode_next_offset will return ESRCH if during downward search it encounters a block with matching birth txg but that contains only holes added at or after that txg.

Consider this DMU dataset object with compression=on:

# zdb -dvvvv test/files 0 | grep -v 0:0:0 | sed 's/0:[0-9a-f:]*//; s/ cksum=.*//'
    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
         0    6   128K    16K    19K     512  34.3M    0.01  DMU dnode

Indirect blocks:
               0 L5       20000L/400P F=9 B=6316998/6316998
               0  L4      20000L/400P F=9 B=6316998/6316998
               0   L3     20000L/400P F=9 B=6316998/6316998
               0    L2    20000L/400P F=9 B=6316998/6316998
               0     L1   20000L/400P F=6 B=6316998/6316998
               0      L0  4000L/200P F=1 B=6316960/6316960
            4000      L0  4000L/400P F=5 B=6316998/6316998
         1000000     L1   20000L/400P F=2 B=6316986/6316986
         1000000      L0  4000L/200P F=1 B=6316978/6316978
         1004000      L0  4000L/200P F=1 B=6316986/6316986
         2000000     L1   20000L/400P F=1 B=6316990/6316990
         2000000      L0  4000L/200P F=1 B=6316990/6316990

Suppose the L0 block at 0x1004000 is freed -- after the one file in that block is deleted.

At the same time, suppose a file in L0 block 0x2000000 gets updated (e.g. mtime updated).

Then the dataset layout becomes:

               0 L5       20000L/400P F=8 B=6317082/6317082
               0  L4      20000L/400P F=8 B=6317082/6317082
               0   L3     20000L/400P F=8 B=6317082/6317082
               0    L2    20000L/400P F=8 B=6317082/6317082
               0     L1   20000L/400P F=6 B=6317082/6317082
               0      L0  4000L/200P F=1 B=6316960/6316960
            4000      L0  4000L/400P F=5 B=6317082/6317082
         1000000     L1   20000L/400P F=1 B=6317075/6317075
         1000000      L0  4000L/200P F=1 B=6316978/6316978
         2000000     L1   20000L/400P F=1 B=6317075/6317075
         2000000      L0  4000L/200P F=1 B=6317075/6317075

Notably, the L1 block at 0x1000000 has birth txg 6317075 but contains no children L0 blocks at that txg. And the L1 block at 0x2000000 is updated to the same birth txg.

Suppose dnode_next_offset is called with offset=0x8000, minlvl=0, and txg=6317074:

dnode_next_offset starts upward search at L0 offset 0x8000
dnode_next_offset_level finds a hole and returns ESRCH
dnode_next_offset continues upward search at L1 offset 0x8000
dnode_next_offset_level finds the newer L1->L0 BP for offset 0x1000000
dnode_next_offset starts downward search at L0 offset 0x1000000
dnode_next_offset_level matches nothing in the block
dnode_next_offset then returns ESRCH with offset 0x2000000 (end of that block)

This is wrong as there is a block at 0x2000000 with matching txg that should be found instead.

To reach this state, I did the following (via script):

Created 100,000 files in a new dataset
Mapped object numbers to files via struct stat::st_ino
Deleted all objects except 32768, 32800, and 65536
Synced the pool to advance txg
Deleted object 32800
Updated mtime on object 65536

The same problem also applies to ordinary files, at least in theory. Consider:

# zdb -vvvv test/54 19 | grep -v 0:0:0 | sed 's/0:[0-9a-f:]*//; s/ cksum=.*//'
    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
        19    3   128K   128K   392K     512   256M    0.15  ZFS plain file
                                               280   bonus  System attributes
...
Indirect blocks:
               0 L2    20000L/400P F=3 B=6317236/6317236
               0  L1   20000L/400P F=1 B=6317219/6317219
               0   L0  20000L/20000P F=1 B=6317213/6317213
         8000000  L1   20000L/400P F=1 B=6317236/6317236
         8000000   L0  20000L/20000P F=1 B=6317219/6317219
        10000000  L1   20000L/400P F=1 B=6317226/6317226
        10000000   L0  20000L/20000P F=1 B=6317226/6317226

A similar file can be created with:

dd if=<(echo z) of=/test/f bs=1 count=1 seek=$((0x8000000)) conv=notrunc
dd if=<(echo z) of=/test/f bs=1 count=1 seek=$((0x8020000)) conv=notrunc
dd if=<(echo z) of=/test/f bs=1 count=1 seek=$((0x10000000)) conv=notrunc
zpool sync
dd if=/dev/zero of=/test/f bs=1 count=1 seek=$((0x8020000)) conv=notrunc
dd if=<(echo a) of=/test/f bs=1 count=1 seek=$((0x10000000)) conv=notrunc

Similar to the dataset above, the L1 block at offset 0x8000000 is newer because it contains a newer hole and the L1 block at offset 0x10000000 is newer because it has been modified.

The same issue happens if dnode_next_offset is called with offset=0x4000000, minlvl=1, and txg=6317225 on this file:

The L1 block at 0x8000000 is born at 6317236 but contains no matching L0 blocks
So dnode_next_offset returns ESRCH at offset 0x10000000 (the end of that block)

What does this break?

dsl_destroy_head can reach this with -o feature@async_destroy=disabled. It looks like this would cause objects at greater offsets to be skipped from open context while doing synchronous destroy. I have not reproduced this case in full.

There is no way to reach this for ordinary files as txg > 0 search is never used.

How to reproduce it?

For this, I've added two new ioctls to allow directly inspecting the objects:

ZFS_IOC_NEXT_OBJ_TXG allows calling dmu_object_next from userspace with txg > 0: rrevans@8498f1e
ZFS_IOC_NEXT_OFFSET allows calling dnode_next_offset from userspace for any object: rrevans@b53f5ca

For each of these, I also implemented matching libzfs_core marshallers.

Then, I can use python ctypes wrapper scripts to invoke these ioctls:

nextobj.py - finds the next DMU object in a dataset
nextoff.py - finds the next block/hole in a DMU object

With these, I can directly call the corresponding functions to observe the unwanted behavior.

For the dataset example:

dmu_object_next and dnode_next_offset both fail (incorrectly) at offset 0x8000 if txg=6317074:

# nextobj.py test/files 63 --txg=6317074 
3 ESRCH next=65536
# nextoff.py test 1029 0 0x8000 --minlvl=0 --blkfill=32 --txg=6317074
3 ESRCH offset=33554432 (0x0000000002000000)

This is wrong. It should return 0 (OK) at that offset.

But then searching from a greater offset finds the dnodes correctly:

# nextobj.py test/files 32767 --txg=6317074 
0 OK next=65536
# nextoff.py test 1029 0 0x1000000 --minlvl=0 --blkfill=32 --txg=6317074
0 OK offset=33554432 (0x0000000002000000)

If search starts exactly one object (or byte) before, both operations still fail with ESRCH:

# nextobj.py test/files 32766 --txg=6317074 
3 ESRCH next=65536
# nextoff.py test 1029 0 0xffffff --minlvl=0 --blkfill=32 --txg=6317074
3 ESRCH offset=33554432 (0x0000000002000000)

For the plain file example:

dnode_next_offset shows the same behavior with txg=6317225:

# nextoff.py test 54 19 0x7ffffff --minlvl=1 --blkfill=1 --txg=6317225
3 ESRCH offset=268435456 (0x0000000010000000)
# nextoff.py test 54 19 0x8000000 --minlvl=1 --blkfill=1 --txg=6317225
0 OK offset=268435456 (0x0000000010000000)

How does this PR fix the problem?

Same as above, search will continue with an upwards search when this condition occurs.

Once this patch is applied, the cases above no longer fail:

Dataset case:

# nextobj.py test/files 32766 --txg=6317074
0 OK next=65536
# nextoff.py test 1029 0 0xffffff --minlvl=0 --blkfill=32 --txg=6317074
0 OK offset=33554432 (0x0000000002000000)

Plain file case:

# nextoff.py test 54 19 0x7ffffff --minlvl=1 --blkfill=1 --txg=6317225
0 OK offset=268435456 (0x0000000010000000)

The new tools can also observe exactly the ESRCH error seen by dnode_free_range for the free_children panic case:

Before the sync starts:

# nextoff.py test 150 3 0x20000 --minlvl=2 --blkfill=1
0 OK offset=1099511627776 (0x0000010000000000)

During the ready delay:

# nextoff.py test 150 3 0x20000 --minlvl=2 --blkfill=1
3 ESRCH offset=1236950581248 (0x0000012000000000)

This is wrong. It should return 0 (OK) and offset 0x20000000000.

After sync finishes:

nextoff.py test 150 3 0x20000 --minlvl=2 --blkfill=1
0 OK offset=2199023255552 (0x0000020000000000)

With the patch applied:

Before the sync starts:

# nextoff.py test 150 3 0x20000 --minlvl=2 --blkfill=1
0 OK offset=1099511627776 (0x0000010000000000)

During the ready delay:

# nextoff.py test 150 3 0x20000 --minlvl=2 --blkfill=1
0 OK offset=2199023255552 (0x0000020000000000)

This the correct outcome and explains why the panic is fixed.

After sync finishes:

nextoff.py test 150 3 0x20000 --minlvl=2 --blkfill=1
0 OK offset=2199023255552 (0x0000020000000000)

Conclusion

The new tooling conclusively reproduces and demonstrates that this PR fixes both the panic and the txg > 0 case.

behlendorf

Thanks for the detailed walk through how both of these problem cases can occur. It took me a while to digest everything, but I was able to eventually convince myself this looks right. Let's see if we can also get @robn and @avg-I to take a look. @avg-I identified the txg > 0 issue and proposed an alternate fix in #11200 so I'm sure he's familiar with this bit of the code.

It'd be great to additionally pull in your zinject improvements. Those could definitely be handy in the future when trying to reproduce other similar subtle bugs. Plus, it would let us add your test case for this to the test suite.

rrevans · 2025-09-25T01:24:55Z

Thanks for reviewing @behlendorf!

I'll send along the other PRs for the zinject tooling.

Also I've been working on a separate PR to cleanup the dnode_offset_next_level offset business which:

is pretty confusing to read since it has to map offsets to blkid and back each call
necessitates dnode_next_block to handle edge cases where the offset is at the limit of iteration
makes it hard to prove that dnode_next_offset never loops (e.g. offset always increases or decreases)

I have a draft commit that addresses this by having dnode_next_offset_level accept a non-pointer blkid plus a pointer to an index into that's blocks BP or dnode array. Then dnode_next_offset deals with changing the blkid and the dnode_next_block business moves inline (it becomes ++blkid and --blkid).

This approach fixes those above issues and makes it clear that dnode_next_offset_level always iterates in a single direction. Let me know if you feel like it would be worthwhile to merge that into this PR or as a separate one.

akashb-22 · 2025-09-25T16:09:21Z

@rrevans Your script to reliably reproduce the VERIFY(BP_GET_FILL(db->db_blkptr) == 0 || db->db_dirtycnt > 0) check in free_children is exceptionally good. We've encountered these issues only rarely and haven't been able to consistently reproduce them ourselves.
Additionally, I'd be interested to know if you have any cases that occur without the zinject patch.

behlendorf · 2025-09-25T16:28:32Z

@rrevans your draft commit looks nice, that really helps with the readability and ability to reason about this code. However, rather then fold those changes in to this PR let me merge this more limited change and you can follow up with that bigger rework in a new PR.

rrevans · 2025-09-25T17:33:58Z

@rrevans Your script to reliably reproduce the VERIFY(BP_GET_FILL(db->db_blkptr) == 0 || db->db_dirtycnt > 0) check in free_children is exceptionally good. We've encountered these issues only rarely and haven't been able to consistently reproduce them ourselves.

Thanks!

Additionally, I'd be interested to know if you have any cases that occur without the zinject patch.

Short answer not in production.

So I've been deep in this code because I am in pursuit of sync-free llseek(SEEK_HOLE/DATA) where the DFS traversal is mandatory to skip over live state that mismatches the disk committed state.

Being this change is non trivial, I have been running synthetic tests against my development machine en masse. This is mainly ztest in a loop, ZTS suite in a loop, and custom llseek stressors. (I don't have prod workloads on ZFS; my day job is infra software, but I contribute here on my own time.)

The free panic occurred about once per day in a ztest loop as I recall, but I've lost track of the setup conditions.

Hope this helps!

Edit: My notes say I was running my llseek stressor x100 with specific tuning, and was able to trigger to 20-30 seconds. I can give that another go if it helps you? TL;DR holes added and removed from 100 large sparse files randomly.

This changes the basic search algorithm from a single search up and down the tree to a full depth-first traversal to handle conditions where the tree matches at a higher level but not a lower level. Normally higher level blocks always point to matching blocks, but there are cases where this does not happen: 1. Racing block pointer updates from dbuf_write_ready. Before f664f1e (openzfs#8946), both dbuf_write_ready and dnode_next_offset held dn_struct_rwlock which protected against pointer writes from concurrent syncs. This no longer applies, so sync context can f.e. clear or fill all L1->L0 BPs before the L2->L1 BP and higher BP's are updated. dnode_free_range in particular can reach this case and skip over L1 blocks that need to be dirtied. Later, sync will panic in free_children when trying to clear a non-dirty indirect block. This case was found with ztest. 2. txg > 0, non-hole case. This is openzfs#11196. Freeing blocks/dnodes breaks the assumption that a match at a higher level implies a match at a lower level when filtering txg > 0. Whenever some but not all L0 blocks are freed, the parent L1 block is rewritten. Its updated L2->L1 BP reflects a newer birth txg. Later when searching by txg, if the L1 block matches since the txg is newer, it is possible that none of the remaining L1->L0 BPs match if none have been updated. The same behavior is possible with dnode search at L0. This is reachable from dsl_destroy_head for synchronous freeing. When this happens open context fails to free objects leaving sync context stuck freeing potentially many objects. This is also reachable from traverse_pool for extreme rewind where it is theoretically possible that datasets not dirtied after txg are skipped if the MOS has high enough indirection to trigger this case. In both of these cases, without backtracking the search ends prematurely as ESRCH result implies no more matches in the entire object. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Akash B <[email protected]> Signed-off-by: Robert Evans <[email protected]> Closes openzfs#16025 Closes openzfs#11196

robn · 2025-09-26T03:52:16Z

Sorry for the no-show, life excuses etc. This is great work, and the explainer is A+. Thanks you!

rrevans · 2025-09-26T12:08:13Z

@rrevans your draft commit looks nice, that really helps with the readability and ability to reason about this code. However, rather then fold those changes in to this PR let me merge this more limited change and you can follow up with that bigger rework in a new PR.

The offset --> blkid cleanup is #17792

rrevans force-pushed the traverse branch from 62bced1 to 0fe0b07 Compare March 25, 2024 14:10

behlendorf added the Status: Code Review Needed Ready for review and testing label Mar 27, 2024

behlendorf self-requested a review March 27, 2024 22:03

rrevans mentioned this pull request Mar 28, 2024

dnode_is_dirty: use dn_dirty_txg to check dirtiness #15615

Closed

13 tasks

rincebrain mentioned this pull request May 11, 2024

Criteria for re-enabling block cloning (toggle zfs_bclone_enabled) by default #16189

Closed

rrevans force-pushed the traverse branch from 0fe0b07 to a33e429 Compare December 23, 2024 15:46

rrevans force-pushed the traverse branch from a33e429 to c510427 Compare January 8, 2025 12:02

rrevans mentioned this pull request Aug 19, 2025

list_link_active panic in dnode_is_dirty #17652

Closed

jumbi77 mentioned this pull request Aug 23, 2025

fix a subtle bug in dnode_next_offset() with txg > 0 #11200

Closed

rrevans force-pushed the traverse branch from c510427 to ec466a8 Compare August 31, 2025 11:44

rrevans force-pushed the traverse branch from ec466a8 to e9d08fd Compare September 1, 2025 16:26

rrevans force-pushed the traverse branch from e9d08fd to 8a8970e Compare September 1, 2025 16:29

behlendorf approved these changes Sep 24, 2025

View reviewed changes

rrevans mentioned this pull request Sep 25, 2025

zinject: Introduce ready delay fault injection #17787

Merged

15 tasks

akashb-22 approved these changes Sep 25, 2025

View reviewed changes

behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Sep 25, 2025

behlendorf merged commit 26b0f56 into openzfs:master Sep 25, 2025
49 of 56 checks passed

rrevans deleted the traverse branch September 26, 2025 04:08

This was referenced Oct 9, 2025

Update dnode_next_offset_level to accept blkid instead of offset #17792

Open

zinject: Introduce waiting on injection events #17832

Open

dnode_next_offset: backtrack if lower level does not match #16025

dnode_next_offset: backtrack if lower level does not match #16025

Uh oh!

Conversation

rrevans commented Mar 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

Uh oh!

rrevans commented Mar 28, 2024

Uh oh!

adamdmoss commented May 14, 2024

Uh oh!

tonyhutter commented Aug 19, 2024

Uh oh!

jumbi77 commented Sep 20, 2024

Uh oh!

rrevans commented Dec 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rrevans commented Jan 8, 2025

Uh oh!

jumbi77 commented Jun 13, 2025

Uh oh!

rrevans commented Aug 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rrevans commented Sep 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1) free_children panic

2) txg > 0, non-hole case

Conclusion

Uh oh!

behlendorf left a comment

Choose a reason for hiding this comment

Uh oh!

rrevans commented Sep 25, 2025

Uh oh!

akashb-22 commented Sep 25, 2025

Uh oh!

behlendorf commented Sep 25, 2025

Uh oh!

rrevans commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

robn commented Sep 26, 2025

Uh oh!

rrevans commented Sep 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

rrevans commented Mar 25, 2024 •

edited

Loading

rrevans commented Dec 23, 2024 •

edited

Loading

rrevans commented Aug 31, 2025 •

edited

Loading

rrevans commented Sep 13, 2025 •

edited

Loading

rrevans commented Sep 25, 2025 •

edited

Loading