-
Notifications
You must be signed in to change notification settings - Fork 1.9k
dnode_next_offset: backtrack if lower level does not match #16025
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This walk is inherently racy w.r.t. dbuf eviction and sync. Consider: 0. A large sparse file with 3 levels of indirection. 1. A new L1 block is added to a brand new L2 block. 2. The L1 block syncs out and is immediately evicted. 3. Before the L3->L2 BP is updated in the L3 block, dnode_free_range attempts to free the new L1. In this case neither dnode_dirty_l1range nor dnode_next_offset can find the newly synced-out L1 block and its L0 blocks: - dnode_dirty_l1range uses in-memory index but the L1 is evicted - dnode_next_offset considers on-disk BPs but the L3->L2 is missing And then free_children will later PANIC because the L1 was not dirtied during open context when freeing the range. This case was found during testing llseek(SEEK_HOLE/SEEK_DATA) without txg sync and is distinct from the _other_ free_childen panic found and addressed by openzfs#16025. The fix is to replace dnode_dirty_l1range with dnode_next_offset(DNODE_FIND_DIRTY) which knows how to find all dirty L1 blocks. This PR also changes to use minlvl=1 to avoid redirtying L2 blocks that are only dirtied in a prior txg. Successive frees otherwise needlessly redirty already-empty L1s which wastes time during txg sync turning them back into holes. Signed-off-by: Robert Evans <[email protected]>
See master...rrevans:zfs:find_dirty for the rest of the patchset here.
|
I found your notes quite educational as a background so I'm repeating the link here for future readers: |
@rrevans sorry no one has taken a look at this yet. I just tried pulling down the patch, but looks like it's now out of sync with master. Would you mind re-basing on top of master? |
@jumbi77 @tonyhutter thanks for the ping. I'll have a look here and post a rebase in the next week or so. Edit: Updated! |
This walk is inherently racy w.r.t. dbuf eviction and sync. Consider: 0. A large sparse file with 3 levels of indirection. 1. A new L1 block is added to a brand new L2 block. 2. The L1 block syncs out and is immediately evicted. 3. Before the L3->L2 BP is updated in the L3 block, dnode_free_range attempts to free the new L1. In this case neither dnode_dirty_l1range nor dnode_next_offset can find the newly synced-out L1 block and its L0 blocks: - dnode_dirty_l1range uses in-memory index but the L1 is evicted - dnode_next_offset considers on-disk BPs but the L3->L2 is missing And then free_children will later PANIC because the L1 was not dirtied during open context when freeing the range. This case was found during testing llseek(SEEK_HOLE/SEEK_DATA) without txg sync and is distinct from the _other_ free_childen panic found and addressed by openzfs#16025. The fix is to replace dnode_dirty_l1range with dnode_next_offset(DNODE_FIND_DIRTY) which knows how to find all dirty L1 blocks. This PR also changes to use minlvl=1 to avoid redirtying L2 blocks that are only dirtied in a prior txg. Successive frees otherwise needlessly redirty already-empty L1s which wastes time during txg sync turning them back into holes. Signed-off-by: Robert Evans <[email protected]>
@tonyhutter Updated this PR as well as the rest of the series in master...rrevans:zfs:find_dirty. Please take a look if you get a chance! Those other patches mostly rework |
In case this optimization is still applicable, maybe @robn is interested to take a look and finishing this? (Its just a hint, so feel free to not respond :) ) |
This walk is inherently racy w.r.t. dbuf eviction and sync. Consider: 0. A large sparse file with 3 levels of indirection. 1. A new L1 block is added to a brand new L2 block. 2. The L1 block syncs out and is immediately evicted. 3. Before the L3->L2 BP is updated in the L3 block, dnode_free_range attempts to free the new L1. In this case neither dnode_dirty_l1range nor dnode_next_offset can find the newly synced-out L1 block and its L0 blocks: - dnode_dirty_l1range uses in-memory index but the L1 is evicted - dnode_next_offset considers on-disk BPs but the L3->L2 is missing And then free_children will later PANIC because the L1 was not dirtied during open context when freeing the range. This case was found during testing llseek(SEEK_HOLE/SEEK_DATA) without txg sync and is distinct from the _other_ free_childen panic found and addressed by openzfs#16025. The fix is to replace dnode_dirty_l1range with dnode_next_offset(DNODE_FIND_DIRTY) which knows how to find all dirty L1 blocks. This PR also changes to use minlvl=1 to avoid redirtying L2 blocks that are only dirtied in a prior txg. Successive frees otherwise needlessly redirty already-empty L1s which wastes time during txg sync turning them back into holes. Signed-off-by: Robert Evans <[email protected]>
@robn For when you are back: I've updated this patch to merge at head as well as the rest of the series in master...rrevans:zfs:find_dirty. Let me know when you might have time to pick this up? Happy to answer questions or rework the patches as needed. (Following up from #17652 (comment)) Edit:
Upon staring more, the changes to The new traversal requires that, when error == 0, the output offset of I'll also have another think about how to test the |
This walk is inherently racy w.r.t. dbuf eviction and sync. Consider: 0. A large sparse file with 3 levels of indirection. 1. A new L1 block is added to a brand new L2 block. 2. The L1 block syncs out and is immediately evicted. 3. Before the L3->L2 BP is updated in the L3 block, dnode_free_range attempts to free the new L1. In this case neither dnode_dirty_l1range nor dnode_next_offset can find the newly synced-out L1 block and its L0 blocks: - dnode_dirty_l1range uses in-memory index but the L1 is evicted - dnode_next_offset considers on-disk BPs but the L3->L2 is missing And then free_children will later PANIC because the L1 was not dirtied during open context when freeing the range. This case was found during testing llseek(SEEK_HOLE/SEEK_DATA) without txg sync and is distinct from the _other_ free_childen panic found and addressed by openzfs#16025. The fix is to replace dnode_dirty_l1range with dnode_next_offset(DNODE_FIND_DIRTY) which knows how to find all dirty L1 blocks. This PR also changes to use minlvl=1 to avoid redirtying L2 blocks that are only dirtied in a prior txg. Successive frees otherwise needlessly redirty already-empty L1s which wastes time during txg sync turning them back into holes. Signed-off-by: Robert Evans <[email protected]>
This changes the basic search algorithm from a single search up and down the tree to a full depth-first traversal to handle conditions where the tree matches at a higher level but not a lower level. Normally higher level blocks always point to matching blocks, but there are cases where this does not happen: 1. Racing block pointer updates from dbuf_write_ready. Before f664f1e (openzfs#8946), both dbuf_write_ready and dnode_next_offset held dn_struct_rwlock which protected against pointer writes from concurrent syncs. This no longer applies, so sync context can f.e. clear or fill all L1->L0 BPs before the L2->L1 BP and higher BP's are updated. dnode_free_range in particular can reach this case and skip over L1 blocks that need to be dirtied. Later, sync will panic in free_children when trying to clear a non-dirty indirect block. This case was found with ztest. 2. txg > 0, non-hole case. This is openzfs#11196. Freeing blocks/dnodes breaks the assumption that a match at a higher level implies a match at a lower level when filtering txg > 0. Whenever some but not all L0 blocks are freed, the parent L1 block is rewritten. Its updated L2->L1 BP reflects a newer birth txg. Later when searching by txg, if the L1 block matches since the txg is newer, it is possible that none of the remaining L1->L0 BPs match if none have been updated. The same behavior is possible with dnode search at L0. This is reachable from dsl_destroy_head for synchronous freeing. When this happens open context fails to free objects leaving sync context stuck freeing potentially many objects. This is also reachable from traverse_pool for extreme rewind where it is theoretically possible that datasets not dirtied after txg are skipped if the MOS has high enough indirection to trigger this case. In both of these cases, without backtracking the search ends prematurely as ESRCH result implies no more matches in the entire object. Signed-off-by: Robert Evans <[email protected]>
This walk is inherently racy w.r.t. dbuf eviction and sync. Consider: 0. A large sparse file with 3 levels of indirection. 1. A new L1 block is added to a brand new L2 block. 2. The L1 block syncs out and is immediately evicted. 3. Before the L3->L2 BP is updated in the L3 block, dnode_free_range attempts to free the new L1. In this case neither dnode_dirty_l1range nor dnode_next_offset can find the newly synced-out L1 block and its L0 blocks: - dnode_dirty_l1range uses in-memory index but the L1 is evicted - dnode_next_offset considers on-disk BPs but the L3->L2 is missing And then free_children will later PANIC because the L1 was not dirtied during open context when freeing the range. This case was found during testing llseek(SEEK_HOLE/SEEK_DATA) without txg sync and is distinct from the _other_ free_childen panic found and addressed by openzfs#16025. The fix is to replace dnode_dirty_l1range with dnode_next_offset(DNODE_FIND_DIRTY) which knows how to find all dirty L1 blocks. This PR also changes to use minlvl=1 to avoid redirtying L2 blocks that are only dirtied in a prior txg. Successive frees otherwise needlessly redirty already-empty L1s which wastes time during txg sync turning them back into holes. Signed-off-by: Robert Evans <[email protected]>
I've been working on reproducing the two issues here more precisely. With new tooling, I've been able to clearly identify the problems and demonstrate that this PR definitely fixes them. Apologies for the long comment, but hopefully this aids in building a better understanding of the problems here. @robn Let me know if you have some time soon to work on this? 1) free_children panicWhat causes the panic? Consider a sparse file with 3 populated L2 blocks (
If the L0 block at offset 0x10000000000 is overwritten with zeros, sync will perform these steps:
There is a window of time between steps (5) and (8) where the L3->L2 BP still reflects the prior on-disk state, but the L2 dbuf contains only holes (or zeros). If the file is truncated to zero bytes during this window,
The search for on-disk blocks proceeds as follows:
(Note that unlike This means the L1 at offset 0x20000000000 is not dirtied at all if both:
Later, For this condition to be hit, the second free range must start in an L2 block prior to the one that contains the blocks freed in the first sync. This is so that it will walk the L3->L2 BP in a downward search. If it starts in the same block, then it will simply search the empty L2 block and continue upwards at an offset at the end of the L2 block. How to reproduce it? The window above depends on timing, so I've created two new debug tools to allow reliably hitting it:
With these, it's now possible to create a reliable reproducer: # create suitable dataset
zfs create test/ds -o recordsize=128k -o compression=on
# create the file
dd if=<(echo z) of=/test/ds/f bs=1 count=1 seek=1T
dd if=<(echo z) of=/test/ds/f bs=1 count=1 seek=2T
dd if=<(echo z) of=/test/ds/f bs=1 count=1 seek=3T
# sync the file to disk
zpool sync
# inject a 5 second delay at zio_ready for L2 writes
# and remount (-m) to evict all L1 dbufs
zinject -m -E 5 -l2 -T write -t data /test/ds/f
# prepare to wait (get sync token from kernel)
STATE=$(zinject -w0 -W0)
# zero out the block to eventually free it
dd if=/dev/zero of=/test/ds/f bs=1 count=1 seek=1T conv=notrunc
# wait for events after sync point above
zinject -w "${STATE?}"
zinject -c all
# now, racing with zio_ready, free the whole file
truncate --size=0 /test/ds/f
# sync will panic if truncate hits the window
zpool sync The above script reaches the panic for me 100% of the time. How does this PR fix the problem? After this PR, the search will continue with an upwards search if the lower level does not match. The script above no longer reproduces the panic with this PR applied. Alternatives considered to this PR?
2) txg > 0, non-hole caseWhat is the problem? When searching for blocks or dnodes with Consider this DMU dataset object with
Suppose the L0 block at 0x1004000 is freed -- after the one file in that block is deleted. At the same time, suppose a file in L0 block 0x2000000 gets updated (e.g. mtime updated). Then the dataset layout becomes:
Notably, the L1 block at 0x1000000 has birth txg 6317075 but contains no children L0 blocks at that txg. And the L1 block at 0x2000000 is updated to the same birth txg. Suppose
This is wrong as there is a block at 0x2000000 with matching txg that should be found instead. To reach this state, I did the following (via script):
The same problem also applies to ordinary files, at least in theory. Consider:
A similar file can be created with: dd if=<(echo z) of=/test/f bs=1 count=1 seek=$((0x8000000)) conv=notrunc
dd if=<(echo z) of=/test/f bs=1 count=1 seek=$((0x8020000)) conv=notrunc
dd if=<(echo z) of=/test/f bs=1 count=1 seek=$((0x10000000)) conv=notrunc
zpool sync
dd if=/dev/zero of=/test/f bs=1 count=1 seek=$((0x8020000)) conv=notrunc
dd if=<(echo a) of=/test/f bs=1 count=1 seek=$((0x10000000)) conv=notrunc Similar to the dataset above, the L1 block at offset 0x8000000 is newer because it contains a newer hole and the L1 block at offset 0x10000000 is newer because it has been modified. The same issue happens if
What does this break?
There is no way to reach this for ordinary files as How to reproduce it? For this, I've added two new ioctls to allow directly inspecting the objects:
For each of these, I also implemented matching Then, I can use python
With these, I can directly call the corresponding functions to observe the unwanted behavior. For the dataset example:
For the plain file example:
How does this PR fix the problem? Same as above, search will continue with an upwards search when this condition occurs. Once this patch is applied, the cases above no longer fail: Dataset case:
Plain file case:
The new tools can also observe exactly the
With the patch applied:
ConclusionThe new tooling conclusively reproduces and demonstrates that this PR fixes both the panic and the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed walk through how both of these problem cases can occur. It took me a while to digest everything, but I was able to eventually convince myself this looks right. Let's see if we can also get @robn and @avg-I to take a look. @avg-I identified the txg > 0
issue and proposed an alternate fix in #11200 so I'm sure he's familiar with this bit of the code.
It'd be great to additionally pull in your zinject
improvements. Those could definitely be handy in the future when trying to reproduce other similar subtle bugs. Plus, it would let us add your test case for this to the test suite.
Thanks for reviewing @behlendorf! I'll send along the other PRs for the Also I've been working on a separate PR to cleanup the
I have a draft commit that addresses this by having This approach fixes those above issues and makes it clear that |
@rrevans Your script to reliably reproduce the |
@rrevans your draft commit looks nice, that really helps with the readability and ability to reason about this code. However, rather then fold those changes in to this PR let me merge this more limited change and you can follow up with that bigger rework in a new PR. |
Thanks!
Short answer not in production. So I've been deep in this code because I am in pursuit of sync-free llseek(SEEK_HOLE/DATA) where the DFS traversal is mandatory to skip over live state that mismatches the disk committed state. Being this change is non trivial, I have been running synthetic tests against my development machine en masse. This is mainly ztest in a loop, ZTS suite in a loop, and custom llseek stressors. (I don't have prod workloads on ZFS; my day job is infra software, but I contribute here on my own time.) The free panic occurred about once per day in a ztest loop as I recall, but I've lost track of the setup conditions. Hope this helps! Edit: My notes say I was running my llseek stressor x100 with specific tuning, and was able to trigger to 20-30 seconds. I can give that another go if it helps you? TL;DR holes added and removed from 100 large sparse files randomly. |
This changes the basic search algorithm from a single search up and down the tree to a full depth-first traversal to handle conditions where the tree matches at a higher level but not a lower level. Normally higher level blocks always point to matching blocks, but there are cases where this does not happen: 1. Racing block pointer updates from dbuf_write_ready. Before f664f1e (openzfs#8946), both dbuf_write_ready and dnode_next_offset held dn_struct_rwlock which protected against pointer writes from concurrent syncs. This no longer applies, so sync context can f.e. clear or fill all L1->L0 BPs before the L2->L1 BP and higher BP's are updated. dnode_free_range in particular can reach this case and skip over L1 blocks that need to be dirtied. Later, sync will panic in free_children when trying to clear a non-dirty indirect block. This case was found with ztest. 2. txg > 0, non-hole case. This is openzfs#11196. Freeing blocks/dnodes breaks the assumption that a match at a higher level implies a match at a lower level when filtering txg > 0. Whenever some but not all L0 blocks are freed, the parent L1 block is rewritten. Its updated L2->L1 BP reflects a newer birth txg. Later when searching by txg, if the L1 block matches since the txg is newer, it is possible that none of the remaining L1->L0 BPs match if none have been updated. The same behavior is possible with dnode search at L0. This is reachable from dsl_destroy_head for synchronous freeing. When this happens open context fails to free objects leaving sync context stuck freeing potentially many objects. This is also reachable from traverse_pool for extreme rewind where it is theoretically possible that datasets not dirtied after txg are skipped if the MOS has high enough indirection to trigger this case. In both of these cases, without backtracking the search ends prematurely as ESRCH result implies no more matches in the entire object. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Akash B <[email protected]> Signed-off-by: Robert Evans <[email protected]> Closes openzfs#16025 Closes openzfs#11196
Sorry for the no-show, life excuses etc. This is great work, and the explainer is A+. Thanks you! |
The offset --> blkid cleanup is #17792 |
This changes the basic search algorithm from a single search up and down the tree to a full depth-first traversal to handle conditions where the tree matches at a higher level but not a lower level.
Motivation and Context
Normally a higher level search match in the first loop of
dnode_next_offset
always points to a matching block in the second loop, but there are cases where this does not happen:Racing block pointer updates from
dbuf_write_ready
.Before f664f1e (Reduce lock contention on dn_struct_rwlock #8946), both
dbuf_write_ready
anddnode_next_offset
helddn_struct_rwlock
which protected against pointer writes from concurrent syncs.This no longer applies, so sync context can f.e. clear or fill all L1->L0 BPs before the L2->L1 BP and higher BP's are updated.
dnode_free_range
in particular can reach this case and skip over L1 blocks that need to be dirtied. Later, sync will panic infree_children
when trying to clear a non-dirty indirect block.This case was found with
ztest
.txg > 0, non-hole case. This is subtle bug in dnode_next_offset() with txg > 0 #11196.
Freeing blocks/dnodes breaks the assumption that a match at a higher level implies a match at a lower level when filtering
txg > 0
.Whenever some but not all L0 blocks are freed, the parent L1 block is rewritten. Its updated L2->L1 BP reflects a newer birth txg.
Later when searching by txg, if the L1 block matches since the txg is newer, it is possible that none of the remaining L1->L0 BPs match if none have been updated.
The same behavior is possible with dnode search at L0.
This is reachable from
dsl_destroy_head
for synchronous freeing. When this happens open context fails to free objects leaving sync context stuck freeing potentially many objects.This is also reachable from
traverse_pool
for extreme rewind where it is theoretically possible that datasets not dirtied after txg are skipped if the MOS has high enough indirection to trigger this case.In both of these cases, without backtracking the search ends prematurely as
ESRCH
result implies no further matches in the entire object.This PR is also a first step towards teaching
dnode_next_offset
to consider dirty dbuf state.In the next PR,
dnode_next_offset_level
is modified to stop at any dirty indirect block when a new flag is set. This allowsdnode_next_offset
to match dirty L0 blocks (or freed-but-not-synced L0 ranges) the same as synced out data blocks (or holes). However that approach requires backtracking since a dirty higher-level indirect may not match once the L0/L1 state is inspected (f.e. consider a data search reaching an L2 block that is dirty but all L0 blocks previously created under that L2 are now newly freed in dirty state).Description
Old algorithm:
minlvl
lvl
until a matching block is found ormaxlvl
exceeded.lvl
untilminlvl
reached or no match found.New algorithm:
minlvl
a. If matched, decrement
lvl
untilminlvl
reached.b. If not matched, adjust offset to next BP at lvl+1 and increment
lvl
.The new algorithm continues the search at the next possible offset at the next higher level when no match is found. This performs in-order traversal of the tree while skipping non-existing or non-matching ranges.
How Has This Been Tested?
Many ztest and ZTS runs as well as
seek(SEEK_DATA/SEEK_HOLE)
stress tests. This surfaced a lot of problems getting*offset
semantics right, and also found a novel PANIC infree_children
which this happens to fix.I don't know how to really test
maxlvl == 0
changes (see also comments in #11200), and it would be nice to have more unit-oriented tests fordnode_next_offset
. Any feedback appreciated.Types of changes
Checklist:
Signed-off-by
.