-
Notifications
You must be signed in to change notification settings - Fork 1.9k
dnode: fix how we track and check dirtyness #17658
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Bumped when we take the dirty hold in dnode_setdirty(), dropped when the dnode is finally cleaned up after sync in dnode_rele_task() or userquota_updates_task(). This gives us a way to check if the dnode is dirty on any txg without having to rely on outside information (eg presence on a dirty list), which has been a rich source of bugs in the past. Suggested-by: Robert Evans <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
dn_dirty_txg only existed for DNODE_IS_DIRTY(). In turn, that only existed to ensure that a dnode was clean before making it eligible for removal from the array of cached dnodes attached to the object 0 L0 dbuf. dn_dirtycnt is enough to check that now, so use it directly and remove the rest. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
Old debug param, not used for anything. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
Only used for a couple of debug assertions which had very little value. Setting it required taking certain locks, so we can remove all that too. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
That's a lot of fixes. Muckraking indeed. One thought- I think dn_dirtyctx and firstset are worth keeping as they enforce the invariant that dnodes only get dirtied from open or sync context but never both. Or I think that's an invariant? I don't remember clearly, but it's worth understanding before ripping that part out. At least to write down why it's obsolete if it is. |
As time goes on, I find myself more and more delighted by stupid wordplays. Not yet sure if its just general getting old or if its a symptom of very little adult human contact (100% home office), that is, no immediate feedback that no, please, stop.
ASSERT(dn->dn_object == DMU_META_DNODE_OBJECT ||
dn->dn_dirtyctx == DN_UNDIRTIED || dn->dn_dirtyctx ==
(dmu_tx_is_syncing(tx) ? DN_DIRTY_SYNC : DN_DIRTY_OPEN)); The assertions are debug-only, but the setup wasn't, so the call to The invariant is still true I expect; it intuitively makes sense, but I don't really have this locking deeply internalised. If we want to keep it, I think the way I would try to do it is make |
Clever, but maybe too clever for a debug facility? I think it's much clearer to have the plain counter because it's so much more obvious. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A fix that's also a simplification, hooray!
LGTM.
I agree, that's why I just ripped it all out. |
@robn Looking closely, I don't think If we want to really protect these invariants, we'd prefer to teach Anyway, TL;DR- I agree |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for seeing this through! It's all quite elegant and tidy. LGTM
If we are keeping score of the number of times this bug has shown up in the past, add https://github.com/openzfs/zfs/pull/16019/files#r1536719525 to the list? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Robert Evans <[email protected]> Reviewed-by: Adam Moss <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16297 Closes #17652 Closes #17658
dn_dirty_txg only existed for DNODE_IS_DIRTY(). In turn, that only existed to ensure that a dnode was clean before making it eligible for removal from the array of cached dnodes attached to the object 0 L0 dbuf. dn_dirtycnt is enough to check that now, so use it directly and remove the rest. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Robert Evans <[email protected]> Reviewed-by: Adam Moss <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16297 Closes #17652 Closes #17658
Old debug param, not used for anything. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Robert Evans <[email protected]> Reviewed-by: Adam Moss <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16297 Closes #17652 Closes #17658
Only used for a couple of debug assertions which had very little value. Setting it required taking certain locks, so we can remove all that too. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Robert Evans <[email protected]> Reviewed-by: Adam Moss <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #16297 Closes #17652 Closes #17658
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to be good otherwise. Thanks.
Bumped when we take the dirty hold in dnode_setdirty(), dropped when the dnode is finally cleaned up after sync in dnode_rele_task() or userquota_updates_task(). This gives us a way to check if the dnode is dirty on any txg without having to rely on outside information (eg presence on a dirty list), which has been a rich source of bugs in the past. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Suggested-by: Robert Evans <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Robert Evans <[email protected]> Reviewed-by: Adam Moss <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#16297 Closes openzfs#17652 Closes openzfs#17658
Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Robert Evans <[email protected]> Reviewed-by: Adam Moss <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#16297 Closes openzfs#17652 Closes openzfs#17658
dn_dirty_txg only existed for DNODE_IS_DIRTY(). In turn, that only existed to ensure that a dnode was clean before making it eligible for removal from the array of cached dnodes attached to the object 0 L0 dbuf. dn_dirtycnt is enough to check that now, so use it directly and remove the rest. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Robert Evans <[email protected]> Reviewed-by: Adam Moss <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#16297 Closes openzfs#17652 Closes openzfs#17658
Old debug param, not used for anything. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Robert Evans <[email protected]> Reviewed-by: Adam Moss <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#16297 Closes openzfs#17652 Closes openzfs#17658
Only used for a couple of debug assertions which had very little value. Setting it required taking certain locks, so we can remove all that too. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Robert Evans <[email protected]> Reviewed-by: Adam Moss <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#16297 Closes openzfs#17652 Closes openzfs#17658
[Sponsors: Klara, Inc., Wasabi Technology, Inc.]
Motivation and Context
Fixes longstanding issues around tracking and checking for dnode dirtiness. Made famous by #15526, came to our attention again in a recent CI failure (see #17652).
Fixes: #17652
Fixes: #16297
Fixes: #15526
Fixes: #13143
Fixes: #11900
Fixes: #11824
Fixes: #9104
Fixes: #9068
Fixes: #8048
Fixes: #7997
Fixes: #7933
Fixes: #7733
Fixes: #7147
Closes: #15615
(Yes, that's a ridiculously long list. Might not even be all of them; those were the ones that seemed to match by description, crash style and call trace).
Description
Here we add a counter to
dnode_t
,dn_dirtycnt
, that counts the number of txgs this dnode is dirty on. This is incremented the first time a dnode is made dirty on a txg (dnode_setdirty()
), and decremented it has been synced to disk (dnode_rele_task()
,userquota_updates_task()
).Full credit to @rrevans for the analysis and suggestion, see #15615 (comment) and #17652 (comment).
After that,
dnode_is_dirty()
becomes a simple check under lock.Finally, we remove the other efforts at dirtyness tracking:
dn_dirtyctx
,dn_dirtyctx_firstset
,dn_dirty_txg
,DNODE_IS_DIRTY()
,dnode_set_dirtyctx()
. These were either unused, insufficient, or made redundant by the new counter, and have all made the situtation just a little more complicated each time.How Has This Been Tested?
1000 runs of
seekflood 2000 6
completed without issue. Previously attempts had triggered the crash in #17652 at ~400 and ~80.Full ZTS run completed on 6.12.38.
Types of changes
Checklist:
Signed-off-by
.