client/db: Close missing body gaps for non archive nodes#11332
Conversation
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
| // Non archive nodes cannot fill the missing block gap with bodies. | ||
| // If the gap is present, it means that every restart will try to fill the gap: | ||
| // - a block request is made for each and every block in the gap | ||
| // - the request is fulfilled putting pressure on the network and other nodes | ||
| // - upon receiving the block, the block cannot be executed since the state | ||
| // of the parent block might have been discarded | ||
| // - then the sync engine closes the gap in memory, but never in DB. | ||
| // | ||
| // This leads to inefficient syncing and high CPU usage on every restart. To mitigate this, | ||
| // remove the gap from the DB if we detect it and the current node is not an archive. |
There was a problem hiding this comment.
How did it come to this? Because someone restarted the node using a different pruning mode?
There was a problem hiding this comment.
Apparently, it was just the collator functioning "normally" and no params were changed other than switching to the master PR (which now has import_existing: true).
I believe I had a look over the client info a few years ago, reporting gaps that should be filled or that should have never existed, but I dismissed that as stopping the node at the wrong time or switching params.
I think this code was silently filling the memory gap but never informed the DB layer about not caring for the block execution:
polkadot-sdk/substrate/client/service/src/client/client.rs
Lines 1809 to 1812 in 2b9576c
And now with the optimization of not even fetching the BODY, its assumed that the DB will never close it:
polkadot-sdk/substrate/client/service/src/client/client.rs
Lines 1809 to 1812 in 2b9576c
Not entirely sure what might have caused the gap in the first place, since the nodes were running without the optimization of stripping BODY initially 🤔 So i would have expected that all bodies were fetched as well
lrubasze
left a comment
There was a problem hiding this comment.
LGTM!
Just thinking that maybe it would make sense to add a test (or modify current) that checks if multi-block gaps are closed as well. Seems more real-world scenario.
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
|
Still unclear to me what happened. IMO this PR is fixing the symptom, not the issue. It looks like we write this kind of gap only here: polkadot-sdk/substrate/client/db/src/lib.rs Line 1914 in 8604871 But there we only go if polkadot-sdk/substrate/client/db/src/lib.rs Line 1895 in 8604871 So this means that this issue occured at the tip. And at the tip I would expect that we can just request the body? Lukasz changes where only for gap sync below the warp target. |
|
Versions deployed on the chain:
Before Lukas changes #10373, we checked if the incoming block had a body: polkadot-sdk/substrate/client/db/src/lib.rs Line 1520 in 4e02aa2 polkadot-sdk/substrate/client/db/src/lib.rs Line 1714 in 4e02aa2
After the changes, the two checks are separate (ie we check if the body exists in the DB not the import) polkadot-sdk/substrate/client/db/src/lib.rs Lines 1595 to 1599 in d5c5fbe So, before the Lukas change, we opened a gap that wasn't supposed to get open (most probably via importing a new block as best): polkadot-sdk/cumulus/client/consensus/common/src/parachain_consensus.rs Lines 531 to 534 in d5c5fbe Then because the import was header-only, the client/db assumes we need to fil the body via gap sync. Which then cascades into a sync loop restart when we switched to Lukas's changes:
TLDR;
Would love to double check this with you guys 🙏 @skunert @lrubasze |
|
So you are saying that all nodes that switch from old releases to the new release will have these "accidental gaps" and run into this bug? If yes, then we need to merge this. |
Yep, this has happened on 6 collators: 3 from polkadot-yap and 3 from kusama-yap parachains |
|
Have double checked AH-Westend which contains the latest 2603-rc4:
|
|
@lexnv amazing analysis and great explanation. |
|
Thanks a lot Lukas for double checking 🙏 |
|
/cmd prdoc --audience node_dev --bump patch |
…e_dev --bump patch'
|
@lexnv we need to backport this? |
|
Created backport PR for
Please cherry-pick the changes locally and resolve any conflicts. git fetch origin backport-11332-to-stable2512
git worktree add --checkout .worktree/backport-11332-to-stable2512 backport-11332-to-stable2512
cd .worktree/backport-11332-to-stable2512
git reset --hard HEAD^
git cherry-pick -x 0f64cfcaf9f1665216d2ea5e5f66a8e632ce423c
git push --force-with-lease |
This PR closes missing body gaps in the database for non-archive nodes. Effectively, a missing body gap cannot be closed on the DB side if the node is non-archive. Since execution is already skipped, the node will close the memory gap in the sync engine; however, the gap remains open in the db. This leads to wasting resources at every startup: - client info contains a gap that cannot be filled (since we don't have the state around for execution) - blocks are fetched from the connected peers - gap is filled by ignoring blocks in the sync engine Further, for collators on origin master this causes an infinite loop of sync engine restarts that get punished via banning and disconnecting. For more details and root cause check: - #11330 Part of: - #11299 --------- Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io> Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com> (cherry picked from commit 0f64cfc)
|
Successfully created backport PR for |
Backport #11332 into `stable2603` from lexnv. See the [documentation](https://github.com/paritytech/polkadot-sdk/blob/master/docs/BACKPORT.md) on how to use this bot. <!-- # To be used by other automation, do not modify: original-pr-number: #${pull_number} --> Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io> Co-authored-by: Alexandru Vasile <60601340+lexnv@users.noreply.github.com> Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com>
This PR closes missing body gaps in the database for non-archive nodes.
Effectively, a missing body gap cannot be closed on the DB side if the node is non-archive. Since execution is already skipped, the node will close the memory gap in the sync engine; however, the gap remains open in the db.
This leads to wasting resources at every startup:
Further, for collators on origin master this causes an infinite loop of sync engine restarts that get punished via banning and disconnecting. For more details and root cause check:
Part of: