Skip to content

client/db: Close missing body gaps for non archive nodes#11332

Merged
lexnv merged 6 commits intomasterfrom
lexnv/nuke-gaps
Mar 19, 2026
Merged

client/db: Close missing body gaps for non archive nodes#11332
lexnv merged 6 commits intomasterfrom
lexnv/nuke-gaps

Conversation

@lexnv
Copy link
Contributor

@lexnv lexnv commented Mar 10, 2026

This PR closes missing body gaps in the database for non-archive nodes.

Effectively, a missing body gap cannot be closed on the DB side if the node is non-archive. Since execution is already skipped, the node will close the memory gap in the sync engine; however, the gap remains open in the db.

This leads to wasting resources at every startup:

  • client info contains a gap that cannot be filled (since we don't have the state around for execution)
  • blocks are fetched from the connected peers
  • gap is filled by ignoring blocks in the sync engine

Further, for collators on origin master this causes an infinite loop of sync engine restarts that get punished via banning and disconnecting. For more details and root cause check:

Part of:

lexnv added 3 commits March 10, 2026 14:22
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
@lexnv lexnv self-assigned this Mar 10, 2026
@lexnv lexnv added the T0-node This PR/Issue is related to the topic “node”. label Mar 10, 2026
Comment on lines +1366 to +1375
// Non archive nodes cannot fill the missing block gap with bodies.
// If the gap is present, it means that every restart will try to fill the gap:
// - a block request is made for each and every block in the gap
// - the request is fulfilled putting pressure on the network and other nodes
// - upon receiving the block, the block cannot be executed since the state
// of the parent block might have been discarded
// - then the sync engine closes the gap in memory, but never in DB.
//
// This leads to inefficient syncing and high CPU usage on every restart. To mitigate this,
// remove the gap from the DB if we detect it and the current node is not an archive.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did it come to this? Because someone restarted the node using a different pruning mode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently, it was just the collator functioning "normally" and no params were changed other than switching to the master PR (which now has import_existing: true).

I believe I had a look over the client info a few years ago, reporting gaps that should be filled or that should have never existed, but I dismissed that as stopping the node at the wrong time or switching params.

I think this code was silently filling the memory gap but never informed the DB layer about not caring for the block execution:

And now with the optimization of not even fetching the BODY, its assumed that the DB will never close it:

Not entirely sure what might have caused the gap in the first place, since the nodes were running without the optimization of stripping BODY initially 🤔 So i would have expected that all bodies were fetched as well

Copy link
Contributor

@lrubasze lrubasze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Just thinking that maybe it would make sense to add a test (or modify current) that checks if multi-block gaps are closed as well. Seems more real-world scenario.

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
@skunert
Copy link
Contributor

skunert commented Mar 12, 2026

Still unclear to me what happened. IMO this PR is fixing the symptom, not the issue.

It looks like we write this kind of gap only here:

gap_type: BlockGapType::MissingBody,

But there we only go if create_gap is true and its above best block:

} else if operation.create_gap {

So this means that this issue occured at the tip. And at the tip I would expect that we can just request the body? Lukasz changes where only for gap sync below the warp target.

@lexnv
Copy link
Contributor Author

lexnv commented Mar 13, 2026

Versions deployed on the chain:

  • t0: stable2512-2 clean sync
  • t1: master-14852d21 import_existing: true, which caused the sync loop
  • t2: stable260

Before Lukas changes #10373, we checked if the incoming block had a body:

let existing_body = pending_block.body.is_some();

let should_check_block_gap = !existing_header || !existing_body;

  • a header only import might trick the DB into thinking the body doesn't exist, although the body is already in the DB

After the changes, the two checks are separate (ie we check if the body exists in the DB not the import)

// Body in DB (not incoming block) - needed to update gap when adding body to existing
// header.
let body_exists_in_db = self.blockchain.body(hash)?.is_some();
// Incoming block has body - used for fast sync gap handling.
let incoming_has_body = pending_block.body.is_some();

So, before the Lukas change, we opened a gap that wasn't supposed to get open (most probably via importing a new block as best):

async fn import_block_as_new_best<Block, P>(hash: Block::Hash, header: Block::Header, parachain: &P)

// Make it the new best block
let mut block_import_params = BlockImportParams::new(BlockOrigin::ConsensusBroadcast, header);
block_import_params.fork_choice = Some(ForkChoiceStrategy::Custom(true));
block_import_params.import_existing = true;

Then because the import was header-only, the client/db assumes we need to fil the body via gap sync. Which then cascades into a sync loop restart when we switched to Lukas's changes:

  • we request the block to fill the gap
  • now because import_existing: true we execute the block
  • execution fails because there's no parent state
  • we go back to step 1 trying to fill the gap

TLDR;

  • stable2512-2 had a bug in the DB whcih opened a gap when it shouldnt have
  • master-14852d21 is actually fixing the bug, but because we also changed the import_existing to true, and we hit the bug from stable2512-2, we try to fill a gap and enter a sync restart loop
  • Since we got data for this happening on 2 parachains (yap polkadot and yap kusama), I would still consider this fix

Would love to double check this with you guys 🙏 @skunert @lrubasze

@skunert
Copy link
Contributor

skunert commented Mar 17, 2026

So you are saying that all nodes that switch from old releases to the new release will have these "accidental gaps" and run into this bug? If yes, then we need to merge this.

@lexnv
Copy link
Contributor Author

lexnv commented Mar 18, 2026

So you are saying that all nodes that switch from old releases to the new release will have these "accidental gaps" and run into this bug? If yes, then we need to merge this.

Yep, this has happened on 6 collators: 3 from polkadot-yap and 3 from kusama-yap parachains

@lexnv
Copy link
Contributor Author

lexnv commented Mar 18, 2026

Have double checked AH-Westend which contains the latest 2603-rc4:

@lrubasze
Copy link
Contributor

@lexnv amazing analysis and great explanation.
Pretty twisted scenario :)

@lexnv
Copy link
Contributor Author

lexnv commented Mar 18, 2026

Thanks a lot Lukas for double checking 🙏
Yep, this scenario took me by surprise as well :D

@lexnv
Copy link
Contributor Author

lexnv commented Mar 19, 2026

/cmd prdoc --audience node_dev --bump patch

@lexnv lexnv added this pull request to the merge queue Mar 19, 2026
Merged via the queue into master with commit 0f64cfc Mar 19, 2026
245 of 249 checks passed
@lexnv lexnv deleted the lexnv/nuke-gaps branch March 19, 2026 16:13
@bkchr
Copy link
Member

bkchr commented Mar 19, 2026

@lexnv we need to backport this?

@lexnv lexnv added the A4-backport-stable2512 Pull request must be backported to the stable2512 release branch label Mar 20, 2026
@lexnv lexnv added the A4-backport-stable2603 Pull request must be backported to the stable2603 release branch label Mar 20, 2026
@paritytech-release-backport-bot

Created backport PR for stable2512:

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin backport-11332-to-stable2512
git worktree add --checkout .worktree/backport-11332-to-stable2512 backport-11332-to-stable2512
cd .worktree/backport-11332-to-stable2512
git reset --hard HEAD^
git cherry-pick -x 0f64cfcaf9f1665216d2ea5e5f66a8e632ce423c
git push --force-with-lease

paritytech-release-backport-bot bot pushed a commit that referenced this pull request Mar 20, 2026
This PR closes missing body gaps in the database for non-archive nodes.

Effectively, a missing body gap cannot be closed on the DB side if the
node is non-archive. Since execution is already skipped, the node will
close the memory gap in the sync engine; however, the gap remains open
in the db.

This leads to wasting resources at every startup:
- client info contains a gap that cannot be filled (since we don't have
the state around for execution)
- blocks are fetched from the connected peers
- gap is filled by ignoring blocks in the sync engine

Further, for collators on origin master this causes an infinite loop of
sync engine restarts that get punished via banning and disconnecting.
For more details and root cause check:
- #11330

Part of:
- #11299

---------

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com>
(cherry picked from commit 0f64cfc)
@paritytech-release-backport-bot

Successfully created backport PR for stable2603:

EgorPopelyaev pushed a commit that referenced this pull request Mar 24, 2026
Backport #11332 into `stable2603` from lexnv.

See the
[documentation](https://github.com/paritytech/polkadot-sdk/blob/master/docs/BACKPORT.md)
on how to use this bot.

<!--
  # To be used by other automation, do not modify:
  original-pr-number: #${pull_number}
-->

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Co-authored-by: Alexandru Vasile <60601340+lexnv@users.noreply.github.com>
Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A4-backport-stable2512 Pull request must be backported to the stable2512 release branch A4-backport-stable2603 Pull request must be backported to the stable2603 release branch T0-node This PR/Issue is related to the topic “node”.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants