client/db: Close missing body gaps for non archive nodes by lexnv · Pull Request #11332 · paritytech/polkadot-sdk

lexnv · 2026-03-10T14:59:25Z

This PR closes missing body gaps in the database for non-archive nodes.

Effectively, a missing body gap cannot be closed on the DB side if the node is non-archive. Since execution is already skipped, the node will close the memory gap in the sync engine; however, the gap remains open in the db.

This leads to wasting resources at every startup:

client info contains a gap that cannot be filled (since we don't have the state around for execution)
blocks are fetched from the connected peers
gap is filled by ignoring blocks in the sync engine

Further, for collators on origin master this causes an infinite loop of sync engine restarts that get punished via banning and disconnecting. For more details and root cause check:

aura/import: Skip block execution when collators have no parent block state #11330

Part of:

master/regression: Parachain Gap cannot be filled #11299

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

bkchr · 2026-03-10T22:26:38Z

substrate/client/db/src/lib.rs

+		// Non archive nodes cannot fill the missing block gap with bodies.
+		// If the gap is present, it means that every restart will try to fill the gap:
+		// - a block request is made for each and every block in the gap
+		// - the request is fulfilled putting pressure on the network and other nodes
+		// - upon receiving the block, the block cannot be executed since the state
+		//  of the parent block might have been discarded
+		// - then the sync engine closes the gap in memory, but never in DB.
+		//
+		// This leads to inefficient syncing and high CPU usage on every restart. To mitigate this,
+		// remove the gap from the DB if we detect it and the current node is not an archive.


How did it come to this? Because someone restarted the node using a different pruning mode?

Apparently, it was just the collator functioning "normally" and no params were changed other than switching to the master PR (which now has import_existing: true).

I believe I had a look over the client info a few years ago, reporting gaps that should be filled or that should have never existed, but I dismissed that as stopping the node at the wrong time or switching params.

I think this code was silently filling the memory gap but never informed the DB layer about not caring for the block execution:

polkadot-sdk/substrate/client/service/src/client/client.rs

Lines 1809 to 1812 in 2b9576c

BlockStatus::InChainPruned if !import_existing => {

return Ok(ImportResult::AlreadyInChain)

},

BlockStatus::InChainPruned => {},

And now with the optimization of not even fetching the BODY, its assumed that the DB will never close it:

polkadot-sdk/substrate/client/service/src/client/client.rs

Lines 1809 to 1812 in 2b9576c

BlockStatus::InChainPruned if !import_existing => {

return Ok(ImportResult::AlreadyInChain)

},

BlockStatus::InChainPruned => {},

Not entirely sure what might have caused the gap in the first place, since the nodes were running without the optimization of stripping BODY initially 🤔 So i would have expected that all bodies were fetched as well

lrubasze

LGTM!

Just thinking that maybe it would make sense to add a test (or modify current) that checks if multi-block gaps are closed as well. Seems more real-world scenario.

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

skunert · 2026-03-12T14:28:55Z

Still unclear to me what happened. IMO this PR is fixing the symptom, not the issue.

It looks like we write this kind of gap only here:

polkadot-sdk/substrate/client/db/src/lib.rs

Line 1914 in 8604871

gap_type: BlockGapType::MissingBody,

But there we only go if create_gap is true and its above best block:

polkadot-sdk/substrate/client/db/src/lib.rs

Line 1895 in 8604871

} else if operation.create_gap {

So this means that this issue occured at the tip. And at the tip I would expect that we can just request the body? Lukasz changes where only for gap sync below the warp target.

lexnv · 2026-03-13T17:11:01Z

Versions deployed on the chain:

t0: stable2512-2 clean sync
t1: master-14852d21 import_existing: true, which caused the sync loop
t2: stable260

Before Lukas changes #10373, we checked if the incoming block had a body:

polkadot-sdk/substrate/client/db/src/lib.rs

Line 1520 in 4e02aa2

let existing_body = pending_block.body.is_some();

polkadot-sdk/substrate/client/db/src/lib.rs

Line 1714 in 4e02aa2

let should_check_block_gap = !existing_header || !existing_body;

a header only import might trick the DB into thinking the body doesn't exist, although the body is already in the DB

After the changes, the two checks are separate (ie we check if the body exists in the DB not the import)

polkadot-sdk/substrate/client/db/src/lib.rs

Lines 1595 to 1599 in d5c5fbe

    
           // Body in DB (not incoming block) - needed to update gap when adding body to existing 
        
           // header. 
        
           let body_exists_in_db = self.blockchain.body(hash)?.is_some(); 
        
           // Incoming block has body - used for fast sync gap handling. 
        
           let incoming_has_body = pending_block.body.is_some();

So, before the Lukas change, we opened a gap that wasn't supposed to get open (most probably via importing a new block as best):

polkadot-sdk/cumulus/client/consensus/common/src/parachain_consensus.rs

Line 513 in d5c5fbe

    
           async fn import_block_as_new_best<Block, P>(hash: Block::Hash, header: Block::Header, parachain: &P)

polkadot-sdk/cumulus/client/consensus/common/src/parachain_consensus.rs

Lines 531 to 534 in d5c5fbe

    
           // Make it the new best block 
        
           let mut block_import_params = BlockImportParams::new(BlockOrigin::ConsensusBroadcast, header); 
        
           block_import_params.fork_choice = Some(ForkChoiceStrategy::Custom(true)); 
        
           block_import_params.import_existing = true;

Then because the import was header-only, the client/db assumes we need to fil the body via gap sync. Which then cascades into a sync loop restart when we switched to Lukas's changes:

we request the block to fill the gap
now because import_existing: true we execute the block
execution fails because there's no parent state
we go back to step 1 trying to fill the gap

TLDR;

stable2512-2 had a bug in the DB whcih opened a gap when it shouldnt have
master-14852d21 is actually fixing the bug, but because we also changed the import_existing to true, and we hit the bug from stable2512-2, we try to fill a gap and enter a sync restart loop
Since we got data for this happening on 2 parachains (yap polkadot and yap kusama), I would still consider this fix

Would love to double check this with you guys 🙏 @skunert @lrubasze

skunert · 2026-03-17T21:59:23Z

So you are saying that all nodes that switch from old releases to the new release will have these "accidental gaps" and run into this bug? If yes, then we need to merge this.

lexnv · 2026-03-18T09:40:43Z

So you are saying that all nodes that switch from old releases to the new release will have these "accidental gaps" and run into this bug? If yes, then we need to merge this.

Yep, this has happened on 6 collators: 3 from polkadot-yap and 3 from kusama-yap parachains

lexnv · 2026-03-18T09:48:00Z

Have double checked AH-Westend which contains the latest 2603-rc4:

PR aura/import: Skip block execution when collators have no parent block state #11330 ensures we are not looping during the restart process
however, without this PR, we'll have the accidental gap live in the DB making the node request it after every reboot to only close it in-memory

lrubasze · 2026-03-18T11:29:16Z

@lexnv amazing analysis and great explanation.
Pretty twisted scenario :)

lexnv · 2026-03-18T12:33:29Z

Thanks a lot Lukas for double checking 🙏
Yep, this scenario took me by surprise as well :D

lexnv · 2026-03-19T14:10:04Z

/cmd prdoc --audience node_dev --bump patch

…e_dev --bump patch'

bkchr · 2026-03-19T22:10:09Z

@lexnv we need to backport this?

paritytech-release-backport-bot · 2026-03-20T12:24:38Z

Created backport PR for stable2512:

[stable2512] Backport #11332 #11451 with remaining conflicts!

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin backport-11332-to-stable2512
git worktree add --checkout .worktree/backport-11332-to-stable2512 backport-11332-to-stable2512
cd .worktree/backport-11332-to-stable2512
git reset --hard HEAD^
git cherry-pick -x 0f64cfcaf9f1665216d2ea5e5f66a8e632ce423c
git push --force-with-lease

This PR closes missing body gaps in the database for non-archive nodes. Effectively, a missing body gap cannot be closed on the DB side if the node is non-archive. Since execution is already skipped, the node will close the memory gap in the sync engine; however, the gap remains open in the db. This leads to wasting resources at every startup: - client info contains a gap that cannot be filled (since we don't have the state around for execution) - blocks are fetched from the connected peers - gap is filled by ignoring blocks in the sync engine Further, for collators on origin master this causes an infinite loop of sync engine restarts that get punished via banning and disconnecting. For more details and root cause check: - #11330 Part of: - #11299 --------- Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io> Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com> (cherry picked from commit 0f64cfc)

paritytech-release-backport-bot · 2026-03-20T12:24:44Z

Successfully created backport PR for stable2603:

[stable2603] Backport #11332 #11452

Backport #11332 into `stable2603` from lexnv. See the [documentation](https://github.com/paritytech/polkadot-sdk/blob/master/docs/BACKPORT.md) on how to use this bot.  Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io> Co-authored-by: Alexandru Vasile <60601340+lexnv@users.noreply.github.com> Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com>

lexnv added 3 commits March 10, 2026 14:22

client/db: Remove missing body gaps for non archive nodes

e34f544

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

client/db: Close missing body gaps for non archive nodes

13fcc0f

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

client/db: Add tests

a6c9cd1

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

lexnv self-assigned this Mar 10, 2026

lexnv added the T0-node This PR/Issue is related to the topic “node”. label Mar 10, 2026

lexnv mentioned this pull request Mar 10, 2026

gap_sync/fix: Close gap and peer banning after warp sync on parachains #11309

Closed

bkchr reviewed Mar 10, 2026

View reviewed changes

lexnv mentioned this pull request Mar 11, 2026

aura/import: Skip block execution when collators have no parent block state #11330

Merged

lrubasze approved these changes Mar 11, 2026

View reviewed changes

client/db: Test with multi block gaps

8604871

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>

pgherveou added this to [preview] release tracker Mar 18, 2026

github-project-automation bot moved this to Todo in [preview] release tracker Mar 18, 2026

skunert approved these changes Mar 18, 2026

View reviewed changes

pgherveou removed this from [preview] release tracker Mar 19, 2026

Merge branch 'master' into lexnv/nuke-gaps

5312a18

Update from github-actions[bot] running command 'prdoc --audience nod…

221cb99

…e_dev --bump patch'

lexnv added this pull request to the merge queue Mar 19, 2026

Merged via the queue into master with commit 0f64cfc Mar 19, 2026
245 of 249 checks passed

lexnv deleted the lexnv/nuke-gaps branch March 19, 2026 16:13

lexnv added the A4-backport-stable2512 Pull request must be backported to the stable2512 release branch label Mar 20, 2026

lexnv added the A4-backport-stable2603 Pull request must be backported to the stable2603 release branch label Mar 20, 2026

paritytech-release-backport-bot bot mentioned this pull request Mar 20, 2026

[stable2512] Backport #11332 #11451

Draft

paritytech-release-backport-bot bot mentioned this pull request Mar 20, 2026

[stable2603] Backport #11332 #11452

Merged

lexnv mentioned this pull request Mar 20, 2026

master/regression: Parachain Gap cannot be filled #11299

Closed

	BlockStatus::InChainPruned if !import_existing => {
	return Ok(ImportResult::AlreadyInChain)
	},
	BlockStatus::InChainPruned => {},

Conversation

lexnv commented Mar 10, 2026

Uh oh!

bkchr Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

lexnv Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

lrubasze left a comment

Choose a reason for hiding this comment

Uh oh!

skunert commented Mar 12, 2026

Uh oh!

lexnv commented Mar 13, 2026

Uh oh!

skunert commented Mar 17, 2026

Uh oh!

lexnv commented Mar 18, 2026

Uh oh!

lexnv commented Mar 18, 2026

Uh oh!

lrubasze commented Mar 18, 2026

Uh oh!

lexnv commented Mar 18, 2026

Uh oh!

lexnv commented Mar 19, 2026

Uh oh!

Uh oh!

bkchr commented Mar 19, 2026

Uh oh!

paritytech-release-backport-bot bot commented Mar 20, 2026

Uh oh!

paritytech-release-backport-bot bot commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants