Skip to content

Commit 2e63fbc

Browse files
aura/import: Skip block execution when collators have no parent block state (#11330)
This PR skips the execution of blocks when they are propagated to importing via `StateAction::Skip`. There is a bug in the import queue that is affecting collators, which is that they should not execute blocks for non-archive collators that are part of Gap Sync. The bug has surfaced by changing the `import_existing` from false to true in: - #10373 ### Issue The issue manifests for collators that have an unfilled block gap in their DB. During restarting with #10373, a collator would try the following: - client info has detected a gap at block 5800 with length 1 - collator [X] requests the block 5800 with `fields: HEADER | BODY | JUSTIFICATION, from: Number(5800)` - the other 2 collators respond with the full block, including the body, because by default collators will keep around the canonical chain but discard the block state - collator [X] tries to import the block because `import_existing` is true and we continue execution after the following check: https://github.com/paritytech/polkadot-sdk/blob/2b9576c163b1c2408291e2b6c98ae0f2465b4818/substrate/client/service/src/client/client.rs#L1809-L1812 - Before the changes, the code returned `return Ok(ImportResult::AlreadyInChain)` which short-circuited the importing of the block - collator [X] imports the block but fails with `State already discarded` - the error is propagated back to the sync engine that decides to restart the sync process with the same block gap `Restarting sync with client ...` - This results in a vicious cycle where the collator [X] requests the same block again, then restarts the sync engine - Eventually at the 3 request the other collators will notice that this behavior is malicious and ban and disconnect the peers. ### Fix The fix is to skip executing blocks when the gap sync has marked blocks as `StateAction::Skip`. Please note we are still dealing with the following, which should be part of a different PR: - Gap Sync was never closed from the database - When the node starts with a block gap, the node will always initiate a block request over the sync protocol to close the gap - Before the gap was marked as `import_existing: false` which short ciruited the circuit and returned `AlreadyInChain` - Effectively nodes would re-request the gap on reboot wasting networking bandwidth to close the gap "in memory" only, but this was never commited to the DB ### Full Logs ```rust 2026-03-10 13:43:41.138 DEBUG main sync: [Parachain] Restarting sync with client info Info { best_hash: 0xcb03c2aa7dd61f84b27d4c7db42ab848d2eaee9da77ddedc827e070ece063102, best_number: 5883392, genesis_hash: 0x8692fdabb7e55c3347c0f887343e3c0f3fbb560c5f52c9cdc1f7660a1f183c5d, finalized_hash: 0x43664710059a72b37c11db9f99a0f38323b478fbdc82afac058c530c7b002e4d, finalized_number: 5883372, finalized_state: Some((0x43664710059a72b37c11db9f99a0f38323b478fbdc82afac058c530c7b002e4d, 5883372)), number_leaves: 1, block_gap: Some(BlockGap { start: 5800, end: 5800, gap_type: MissingBody }) } 2026-03-10 13:43:41.138 DEBUG main sync: [Parachain] Starting gap sync #5800 - #5800 (old gap best and target: None) 2026-03-10 13:43:41.138 TRACE main sync: [Parachain] Restarted sync at #5883392 (0xcb03c2aa7dd61f84b27d4c7db42ab848d2eaee9da77ddedc827e070ece063102) 2026-03-10 13:45:17.775 TRACE tokio-runtime-worker sync: [Parachain] New gap block request for 12D3KooWRejf1JYYjaaKhHAn28VJJR9ryZqs3wiGPsVjk6eFLLrn, (best:5883362, common:5883362) BlockRequest { id: 0, fields: HEADER | BODY | JUSTIFICATION, from: Number(5800), direction: Descending, max: Some(1) } 2026-03-10 13:45:17.784 DEBUG tokio-runtime-worker sync::import-queue: [Parachain] Starting import of 1 blocks (5800) (origin: GapSync) 2026-03-10 13:45:17.784 TRACE tokio-runtime-worker sync::import-queue: [Parachain] Block 5800 (0x26dc…1cda) has 4 logs (origin: GapSync) 2026-03-10 13:45:17.792 DEBUG tokio-runtime-worker sync::import-queue: [Parachain] Error importing block 5800: 0x26dca166cfefe439262d201b10a8d2679edc4bd98ae59fe12d7f7eef9b871cda: Api called for an unknown Block: State already discarded for 0x4739cf07649d6383bb19d2adccbe9d3f5b1ed91ef5fd6530bc8e69e560b5be91 2026-03-10 13:45:17.792 WARN tokio-runtime-worker sync: [Parachain] 💔 Error importing block 0x26dca166cfefe439262d201b10a8d2679edc4bd98ae59fe12d7f7eef9b871cda: consensus error: Api called for an unknown Block: State already discarded for 0x4739cf07649d6383bb19d2adccbe9d3f5b1ed91ef5fd6530bc8e69e560b5be91 2026-03-10 13:45:17.792 DEBUG tokio-runtime-worker sync: [Parachain] Restarting sync with client info Info { best_hash: 0xcb03c2aa7dd61f84b27d4c7db42ab848d2eaee9da77ddedc827e070ece063102, best_number: 5883392, genesis_hash: 0x8692fdabb7e55c3347c0f887343e3c0f3fbb560c5f52c9cdc1f7660a1f183c5d, finalized_hash: 0xcb03c2aa7dd61f84b27d4c7db42ab848d2eaee9da77ddedc827e070ece063102, finalized_number: 5883392, finalized_state: Some((0xcb03c2aa7dd61f84b27d4c7db42ab848d2eaee9da77ddedc827e070ece063102, 5883392)), number_leaves: 1, block_gap: Some(BlockGap { start: 5800, end: 5800, gap_type: MissingBody }) } 2026-03-10 13:45:17.792 DEBUG tokio-runtime-worker sync: [Parachain] Starting gap sync #5800 - #5800 (old gap best and target: Some((5800, 5800))) 2026-03-10 13:45:17.792 TRACE tokio-runtime-worker sync: [Parachain] Restarted sync at #5883392 (0xcb03c2aa7dd61f84b27d4c7db42ab848d2eaee9da77ddedc827e070ece063102) ``` ### Testing Done - unblocks kusama yap 3392: https://grafana.teleport.parity.io/goto/KBKfuhKDR?orgId=1 - left side of the graph is origin/master, right side is the patch applied with connected peers Closes: - #11299 --------- Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io> Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com> (cherry picked from commit 3c93291)
1 parent f8faa28 commit 2e63fbc

File tree

2 files changed

+153
-3
lines changed

2 files changed

+153
-3
lines changed

cumulus/client/consensus/aura/src/collators/slot_based/block_import.rs

Lines changed: 143 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -97,9 +97,16 @@ where
9797
) -> Result<sc_consensus::ImportResult, Self::Error> {
9898
// If the channel exists and it is required to execute the block, we will execute the block
9999
// here. This is done to collect the storage proof and to prevent re-execution, we push
100-
// downwards the state changes. `StateAction::ApplyChanges` is ignored, because it either
101-
// means that the node produced the block itself or the block was imported via state sync.
102-
if !self.sender.is_closed() && !matches!(params.state_action, StateAction::ApplyChanges(_))
100+
// downwards the state changes.
101+
//
102+
// The following states are ignored:
103+
// - `StateAction::ApplyChanges`: means that the node produced the block itself or the
104+
// block was imported via state sync.
105+
// - `StateAction::Skip`: means that the block should be skipped. This is evident in the
106+
// context of gap-sync with collators running in non-archive modes. The state of the
107+
// parent block has already been discarded and therefore any import would fail.
108+
if !self.sender.is_closed() &&
109+
!matches!(params.state_action, StateAction::ApplyChanges(_) | StateAction::Skip)
103110
{
104111
let mut runtime_api = self.client.runtime_api();
105112

@@ -143,3 +150,136 @@ where
143150
self.inner.import_block(params).await.map_err(Into::into)
144151
}
145152
}
153+
154+
#[cfg(test)]
155+
mod tests {
156+
use super::*;
157+
use codec::Encode;
158+
use cumulus_test_client::{
159+
runtime::Block, DefaultTestClientBuilderExt, InitBlockBuilder, TestClientBuilder,
160+
TestClientBuilderExt,
161+
};
162+
use cumulus_test_relay_sproof_builder::RelayStateSproofBuilder;
163+
use polkadot_primitives::HeadData;
164+
use sc_consensus::{BlockImportParams, ImportResult, StateAction};
165+
use sp_blockchain::HeaderBackend;
166+
use sp_consensus::BlockOrigin;
167+
168+
fn sproof_with_best_parent(client: &cumulus_test_client::Client) -> RelayStateSproofBuilder {
169+
let best_hash = client.info().best_hash;
170+
let header = client.header(best_hash).ok().flatten().expect("No header for best block");
171+
let mut builder = RelayStateSproofBuilder::default();
172+
builder.para_id = cumulus_test_client::runtime::PARACHAIN_ID.into();
173+
builder.included_para_head = Some(HeadData(header.encode()));
174+
builder
175+
}
176+
177+
/// Mock inner block import that always succeeds.
178+
#[derive(Clone)]
179+
struct MockBlockImport;
180+
181+
#[async_trait::async_trait]
182+
impl BlockImport<Block> for MockBlockImport {
183+
type Error = sp_consensus::Error;
184+
185+
async fn check_block(
186+
&self,
187+
_block: sc_consensus::BlockCheckParams<Block>,
188+
) -> Result<ImportResult, Self::Error> {
189+
Ok(ImportResult::imported(false))
190+
}
191+
192+
async fn import_block(
193+
&self,
194+
_block: BlockImportParams<Block>,
195+
) -> Result<ImportResult, Self::Error> {
196+
Ok(ImportResult::imported(true))
197+
}
198+
}
199+
200+
/// Regression test for the gap-sync infinite loop issue.
201+
///
202+
/// When a non-archive collator has a block gap of size 1, gap-sync downloads
203+
/// the block and marks it with `skip_execution: true` (which translates to
204+
/// `StateAction::Skip`). Before the fix, `SlotBasedBlockImport` would attempt
205+
/// to execute such blocks, fail with a consensus error ("State already
206+
/// discarded for parent"), and trigger a chain-sync restart that re-creates
207+
/// the same gap — leading to an infinite retry loop.
208+
///
209+
/// This test verifies that `StateAction::Skip` blocks are forwarded to the
210+
/// inner block import without attempting runtime execution.
211+
#[tokio::test]
212+
async fn gap_sync_block_with_skip_execution_does_not_attempt_runtime_call() {
213+
sp_tracing::try_init_simple();
214+
215+
let client = Arc::new(TestClientBuilder::new().build());
216+
217+
// Build a valid block so we have realistic headers/bodies.
218+
let sproof = sproof_with_best_parent(&client);
219+
let block_builder_data = client.init_block_builder(None, sproof);
220+
let block = block_builder_data.block_builder.build().unwrap().block;
221+
222+
let (slot_based_import, mut handle) =
223+
SlotBasedBlockImport::new(MockBlockImport, client.clone());
224+
225+
// Simulate the gap-sync scenario: a block arrives with StateAction::Skip
226+
// because the parent state has been pruned.
227+
let mut params = BlockImportParams::new(BlockOrigin::NetworkInitialSync, block.header);
228+
params.body = Some(block.extrinsics);
229+
params.state_action = StateAction::Skip;
230+
params.import_existing = true;
231+
232+
// Before the fix, this would fail with a consensus error because
233+
// SlotBasedBlockImport would try to call `runtime_api.execute_block()`
234+
// on the parent hash whose state is no longer available.
235+
//
236+
// After the fix, StateAction::Skip is recognized and the block is
237+
// forwarded directly to the inner import without execution.
238+
let result = slot_based_import.import_block(params).await;
239+
assert!(result.is_ok(), "Gap-sync block with StateAction::Skip must not fail: {result:?}");
240+
241+
// The channel must be empty — execution should have been skipped entirely,
242+
// so no (block, proof) was sent. This is the key assertion: without the
243+
// StateAction::Skip guard, execute_block() would run and send a message.
244+
//
245+
// Drop the sender side so the channel closes, then verify no message was queued.
246+
drop(slot_based_import);
247+
assert!(
248+
handle.receiver.next().await.is_none(),
249+
"No block+proof should be sent through the channel for StateAction::Skip"
250+
);
251+
}
252+
253+
/// Verify that `StateAction::Execute` still triggers runtime execution.
254+
///
255+
/// This complements the gap-sync regression test by ensuring we did not
256+
/// accidentally disable execution for normal blocks.
257+
#[tokio::test]
258+
async fn normal_block_with_execute_action_triggers_runtime_execution() {
259+
sp_tracing::try_init_simple();
260+
261+
let client = Arc::new(TestClientBuilder::new().build());
262+
263+
let sproof = sproof_with_best_parent(&client);
264+
let block_builder_data = client.init_block_builder(None, sproof);
265+
let block = block_builder_data.block_builder.build().unwrap().block;
266+
267+
let (slot_based_import, mut handle) =
268+
SlotBasedBlockImport::new(MockBlockImport, client.clone());
269+
270+
// Normal import with StateAction::Execute should trigger execution
271+
// and send the block + proof through the channel.
272+
let mut params =
273+
BlockImportParams::new(BlockOrigin::NetworkInitialSync, block.header.clone());
274+
params.body = Some(block.extrinsics.clone());
275+
params.state_action = StateAction::Execute;
276+
277+
let result = slot_based_import.import_block(params).await;
278+
assert!(result.is_ok(), "Normal block import should succeed: {result:?}");
279+
280+
// The block and proof should have been sent through the channel,
281+
// confirming that execution actually happened.
282+
let (received_block, _proof) = handle.next().await;
283+
assert_eq!(*received_block.header(), block.header);
284+
}
285+
}

prdoc/pr_11330.prdoc

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
title: 'aura/import: Skip block execution when collators have no parent block state'
2+
doc:
3+
- audience: Node Dev
4+
description: "Skip block execution in `SlotBasedBlockImport` when gap-sync marks
5+
blocks with `StateAction::Skip`. This fixes an infinite retry loop where non-archive
6+
collators failed to import gap-sync blocks because the parent state was already
7+
pruned, causing repeated sync restarts and eventual peer bans."
8+
crates:
9+
- name: cumulus-client-consensus-aura
10+
bump: patch

0 commit comments

Comments
 (0)