gap_sync/fix: Close gap and peer banning after warp sync on parachains#11309
gap_sync/fix: Close gap and peer banning after warp sync on parachains#11309
Conversation
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
|
Can you explain a bit more clearly what exactly leads to this bug? I remember reviewing @lrubasze PR and thought this case was handled where gap sync encounters a sparse warp block 🤔 |
bkchr
left a comment
There was a problem hiding this comment.
This requires a regression test.
My AI assistant told me that this gap already existed before, using the old nodes. I did not yet had verified it. So, there was maybe a bug before, that triggers now something using the latest node. Clearly we need a test that exactly reproduces the problem. |
|
My main point of confusion is whether this bug triggers for freshly warp-synced nodes. If it only triggers when gap sync is in progress and you upgrade mid-sync, then ignoring this is also an option. |
This PR fixes an issue with parachain collators stuck in a ban loop after starting with warp sync. Effectively having 0 connected peers.
The root causes are
Warp Sync
Gap sync fails to advance, causing the sync engine to request the same block multiple times. Considering that the gap is old, no other collator can provide a response and close it. The downstream effect is that after 3 identical requests, the peer gets banned.
For non-archive nodes (ie
block_data.block.body.is_none()), the gap sync requests blocks without bodies.When a warp synced block is provided, the block is already in the chain. That causes the import to return
AlreadyInChainbefore having a chance to advance the gap start.Then, because the gap start is never advanced, the request is duplicated to peers, causing a banning loop and disconnects.
DB never closes the gap
The DB gap stalls across node restarts: the DB gap is never updated. Even when a warp-sync is skipped by advancing now the gap's best queued number, the change is not reflected in the DB.
number == gap sttart + 1number == gap startTo fix this, a while loop is added on the
MissingHeaderAndBodycase, that closes the gap if the block's headers were already imported.The issue has surfaced after introducing the following optimization:
Closes: