Skip to content

fix: fix SnapSync regression from network layer refactor#10902

Open
kamilchodola wants to merge 8 commits intomasterfrom
kch/fix-snap-sync-regression
Open

fix: fix SnapSync regression from network layer refactor#10902
kamilchodola wants to merge 8 commits intomasterfrom
kch/fix-snap-sync-regression

Conversation

@kamilchodola
Copy link
Contributor

@kamilchodola kamilchodola commented Mar 20, 2026

Summary

Fixes issues introduced in #10753 that caused SnapSync to regress from 25% in 5 minutes to 1% in 15 minutes on Hoodi:

  • StorageRange deserialization limit too low (critical): MaxResponseSlotsPerAccount was 16,384 (~1.08 MB). Large-storage contracts return up to 45,000+ slots per response (up to 3 MB). When the limit was hit, RlpLimitException triggered InitiateDisconnect(MessageLimitsBreached), disconnecting and banning the serving peer for 15 minutes with -10,000 reputation. This progressively banned all good peers. Raised to 131,072.
  • Session.Handshake NodeId mismatch: Only disconnect outbound peers on identity mismatch for static/bootnode peers (operator-verified). Discovered peers accept the new identity — stale discovery data is common and benign. Also fixes HandshakeComplete firing after InitiateDisconnect on doomed sessions.
  • SemaphoreSlim signal consumption: EnsureAvailableActivePeerSlotAsync polling loop was consuming signals meant for the main peer update loop. Replaced _peerUpdateRequested.WaitAsync with Task.Delay in the polling path.
  • Lock scope reduction: OnDisconnected and OnHandshakeComplete held _sessionLock for their entire body, serializing all session events. Now only guards session bookkeeping; peer processing runs outside the lock.
  • Dedicated thread for peer update loop: Restored TaskCreationOptions.LongRunning with .Unwrap() (fixing the original Task<Task> bug), so the loop gets a dedicated thread instead of competing on the thread pool.

Test plan

What types of changes does your code introduce?

  • Bugfix (a non-breaking change that fixes an issue)

Fix four issues introduced in #10753 that caused SnapSync to regress
from 25% in 5 minutes to 1% in 15 minutes on Hoodi:

1. Session.Handshake: only disconnect on NodeId mismatch for static/bootnode
   peers (operator-verified identities). Discovered peers accept the new
   identity as before — stale discovery data is common and benign. Also
   fix HandshakeComplete firing after InitiateDisconnect on doomed sessions.

2. PeerManager: stop EnsureAvailableActivePeerSlotAsync from consuming
   SemaphoreSlim signals meant for the main peer update loop. Use
   Task.Delay for polling instead of WaitAsync on the shared semaphore.

3. PeerManager: reduce _sessionLock scope in OnDisconnected and
   OnHandshakeComplete to only guard session bookkeeping. Peer processing
   runs outside the lock to avoid serializing all session events.

4. PeerManager: restore dedicated thread for peer update loop via
   Task.Factory.StartNew(LongRunning) with .Unwrap() — fixes the
   original Task<Task> bug while keeping the loop off the thread pool.
MaxResponseSlotsPerAccount was 16,384 which caps at ~1.08 MB of storage
slots per account. During SnapSync, large-storage contracts (Uniswap,
USDT, etc.) return up to 45,000+ slots in a single response (up to 3 MB).

When the limit was hit, RlpLimitException triggered
Session.InitiateDisconnect(MessageLimitsBreached), disconnecting and
banning the serving peer for 15 minutes with -10,000 reputation penalty.
This progressively banned all good peers, collapsing sync throughput.

Raise to 131,072 (128K) which accommodates the 3 MiB max response size
with margin (~66 bytes per slot → ~47K slots at 3 MB).
@github-actions github-actions bot added the tools label Mar 20, 2026
…ules

- Replace Task.Factory.StartNew(LongRunning) with async Task + Task.Yield()
  since the async delegate yields the dedicated thread at the first await
  anyway, making LongRunning pointless. Also avoids the potential
  TaskCanceledException from passing the token to StartNew.
- Remove accidentally committed src/nevm and tools/gas-benchmarks submodules.
@github-actions github-actions bot removed the tools label Mar 20, 2026
@kamilchodola kamilchodola marked this pull request as ready for review March 20, 2026 13:52
- Invariant tests verify response limits (MaxResponseAccounts,
  MaxResponseSlotsPerAccount) are large enough to accommodate the maximum
  item count that fits within the 3 MiB response cap. This catches the
  exact class of bug that caused the SnapSync regression — limits set
  below what real peers send in valid responses.
- Boundary roundtrip tests verify serialization works at realistic sizes
  (40K accounts, 50K storage slots) matching production traffic.
@LukaszRozmej LukaszRozmej requested a review from benaadams March 22, 2026 09:18
@LukaszRozmej
Copy link
Member

@benaadams @flcl42 can you review?

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes SnapSync throughput regression introduced by the network layer refactor by adjusting Snap RLP limits and improving session/peer-manager event handling to avoid unnecessary disconnects, peer bans, and scheduling contention.

Changes:

  • Increase Snap response RLP collection limits (accounts + storage slots) to avoid disconnecting/banning peers on valid large responses.
  • Adjust Session.Handshake identity-mismatch handling to only reject outbound static/bootnode peers; accept updated identities for discovered peers and avoid firing HandshakeComplete after a doomed disconnect.
  • Reduce PeerManager lock scope in disconnect/handshake handlers and prevent _peerUpdateRequested semaphore starvation by switching a polling path to Task.Delay.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/Nethermind/Nethermind.Network/PeerManager.cs Improves peer-update scheduling and reduces session event lock contention; prevents semaphore signal consumption in a polling loop.
src/Nethermind/Nethermind.Network/P2P/Subprotocols/Snap/SnapMessageLimits.cs Raises Snap response item-count limits to align with real-world 3 MiB responses.
src/Nethermind/Nethermind.Network/P2P/Session.cs Refines outbound identity-mismatch behavior to avoid disconnecting discovered peers while keeping strict checks for static/bootnodes.
src/Nethermind/Nethermind.Network.Test/P2P/Subprotocols/Snap/Messages/SnapMessageLimitsTests.cs Adds coverage validating higher Snap response limits and large round-trip serialization/deserialization.
src/Nethermind/Nethermind.Network.Test/P2P/SessionTests.cs Updates/extends tests for the new handshake identity-mismatch behavior (discovered vs static/bootnode).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@benaadams
Copy link
Member

@copilot open a new pull request to apply changes based on the comments in this thread

Copy link
Contributor

Copilot AI commented Mar 24, 2026

@benaadams I've opened a new pull request, #10934, to work on those changes. Once the pull request is ready, I'll request review from you.

benaadams and others added 2 commits March 24, 2026 02:40
#10934)

* Initial plan

* test: dispose AccountRangeMessage instances and hoist Rlp.Encode outside loop

Co-authored-by: benaadams <1142958+benaadams@users.noreply.github.com>
Agent-Logs-Url: https://github.com/NethermindEth/nethermind/sessions/340f6ff2-89a5-4c0d-89a0-dc1e74412382

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: benaadams <1142958+benaadams@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants