fix: fix SnapSync regression from network layer refactor#10902
Open
kamilchodola wants to merge 8 commits intomasterfrom
Open
fix: fix SnapSync regression from network layer refactor#10902kamilchodola wants to merge 8 commits intomasterfrom
kamilchodola wants to merge 8 commits intomasterfrom
Conversation
Fix four issues introduced in #10753 that caused SnapSync to regress from 25% in 5 minutes to 1% in 15 minutes on Hoodi: 1. Session.Handshake: only disconnect on NodeId mismatch for static/bootnode peers (operator-verified identities). Discovered peers accept the new identity as before — stale discovery data is common and benign. Also fix HandshakeComplete firing after InitiateDisconnect on doomed sessions. 2. PeerManager: stop EnsureAvailableActivePeerSlotAsync from consuming SemaphoreSlim signals meant for the main peer update loop. Use Task.Delay for polling instead of WaitAsync on the shared semaphore. 3. PeerManager: reduce _sessionLock scope in OnDisconnected and OnHandshakeComplete to only guard session bookkeeping. Peer processing runs outside the lock to avoid serializing all session events. 4. PeerManager: restore dedicated thread for peer update loop via Task.Factory.StartNew(LongRunning) with .Unwrap() — fixes the original Task<Task> bug while keeping the loop off the thread pool.
MaxResponseSlotsPerAccount was 16,384 which caps at ~1.08 MB of storage slots per account. During SnapSync, large-storage contracts (Uniswap, USDT, etc.) return up to 45,000+ slots in a single response (up to 3 MB). When the limit was hit, RlpLimitException triggered Session.InitiateDisconnect(MessageLimitsBreached), disconnecting and banning the serving peer for 15 minutes with -10,000 reputation penalty. This progressively banned all good peers, collapsing sync throughput. Raise to 131,072 (128K) which accommodates the 3 MiB max response size with margin (~66 bytes per slot → ~47K slots at 3 MB).
…ules - Replace Task.Factory.StartNew(LongRunning) with async Task + Task.Yield() since the async delegate yields the dedicated thread at the first await anyway, making LongRunning pointless. Also avoids the potential TaskCanceledException from passing the token to StartNew. - Remove accidentally committed src/nevm and tools/gas-benchmarks submodules.
- Invariant tests verify response limits (MaxResponseAccounts, MaxResponseSlotsPerAccount) are large enough to accommodate the maximum item count that fits within the 3 MiB response cap. This catches the exact class of bug that caused the SnapSync regression — limits set below what real peers send in valid responses. - Boundary roundtrip tests verify serialization works at realistic sizes (40K accounts, 50K storage slots) matching production traffic.
1 task
smartprogrammer93
approved these changes
Mar 22, 2026
Member
|
@benaadams @flcl42 can you review? |
asdacap
approved these changes
Mar 23, 2026
Contributor
There was a problem hiding this comment.
Pull request overview
Fixes SnapSync throughput regression introduced by the network layer refactor by adjusting Snap RLP limits and improving session/peer-manager event handling to avoid unnecessary disconnects, peer bans, and scheduling contention.
Changes:
- Increase Snap response RLP collection limits (accounts + storage slots) to avoid disconnecting/banning peers on valid large responses.
- Adjust
Session.Handshakeidentity-mismatch handling to only reject outbound static/bootnode peers; accept updated identities for discovered peers and avoid firingHandshakeCompleteafter a doomed disconnect. - Reduce
PeerManagerlock scope in disconnect/handshake handlers and prevent_peerUpdateRequestedsemaphore starvation by switching a polling path toTask.Delay.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/Nethermind/Nethermind.Network/PeerManager.cs | Improves peer-update scheduling and reduces session event lock contention; prevents semaphore signal consumption in a polling loop. |
| src/Nethermind/Nethermind.Network/P2P/Subprotocols/Snap/SnapMessageLimits.cs | Raises Snap response item-count limits to align with real-world 3 MiB responses. |
| src/Nethermind/Nethermind.Network/P2P/Session.cs | Refines outbound identity-mismatch behavior to avoid disconnecting discovered peers while keeping strict checks for static/bootnodes. |
| src/Nethermind/Nethermind.Network.Test/P2P/Subprotocols/Snap/Messages/SnapMessageLimitsTests.cs | Adds coverage validating higher Snap response limits and large round-trip serialization/deserialization. |
| src/Nethermind/Nethermind.Network.Test/P2P/SessionTests.cs | Updates/extends tests for the new handshake identity-mismatch behavior (discovered vs static/bootnode). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/Nethermind/Nethermind.Network.Test/P2P/Subprotocols/Snap/Messages/SnapMessageLimitsTests.cs
Show resolved
Hide resolved
src/Nethermind/Nethermind.Network.Test/P2P/Subprotocols/Snap/Messages/SnapMessageLimitsTests.cs
Outdated
Show resolved
Hide resolved
Member
|
@copilot open a new pull request to apply changes based on the comments in this thread |
Contributor
|
@benaadams I've opened a new pull request, #10934, to work on those changes. Once the pull request is ready, I'll request review from you. |
14 tasks
#10934) * Initial plan * test: dispose AccountRangeMessage instances and hoist Rlp.Encode outside loop Co-authored-by: benaadams <1142958+benaadams@users.noreply.github.com> Agent-Logs-Url: https://github.com/NethermindEth/nethermind/sessions/340f6ff2-89a5-4c0d-89a0-dc1e74412382 --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: benaadams <1142958+benaadams@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes issues introduced in #10753 that caused SnapSync to regress from 25% in 5 minutes to 1% in 15 minutes on Hoodi:
MaxResponseSlotsPerAccountwas 16,384 (~1.08 MB). Large-storage contracts return up to 45,000+ slots per response (up to 3 MB). When the limit was hit,RlpLimitExceptiontriggeredInitiateDisconnect(MessageLimitsBreached), disconnecting and banning the serving peer for 15 minutes with -10,000 reputation. This progressively banned all good peers. Raised to 131,072.HandshakeCompletefiring afterInitiateDisconnecton doomed sessions.EnsureAvailableActivePeerSlotAsyncpolling loop was consuming signals meant for the main peer update loop. Replaced_peerUpdateRequested.WaitAsyncwithTask.Delayin the polling path.OnDisconnectedandOnHandshakeCompleteheld_sessionLockfor their entire body, serializing all session events. Now only guards session bookkeeping; peer processing runs outside the lock.TaskCreationOptions.LongRunningwith.Unwrap()(fixing the originalTask<Task>bug), so the loop gets a dedicated thread instead of competing on the thread pool.Test plan
PeerManagerTestspass (35/35)SessionTestspass + 3 new tests added (44/44)What types of changes does your code introduce?