fix(nodebuilder/tests): fix integration test flakes due to slow header syncing#4846
Draft
fix(nodebuilder/tests): fix integration test flakes due to slow header syncing#4846
Conversation
…ader sync flakes Fixes flaky integration tests caused by tight per-test context deadlines expiring before nodes can sync headers to the required block height. - Increase timeouts to DefaultTestTimeout (10m) for TestBlobModule, TestDaModule, TestShareModule, TestShrexFromLightsWithBadFulls, and TestArchivalBlobSync - Add missing WaitForHeight call in TestDaModule GetIDs subtest before calling GetByHeight on light client - Fix WaitTillHeight to t.Fatalf on context cancellation instead of silently falling through Closes #4845 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
walldiss
reviewed
Mar 12, 2026
| func TestBlobModule(t *testing.T) { | ||
| t.Parallel() | ||
| ctx, cancel := context.WithTimeout(context.Background(), 25*time.Second) | ||
| ctx, cancel := context.WithTimeout(context.Background(), swamp.DefaultTestTimeout) |
Member
There was a problem hiding this comment.
I think it is too much for timeout. We should look closer on how much time it takes for test to finish and whether it is reasonable
Same applies to another test timeout change
Comment on lines
+105
to
107
| _, err = lightClient.Header.WaitForHeight(ctx, height) | ||
| require.NoError(t, err) | ||
| header, err := lightClient.Header.GetByHeight(ctx, height) |
Member
There was a problem hiding this comment.
What is the reason to add waitforheight?
Contributor
Author
|
Claude marked this as ready for review prematurely because integration tests weren't running in draft mode. Since they're still failing, this needs more work |
…ossipsub discovery Bridge nodes use FanoutOnly gossipsub mode, meaning they publish headers but don't subscribe to the mesh. For fanout to deliver headers to the light node, the bridge must discover the light node as a subscriber, which requires an actual p2p connection (not just a mocknet link). In tests with complex topologies (bridge + full + light), the light node bootstraps from the full node, so the bridge never establishes a direct connection to the light node. Without this connection, gossipsub fanout from the bridge doesn't reach the light node, causing it to get stuck at an early head height after the initial header exchange sync. Add sw.Connect() calls after starting all nodes to ensure the bridge (or archival FN) has a direct connection to the light node, enabling reliable gossipsub header delivery. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…id gossipsub race Bridge nodes use FanoutOnly gossipsub mode, so they publish headers but don't subscribe to the mesh. For fanout to deliver to the light node, the bridge must discover the light node as a subscriber, which requires gossipsub control message exchange. In the mocknet, this discovery can race with header publication, causing the light node's syncer to get stuck at the initial exchange head and never progress. Fix by starting the light node AFTER blobs are submitted and the full node has synced. The light node's initial header exchange sync then reliably fetches all needed headers without depending on gossipsub message delivery. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- TestHeaderSubscription: wait for bridge to accumulate blocks before starting light node so initial exchange sync generates subscription events without relying on gossipsub (unreliable with FanoutOnly in mocknet) - TestSyncLightAgainstFull: use fixed numBlocks height instead of full's moving head to avoid race where full advances past light's exchange sync - TestArchivalBlobSync: reset bootstrappers after stopping archivalBN to prevent new nodes from connecting to the stopped node Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
TestHeaderSubscription: explicitly connect bridge and light nodes via mocknet.ConnectPeers, wait for initial sync, and add a delay for gossipsub heartbeat exchange before subscribing. The bridge uses FanoutOnly mode which requires subscription metadata propagation. TestArchivalBlobSync: use assert.Eventually instead of assert.False for EDS pruning checks. Pruning FNs fetch blocks from core during catch-up and the pruner needs time to remove data outside the availability window. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
TestArchivalBlobSync: the prunerOpts only replaced time.Duration, but the pruner service uses share.Window for its availability window. Add fx.Replace(modshare.Window(testAvailWindow)) for pruning nodes so the pruner actually uses the 1ms window to prune historical data. TestShrexFromLightsWithBadFulls: setTimeInterval was using testTimeout (10min) as the AdvertiseInterval, preventing peer discovery for the entire test duration. Use a fixed 20s interval instead. TestHeaderSubscription: revert changes since the test already used DefaultTestTimeout and the gossipsub issue with FanoutOnly in mocknet is a pre-existing problem requiring deeper fixes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Start the light node after blocks are filled and the bridge has synced them. This ensures the light's initial header exchange sync fetches all needed headers without relying on gossipsub (unreliable with FanoutOnly bridge nodes in mocknet). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ync flakes TestShrexFromLights: restructure to start light node after blocks are filled so initial header exchange sync fetches all needed headers without relying on gossipsub (same pattern that fixed TestShrexFromLightsWithBadFulls). TestArchivalBlobSync: close the stopped archivalBN's host so that future createPeer/LinkAll calls don't re-link new nodes to it. Without this, new nodes may inadvertently reach archivalBN via shrex and get "protocols not supported" errors since its handlers are gone after Stop(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
DefaultTestTimeout(10m) forTestBlobModule,TestDaModule,TestShareModule,TestShrexFromLightsWithBadFulls, andTestArchivalBlobSync— these were flaking because nodes couldn't sync headers before the deadlineWaitForHeightcall inTestDaModule'sGetIDssubtest before callingGetByHeighton the light clientSwamp.WaitTillHeighttot.Fatalfon context cancellation instead of silently falling through, which caused confusing test failuresTest plan
make test-integration TAGS=blobpassesmake test-integration TAGS=dapassesmake test-integration TAGS=sharepassesmake test-integration TAGS=ndpassesmake test-integration TAGS=pruningpassesCloses #4845
🤖 Generated with Claude Code