Skip to content

fix(nodebuilder/tests): fix integration test flakes due to slow header syncing#4846

Draft
rootulp wants to merge 9 commits intomainfrom
fix/integration-test-flakes-header-sync
Draft

fix(nodebuilder/tests): fix integration test flakes due to slow header syncing#4846
rootulp wants to merge 9 commits intomainfrom
fix/integration-test-flakes-header-sync

Conversation

@rootulp
Copy link
Contributor

@rootulp rootulp commented Mar 12, 2026

Summary

  • Increase per-test context timeouts from 20-30s to DefaultTestTimeout (10m) for TestBlobModule, TestDaModule, TestShareModule, TestShrexFromLightsWithBadFulls, and TestArchivalBlobSync — these were flaking because nodes couldn't sync headers before the deadline
  • Add missing WaitForHeight call in TestDaModule's GetIDs subtest before calling GetByHeight on the light client
  • Fix Swamp.WaitTillHeight to t.Fatalf on context cancellation instead of silently falling through, which caused confusing test failures

Test plan

  • make test-integration TAGS=blob passes
  • make test-integration TAGS=da passes
  • make test-integration TAGS=share passes
  • make test-integration TAGS=nd passes
  • make test-integration TAGS=pruning passes

Closes #4845

🤖 Generated with Claude Code


Open with Devin

…ader sync flakes

Fixes flaky integration tests caused by tight per-test context deadlines
expiring before nodes can sync headers to the required block height.

- Increase timeouts to DefaultTestTimeout (10m) for TestBlobModule,
  TestDaModule, TestShareModule, TestShrexFromLightsWithBadFulls, and
  TestArchivalBlobSync
- Add missing WaitForHeight call in TestDaModule GetIDs subtest before
  calling GetByHeight on light client
- Fix WaitTillHeight to t.Fatalf on context cancellation instead of
  silently falling through

Closes #4845

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@rootulp rootulp self-assigned this Mar 12, 2026
@github-actions github-actions bot added the external Issues created by non node team members label Mar 12, 2026
@rootulp rootulp added the kind:fix Attached to bug-fixing PRs label Mar 12, 2026
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@rootulp rootulp marked this pull request as ready for review March 12, 2026 20:16
Copy link

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 3 additional findings.

Open in Devin Review

func TestBlobModule(t *testing.T) {
t.Parallel()
ctx, cancel := context.WithTimeout(context.Background(), 25*time.Second)
ctx, cancel := context.WithTimeout(context.Background(), swamp.DefaultTestTimeout)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is too much for timeout. We should look closer on how much time it takes for test to finish and whether it is reasonable

Same applies to another test timeout change

Comment on lines +105 to 107
_, err = lightClient.Header.WaitForHeight(ctx, height)
require.NoError(t, err)
header, err := lightClient.Header.GetByHeight(ctx, height)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason to add waitforheight?

@rootulp rootulp marked this pull request as draft March 12, 2026 20:48
@rootulp
Copy link
Contributor Author

rootulp commented Mar 12, 2026

Claude marked this as ready for review prematurely because integration tests weren't running in draft mode. Since they're still failing, this needs more work

rootulp and others added 3 commits March 12, 2026 17:41
…ossipsub discovery

Bridge nodes use FanoutOnly gossipsub mode, meaning they publish headers
but don't subscribe to the mesh. For fanout to deliver headers to the
light node, the bridge must discover the light node as a subscriber,
which requires an actual p2p connection (not just a mocknet link).

In tests with complex topologies (bridge + full + light), the light
node bootstraps from the full node, so the bridge never establishes a
direct connection to the light node. Without this connection, gossipsub
fanout from the bridge doesn't reach the light node, causing it to get
stuck at an early head height after the initial header exchange sync.

Add sw.Connect() calls after starting all nodes to ensure the bridge
(or archival FN) has a direct connection to the light node, enabling
reliable gossipsub header delivery.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…id gossipsub race

Bridge nodes use FanoutOnly gossipsub mode, so they publish headers but
don't subscribe to the mesh. For fanout to deliver to the light node,
the bridge must discover the light node as a subscriber, which requires
gossipsub control message exchange. In the mocknet, this discovery can
race with header publication, causing the light node's syncer to get
stuck at the initial exchange head and never progress.

Fix by starting the light node AFTER blobs are submitted and the full
node has synced. The light node's initial header exchange sync then
reliably fetches all needed headers without depending on gossipsub
message delivery.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- TestHeaderSubscription: wait for bridge to accumulate blocks before
  starting light node so initial exchange sync generates subscription
  events without relying on gossipsub (unreliable with FanoutOnly in
  mocknet)
- TestSyncLightAgainstFull: use fixed numBlocks height instead of
  full's moving head to avoid race where full advances past light's
  exchange sync
- TestArchivalBlobSync: reset bootstrappers after stopping archivalBN
  to prevent new nodes from connecting to the stopped node

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@rootulp rootulp marked this pull request as ready for review March 12, 2026 22:43
@github-actions github-actions bot requested a review from walldiss March 12, 2026 22:43
rootulp and others added 4 commits March 12, 2026 19:19
TestHeaderSubscription: explicitly connect bridge and light nodes via
mocknet.ConnectPeers, wait for initial sync, and add a delay for
gossipsub heartbeat exchange before subscribing. The bridge uses
FanoutOnly mode which requires subscription metadata propagation.

TestArchivalBlobSync: use assert.Eventually instead of assert.False
for EDS pruning checks. Pruning FNs fetch blocks from core during
catch-up and the pruner needs time to remove data outside the
availability window.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
TestArchivalBlobSync: the prunerOpts only replaced time.Duration, but
the pruner service uses share.Window for its availability window. Add
fx.Replace(modshare.Window(testAvailWindow)) for pruning nodes so the
pruner actually uses the 1ms window to prune historical data.

TestShrexFromLightsWithBadFulls: setTimeInterval was using testTimeout
(10min) as the AdvertiseInterval, preventing peer discovery for the
entire test duration. Use a fixed 20s interval instead.

TestHeaderSubscription: revert changes since the test already used
DefaultTestTimeout and the gossipsub issue with FanoutOnly in mocknet
is a pre-existing problem requiring deeper fixes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Start the light node after blocks are filled and the bridge has synced
them. This ensures the light's initial header exchange sync fetches all
needed headers without relying on gossipsub (unreliable with FanoutOnly
bridge nodes in mocknet).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ync flakes

TestShrexFromLights: restructure to start light node after blocks are
filled so initial header exchange sync fetches all needed headers without
relying on gossipsub (same pattern that fixed TestShrexFromLightsWithBadFulls).

TestArchivalBlobSync: close the stopped archivalBN's host so that future
createPeer/LinkAll calls don't re-link new nodes to it. Without this,
new nodes may inadvertently reach archivalBN via shrex and get "protocols
not supported" errors since its handlers are gone after Stop().

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@rootulp rootulp marked this pull request as draft March 13, 2026 18:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

external Issues created by non node team members kind:fix Attached to bug-fixing PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

test(integration): Widespread flakes due to slow header syncing in Swamp test network

2 participants