Ensure consistent cleanup of range sync request tracking by jimmygchen · Pull Request #8890 · sigp/lighthouse

jimmygchen · 2026-02-24T02:55:42Z

Issue Addressed

Ensure SyncNetworkContext request-tracking maps are cleaned up on all exit paths.

SyncNetworkContext tracks active range sync requests in two maps: components_by_range_requests and custody_backfill_data_column_batch_requests. Several code paths left stale entries in these maps:

`components_by_range_requests`:

retry_columns_by_range: when peer selection or request sending fails mid-retry, the entry created for the original request was left behind.
Chain removal: when a range sync chain is removed (peer disconnect, chain failure), its associated entries were not cleaned up.

`custody_backfill_data_column_batch_requests`:

custody_backfill_data_columns_response: when the aggregated response completed with a DataColumnPeerFailure (missing or bad columns), the entry was retained. The caller creates a new entry for the retry, so the old one should always be removed on completion.

pawanjay176

Looks good mostly, just curious what you think about the backfill case.

beacon_node/network/src/sync/network_context.rs

beacon_node/network/src/sync/tests/range.rs

beacon_node/network/src/sync/network_context.rs

dapplion · 2026-02-24T21:46:55Z

beacon_node/network/src/sync/network_context.rs

+            .map_err(|e| {
+                // Clean up the components_by_range_requests entry before returning error
+                self.components_by_range_requests
+                    .retain(|key, _| key.id != id);


Damn is hard to reason about

dapplion · 2026-02-24T21:50:02Z

beacon_node/network/src/sync/network_context.rs

+    pub custody_backfill_batches: usize,
+}
+
+#[cfg(test)]


For clarify can we put the cfg(test) on each function to make it extra clear to mantainers that this specific function is for tests? We have 3 impl blocks in this 1 file

This compacted diff threw me off, the .map_err happens in an impl that's not cfg(test)

This is an impl with TestBeaconChainType<E> as generic parameter, hence the entire block is cfg(test), because the type TestBeaconChainType is also cfg(test).

I'll come back to this later once we get a working fix. I'm not convinced this PR fixed the issue.

lol ignore my comment above, I also got confused by the collapsed view too. The test functions have been moved to the impl<E: EthSpec> SyncNetWorkContext<TestBeaconChainHarnessType<E> block

right, so it looks like there's nothing that needs address here?

dapplion · 2026-02-24T22:01:32Z

beacon_node/network/src/sync/range_sync/range.rs

            debug!(id = chain.id(), ?sync_type, reason = ?remove_reason, op, "Chain removed");
        }

+        network.remove_components_by_range_for_chain(chain.id());


Why are components_by_range requests stale in the first place? They should either:

drop if 0 no peers, or timeout if allowed to have 0 peers

fail after N retries

In no case they should linger there forever

Good point, i'll remove remove_components_by_range_for_chain

Just to add, in the normal flow they shouldn't (the two fixes in this PR) and this was just being defensive but hides the problem. With the fixes in place, this is unnecessary, so I've removed this.

…lighthouse into fix-components-by-range-leak

mergify · 2026-02-25T03:09:37Z

Some required checks have failed. Could you please take a look @jimmygchen? 🙏

…ailure

jimmygchen · 2026-02-25T06:34:30Z

@dapplion I've pushed a fixed for a 3rd issue I found. I know the fix but I have not yet have the chance to self-review the actual implementation, so I've marked this as work-in-progress but would be useful if you can look at it.

I'm also in the process of adding a test.

dapplion · 2026-02-25T20:11:19Z

beacon_node/network/src/sync/tests/range.rs

+/// no custody peer is available for the retry.
+#[test]
+fn retry_columns_by_range_cleans_up_on_no_peers() {
+    use lighthouse_network::rpc::BlocksByRangeRequest;


Why does Claude do this random imports inside functions?

I've actually told claude to NOT do it and fix them, but looks like it either missed this one, or added a new test that followed the old pattern 💀

dapplion · 2026-02-25T20:11:56Z

beacon_node/network/src/sync/tests/range.rs

+        peer_id: block_peer_0,
+        beacon_block: None,
+        seen_timestamp: D,
+    });


@jimmygchen and you complained about the verbosity of my tests? 😆

you got me 🤣

I'll manually trim this down now

dapplion · 2026-02-25T20:19:47Z

beacon_node/network/src/sync/backfill_sync/mod.rs

+        // Clean up any orphaned components_by_range entries for backfill.
+        network.remove_components_by_range_requests(|r| {
+            matches!(r, RangeRequestId::BackfillSync { .. })
+        });


I still don't get why this clean up is necessary

Network requests are finite by definition as libp2p enforces timeouts + we (should) have a finite amount of retries. So eventually all requests either succeed or fail. If a request completes (ok/nok) for a backfill run of forward sync chain that no longer exists then fine. But if components_by_range_requests never expire that points to an issue in the design of those

Yeah I agree. This is hiding something in backfill that we should address

I think this is the sequence:

Consider two in-flight batches A & B:

Batch A fails due to a peer failure and exceeds retry, fail_sync clears all the batches (including B)

lighthouse/beacon_node/network/src/sync/backfill_sync/mod.rs

Lines 366 to 369 in 6166ad2

Err(e) => self.fail_sync(BackFillError::BatchInvalidState(batch_id, e.0)),

Ok(BatchOperationOutcome::Failed { blacklist: _ }) => {

self.fail_sync(BackFillError::BatchDownloadFailed(batch_id))

}

lighthouse/beacon_node/network/src/sync/backfill_sync/mod.rs

Lines 443 to 444 in 6166ad2

// Remove all batches and active requests and participating peers.

self.batches.clear();

Batch B's responses arrives, hasn't exceeded retry, so it's entry was kept to be retried later here:

lighthouse/beacon_node/network/src/sync/network_context.rs

Lines 790 to 797 in f4a6b8d

if *exceeded_retries {

debug!(

entry=?entry.key(),

msg = error,

"Request exceeded max retries, failing batch"

);

entry.remove();

};

It then reach inject_error, which is suppose to perform retry, however the batch no longer exists, so it never performs the retry and the request stays in the map.

lighthouse/beacon_node/network/src/sync/backfill_sync/mod.rs

Lines 373 to 374 in 6166ad2

// this could be an error for an old batch, removed when the chain advances

Ok(())

So I think the correct fix here is to clear the requests when the batches are cleared in step 1, which is exactly what this fix does.

jimmygchen · 2026-02-26T02:01:39Z

beacon_node/network/src/sync/network_context.rs

            metrics::set_gauge_vec(&metrics::SYNC_ACTIVE_NETWORK_REQUESTS, &[id], count as i64);
        }
+
+        // Detect stale components_by_range entries (older than 60s)


This is not useful and can be removed

jimmygchen · 2026-02-26T05:41:44Z

beacon_node/network/src/sync/network_context.rs

+            Ok(peers) => peers,
+            Err(e) => {
+                let id = ComponentsByRangeRequestId { id, requester };
+                self.components_by_range_requests.remove(&id);


Not sure if this is correct? if we can't find peers, we shouldn't just remove the components_by_range request.

🤖 Tracing through the code: when retry_columns_by_range fails here, retry_partial_batch swallows the error and returns Ok(KeepChain). The batch stays in AwaitingDownload. Later, attempt_send_awaiting_download_batches picks it up and calls send_batch → block_components_by_range_request, which creates a new entry with a new ID. So the old entry is orphaned — nobody calls retry_columns_by_range with the old ID again.

That said, removing the entry means send_batch re-downloads blocks that were already successfully received. Ideally the system would keep the entry and route the AwaitingDownload batch back through retry_columns_by_range (column-only retry) when peers become available, rather than through send_batch (full batch retry). But attempt_send_awaiting_download_batches doesn't currently distinguish "needs column-only retry" from "needs full batch download."

So the removal is correct given the current flow, but it points to a gap — there's no path to resume a column-only retry after a transient peer shortage.

jimmygchen · 2026-02-26T05:43:02Z

beacon_node/network/src/sync/network_context.rs

+        let data_column_requests = match data_column_requests {
+            Ok(reqs) => reqs,
+            Err(e) => {
+                self.components_by_range_requests.remove(&id);


Not sure if this is correct, if we fail to send a single column by range request - we probably shouldnt' remove the components_by_range_request? otherwise it won't be able to retry.

🤖 Same situation as above. When this fails, reinsert_failed_column_requests is never called, so the entry's internal requests map still has the old (completed) sub-request IDs, not the new retry ones. Any responses from successfully-sent requests would hit add_custody_columns → requests.get_mut(&req_id) → "unknown data columns by range req_id" → error removal at line 774. And if the first send fails (.collect() short-circuits), no responses arrive at all — permanent leak.

The batch then falls through to AwaitingDownload → send_batch which creates a new entry, same as the peer selection case.

So keeping the entry doesn't help with the current flow — the entry can't be reused because the retry sub-requests were never registered in it. The deeper fix would be the same: handle partial send failures more granularly (register successful sends, track failed columns for later retry) so the entry can be reused.

jimmygchen · 2026-02-26T05:46:33Z

beacon_node/network/src/sync/network_context.rs

-            }
+            // Request is complete — always remove the entry. On error, the caller
+            // will create a new entry for the retry.
+            entry.remove();


Does this mean if we previously fail with a coupling error, the request stays in custody_backfill_data_column_batch_requests forever?
and when retried a new one gets created?

🤖 Yes. Tracing the flow: when responses() returns Err(DataColumnPeerFailure{exceeded_retries: false}), the entry is kept (is_ok() is false → no removal). The error propagates to custody_backfill_sync::on_data_column_response → download_failed → Continue → send_batch → custody_backfill_data_columns_batch_request, which creates a new entry with a new CustodyBackFillBatchRequestId. The old entry is never referenced again.

So always-remove here is correct — the old entry is dead once the retry creates a fresh one.

jimmygchen · 2026-02-27T02:18:33Z

Pushing this out of v8.1.1 as this still need further refinement and testing, and only affects some edge cases.
Worth considering alternative fix from @dapplion dapplion#67

jimmygchen added the work-in-progress PR is a work-in-progress label Feb 24, 2026

jimmygchen force-pushed the fix-components-by-range-leak branch from 7207df2 to 135f086 Compare February 24, 2026 03:03

Ensure consistent cleanup of range sync request tracking

df917a8

jimmygchen force-pushed the fix-components-by-range-leak branch from 135f086 to df917a8 Compare February 24, 2026 03:06

jimmygchen changed the base branch from unstable to release-v8.1 February 24, 2026 03:06

jimmygchen added v8.1.1 Hotfix for v8.1.0 syncing and removed work-in-progress PR is a work-in-progress labels Feb 24, 2026

jimmygchen requested review from dapplion and pawanjay176 February 24, 2026 03:11

jimmygchen marked this pull request as ready for review February 24, 2026 03:11

jimmygchen requested a review from jxs as a code owner February 24, 2026 03:11

jimmygchen mentioned this pull request Feb 24, 2026

Regression test and fixes for range sync CGC race condition #8039

Open

jimmygchen added the ready-for-review The code is ready for review label Feb 24, 2026

Improve range sync cleanup tests

53e7ddd

pawanjay176 reviewed Feb 24, 2026

View reviewed changes

beacon_node/network/src/sync/network_context.rs Outdated Show resolved Hide resolved

beacon_node/network/src/sync/tests/range.rs Outdated Show resolved Hide resolved

beacon_node/network/src/sync/tests/range.rs Outdated Show resolved Hide resolved

dapplion reviewed Feb 24, 2026

View reviewed changes

beacon_node/network/src/sync/network_context.rs Outdated Show resolved Hide resolved

dapplion reviewed Feb 24, 2026

View reviewed changes

jimmygchen added 3 commits February 25, 2026 12:51

Add entry lifecycle logging and drop chain-removal cleanup

c1cf18a

Add entry lifecycle tracing and drop chain-removal cleanup

bc7a374

Merge branch 'fix-components-by-range-leak' of github.com:jimmygchen/…

02e9511

…lighthouse into fix-components-by-range-leak

mergify bot added waiting-on-author The reviewer has suggested changes and awaits thier implementation. and removed ready-for-review The code is ready for review labels Feb 25, 2026

jimmygchen added 4 commits February 25, 2026 14:18

clean up

7e16c8a

Remove debug lifecycle logs from components_by_range

05da19a

Remove debug lifecycle logs, keep gauge and stale warn

dc6e95c

Add test

4c8abf7

Fix retry cycle test: send block with data to trigger DataColumnPeerF…

e84e33c

…ailure

jimmygchen force-pushed the fix-components-by-range-leak branch from 7d8102f to e84e33c Compare February 25, 2026 05:14

Clean up orphaned components_by_range entries on sync session reset

5bad6b5

jimmygchen force-pushed the fix-components-by-range-leak branch from b61561d to 5bad6b5 Compare February 25, 2026 06:30

jimmygchen added 2 commits February 25, 2026 17:38

Add missing files

df6398c

Add test for orphaned components_by_range cleanup on chain removal

8b681d5

jimmygchen added the work-in-progress PR is a work-in-progress label Feb 25, 2026

Add test for orphaned components_by_range cleanup on chain removal

7dc4aa4

jimmygchen added ready-for-review The code is ready for review and removed work-in-progress PR is a work-in-progress waiting-on-author The reviewer has suggested changes and awaits thier implementation. labels Feb 25, 2026

jimmygchen requested review from dapplion and pawanjay176 February 25, 2026 11:41

Fix clippy

dba1725

jimmygchen pushed a commit to jimmygchen/lighthouse that referenced this pull request Feb 25, 2026

Merge fix-components-by-range-leak (sigp#8890) into staging

b81b702

dapplion reviewed Feb 25, 2026

View reviewed changes

jimmygchen commented Feb 26, 2026

View reviewed changes

jimmygchen added work-in-progress PR is a work-in-progress and removed ready-for-review The code is ready for review labels Feb 26, 2026

Rework PR sigp#8890 range sync tests to BDD-style

d034713

jimmygchen removed the v8.1.1 Hotfix for v8.1.0 label Feb 27, 2026

jimmygchen added the bug Something isn't working label Mar 10, 2026

	Err(e) => self.fail_sync(BackFillError::BatchInvalidState(batch_id, e.0)),
	Ok(BatchOperationOutcome::Failed { blacklist: _ }) => {
	self.fail_sync(BackFillError::BatchDownloadFailed(batch_id))
	}

	// Remove all batches and active requests and participating peers.
	self.batches.clear();

	if *exceeded_retries {
	debug!(
	entry=?entry.key(),
	msg = error,
	"Request exceeded max retries, failing batch"
	);
	entry.remove();
	};

	// this could be an error for an old batch, removed when the chain advances
	Ok(())

Conversation

jimmygchen commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue Addressed

components_by_range_requests:

custody_backfill_data_column_batch_requests:

Uh oh!

pawanjay176 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jimmygchen Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Feb 25, 2026

Uh oh!

jimmygchen commented Feb 25, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pawanjay176 Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jimmygchen commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

jimmygchen commented Feb 24, 2026 •

edited

Loading

`components_by_range_requests`:

`custody_backfill_data_column_batch_requests`:

jimmygchen Feb 25, 2026 •

edited

Loading

pawanjay176 Feb 26, 2026 •

edited

Loading