Ensure consistent cleanup of range sync request tracking#8890
Ensure consistent cleanup of range sync request tracking#8890jimmygchen wants to merge 16 commits intosigp:release-v8.1from
Conversation
7207df2 to
135f086
Compare
135f086 to
df917a8
Compare
pawanjay176
left a comment
There was a problem hiding this comment.
Looks good mostly, just curious what you think about the backfill case.
| .map_err(|e| { | ||
| // Clean up the components_by_range_requests entry before returning error | ||
| self.components_by_range_requests | ||
| .retain(|key, _| key.id != id); |
There was a problem hiding this comment.
Damn is hard to reason about
| pub custody_backfill_batches: usize, | ||
| } | ||
|
|
||
| #[cfg(test)] |
There was a problem hiding this comment.
For clarify can we put the cfg(test) on each function to make it extra clear to mantainers that this specific function is for tests? We have 3 impl blocks in this 1 file
There was a problem hiding this comment.
This is an impl with TestBeaconChainType<E> as generic parameter, hence the entire block is cfg(test), because the type TestBeaconChainType is also cfg(test).
I'll come back to this later once we get a working fix. I'm not convinced this PR fixed the issue.
There was a problem hiding this comment.
lol ignore my comment above, I also got confused by the collapsed view too. The test functions have been moved to the impl<E: EthSpec> SyncNetWorkContext<TestBeaconChainHarnessType<E> block
There was a problem hiding this comment.
right, so it looks like there's nothing that needs address here?
| debug!(id = chain.id(), ?sync_type, reason = ?remove_reason, op, "Chain removed"); | ||
| } | ||
|
|
||
| network.remove_components_by_range_for_chain(chain.id()); |
There was a problem hiding this comment.
Why are components_by_range requests stale in the first place? They should either:
- drop if 0 no peers, or timeout if allowed to have 0 peers
- fail after N retries
In no case they should linger there forever
There was a problem hiding this comment.
Good point, i'll remove remove_components_by_range_for_chain
There was a problem hiding this comment.
Just to add, in the normal flow they shouldn't (the two fixes in this PR) and this was just being defensive but hides the problem. With the fixes in place, this is unnecessary, so I've removed this.
|
Some required checks have failed. Could you please take a look @jimmygchen? 🙏 |
7d8102f to
e84e33c
Compare
b61561d to
5bad6b5
Compare
|
@dapplion I've pushed a fixed for a 3rd issue I found. I know the fix but I have not yet have the chance to self-review the actual implementation, so I've marked this as I'm also in the process of adding a test. |
| /// no custody peer is available for the retry. | ||
| #[test] | ||
| fn retry_columns_by_range_cleans_up_on_no_peers() { | ||
| use lighthouse_network::rpc::BlocksByRangeRequest; |
There was a problem hiding this comment.
Why does Claude do this random imports inside functions?
There was a problem hiding this comment.
I've actually told claude to NOT do it and fix them, but looks like it either missed this one, or added a new test that followed the old pattern 💀
| peer_id: block_peer_0, | ||
| beacon_block: None, | ||
| seen_timestamp: D, | ||
| }); |
There was a problem hiding this comment.
@jimmygchen and you complained about the verbosity of my tests? 😆
There was a problem hiding this comment.
I'll manually trim this down now
| // Clean up any orphaned components_by_range entries for backfill. | ||
| network.remove_components_by_range_requests(|r| { | ||
| matches!(r, RangeRequestId::BackfillSync { .. }) | ||
| }); |
There was a problem hiding this comment.
I still don't get why this clean up is necessary
There was a problem hiding this comment.
Network requests are finite by definition as libp2p enforces timeouts + we (should) have a finite amount of retries. So eventually all requests either succeed or fail. If a request completes (ok/nok) for a backfill run of forward sync chain that no longer exists then fine. But if components_by_range_requests never expire that points to an issue in the design of those
There was a problem hiding this comment.
Yeah I agree. This is hiding something in backfill that we should address
There was a problem hiding this comment.
I think this is the sequence:
Consider two in-flight batches A & B:
- Batch A fails due to a peer failure and exceeds retry,
fail_syncclears all the batches (including B)
lighthouse/beacon_node/network/src/sync/backfill_sync/mod.rs
Lines 366 to 369 in 6166ad2
lighthouse/beacon_node/network/src/sync/backfill_sync/mod.rs
Lines 443 to 444 in 6166ad2
- Batch B's responses arrives, hasn't exceeded retry, so it's entry was kept to be retried later here:
lighthouse/beacon_node/network/src/sync/network_context.rs
Lines 790 to 797 in f4a6b8d
- It then reach
inject_error, which is suppose to perform retry, however the batch no longer exists, so it never performs the retry and the request stays in the map.lighthouse/beacon_node/network/src/sync/backfill_sync/mod.rs
Lines 373 to 374 in 6166ad2
So I think the correct fix here is to clear the requests when the batches are cleared in step 1, which is exactly what this fix does.
| metrics::set_gauge_vec(&metrics::SYNC_ACTIVE_NETWORK_REQUESTS, &[id], count as i64); | ||
| } | ||
|
|
||
| // Detect stale components_by_range entries (older than 60s) |
There was a problem hiding this comment.
This is not useful and can be removed
| Ok(peers) => peers, | ||
| Err(e) => { | ||
| let id = ComponentsByRangeRequestId { id, requester }; | ||
| self.components_by_range_requests.remove(&id); |
There was a problem hiding this comment.
Not sure if this is correct? if we can't find peers, we shouldn't just remove the components_by_range request.
There was a problem hiding this comment.
🤖 Tracing through the code: when retry_columns_by_range fails here, retry_partial_batch swallows the error and returns Ok(KeepChain). The batch stays in AwaitingDownload. Later, attempt_send_awaiting_download_batches picks it up and calls send_batch → block_components_by_range_request, which creates a new entry with a new ID. So the old entry is orphaned — nobody calls retry_columns_by_range with the old ID again.
That said, removing the entry means send_batch re-downloads blocks that were already successfully received. Ideally the system would keep the entry and route the AwaitingDownload batch back through retry_columns_by_range (column-only retry) when peers become available, rather than through send_batch (full batch retry). But attempt_send_awaiting_download_batches doesn't currently distinguish "needs column-only retry" from "needs full batch download."
So the removal is correct given the current flow, but it points to a gap — there's no path to resume a column-only retry after a transient peer shortage.
| let data_column_requests = match data_column_requests { | ||
| Ok(reqs) => reqs, | ||
| Err(e) => { | ||
| self.components_by_range_requests.remove(&id); |
There was a problem hiding this comment.
Not sure if this is correct, if we fail to send a single column by range request - we probably shouldnt' remove the components_by_range_request? otherwise it won't be able to retry.
There was a problem hiding this comment.
🤖 Same situation as above. When this fails, reinsert_failed_column_requests is never called, so the entry's internal requests map still has the old (completed) sub-request IDs, not the new retry ones. Any responses from successfully-sent requests would hit add_custody_columns → requests.get_mut(&req_id) → "unknown data columns by range req_id" → error removal at line 774. And if the first send fails (.collect() short-circuits), no responses arrive at all — permanent leak.
The batch then falls through to AwaitingDownload → send_batch which creates a new entry, same as the peer selection case.
So keeping the entry doesn't help with the current flow — the entry can't be reused because the retry sub-requests were never registered in it. The deeper fix would be the same: handle partial send failures more granularly (register successful sends, track failed columns for later retry) so the entry can be reused.
| } | ||
| // Request is complete — always remove the entry. On error, the caller | ||
| // will create a new entry for the retry. | ||
| entry.remove(); |
There was a problem hiding this comment.
Does this mean if we previously fail with a coupling error, the request stays in custody_backfill_data_column_batch_requests forever?
and when retried a new one gets created?
There was a problem hiding this comment.
🤖 Yes. Tracing the flow: when responses() returns Err(DataColumnPeerFailure{exceeded_retries: false}), the entry is kept (is_ok() is false → no removal). The error propagates to custody_backfill_sync::on_data_column_response → download_failed → Continue → send_batch → custody_backfill_data_columns_batch_request, which creates a new entry with a new CustodyBackFillBatchRequestId. The old entry is never referenced again.
So always-remove here is correct — the old entry is dead once the retry creates a fresh one.
|
Pushing this out of v8.1.1 as this still need further refinement and testing, and only affects some edge cases. |

Issue Addressed
Ensure
SyncNetworkContextrequest-tracking maps are cleaned up on all exit paths.SyncNetworkContexttracks active range sync requests in two maps:components_by_range_requestsandcustody_backfill_data_column_batch_requests. Several code paths left stale entries in these maps:components_by_range_requests:retry_columns_by_range: when peer selection or request sending fails mid-retry, the entry created for the original request was left behind.custody_backfill_data_column_batch_requests:custody_backfill_data_columns_response: when the aggregated response completed with aDataColumnPeerFailure(missing or bad columns), the entry was retained. The caller creates a new entry for the retry, so the old one should always be removed on completion.