Skip to content

Ensure consistent cleanup of range sync request tracking#8890

Open
jimmygchen wants to merge 16 commits intosigp:release-v8.1from
jimmygchen:fix-components-by-range-leak
Open

Ensure consistent cleanup of range sync request tracking#8890
jimmygchen wants to merge 16 commits intosigp:release-v8.1from
jimmygchen:fix-components-by-range-leak

Conversation

@jimmygchen
Copy link
Member

@jimmygchen jimmygchen commented Feb 24, 2026

Issue Addressed

Ensure SyncNetworkContext request-tracking maps are cleaned up on all exit paths.

SyncNetworkContext tracks active range sync requests in two maps: components_by_range_requests and custody_backfill_data_column_batch_requests. Several code paths left stale entries in these maps:

components_by_range_requests:

  • retry_columns_by_range: when peer selection or request sending fails mid-retry, the entry created for the original request was left behind.
  • Chain removal: when a range sync chain is removed (peer disconnect, chain failure), its associated entries were not cleaned up.

custody_backfill_data_column_batch_requests:

  • custody_backfill_data_columns_response: when the aggregated response completed with a DataColumnPeerFailure (missing or bad columns), the entry was retained. The caller creates a new entry for the retry, so the old one should always be removed on completion.

@jimmygchen jimmygchen added the work-in-progress PR is a work-in-progress label Feb 24, 2026
@jimmygchen jimmygchen force-pushed the fix-components-by-range-leak branch from 7207df2 to 135f086 Compare February 24, 2026 03:03
@jimmygchen jimmygchen force-pushed the fix-components-by-range-leak branch from 135f086 to df917a8 Compare February 24, 2026 03:06
@jimmygchen jimmygchen changed the base branch from unstable to release-v8.1 February 24, 2026 03:06
@jimmygchen jimmygchen added v8.1.1 Hotfix for v8.1.0 syncing and removed work-in-progress PR is a work-in-progress labels Feb 24, 2026
@jimmygchen jimmygchen marked this pull request as ready for review February 24, 2026 03:11
@jimmygchen jimmygchen requested a review from jxs as a code owner February 24, 2026 03:11
@jimmygchen jimmygchen added the ready-for-review The code is ready for review label Feb 24, 2026
Copy link
Member

@pawanjay176 pawanjay176 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good mostly, just curious what you think about the backfill case.

.map_err(|e| {
// Clean up the components_by_range_requests entry before returning error
self.components_by_range_requests
.retain(|key, _| key.id != id);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Damn is hard to reason about

pub custody_backfill_batches: usize,
}

#[cfg(test)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarify can we put the cfg(test) on each function to make it extra clear to mantainers that this specific function is for tests? We have 3 impl blocks in this 1 file

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This compacted diff threw me off, the .map_err happens in an impl that's not cfg(test)

Image

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an impl with TestBeaconChainType<E> as generic parameter, hence the entire block is cfg(test), because the type TestBeaconChainType is also cfg(test).

I'll come back to this later once we get a working fix. I'm not convinced this PR fixed the issue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol ignore my comment above, I also got confused by the collapsed view too. The test functions have been moved to the impl<E: EthSpec> SyncNetWorkContext<TestBeaconChainHarnessType<E> block

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, so it looks like there's nothing that needs address here?

debug!(id = chain.id(), ?sync_type, reason = ?remove_reason, op, "Chain removed");
}

network.remove_components_by_range_for_chain(chain.id());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are components_by_range requests stale in the first place? They should either:

  • drop if 0 no peers, or timeout if allowed to have 0 peers
  • fail after N retries

In no case they should linger there forever

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, i'll remove remove_components_by_range_for_chain

Copy link
Member Author

@jimmygchen jimmygchen Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to add, in the normal flow they shouldn't (the two fixes in this PR) and this was just being defensive but hides the problem. With the fixes in place, this is unnecessary, so I've removed this.

@mergify
Copy link

mergify bot commented Feb 25, 2026

Some required checks have failed. Could you please take a look @jimmygchen? 🙏

@mergify mergify bot added waiting-on-author The reviewer has suggested changes and awaits thier implementation. and removed ready-for-review The code is ready for review labels Feb 25, 2026
@jimmygchen jimmygchen force-pushed the fix-components-by-range-leak branch from 7d8102f to e84e33c Compare February 25, 2026 05:14
@jimmygchen jimmygchen force-pushed the fix-components-by-range-leak branch from b61561d to 5bad6b5 Compare February 25, 2026 06:30
@jimmygchen
Copy link
Member Author

@dapplion I've pushed a fixed for a 3rd issue I found. I know the fix but I have not yet have the chance to self-review the actual implementation, so I've marked this as work-in-progress but would be useful if you can look at it.

I'm also in the process of adding a test.

@jimmygchen jimmygchen added the work-in-progress PR is a work-in-progress label Feb 25, 2026
@jimmygchen jimmygchen added ready-for-review The code is ready for review and removed work-in-progress PR is a work-in-progress waiting-on-author The reviewer has suggested changes and awaits thier implementation. labels Feb 25, 2026
jimmygchen pushed a commit to jimmygchen/lighthouse that referenced this pull request Feb 25, 2026
/// no custody peer is available for the retry.
#[test]
fn retry_columns_by_range_cleans_up_on_no_peers() {
use lighthouse_network::rpc::BlocksByRangeRequest;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does Claude do this random imports inside functions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've actually told claude to NOT do it and fix them, but looks like it either missed this one, or added a new test that followed the old pattern 💀

peer_id: block_peer_0,
beacon_block: None,
seen_timestamp: D,
});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jimmygchen and you complained about the verbosity of my tests? 😆

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you got me 🤣

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll manually trim this down now

// Clean up any orphaned components_by_range entries for backfill.
network.remove_components_by_range_requests(|r| {
matches!(r, RangeRequestId::BackfillSync { .. })
});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't get why this clean up is necessary

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Network requests are finite by definition as libp2p enforces timeouts + we (should) have a finite amount of retries. So eventually all requests either succeed or fail. If a request completes (ok/nok) for a backfill run of forward sync chain that no longer exists then fine. But if components_by_range_requests never expire that points to an issue in the design of those

Copy link
Member

@pawanjay176 pawanjay176 Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree. This is hiding something in backfill that we should address

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the sequence:

Consider two in-flight batches A & B:

  1. Batch A fails due to a peer failure and exceeds retry, fail_sync clears all the batches (including B)
  1. Batch B's responses arrives, hasn't exceeded retry, so it's entry was kept to be retried later here:
  1. It then reach inject_error, which is suppose to perform retry, however the batch no longer exists, so it never performs the retry and the request stays in the map.
    // this could be an error for an old batch, removed when the chain advances
    Ok(())

So I think the correct fix here is to clear the requests when the batches are cleared in step 1, which is exactly what this fix does.

metrics::set_gauge_vec(&metrics::SYNC_ACTIVE_NETWORK_REQUESTS, &[id], count as i64);
}

// Detect stale components_by_range entries (older than 60s)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not useful and can be removed

Ok(peers) => peers,
Err(e) => {
let id = ComponentsByRangeRequestId { id, requester };
self.components_by_range_requests.remove(&id);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is correct? if we can't find peers, we shouldn't just remove the components_by_range request.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Tracing through the code: when retry_columns_by_range fails here, retry_partial_batch swallows the error and returns Ok(KeepChain). The batch stays in AwaitingDownload. Later, attempt_send_awaiting_download_batches picks it up and calls send_batchblock_components_by_range_request, which creates a new entry with a new ID. So the old entry is orphaned — nobody calls retry_columns_by_range with the old ID again.

That said, removing the entry means send_batch re-downloads blocks that were already successfully received. Ideally the system would keep the entry and route the AwaitingDownload batch back through retry_columns_by_range (column-only retry) when peers become available, rather than through send_batch (full batch retry). But attempt_send_awaiting_download_batches doesn't currently distinguish "needs column-only retry" from "needs full batch download."

So the removal is correct given the current flow, but it points to a gap — there's no path to resume a column-only retry after a transient peer shortage.

let data_column_requests = match data_column_requests {
Ok(reqs) => reqs,
Err(e) => {
self.components_by_range_requests.remove(&id);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is correct, if we fail to send a single column by range request - we probably shouldnt' remove the components_by_range_request? otherwise it won't be able to retry.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Same situation as above. When this fails, reinsert_failed_column_requests is never called, so the entry's internal requests map still has the old (completed) sub-request IDs, not the new retry ones. Any responses from successfully-sent requests would hit add_custody_columnsrequests.get_mut(&req_id)"unknown data columns by range req_id" → error removal at line 774. And if the first send fails (.collect() short-circuits), no responses arrive at all — permanent leak.

The batch then falls through to AwaitingDownloadsend_batch which creates a new entry, same as the peer selection case.

So keeping the entry doesn't help with the current flow — the entry can't be reused because the retry sub-requests were never registered in it. The deeper fix would be the same: handle partial send failures more granularly (register successful sends, track failed columns for later retry) so the entry can be reused.

}
// Request is complete — always remove the entry. On error, the caller
// will create a new entry for the retry.
entry.remove();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean if we previously fail with a coupling error, the request stays in custody_backfill_data_column_batch_requests forever?
and when retried a new one gets created?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Yes. Tracing the flow: when responses() returns Err(DataColumnPeerFailure{exceeded_retries: false}), the entry is kept (is_ok() is false → no removal). The error propagates to custody_backfill_sync::on_data_column_responsedownload_failedContinuesend_batchcustody_backfill_data_columns_batch_request, which creates a new entry with a new CustodyBackFillBatchRequestId. The old entry is never referenced again.

So always-remove here is correct — the old entry is dead once the retry creates a fresh one.

@jimmygchen jimmygchen added work-in-progress PR is a work-in-progress and removed ready-for-review The code is ready for review labels Feb 26, 2026
@jimmygchen jimmygchen removed the v8.1.1 Hotfix for v8.1.0 label Feb 27, 2026
@jimmygchen
Copy link
Member Author

Pushing this out of v8.1.1 as this still need further refinement and testing, and only affects some edge cases.
Worth considering alternative fix from @dapplion dapplion#67

@jimmygchen jimmygchen added the bug Something isn't working label Mar 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working syncing work-in-progress PR is a work-in-progress

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants