Skip to content

Conversation

sashaodessa
Copy link

Replaces unsafe port-finding functions with secure bound socket APIs to eliminate race conditions.

Problem

The existing unused_tcp*_port() and unused_udp*_port() functions had a classic Time-of-Check-Time-of-Use (TOCTOU) vulnerability:

  • Functions would bind to port 0, get the assigned port number, then immediately close the socket
  • Between getting the port number and later binding to it, another process could claim the same port
  • The LRU cache provided false security and didn't prevent the race condition

Solution

Added new safe APIs that return already-bound sockets:

  • bind_tcp4_any() / bind_tcp6_any() → returns TcpListener
  • bind_udp4_any() / bind_udp6_any() → returns UdpSocket

* Add co-author to mergify commits.

* Remove unnecessary pull request rules from mergify config.

* Revert automation removals
Copy link

cla-assistant bot commented Sep 9, 2025

CLA assistant check
All committers have signed the CLA.

Copy link

cla-assistant bot commented Sep 9, 2025

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@michaelsproul
Copy link
Member

Thanks!

We are aware of this issue and had been phasing these functions out in favour of using 0 ports explicitly. The hard part is replacing the uses of the old functions in code that assumes it can determine an unused port without binding a socket, e.g. here:

let port = unused_tcp4_port().expect("Unable to find unused port.");
CommandLineTest::new()
.flag("port", Some(port.to_string().as_str()))
.flag("allow-insecure-genesis-sync", None)
.run()
.with_config(|config| {
assert_eq!(
config.network.listen_addrs().v4().map(|listen_addr| (
listen_addr.disc_port,
listen_addr.quic_port,
listen_addr.tcp_port
)),
// quic_port should be (tcp_port + 1) if tcp_port is given as non-zero.
Some((port, port + 1, port))
);
});

If you would be interested in coming up with clever fixes for these cases, that would be extremely helpful. That would then allow us to delete rather than deprecate the unsafe functions.

@michaelsproul
Copy link
Member

Most of the uses are in tests, but it's still important for us to get rid of them (if possible), as they sometimes contribute to CI flakiness.

@michaelsproul michaelsproul added waiting-on-author The reviewer has suggested changes and awaits thier implementation. Networking labels Sep 9, 2025
@sashaodessa sashaodessa requested a review from jxs as a code owner September 10, 2025 08:26
daveth3t3chg33k

This comment was marked as spam.

@sashaodessa
Copy link
Author

@daveth3t3chg33k

@michaelsproul
Copy link
Member

@sashaodessa Can you also please rebase on unstable? All PRs need to be merged to unstable

@sashaodessa
Copy link
Author

@michaelsproul

@michaelsproul
Copy link
Member

Looks good at a glance, I've requested a review from some our networking experts on the changes that impact prod

@michaelsproul michaelsproul changed the base branch from stable to unstable September 14, 2025 12:34
Fulu uses `getPayloadV5`, this PR updates the notifier logging prior to the fork.


  


Co-Authored-By: Jimmy Chen <[email protected]>
Copy link
Member

@AgeManning AgeManning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look good to me.

Agree a future PR to remove deprecation would be nice.

Comment on lines +108 to +111
{% for commit in commits | unique(attribute='email_author') %}
Co-Authored-By: {{ commit.author }} <{{ commit.email_author }}>
{% endfor %}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these changes relevant?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No this is some leftover stuff from stable that hasn't been backmerged to unstable yet (see https://github.com/sigp/lighthouse/commits/stable/), it should be deleted from this PR.

@michaelsproul
Copy link
Member

I think we can probably just delete the deprecated functions in this PR, AFAICT all (?) usages have been removed

michaelsproul and others added 11 commits September 16, 2025 04:10
Attempt to address performance issues caused by importing the same block multiple times.


  - Check fork choice "after" obtaining the fork choice write lock in `BeaconChain::import_block`. We actually use an upgradable read lock, but this is semantically equivalent (the upgradable read has the advantage of not excluding regular reads).

The hope is that this change has several benefits:

1. By preventing duplicate block imports we save time repeating work inside `import_block` that is unnecessary, e.g. writing the state to disk. Although the store itself now takes some measures to avoid re-writing diffs, it is even better if we avoid a disk write entirely.
2. By returning `DuplicateFullyImported`, we reduce some duplicated work downstream. E.g. if multiple threads importing columns trigger `import_block`, now only _one_ of them will get a notification of the block import completing successfully, and only this one will run `recompute_head`. This should help avoid a situation where multiple beacon processor workers are consumed by threads blocking on the `recompute_head_lock`. However, a similar block-fest is still possible with the upgradable fork choice lock (a large number of threads can be blocked waiting for the first thread to complete block import).


Co-Authored-By: Michael Sproul <[email protected]>
…nd instrument tracing (sigp#8052)

Co-Authored-By: Jimmy Chen <[email protected]>

Co-Authored-By: Jimmy Chen <[email protected]>
Lookup sync has a cache of block roots "failed_chains". If a peer triggers a lookup for a block or descendant of a root in failed_chains the lookup is dropped and the peer penalized. However blocks are inserted into failed_chains for a single reason:
- If a chain is longer than 32 blocks the lookup is dropped to prevent OOM risks. However the peer is not at fault, since discovering an unknown chain longer than 32 blocks is not malicious. We just drop the lookup to sync the blocks from range forward sync.

This discrepancy is probably an oversight when changing old code. Before we used to add blocks that failed too many times to process to that cache. However, we don't do that anymore. Adding a block that fails too many times to process is an optimization to save resources in rare cases where peers keep sending us invalid blocks. In case that happens, today we keep trying to process the block, downscoring the peers and eventually disconnecting them. _IF_ we found that optimization to be necessary we should merge this PR (_Stricter match of BlockError in lookup sync_) first. IMO we are fine without the failed_chains cache and the ignored_chains cache will be obsolete with [tree sync](sigp#7678) as the OOM risk of long lookup chains does not exist anymore.

Closes sigp#7577


  Rename `failed_chains` for `ignored_chains` and don't penalize peers that trigger lookups for those blocks


Co-Authored-By: dapplion <[email protected]>
* Update issue template

* Delete old issue template
Fix a memory leak in the reprocess queue.


  If the vec of attestation IDs for a block is never evicted from the reprocess queue by a `BlockImported` event, then it stays in the map forever consuming memory. The fix is to remove the entry when its last attestation times out. We do similarly for light client updates.

In practice this will only occur if there is a race between adding an attestation to the queue and processing the `BlockImported` event, or if there are attestations for block roots that we never import (e.g. random block roots, block roots of invalid blocks).


Co-Authored-By: Michael Sproul <[email protected]>
michaelsproul and others added 25 commits September 22, 2025 11:03
Ensure that we don't log a warning for HTTP 202s, which are expected on the blinded block endpoints after Fulu.


Co-Authored-By: Michael Sproul <[email protected]>
No issue


  Use HTTPS for dependency


Co-Authored-By: Antonio Viggiano <[email protected]>
Co-Authored-By: Jimmy Chen <[email protected]>

Co-Authored-By: Eitan Seri- Levi <[email protected]>

Co-Authored-By: Eitan Seri-Levi <[email protected]>
…uled (sigp#8109)

sigp#8105 (to be confirmed)

I noticed a large number of failed discovery requests after deploying latest `unstable` to some of our testnet and mainnet nodes. This is because of a recent PeerDAS change to attempt to maintain sufficient peers across data column subnets - this shouldn't be enabled on network without peerdas scheduled, otherwise it will keep retrying discovery on these subnets and never succeed.

Also removed some unused files.


  


Co-Authored-By: Jimmy Chen <[email protected]>

Co-Authored-By: Jimmy Chen <[email protected]>
…igp#8112)

- PR sigp#8045 introduced a regression of how lookup sync interacts with the da_checker.

Now in unstable block import from the HTTP API also insert the block in the da_checker while the block is being execution verified. If lookup sync finds the block in the da_checker in `NotValidated` state it expects a `GossipBlockProcessResult` message sometime later. That message is only sent after block import in gossip.

I confirmed in our node's logs for 4/4 cases of stuck lookups are caused by this sequence of events:
- Receive block through API, insert into da_checker in fn process_block in put_pre_execution_block
- Create lookup and leave in AwaitingDownload(block in processing cache) state
- Block from HTTP API finishes importing
- Lookup is left stuck

Closes sigp#8104


  - sigp#8110 was my initial solution attempt but we can't send the `GossipBlockProcessResult` event from the `http_api` crate without adding new channels, which seems messy.

For a given node it's rare that a lookup is created at the same time that a block is being published. This PR solves sigp#8104 by allowing lookup sync to import the block twice in that case.


Co-Authored-By: dapplion <[email protected]>
Holesky - sigp#8096
Hoodi - sigp#8097
Sepolia - sigp#8099


  Testnet configs for Holesky, Hoodi and Sepolia

Holesky - eth-clients/holesky#132
Hoodi - eth-clients/hoodi#21
Sepolia - eth-clients/sepolia#111


Co-Authored-By: Eitan Seri- Levi <[email protected]>
As identified by a researcher during the Fusaka security competition, we were computing the proposer index incorrectly in some places by computing without lookahead.


  - [x] Add "low level" checks to computation functions in `consensus/types` to ensure they error cleanly
- [x] Re-work the determination of proposer shuffling decision roots, which are now fork aware.
- [x] Re-work and simplify the beacon proposer cache to be fork-aware.
- [x] Optimise `with_proposer_cache` to use `OnceCell`.
- [x] All tests passing.
- [x] Resolve all remaining `FIXME(sproul)`s.
- [x] Unit tests for `ProtoBlock::proposer_shuffling_root_for_child_block`.
- [x] End-to-end regression test.
- [x] Test on pre-Fulu network.
- [x] Test on post-Fulu network.


Co-Authored-By: Michael Sproul <[email protected]>
N/A


  In sigp#8101 , when we modified the logic to get the proposer index post fulu, we seem to have missed advancing the state at the fork boundaries to get the right `Fork` for signature verification.
This led to lighthouse failing all gossip verification right after transitioning to fulu that was observed on the holesky shadow fork
```
Sep 26 14:24:00.088 DEBUG Rejected gossip block                         error: "InvalidSignature(ProposerSignature)", graffiti: "grandine-geth-super-1", slot: 640
Sep 26 14:24:00.099 WARN  Could not verify block for gossip. Rejecting the block  error: InvalidSignature(ProposerSignature)
```

I'm not completely sure this is the correct fix, but this fixes the issue with `InvalidProposerSignature` on the holesky shadow fork.

Thanks to @eserilev for helping debug this


Co-Authored-By: Pawan Dhananjay <[email protected]>
Follow-up to the bug fixed in:

- sigp#8121

This fixes the root cause of that bug, which was introduced by me in:

- sigp#8101

Lion identified the issue here:

- sigp#8101 (comment)


  In the methods that compute the proposer shuffling decision root, ensure we don't use lookahead for the Fulu fork epoch itself. This is accomplished by checking if Fulu is enabled at `epoch - 1`, i.e. if `epoch > fulu_fork_epoch`.

I haven't updated the methods that _compute_ shufflings to use these new corrected bounds (e.g. `BeaconState::compute_proposer_indices`), although we could make this change in future. The `get_beacon_proposer_indices` method already gracefully handles the Fulu boundary case by using the `proposer_lookahead` field (if initialised).


Co-Authored-By: Michael Sproul <[email protected]>
Testnet release for the upcoming Fusaka fork.


  


Co-Authored-By: Jimmy Chen <[email protected]>

Co-Authored-By: Jimmy Chen <[email protected]>
Bump `superstruct` to the latest release `0.10.0`.
This version uses a later version of `darling` which is helpful for sigp#8125


Co-Authored-By: Mac L <[email protected]>
- [x] Remove the unnecessary `_MILLIS` suffix from `MAXIMUM_GOSSIP_CLOCK_DISPARITY`
- [x] Add missing Deneb preset `KZG_COMMITMENT_INCLUSION_PROOF_DEPTH`, not to be confused with `KZG_COMMITMENTS_INCLUSION_PROOF_DEPTH` (plural) from Fulu...


Co-Authored-By: Michael Sproul <[email protected]>
@michaelsproul noticed this warning on a devnet-3 node

```
Oct 01 16:37:29.896 WARN  Error when importing rpc custody columns      error: ParentUnknown { parent_root: 0xe4cc85a2137b76eb083d7076255094a90f10caaec0afc8fd36807db742f6ff13 }, block_hash: 0x43ce63b2344990f5f4d8911b8f14e3d3b6b006edc35bbc833360e667df0edef7
```

We're also seeing similar `WARN` logs for blobs on our live nodes.

It's normal to get parent unknown in lookups and it's handled here
https://github.com/sigp/lighthouse/blob/a134d43446f776fe2a84f420854afbff76ca93d8/beacon_node/network/src/sync/block_lookups/mod.rs#L611-L619

These shouldn't be a `WARN`, and we also log the same error in block lookups at `DEBUG` level here:
https://github.com/sigp/lighthouse/blob/a134d43446f776fe2a84f420854afbff76ca93d8/beacon_node/network/src/sync/block_lookups/mod.rs#L643-L648

So i've removed these extra WARN logs.

I've also lower the level of an `ERROR` log when unable to serve data column root requests - it's unexpected, but is unlikely to impact the nodes performance, so I think we can downgrade this.


  


Co-Authored-By: Jimmy Chen <[email protected]>
Co-Authored-By: Eitan Seri- Levi <[email protected]>
N/A


  Post fulu, we should be calling the v2 api on the relays that doesn't return the blobs/data columns.

However, we decided to start hitting the v2 api as soon as fulu is scheduled to avoid unexpected surprises at the fork.
In the ACDT call, it seems like most clients are calling v2 only after the fulu fork.
This PR aims to be the best of both worlds where we fallback to hitting v1 api if v2 fails. This way, we know beforehand if relays don't support it and can potentially alert them.


Co-Authored-By: Pawan Dhananjay <[email protected]>
Use quoted integers for `state.proposer_lookahead` when serializing JSON. This is standard for all integer fields, but was missed for the newly added proposer lookahead. I noticed this issue while inspecting the head state on a local devnet.

I'm glad we found this before someone reported it :P


Co-Authored-By: Michael Sproul <[email protected]>
Closes sigp#8131

- [x] Remove deprecated flags from beacon_node/src/cli.rs:
- [x] eth1-purge-cache
- [x] eth1-blocks-per-log-query
- [x] eth1-cache-follow-distance
- [x] disable-deposit-contract-sync
- [x] light-client-server
- [x] Remove deprecated flags from lighthouse/src/main.rs:
- [x] logfile
- [x] terminal-total-difficulty-override
- [x] terminal-block-hash-override
- [x] terminal-block-hash-epoch-override
- [x] safe-slots-to-import-optimistically
- [x] Remove references to deprecated flags in config.rs files
- [x] Remove warning messages for deprecated flags in main.rs
- [x] Update/remove related tests in beacon_node.rs


  


Co-Authored-By: Jimmy Chen <[email protected]>

Co-Authored-By: Jimmy Chen <[email protected]>
Closes:

- sigp#4412

This should reduce Lighthouse's block proposal times on Holesky and prevent us getting reorged.


  - [x] Allow the head state to be advanced further than 1 slot. This lets us avoid epoch processing on hot paths including block production, by having new epoch boundaries pre-computed and available in the state cache.
- [x] Use the finalized state to prune the op pool. We were previously using the head state and trying to infer slashing/exit relevance based on `exit_epoch`. However some exit epochs are far in the future, despite occurring recently.


Co-Authored-By: Michael Sproul <[email protected]>
N/A


  Update c-kzg and rust-eth-kzg to their latest versions. Also removes the patch version hardcoding in Cargo.toml.


Co-Authored-By: Pawan Dhananjay <[email protected]>
* sigp#8085


  


Co-Authored-By: Tan Chee Keong <[email protected]>

Co-Authored-By: chonghe <[email protected]>
…of ignoring it (sigp#8179)

This issue was identified during the fusaka audit competition.

The [`verify_parent_block_and_finalized_descendant`](https://github.com/sigp/lighthouse/blob/62d9302e0f9dd9f94d0325411a3029b36ad90685/beacon_node/beacon_chain/src/data_column_verification.rs#L606-L627) in data column gossip verification currently load the parent first before checking if the column descends from the finalized root.

However, the `fork_choice.get_block(&block_parent_root)` function also make the same check internally:

https://github.com/sigp/lighthouse/blob/8a4f6cf0d5b6b261b2c3439ce7c05383a53d30c5/consensus/fork_choice/src/fork_choice.rs#L1242-L1249

Therefore, if the column does not descend from the finalized root, we return an `UnknownParent` error, before hitting the `is_finalized_checkpoint_or_descendant` check just below.

Which means we `IGNORE` the gossip message instead `REJECT`, and the gossip peer is not _immediately_ penalised. This deviates from the spec.

However, worth noting that lighthouse will currently attempt to request the parent from this peer, and if the peer is not able to serve the parent, it gets penalised with a `LowToleranceError`, and will get banned after ~5 occurences.

https://github.com/sigp/lighthouse/blob/ffa7b2b2b9e3b4e70678e2c749b8bc45234febd7/beacon_node/network/src/sync/network_context.rs#L1530-L1532

This PR will penalise the bad peer immediately instead of performing block lookups before penalising it.


  


Co-Authored-By: Jimmy Chen <[email protected]>
Update `ForkName::latest_stable` to Fulu, reflecting our plan to stabilise Fulu in the immediate future!

This will lead to some more tests running with Fulu rather than Electra.


Co-Authored-By: Michael Sproul <[email protected]>
@michaelsproul
Copy link
Member

@sashaodessa Can you please update this PR with the latest changes from unstable?

@sashaodessa
Copy link
Author

sashaodessa commented Oct 10, 2025

@sashaodessa Can you please update this PR with the latest changes from unstable?

@michaelsproul Hi, I flooded my MacBook and took it to a repair shop. Unfortunately, I only have my phone now. I`ll try from another device

@michaelsproul
Copy link
Member

This diff is way too large now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Networking waiting-on-author The reviewer has suggested changes and awaits thier implementation.

Projects

None yet

Development

Successfully merging this pull request may close these issues.