Skip to content

Remove portpicker race condition#3932

Open
sveitser wants to merge 15 commits intomainfrom
ma/remove-portpicker
Open

Remove portpicker race condition#3932
sveitser wants to merge 15 commits intomainfrom
ma/remove-portpicker

Conversation

@sveitser
Copy link
Collaborator

@sveitser sveitser commented Feb 5, 2026

Bind to socket then release it which will make the OS not hand out the socket for 60s by default on linux.

Run 1: https://github.com/EspressoSystems/espresso-network/actions/runs/21724072749?pr=3932 ok
Run 2: https://github.com/EspressoSystems/espresso-network/actions/runs/21724930185?pr=3932 2 failures (also on main)
Run 3: https://github.com/EspressoSystems/espresso-network/actions/runs/21724930185?pr=3932 same 2 failures (also on main)

I didn't see any addr already in use error anymore.

@gemini-code-assist
Copy link

Summary of Changes

Hello @sveitser, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request undertakes a significant refactoring of the project's network port allocation mechanism. It replaces the reliance on the external portpicker crate with internally developed, atomic port binding utilities. This change is primarily driven by the objective to enhance the stability and predictability of network interactions, especially within test suites, by preventing race conditions that could lead to flakiness. The new approach ensures that ports are securely bound and managed, leading to more robust and reliable system behavior.

Highlights

  • Removal of portpicker Crate: The external portpicker crate has been entirely removed from the project's dependencies, including Cargo.lock, crates/builder/Cargo.toml, sequencer/Cargo.toml, and utils/Cargo.toml.
  • Introduction of Custom Port Binding Utilities: New bind_tcp_port and bind_udp_port functions have been introduced in hotshot_types::utils. These functions atomically bind to available TCP and UDP ports, respectively, and keep the listener/socket alive to prevent race conditions, particularly in concurrent test environments.
  • Widespread Adoption of New Utilities: All instances where portpicker::pick_unused_port() was previously used have been updated to utilize the new bind_tcp_port() or bind_udp_port() functions, or direct std::net::TcpListener::bind("127.0.0.1:0") where appropriate. This includes various crates such as crates/builder, crates/cliquenet, crates/hotshot-builder, crates/hotshot, espresso-dev-node, hotshot-events-service, hotshot-query-service, light-client, sequencer, slow-tests, staking-cli, and tests.
  • Enhanced Test Stability: The refactoring aims to improve the reliability and reduce flakiness in tests by ensuring that allocated ports remain exclusively bound until their respective listeners or sockets are explicitly dropped, mitigating issues caused by port reuse or conflicts.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • Cargo.lock
    • Removed portpicker dependency entries from various sections.
  • crates/builder/Cargo.toml
    • Removed portpicker dependency.
  • crates/builder/src/lib.rs
    • Replaced portpicker::pick_unused_port() with hotshot_types::utils::bind_tcp_port() for event streaming and builder API URLs.
  • crates/builder/src/non_permissioned.rs
    • Removed portpicker import.
    • Replaced portpicker::pick_unused_port() with hotshot_types::utils::bind_tcp_port() for test builder and query ports.
  • crates/cliquenet/benches/bench1.rs
    • Replaced portpicker::pick_unused_port() with std::net::TcpListener::bind("127.0.0.1:0") for setting up network addresses in benchmarks.
  • crates/hotshot-builder/refactored/src/testing/mod.rs
    • Added utils import from hotshot_types.
    • Replaced portpicker::pick_unused_port() with utils::bind_tcp_port() for test service wrapper port allocation.
  • crates/hotshot-builder/shared/src/utils/event_service_wrapper.rs
    • Replaced portpicker::pick_unused_port() with utils::bind_tcp_port() for event stream wrapper tests.
  • crates/hotshot/examples/combined/all.rs
    • Replaced portpicker::pick_unused_port() with hotshot_types::utils::bind_tcp_port() for broker private and public ports.
  • crates/hotshot/examples/infra/mod.rs
    • Replaced portpicker::pick_unused_port() with hotshot_types::utils::bind_tcp_port() for builder address allocation.
  • crates/hotshot/examples/push-cdn/all.rs
    • Replaced portpicker::pick_unused_port() with hotshot_types::utils::bind_tcp_port() for broker private and public ports.
  • crates/hotshot/hotshot/src/traits/networking/cliquenet_network.rs
    • Replaced portpicker::pick_unused_port() with hotshot_types::utils::bind_tcp_port() for network party address allocation.
  • crates/hotshot/hotshot/src/traits/networking/libp2p_network.rs
    • Replaced portpicker::pick_unused_port() with hotshot_types::utils::bind_udp_port() for libp2p network port allocation.
  • crates/hotshot/hotshot/src/traits/networking/push_cdn_network.rs
    • Replaced portpicker::pick_unused_port() with hotshot_types::utils::bind_tcp_port() for public, private, and marshal ports.
  • crates/hotshot/testing/src/test_runner.rs
    • Replaced portpicker::pick_unused_port() with hotshot_types::utils::bind_tcp_port() for builder task port allocation.
  • crates/hotshot/testing/tests/tests_1/block_builder.rs
    • Added bind_tcp_port import from hotshot_types::utils.
    • Replaced portpicker::pick_unused_port() with bind_tcp_port() for random block builder test port.
  • crates/hotshot/types/src/utils.rs
    • Added TcpListener and UdpSocket imports.
    • Introduced bind_tcp_port() function to atomically bind to an available TCP port.
    • Introduced bind_udp_port() function to atomically bind to an available UDP port.
  • espresso-dev-node/src/main.rs
    • Updated alloy::providers import to include AnvilLayer.
    • Added bind_tcp_port import from hotshot_types::utils.
    • Removed portpicker import.
    • Modified Anvil initialization to use AnvilLayer and handle the listener for the relay server port.
  • espresso-dev-node/tests/dev_node_tests.rs
    • Added TcpListener import.
    • Removed portpicker import.
    • Introduced a PortPicker struct to manage TcpListeners for atomic port binding in tests.
    • Updated slow_dev_node_test and slow_dev_node_multiple_lc_providers_test to use the new PortPicker.
  • hotshot-events-service/src/test.rs
    • Added utils import from hotshot_types.
    • Replaced portpicker::pick_unused_port() with utils::bind_tcp_port() for various test ports.
  • hotshot-query-service/examples/simple-server.rs
    • Added bind_tcp_port import from hotshot_types::utils.
    • Replaced portpicker::pick_unused_port() with bind_tcp_port() for builder server port allocation.
  • hotshot-query-service/src/availability.rs
    • Added bind_tcp_port import from hotshot_types::utils.
    • Replaced portpicker::pick_unused_port() with bind_tcp_port() for web server port allocation in tests.
  • hotshot-query-service/src/data_source/storage/sql.rs
    • Replaced portpicker::pick_unused_port() with hotshot_types::utils::bind_tcp_port() for test database port allocation.
  • hotshot-query-service/src/explorer.rs
    • Replaced portpicker::pick_unused_port() with hotshot_types::utils::bind_tcp_port() for web server port allocation in tests.
  • hotshot-query-service/src/fetching/provider/any.rs
    • Replaced portpicker::pick_unused_port() with hotshot_types::utils::bind_tcp_port() for web server port allocation in tests.
  • hotshot-query-service/src/fetching/provider/query_service.rs
    • Added bind_tcp_port import from hotshot_types::utils.
    • Replaced portpicker::pick_unused_port() with bind_tcp_port() for web server port allocation in tests.
  • hotshot-query-service/src/lib.rs
    • Added bind_tcp_port import from hotshot_types::utils.
    • Replaced portpicker::pick_unused_port() with bind_tcp_port() for web server port allocation in tests.
  • hotshot-query-service/src/node.rs
    • Added bind_tcp_port import from hotshot_types::utils.
    • Replaced portpicker::pick_unused_port() with bind_tcp_port() for web server port allocation in tests.
  • hotshot-query-service/src/status.rs
    • Replaced portpicker::pick_unused_port() with hotshot_types::utils::bind_tcp_port() for web server port allocation in tests.
  • hotshot-query-service/src/testing/consensus.rs
    • Added bind_tcp_port import from hotshot_types::utils.
    • Replaced portpicker::pick_unused_port() with bind_tcp_port() for builder server port allocation.
  • light-client/src/client.rs
    • Replaced portpicker import with hotshot_types::utils.
    • Replaced portpicker::pick_unused_port() with utils::bind_tcp_port() for various test ports.
  • sequencer/Cargo.toml
    • Removed portpicker dependency.
  • sequencer/src/api.rs
    • Added bind_tcp_port import from hotshot_types::utils.
    • Removed portpicker import.
    • Replaced pick_unused_port() with bind_tcp_port() for various API test helpers and query service ports.
  • sequencer/src/bin/dev-cdn.rs
    • Added bind_tcp_port import from hotshot_types::utils.
    • Removed portpicker import.
    • Replaced pick_unused_port() with bind_tcp_port() for broker public and private ports.
  • sequencer/src/lib.rs
    • Added bind_tcp_port import from hotshot_types::utils.
    • Removed portpicker import.
    • Replaced pick_unused_port() with bind_tcp_port() for test builder port allocation.
  • sequencer/src/persistence.rs
    • Added bind_tcp_port import from hotshot_types::utils.
    • Removed portpicker import.
    • Replaced pick_unused_port() with bind_tcp_port() for query service port allocation in tests.
  • sequencer/src/run.rs
    • Added bind_tcp_port import from hotshot_types::utils.
    • Removed portpicker import.
    • Replaced pick_unused_port() with bind_tcp_port() for sequencer run test port.
  • slow-tests/tests/restart_tests.rs
    • Added TcpListener import.
    • Removed portpicker import and HashSet import.
    • Updated PortPicker struct to use TcpListener for atomic binding.
    • Updated NodeParams and TestNetwork to use the new PortPicker for port allocation.
  • slow-tests/tests/state.rs
    • Removed portpicker import.
    • Replaced pick_unused_port() with direct OS port assignment (port = 0) for slow_test_merklized_state_api.
  • staking-cli/src/deploy.rs
    • Replaced portpicker::pick_unused_port() with TcpListener::bind("127.0.0.1:0") for Anvil port allocation.
  • tests/reward_claims_e2e.rs
    • Updated alloy::providers import to include AnvilLayer.
    • Added bind_tcp_port import from hotshot_types::utils.
    • Removed portpicker import.
    • Replaced pick_unused_port() with bind_tcp_port() for relay server and sequencer API ports, and used AnvilLayer for Anvil initialization.
  • types/src/lib.rs
    • Re-exported bind_tcp_port and bind_udp_port from hotshot_types::utils for broader use.
  • types/src/v0/impls/l1.rs
    • Removed portpicker import.
    • Replaced pick_unused_port() with port(0u16) for Anvil port allocation in tests.
  • utils/Cargo.toml
    • Removed portpicker dependency.
Activity
  • The pull request introduces a significant refactoring to address potential flakiness in network port allocation, especially in test environments.
  • The author, sveitser, has systematically replaced an external dependency (portpicker) with custom, more controlled internal utilities for port management.
  • The changes span across numerous crates and files, indicating a comprehensive effort to standardize port binding practices throughout the codebase.
  • The title 'WIP: see if removing portpicker affects flakyness' suggests an investigative and iterative approach to improving system stability.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@sveitser sveitser force-pushed the ma/remove-portpicker branch from 7027b32 to 4a37c53 Compare February 5, 2026 08:57
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors port allocation logic across the codebase, replacing direct calls to portpicker::pick_unused_port() with new utility functions hotshot_types::utils::bind_tcp_port() and hotshot_types::utils::bind_udp_port(). These new functions are intended to atomically bind to an available port and return a listener or socket that keeps the port reserved, preventing race conditions. The portpicker dependency is consequently removed from several Cargo.toml files. However, the review comments consistently point out that in many instances, the returned listener or socket is immediately dropped (e.g., by assigning it to an ignored _listener variable), which defeats the purpose of the atomic binding and reintroduces the very race conditions the change aimed to solve. The reviewer suggests keeping the listener in scope until the service has successfully bound to the port, and in some cases, proposes using a PortPicker struct to manage multiple listeners, or ensuring that AnvilLayer instances are not prematurely dropped.

I am having trouble creating individual review comments. Click here to see my feedback.

crates/builder/src/lib.rs (245-246)

critical

The _listener is immediately dropped, which releases the port and re-introduces a race condition. The port is not guaranteed to be available when it's used later. To fix this, the listener must be kept in scope until the server using this port is started. Since this function only returns a Url, you might need to refactor the calling code to handle port allocation and listener lifecycle.

crates/builder/src/lib.rs (408-409)

critical

The _listener is immediately dropped, which releases the port and re-introduces a race condition. The port is not guaranteed to be available when it's used later. To fix this, the listener must be kept in scope until the server using this port is started. Since this function only returns a Url, you might need to refactor the calling code to handle port allocation and listener lifecycle.

crates/builder/src/non_permissioned.rs (269-276)

critical

Both _query_listener and _builder_listener are dropped immediately after being created. This defeats the purpose of atomically binding the port, as it re-introduces a race condition where the port can be taken by another process before the service starts. To fix this, you should keep the listeners in scope until the respective services have bound to the ports. A good approach is to use a struct that holds the listeners, similar to the PortPicker implemented in other tests in this PR.

staking-cli/src/deploy.rs (100-104)

critical

The listener is explicitly dropped before connect_anvil_with_wallet_and_config is called. This re-introduces a race condition, as the port is not guaranteed to be free when Anvil tries to bind to it. The listener should be held until Anvil has successfully started and bound the port.

espresso-dev-node/src/main.rs (311)

critical

The _listener is immediately dropped, which releases the port and re-introduces a race condition. The port is not guaranteed to be available when the relay server starts. The listener must be kept in scope until the server has bound to the port.

espresso-dev-node/src/main.rs (281)

critical

The _anvil_layer variable will be dropped at the end of this statement because of the _ prefix, which will terminate the Anvil instance it holds. To keep the Anvil instance alive for the duration of the scope, you should bind it to a variable that is not ignored (e.g., by removing the _ prefix) and ensure it is not dropped prematurely.

crates/hotshot/examples/push-cdn/all.rs (76-79)

high

The _private_listener and _public_listener are dropped immediately, releasing the ports. This re-introduces a race condition. The listeners should be kept alive until the broker service has bound to these ports.

hotshot-query-service/src/fetching/provider/any.rs (237)

high

The _listener is immediately dropped, which releases the port and re-introduces a race condition. The listener must be kept in scope until the server has bound to the port.

crates/hotshot/examples/infra/mod.rs (1064-1065)

high

The _listener is immediately dropped, which releases the port and re-introduces a race condition. The port is not guaranteed to be available when it's used later to build URLs for services. The listener must be kept in scope until the server has bound to the port.

crates/hotshot-builder/shared/src/utils/event_service_wrapper.rs (295-296)

high

The _listener is immediately dropped, which releases the port and re-introduces a race condition. The port is not guaranteed to be available when run_app is called. The listener must be kept in scope until the server has bound to the port.

crates/hotshot/hotshot/src/traits/networking/cliquenet_network.rs (163-164)

high

The _listener is immediately dropped, which releases the port and re-introduces a race condition. The port is not guaranteed to be available when the network service starts. The listener must be kept alive until the port is used.

crates/hotshot/hotshot/src/traits/networking/libp2p_network.rs (228-229)

high

The _socket is immediately dropped, which releases the port and re-introduces a race condition. The port is not guaranteed to be available when the network service starts. The socket must be kept alive until the port is used.

crates/hotshot/hotshot/src/traits/networking/push_cdn_network.rs (343-348)

high

The _listener_1 and _listener_2 are dropped immediately, releasing the ports. This re-introduces a race condition. The listeners should be kept alive until the services have bound to these ports.

crates/hotshot/hotshot/src/traits/networking/push_cdn_network.rs (353-354)

high

The _private_listener is dropped immediately, releasing the port. This re-introduces a race condition. The listener should be kept alive until the service has bound to this port.

crates/hotshot/hotshot/src/traits/networking/push_cdn_network.rs (409-410)

high

The _marshal_listener is dropped immediately, releasing the port. This re-introduces a race condition. The listener should be kept alive until the service has bound to this port.

crates/hotshot/testing/src/test_runner.rs (319-321)

high

The listener returned by bind_tcp_port is immediately dropped because only the port (.1) is being kept. This releases the port, re-introducing a race condition. The listener must be kept in scope until the builder service has bound to the port.

crates/hotshot/testing/tests/tests_1/block_builder.rs (38-40)

high

The listener returned by bind_tcp_port is immediately dropped because only the port (.1) is being kept. This releases the port, re-introducing a race condition. The listener must be kept in scope until the service has bound to the port.

crates/hotshot-builder/shared/src/utils/event_service_wrapper.rs (254-255)

high

The _listener is immediately dropped, which releases the port and re-introduces a race condition. The port is not guaranteed to be available when run_app is called. The listener must be kept in scope until the server has bound to the port.

crates/hotshot-builder/refactored/src/testing/mod.rs (77)

high

The _listener is immediately dropped, which releases the port. The subsequent call to app.serve might fail if another process takes the port in the meantime. The listener should be kept alive until the server has successfully bound the port.

espresso-dev-node/src/main.rs (554)

high

Using unwrap_or(0) to get a prover port is problematic. The prover service will be configured with port 0, but when it binds to an ephemeral port, there is no mechanism to communicate the actual port number back. This will likely cause connection issues. It's better to atomically bind to a port, get the port number, and then configure the service with that specific port.

hotshot-query-service/src/availability.rs (1119)

high

The _listener is immediately dropped, which releases the port and re-introduces a race condition. This pattern is repeated throughout the tests in this file. The listener must be kept in scope until the server has bound to the port.

hotshot-query-service/src/data_source/storage/sql.rs (1210)

high

The _listener is immediately dropped, which releases the port and re-introduces a race condition. The listener must be kept in scope until the Docker container is started and has bound the port.

hotshot-query-service/src/explorer.rs (870)

high

The _listener is immediately dropped, which releases the port and re-introduces a race condition. The listener must be kept in scope until the server has bound to the port.

crates/hotshot/examples/combined/all.rs (65-68)

high

The _private_listener and _public_listener are dropped immediately, releasing the ports. This re-introduces a race condition. The listeners should be kept alive until the broker service has bound to these ports.

hotshot-query-service/src/fetching/provider/query_service.rs (617)

high

The _listener is immediately dropped, which releases the port and re-introduces a race condition. This pattern is repeated throughout the tests in this file. The listener must be kept in scope until the server has bound to the port.

hotshot-query-service/src/lib.rs (940)

high

The _listener is immediately dropped, which releases the port and re-introduces a race condition. The listener must be kept in scope until the server has bound to the port.

hotshot-query-service/src/node.rs (261)

high

The _listener is immediately dropped, which releases the port and re-introduces a race condition. This pattern is repeated throughout the tests in this file. The listener must be kept in scope until the server has bound to the port.

hotshot-query-service/src/status.rs (140)

high

The _listener is immediately dropped, which releases the port and re-introduces a race condition. This pattern is repeated throughout the tests in this file. The listener must be kept in scope until the server has bound to the port.

hotshot-query-service/src/testing/consensus.rs (115)

high

The _listener is immediately dropped, which releases the port and re-introduces a race condition. The listener must be kept in scope until the server has bound to the port.

light-client/src/client.rs (229)

high

The _listener is immediately dropped, which releases the port and re-introduces a race condition. This pattern is repeated throughout the tests in this file. The listener must be kept in scope until the server has bound to the port.

sequencer/src/api.rs (1798)

high

The listener returned by bind_tcp_port is immediately dropped because only the port (.1) is being kept. This releases the port, re-introducing a race condition. This pattern is repeated throughout the tests in this file. The listener must be kept in scope until the server has bound to the port.

sequencer/src/bin/dev-cdn.rs (60-63)

high

The _public_listener and _private_listener are dropped immediately, releasing the ports. This re-introduces a race condition. The listeners should be kept alive until the broker service has bound to these ports.

sequencer/src/lib.rs (825-828)

high

The _listener is dropped at the end of the closure, which releases the port. This re-introduces a race condition. The listener must be kept alive until the builder service starts.

sequencer/src/lib.rs (873-876)

high

The _listener is dropped at the end of the closure, which releases the port. This re-introduces a race condition. The listener must be kept alive until the builder service starts.

sequencer/src/persistence.rs (1397-1398)

high

The _listener is immediately dropped, which releases the port and re-introduces a race condition. The listener must be kept in scope until the query service has bound to the port.

sequencer/src/run.rs (363)

high

The listener returned by bind_tcp_port is immediately dropped because only the port (.1) is being kept. This releases the port, re-introducing a race condition. The listener must be kept in scope until the server has bound to the port.

slow-tests/tests/state.rs (47)

high

Using port 0 here is incorrect. While it allows the service to bind to an ephemeral port, the client that needs to connect to it will be configured with http://localhost:0, which will not work. You should first bind to port 0 to get an available port number, and then use that specific port number to configure both the server and the client.

crates/cliquenet/benches/bench1.rs (50-61)

high

The listener_a and listener_b are dropped at the end of the setup_cliquenet function, which means the ports they reserved are released. The Retry instances that are returned will later try to bind to these ports, but they are not guaranteed to be available. This re-introduces a race condition. The listeners should be kept alive until the network services are up and running.

tests/reward_claims_e2e.rs (52-56)

high

The _listener1 and _listener2 are dropped immediately, which releases the ports and re-introduces a race condition. The listeners must be kept in scope until the services using these ports have started.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 5, 2026

Created new PR to fix cargo audit.

PR: #3924

Please merge that PR first to fix cargo-audit.

Create test-utils crate with BoundPort/BoundSocket structs that prevent
port binding race conditions via Rust's borrow checker. The .port() method
returns &u16, making bind_tcp_port()?.port() fail to compile with:
"temporary value dropped while borrowed".

- Create test-utils crate with BoundPort and BoundSocket structs
- Remove portpicker dependency from workspace and 16 crates
- Migrate 33+ files to use test_utils::{bind_tcp_port, bind_udp_port}
- Port method returns &u16 for borrow checker enforcement
- Fix all type mismatches with dereference operator (*)
- Add regression tests proving compile-time safety
- Use listener/socket (not _listener/_socket) as public accessors exist

Fixes race conditions where port was freed before service could bind.
@sveitser
Copy link
Collaborator Author

sveitser commented Feb 5, 2026

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request replaces the portpicker crate with a more robust method of acquiring free ports for tests, which involves binding to port 0 and letting the OS assign a port. This is a great change that should help reduce test flakiness caused by port allocation race conditions. A new test-utils crate has been introduced to house the port binding utilities.

My review focuses on ensuring the new utility is used consistently across the codebase. I've identified a couple of places where the logic is duplicated instead of using the new test-utils crate. Consolidating this logic will improve maintainability.

Address PR review comments by ensuring BoundPort instances remain
in scope until services actually bind to the ports:

- builder/lib.rs: Return BoundPort from hotshot_builder_url() so
  callers keep it alive, remove unused hotshot_event_streaming_api_url
- builder/non_permissioned.rs: Keep bound ports alive until end of test
- espresso-dev-node/main.rs: Move bound_relay_port into async task
- dev_node_tests.rs: Replace custom PortPicker with test-utils
- restart_tests.rs: Update PortPicker to use test-utils internally

This prevents race conditions where ports could be freed before
services have a chance to bind to them.
@sveitser
Copy link
Collaborator Author

sveitser commented Feb 5, 2026

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request replaces the portpicker crate with a new test-utils crate that provides a more robust way to handle temporary port allocation for tests using an RAII guard pattern. This is a great improvement for test stability. However, in many places, the new utility is used in a way that re-introduces the race conditions it was designed to prevent, by dropping the port guard immediately after creation. I've left several comments pointing out these issues and suggesting fixes. The general solution is to ensure the BoundPort or BoundSocket guard from test-utils lives until the server has successfully bound to the port, which for spawned tasks usually means moving the guard into the task. There are also a few inter-process race conditions in integration tests that are not solved by this change and may require a different approach.

I am having trouble creating individual review comments. Click here to see my feedback.

hotshot-query-service/examples/simple-server.rs (176-177)

high

The bound_builder_port is dropped, but the port is used later when the BuilderTask is started. This is a race condition. The BoundPort guard needs to be kept alive until the builder server has bound to the port. Consider moving the guard into the builder_task.

hotshot-query-service/src/availability.rs (1118-1119)

high

The bound_port is dropped at the end of the scope, but app.serve is running in a spawned task. This creates a race condition where the port could be reused by another process before the server task binds to it. The bound_port guard must be kept alive until the server has bound the port. This issue appears in multiple tests in this file. The fix is to move the guard into the spawned task.

sequencer/src/run.rs (362-363)

high

The bound_port is dropped at the end of the scope, but the port is used later to start a server. This creates a race condition. The BoundPort guard must be kept alive until the server has bound the port.

sequencer/src/persistence.rs (1398-1399)

high

The bound_query_port is dropped at the end of the scope, but the port is used later to start a server. This creates a race condition. The BoundPort guard must be kept alive until the server has bound the port.

crates/cliquenet/benches/bench1.rs (50-61)

high

This manual port picking logic re-introduces the race condition that test-utils is meant to solve. The listener_a and listener_b are dropped at the end of setup_cliquenet, releasing the ports before cliquenet has a chance to bind to them. This can lead to flaky benchmarks.

Consider using test_utils::bind_tcp_port() and restructuring the benchmark setup to keep the returned BoundPort guards alive until the network is fully initialized.

crates/hotshot-builder/refactored/src/testing/mod.rs (76-80)

high

The bound_port is dropped at the end of this function's scope, but app.serve runs in a spawned task. This creates a race condition where the port could be reused by another process before the server task binds to it. The bound_port guard must be kept alive until the server has bound the port.

You can fix this by moving the guard into the spawned task.

        let bound_port = test_utils::bind_tcp_port().unwrap();
        let port = *bound_port.port();
        let url: Url = format!("http://localhost:{port}").parse().unwrap();
        let app = Arc::clone(&global_state).into_app().unwrap();
        spawn(async move {
            // By moving `bound_port` into this task, we ensure the port remains reserved.
            let _guard = bound_port;
            app.serve(url.clone(), StaticVersion::<0, 1> {}).await;
        });

crates/hotshot-builder/shared/src/utils/event_service_wrapper.rs (254-258)

high

The bound_port is dropped at the end of the scope, but run_app (which starts a server) is called right after. This creates a race condition where the port can be reused before the server binds to it. The BoundPort guard should be passed to run_app and kept alive until the server has bound the port.

crates/hotshot/examples/combined/all.rs (65-68)

high

The bound_private_port and bound_public_port guards are dropped at the end of the loop, but the broker is started in a spawned task. This creates a race condition. The guards should be moved into the spawned task to ensure the ports remain reserved until the broker has bound to them.

crates/hotshot/examples/infra/mod.rs (1064-1065)

high

The bound_port is dropped at the end of this match arm, but the port is used later to start a builder task. This creates a race condition. The BoundPort guard needs to be kept alive until the builder server has bound to the port. This might require refactoring initialize_builder to return the guard along with the task.

crates/hotshot/examples/push-cdn/all.rs (76-79)

high

The bound_private_port and bound_public_port guards are dropped at the end of the loop, but the broker is started in a spawned task. This creates a race condition. The guards should be moved into the spawned task to ensure the ports remain reserved until the broker has bound to them.

crates/hotshot/hotshot/src/traits/networking/cliquenet_network.rs (165-167)

high

The bound_port is dropped at the end of the loop iteration, but the port is used later to create Cliquenet instances. This is a race condition. The BoundPort guards should be collected and moved into the returned AsyncGenerator closure to keep them alive.

crates/hotshot/hotshot/src/traits/networking/libp2p_network.rs (228-229)

high

The bound_socket is dropped at the end of the loop iteration, but the port is used later to create Libp2pNetwork instances. This is a race condition. The BoundSocket guards should be collected and moved into the returned AsyncGenerator closure to keep them alive.

crates/hotshot/testing/src/test_runner.rs (319-321)

high

The bound_builder_port is dropped when init_builders returns, but the port is used later when the BuilderTask is started. This is a race condition.

To fix this, init_builders should also return the BoundPort guards, and the caller (run_test) should hold them until the test is complete.

crates/hotshot/testing/tests/tests_1/block_builder.rs (38-40)

high

The bound_port is dropped at the end of the scope, but RandomBuilderImplementation::start spawns a server in a background task. This creates a race condition where the port could be reused before the server binds to it. The bound_port guard should be moved into the task spawned by RandomBuilderImplementation::start.

espresso-dev-node/tests/dev_node_tests.rs (52-57)

high

This test binds ports and passes them as command-line arguments to a child process (espresso-dev-node). The parent process (cargo test) might release these ports before the child process has a chance to bind to them, creating a race condition. This can lead to flaky tests.

A more robust solution for inter-process port management is needed, such as having the child process bind to port 0 and print the chosen port to stdout for the parent to read.

hotshot-events-service/src/test.rs (45-47)

high

The bound_port is dropped at the end of the scope, but app.serve is running in a spawned task. This creates a race condition where the port could be reused by another process before the server task binds to it. The bound_port guard must be kept alive until the server has bound the port. You can achieve this by moving the guard into the spawned task.

sequencer/src/lib.rs (825-828)

high

The bound_port created inside unwrap_or_else is dropped immediately after the port number is extracted. This creates a race condition where the port can be reused before the builder task binds to it. The BoundPort guard needs to be kept alive.

crates/builder/src/lib.rs (118)

high

The _bound_builder_port is unused and dropped at the end of the statement, which means the port is not guaranteed to be reserved when hotshot_builder_url returns. This re-introduces a race condition. The BoundPort guard must be kept alive until the builder server, which is started later, has bound to the port.

hotshot-query-service/src/data_source/storage/sql.rs (1210-1211)

high

This code picks a free port on the host running the test, and then passes it to a Docker container. This is a race condition. Another process on the host could take the port before the Docker container binds to it. Furthermore, if Docker is running on a different host, the port might not be free on that host at all.

A more robust approach would be to let Docker assign a random host port by using -p 5432 and then query Docker for the assigned port.

hotshot-query-service/src/explorer.rs (870-871)

high

The bound_port is dropped at the end of the scope, but app.serve is running in a spawned task. This creates a race condition where the port could be reused by another process before the server task binds to it. The bound_port guard must be kept alive until the server has bound the port. You can achieve this by moving the guard into the spawned task.

hotshot-query-service/src/fetching/provider/any.rs (237-238)

high

The bound_port is dropped at the end of the scope, but app.serve is running in a spawned task. This creates a race condition where the port could be reused by another process before the server task binds to it. The bound_port guard must be kept alive until the server has bound the port. You can achieve this by moving the guard into the spawned task.

hotshot-query-service/src/fetching/provider/query_service.rs (618-619)

high

The bound_port is dropped at the end of the scope, but app.serve is running in a spawned task. This creates a race condition where the port could be reused by another process before the server task binds to it. This issue appears in multiple tests in this file. The fix is to move the guard into the spawned task.

hotshot-query-service/src/lib.rs (939-940)

high

The bound_port is dropped at the end of the scope, but app.serve is running in a spawned task. This creates a race condition where the port could be reused by another process before the server task binds to it. The bound_port guard must be kept alive until the server has bound the port. You can achieve this by moving the guard into the spawned task.

hotshot-query-service/src/node.rs (261-262)

high

The bound_port is dropped at the end of the scope, but app.serve is running in a spawned task. This creates a race condition where the port could be reused by another process before the server task binds to it. The bound_port guard must be kept alive until the server has bound the port. This issue appears in multiple tests in this file. The fix is to move the guard into the spawned task.

hotshot-query-service/src/status.rs (140-141)

high

The bound_port is dropped at the end of the scope, but app.serve is running in a spawned task. This creates a race condition where the port could be reused by another process before the server task binds to it. The bound_port guard must be kept alive until the server has bound the port. This issue appears in multiple tests in this file. The fix is to move the guard into the spawned task.

hotshot-query-service/src/testing/consensus.rs (115-116)

high

The bound_builder_port is dropped, but the port is used later to start a builder task. This is a race condition. The BoundPort guard needs to be kept alive until the builder server has bound to the port. The builder_task should take ownership of the guard.

light-client/src/client.rs (229-230)

high

The bound_port is dropped at the end of the scope, but the TestNetwork constructor starts servers that will use this port. This creates a race condition. The BoundPort guard must be kept alive until the server has bound the port. This issue appears in multiple tests in this file.

sequencer/src/api.rs (1798-1799)

high

The bound_port is dropped at the end of the scope, but the port is used later to start a server. This creates a race condition. The BoundPort guard must be kept alive until the server has bound the port. This issue appears in multiple tests in this file.

sequencer/src/bin/dev-cdn.rs (58-61)

high

The bound_public_port and bound_private_port guards are dropped before the Broker and Marshal tasks are spawned and have a chance to bind to the ports. This creates a race condition. The guards should be moved into the spawned tasks to ensure the ports remain reserved.

crates/builder/src/lib.rs (242-252)

medium

The hotshot_event_streaming_api_url function, which used portpicker, has been removed. This is a good cleanup, but it's important to ensure that its replacement, which seems to be happening in crates/builder/src/non_permissioned.rs, correctly handles port allocation without race conditions. The new approach should use test-utils and keep the port guard alive.

crates/builder/src/non_permissioned.rs (325-327)

medium

Explicitly dropping the port guards here is a good way to signal intent and manage resource lifetimes clearly. This ensures the ports remain bound for the duration of the test, preventing race conditions with the background server tasks.

crates/builder/src/lib.rs (394-404)

medium

The hotshot_builder_url function now correctly returns the BoundPort guard along with the URL. This is the correct pattern to prevent race conditions. However, the caller at line 118 doesn't seem to be using it correctly.

slow-tests/tests/state.rs (47)

medium

Using port 0 is a great way to let the OS assign an available port, which is robust against race conditions. This is a good pattern for tests.

types/src/v0/impls/l1.rs (1423-1425)

medium

Using port 0 to let the OS assign an available port for the Anvil instance is a robust way to avoid port conflicts in tests. This is a good improvement.

types/src/v0/impls/l1.rs (1684-1685)

medium

Using port 0 to let the OS assign an available port for the Anvil instance is a robust way to avoid port conflicts in tests. This is a good improvement.

- cliquenet_network: Store BoundPort in parties vec, keep alive in async closure
- libp2p_network: Move bound_socket into async closure as guard
- test_runner: Collect all ports upfront before starting any builders
- sequencer/lib.rs: Keep BoundPort guard alive until server binds
- dev_node_tests: Drop ports right before child process spawn
- non_permissioned: Remove unnecessary explicit drops
@sveitser
Copy link
Collaborator Author

sveitser commented Feb 5, 2026

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a work-in-progress to replace the portpicker crate with a new, more robust test-utils crate for reserving ports in tests. This is a commendable effort to reduce test flakiness by preventing port race conditions. The new utility is well-designed, using RAII guards to ensure ports remain reserved.

However, I've identified several instances where the new utility is not used correctly. The BoundPort guard, which keeps the port reserved, is often dropped prematurely, reintroducing the very race conditions this PR aims to fix. I've provided specific comments and suggestions on how to address these issues. Ensuring the BoundPort or BoundSocket instances are kept in scope for the required duration is critical for this refactoring to be effective.

Address Gemini review comments on PR #3932:

- HotShotTestConfig: store _bound_builder_port in struct
- TestConfigBuilder/TestConfig: store _bound_builder_port in struct
- run_test_builder: return BoundPort so callers keep it alive
- cliquenet benchmark: use bind_tcp_port() with proper guard lifetime

Previously guards were dropped prematurely, releasing ports before
the services bound to them.
The test network generators were holding BoundPort guards while the
network tried to bind to the same port, causing "Address already in use"
errors. Drop guards immediately after getting the port number.
Replace bind_tcp_port()/bind_udp_port() with reserve_tcp_port()/reserve_udp_port()
across the codebase. The new functions use the TIME_WAIT trick from Yelp's
ephemeral-port-reserve to provide race-free port allocation without requiring
callers to manage guard lifetimes.

The TIME_WAIT pattern works by completing a TCP handshake then dropping all
sockets, which puts the port into TIME_WAIT state for ~60s. This prevents
the OS from assigning the port to other ephemeral allocations while still
allowing explicit binds.

Changes:
- Add reserve_tcp_port() using TIME_WAIT trick
- Add reserve_udp_port() for UDP (tiny race, acceptable for tests)
- Remove BoundPort/BoundSocket structs (no longer needed)
- Remove bind_tcp_port()/bind_udp_port() functions
- Migrate all callers to simpler reserve_*_port() API

This reduces test-utils from 373 to 168 lines and eliminates the awkward
pattern of keeping guards alive just to hold ports.
@sveitser
Copy link
Collaborator Author

sveitser commented Feb 5, 2026

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a significant and well-executed refactoring to improve test stability by replacing the portpicker crate with a custom test-utils crate. The new implementation for reserving TCP ports uses the TIME_WAIT trick, which is a robust method to prevent port allocation race conditions in tests. The changes are applied consistently across the codebase, and the new utility is well-documented and tested.

Additionally, the pull request introduces other good practices, such as using AnvilLayer for better management of the Anvil process lifecycle in tests and leveraging port 0 for OS-assigned ephemeral ports where applicable. These changes should collectively reduce test flakiness. The overall quality of the code is high, and I have no further suggestions for improvement.

Clippy complains about using ::default() on unit structs. Use the struct
directly instead.
PortPicker was simplified to a unit struct that just wrapped reserve_tcp_port().
Remove it entirely and call reserve_tcp_port() directly.
Revert unnecessary splitting of chained method calls into separate
statements. The original chained style is cleaner and there's no need
to keep intermediate variables alive.
@sveitser sveitser changed the title WIP: see if removing portpicker affects flakyness Remove portpicker race condition Feb 6, 2026
…state test

Port 0 was passed to Options::with_port but also used to construct the
client URL, resulting in the client connecting to localhost:0 instead of
the OS-assigned port. Use reserve_tcp_port() to pre-reserve a port that
both the server and client can use.
UDP has no TIME_WAIT state so the function name was misleading -- it
didn't reserve anything. Inline the two-line UdpSocket::bind at the
single call site in libp2p_network.rs.
@sveitser
Copy link
Collaborator Author

sveitser commented Feb 6, 2026

/gemini review

@sveitser sveitser marked this pull request as ready for review February 6, 2026 14:18
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a race condition in port selection for tests by replacing the portpicker crate with a more robust solution. A new test-utils crate is introduced, which provides a reserve_tcp_port function utilizing the TCP TIME_WAIT trick to reliably reserve ports. This change is applied consistently across the codebase.

The PR also includes several other improvements:

  • In some cases, instead of picking a port beforehand, the code now lets the underlying service (like Anvil) bind to an ephemeral port (port 0) and then queries for the assigned port, which is an even more robust pattern.
  • A critical bug in sequencer/src/lib.rs has been fixed where a server's serve future was not being awaited, causing it to not run.

The changes are well-executed and significantly improve the reliability of tests that require network ports.

Comment on lines +857 to +865
spawn(async move {
app.serve(
format!("http://0.0.0.0:{port}")
.parse::<Url>()
.expect("Failed to parse builder listener"),
EpochVersion::instance(),
),
);
)
.await
});

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While this change correctly awaits the app.serve future, the Result it returns is being ignored. It would be beneficial to handle the potential error case, for example by logging it. This would improve debuggability if the server fails to start.

Suggested change
spawn(async move {
app.serve(
format!("http://0.0.0.0:{port}")
.parse::<Url>()
.expect("Failed to parse builder listener"),
EpochVersion::instance(),
),
);
)
.await
});
spawn(async move {
if let Err(e) = app.serve(
format!("http://0.0.0.0:{port}")
.parse::<Url>()
.expect("Failed to parse builder listener"),
EpochVersion::instance(),
)
.await {
tracing::error!("Legacy builder server failed: {e}");
}
});

The cloned AnvilInstance already keeps the Anvil process alive, so
holding the AnvilLayer in _anvil_layer is unnecessary.
}
}

// url for the hotshot event streaming api
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be unused, hence removed.

use super::*;
use crate::non_permissioned::BuilderConfig;

#[derive(Clone)]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will add this back

Copy link
Contributor

@twittner twittner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

let public = T::SignatureKey::from_private(&secret);
let kpair = derive_keypair::<<T as NodeType>::SignatureKey>(&secret);
let port = portpicker::pick_unused_port().expect("an unused port is available");
let port = test_utils::reserve_tcp_port().expect("Could not reserve TCP port");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have preferred the "expect as precondition style", but not important.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i agree i will change it

Comment on lines +43 to +46
// Force TIME_WAIT by completing TCP handshake
let _client = TcpStream::connect(addr)?;
let (_accepted, _) = server.accept()?;
// All sockets drop here - port enters TIME_WAIT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would perhaps be nice to upstream this to the port picker crate. It would neatly fit into https://github.com/Dentosal/portpicker-rs/blob/912f913ac325278a564d8828d550070db1d79373/src/lib.rs#L15. While under review the repo could be forked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants