Conversation
Summary of ChangesHello @sveitser, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request undertakes a significant refactoring of the project's network port allocation mechanism. It replaces the reliance on the external Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
7027b32 to
4a37c53
Compare
There was a problem hiding this comment.
Code Review
This pull request refactors port allocation logic across the codebase, replacing direct calls to portpicker::pick_unused_port() with new utility functions hotshot_types::utils::bind_tcp_port() and hotshot_types::utils::bind_udp_port(). These new functions are intended to atomically bind to an available port and return a listener or socket that keeps the port reserved, preventing race conditions. The portpicker dependency is consequently removed from several Cargo.toml files. However, the review comments consistently point out that in many instances, the returned listener or socket is immediately dropped (e.g., by assigning it to an ignored _listener variable), which defeats the purpose of the atomic binding and reintroduces the very race conditions the change aimed to solve. The reviewer suggests keeping the listener in scope until the service has successfully bound to the port, and in some cases, proposes using a PortPicker struct to manage multiple listeners, or ensuring that AnvilLayer instances are not prematurely dropped.
I am having trouble creating individual review comments. Click here to see my feedback.
crates/builder/src/lib.rs (245-246)
The _listener is immediately dropped, which releases the port and re-introduces a race condition. The port is not guaranteed to be available when it's used later. To fix this, the listener must be kept in scope until the server using this port is started. Since this function only returns a Url, you might need to refactor the calling code to handle port allocation and listener lifecycle.
crates/builder/src/lib.rs (408-409)
The _listener is immediately dropped, which releases the port and re-introduces a race condition. The port is not guaranteed to be available when it's used later. To fix this, the listener must be kept in scope until the server using this port is started. Since this function only returns a Url, you might need to refactor the calling code to handle port allocation and listener lifecycle.
crates/builder/src/non_permissioned.rs (269-276)
Both _query_listener and _builder_listener are dropped immediately after being created. This defeats the purpose of atomically binding the port, as it re-introduces a race condition where the port can be taken by another process before the service starts. To fix this, you should keep the listeners in scope until the respective services have bound to the ports. A good approach is to use a struct that holds the listeners, similar to the PortPicker implemented in other tests in this PR.
staking-cli/src/deploy.rs (100-104)
The listener is explicitly dropped before connect_anvil_with_wallet_and_config is called. This re-introduces a race condition, as the port is not guaranteed to be free when Anvil tries to bind to it. The listener should be held until Anvil has successfully started and bound the port.
espresso-dev-node/src/main.rs (311)
The _listener is immediately dropped, which releases the port and re-introduces a race condition. The port is not guaranteed to be available when the relay server starts. The listener must be kept in scope until the server has bound to the port.
espresso-dev-node/src/main.rs (281)
The _anvil_layer variable will be dropped at the end of this statement because of the _ prefix, which will terminate the Anvil instance it holds. To keep the Anvil instance alive for the duration of the scope, you should bind it to a variable that is not ignored (e.g., by removing the _ prefix) and ensure it is not dropped prematurely.
crates/hotshot/examples/push-cdn/all.rs (76-79)
The _private_listener and _public_listener are dropped immediately, releasing the ports. This re-introduces a race condition. The listeners should be kept alive until the broker service has bound to these ports.
hotshot-query-service/src/fetching/provider/any.rs (237)
The _listener is immediately dropped, which releases the port and re-introduces a race condition. The listener must be kept in scope until the server has bound to the port.
crates/hotshot/examples/infra/mod.rs (1064-1065)
The _listener is immediately dropped, which releases the port and re-introduces a race condition. The port is not guaranteed to be available when it's used later to build URLs for services. The listener must be kept in scope until the server has bound to the port.
crates/hotshot-builder/shared/src/utils/event_service_wrapper.rs (295-296)
The _listener is immediately dropped, which releases the port and re-introduces a race condition. The port is not guaranteed to be available when run_app is called. The listener must be kept in scope until the server has bound to the port.
crates/hotshot/hotshot/src/traits/networking/cliquenet_network.rs (163-164)
The _listener is immediately dropped, which releases the port and re-introduces a race condition. The port is not guaranteed to be available when the network service starts. The listener must be kept alive until the port is used.
crates/hotshot/hotshot/src/traits/networking/libp2p_network.rs (228-229)
The _socket is immediately dropped, which releases the port and re-introduces a race condition. The port is not guaranteed to be available when the network service starts. The socket must be kept alive until the port is used.
crates/hotshot/hotshot/src/traits/networking/push_cdn_network.rs (343-348)
The _listener_1 and _listener_2 are dropped immediately, releasing the ports. This re-introduces a race condition. The listeners should be kept alive until the services have bound to these ports.
crates/hotshot/hotshot/src/traits/networking/push_cdn_network.rs (353-354)
The _private_listener is dropped immediately, releasing the port. This re-introduces a race condition. The listener should be kept alive until the service has bound to this port.
crates/hotshot/hotshot/src/traits/networking/push_cdn_network.rs (409-410)
The _marshal_listener is dropped immediately, releasing the port. This re-introduces a race condition. The listener should be kept alive until the service has bound to this port.
crates/hotshot/testing/src/test_runner.rs (319-321)
The listener returned by bind_tcp_port is immediately dropped because only the port (.1) is being kept. This releases the port, re-introducing a race condition. The listener must be kept in scope until the builder service has bound to the port.
crates/hotshot/testing/tests/tests_1/block_builder.rs (38-40)
The listener returned by bind_tcp_port is immediately dropped because only the port (.1) is being kept. This releases the port, re-introducing a race condition. The listener must be kept in scope until the service has bound to the port.
crates/hotshot-builder/shared/src/utils/event_service_wrapper.rs (254-255)
The _listener is immediately dropped, which releases the port and re-introduces a race condition. The port is not guaranteed to be available when run_app is called. The listener must be kept in scope until the server has bound to the port.
crates/hotshot-builder/refactored/src/testing/mod.rs (77)
The _listener is immediately dropped, which releases the port. The subsequent call to app.serve might fail if another process takes the port in the meantime. The listener should be kept alive until the server has successfully bound the port.
espresso-dev-node/src/main.rs (554)
Using unwrap_or(0) to get a prover port is problematic. The prover service will be configured with port 0, but when it binds to an ephemeral port, there is no mechanism to communicate the actual port number back. This will likely cause connection issues. It's better to atomically bind to a port, get the port number, and then configure the service with that specific port.
hotshot-query-service/src/availability.rs (1119)
The _listener is immediately dropped, which releases the port and re-introduces a race condition. This pattern is repeated throughout the tests in this file. The listener must be kept in scope until the server has bound to the port.
hotshot-query-service/src/data_source/storage/sql.rs (1210)
The _listener is immediately dropped, which releases the port and re-introduces a race condition. The listener must be kept in scope until the Docker container is started and has bound the port.
hotshot-query-service/src/explorer.rs (870)
The _listener is immediately dropped, which releases the port and re-introduces a race condition. The listener must be kept in scope until the server has bound to the port.
crates/hotshot/examples/combined/all.rs (65-68)
The _private_listener and _public_listener are dropped immediately, releasing the ports. This re-introduces a race condition. The listeners should be kept alive until the broker service has bound to these ports.
hotshot-query-service/src/fetching/provider/query_service.rs (617)
The _listener is immediately dropped, which releases the port and re-introduces a race condition. This pattern is repeated throughout the tests in this file. The listener must be kept in scope until the server has bound to the port.
hotshot-query-service/src/lib.rs (940)
The _listener is immediately dropped, which releases the port and re-introduces a race condition. The listener must be kept in scope until the server has bound to the port.
hotshot-query-service/src/node.rs (261)
The _listener is immediately dropped, which releases the port and re-introduces a race condition. This pattern is repeated throughout the tests in this file. The listener must be kept in scope until the server has bound to the port.
hotshot-query-service/src/status.rs (140)
The _listener is immediately dropped, which releases the port and re-introduces a race condition. This pattern is repeated throughout the tests in this file. The listener must be kept in scope until the server has bound to the port.
hotshot-query-service/src/testing/consensus.rs (115)
The _listener is immediately dropped, which releases the port and re-introduces a race condition. The listener must be kept in scope until the server has bound to the port.
light-client/src/client.rs (229)
The _listener is immediately dropped, which releases the port and re-introduces a race condition. This pattern is repeated throughout the tests in this file. The listener must be kept in scope until the server has bound to the port.
sequencer/src/api.rs (1798)
The listener returned by bind_tcp_port is immediately dropped because only the port (.1) is being kept. This releases the port, re-introducing a race condition. This pattern is repeated throughout the tests in this file. The listener must be kept in scope until the server has bound to the port.
sequencer/src/bin/dev-cdn.rs (60-63)
The _public_listener and _private_listener are dropped immediately, releasing the ports. This re-introduces a race condition. The listeners should be kept alive until the broker service has bound to these ports.
sequencer/src/lib.rs (825-828)
The _listener is dropped at the end of the closure, which releases the port. This re-introduces a race condition. The listener must be kept alive until the builder service starts.
sequencer/src/lib.rs (873-876)
The _listener is dropped at the end of the closure, which releases the port. This re-introduces a race condition. The listener must be kept alive until the builder service starts.
sequencer/src/persistence.rs (1397-1398)
The _listener is immediately dropped, which releases the port and re-introduces a race condition. The listener must be kept in scope until the query service has bound to the port.
sequencer/src/run.rs (363)
The listener returned by bind_tcp_port is immediately dropped because only the port (.1) is being kept. This releases the port, re-introducing a race condition. The listener must be kept in scope until the server has bound to the port.
slow-tests/tests/state.rs (47)
Using port 0 here is incorrect. While it allows the service to bind to an ephemeral port, the client that needs to connect to it will be configured with http://localhost:0, which will not work. You should first bind to port 0 to get an available port number, and then use that specific port number to configure both the server and the client.
crates/cliquenet/benches/bench1.rs (50-61)
The listener_a and listener_b are dropped at the end of the setup_cliquenet function, which means the ports they reserved are released. The Retry instances that are returned will later try to bind to these ports, but they are not guaranteed to be available. This re-introduces a race condition. The listeners should be kept alive until the network services are up and running.
tests/reward_claims_e2e.rs (52-56)
The _listener1 and _listener2 are dropped immediately, which releases the ports and re-introduces a race condition. The listeners must be kept in scope until the services using these ports have started.
|
Created new PR to fix cargo audit. PR: #3924 Please merge that PR first to fix cargo-audit. |
Create test-utils crate with BoundPort/BoundSocket structs that prevent
port binding race conditions via Rust's borrow checker. The .port() method
returns &u16, making bind_tcp_port()?.port() fail to compile with:
"temporary value dropped while borrowed".
- Create test-utils crate with BoundPort and BoundSocket structs
- Remove portpicker dependency from workspace and 16 crates
- Migrate 33+ files to use test_utils::{bind_tcp_port, bind_udp_port}
- Port method returns &u16 for borrow checker enforcement
- Fix all type mismatches with dereference operator (*)
- Add regression tests proving compile-time safety
- Use listener/socket (not _listener/_socket) as public accessors exist
Fixes race conditions where port was freed before service could bind.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request replaces the portpicker crate with a more robust method of acquiring free ports for tests, which involves binding to port 0 and letting the OS assign a port. This is a great change that should help reduce test flakiness caused by port allocation race conditions. A new test-utils crate has been introduced to house the port binding utilities.
My review focuses on ensuring the new utility is used consistently across the codebase. I've identified a couple of places where the logic is duplicated instead of using the new test-utils crate. Consolidating this logic will improve maintainability.
Address PR review comments by ensuring BoundPort instances remain in scope until services actually bind to the ports: - builder/lib.rs: Return BoundPort from hotshot_builder_url() so callers keep it alive, remove unused hotshot_event_streaming_api_url - builder/non_permissioned.rs: Keep bound ports alive until end of test - espresso-dev-node/main.rs: Move bound_relay_port into async task - dev_node_tests.rs: Replace custom PortPicker with test-utils - restart_tests.rs: Update PortPicker to use test-utils internally This prevents race conditions where ports could be freed before services have a chance to bind to them.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request replaces the portpicker crate with a new test-utils crate that provides a more robust way to handle temporary port allocation for tests using an RAII guard pattern. This is a great improvement for test stability. However, in many places, the new utility is used in a way that re-introduces the race conditions it was designed to prevent, by dropping the port guard immediately after creation. I've left several comments pointing out these issues and suggesting fixes. The general solution is to ensure the BoundPort or BoundSocket guard from test-utils lives until the server has successfully bound to the port, which for spawned tasks usually means moving the guard into the task. There are also a few inter-process race conditions in integration tests that are not solved by this change and may require a different approach.
I am having trouble creating individual review comments. Click here to see my feedback.
hotshot-query-service/examples/simple-server.rs (176-177)
The bound_builder_port is dropped, but the port is used later when the BuilderTask is started. This is a race condition. The BoundPort guard needs to be kept alive until the builder server has bound to the port. Consider moving the guard into the builder_task.
hotshot-query-service/src/availability.rs (1118-1119)
The bound_port is dropped at the end of the scope, but app.serve is running in a spawned task. This creates a race condition where the port could be reused by another process before the server task binds to it. The bound_port guard must be kept alive until the server has bound the port. This issue appears in multiple tests in this file. The fix is to move the guard into the spawned task.
sequencer/src/run.rs (362-363)
The bound_port is dropped at the end of the scope, but the port is used later to start a server. This creates a race condition. The BoundPort guard must be kept alive until the server has bound the port.
sequencer/src/persistence.rs (1398-1399)
The bound_query_port is dropped at the end of the scope, but the port is used later to start a server. This creates a race condition. The BoundPort guard must be kept alive until the server has bound the port.
crates/cliquenet/benches/bench1.rs (50-61)
This manual port picking logic re-introduces the race condition that test-utils is meant to solve. The listener_a and listener_b are dropped at the end of setup_cliquenet, releasing the ports before cliquenet has a chance to bind to them. This can lead to flaky benchmarks.
Consider using test_utils::bind_tcp_port() and restructuring the benchmark setup to keep the returned BoundPort guards alive until the network is fully initialized.
crates/hotshot-builder/refactored/src/testing/mod.rs (76-80)
The bound_port is dropped at the end of this function's scope, but app.serve runs in a spawned task. This creates a race condition where the port could be reused by another process before the server task binds to it. The bound_port guard must be kept alive until the server has bound the port.
You can fix this by moving the guard into the spawned task.
let bound_port = test_utils::bind_tcp_port().unwrap();
let port = *bound_port.port();
let url: Url = format!("http://localhost:{port}").parse().unwrap();
let app = Arc::clone(&global_state).into_app().unwrap();
spawn(async move {
// By moving `bound_port` into this task, we ensure the port remains reserved.
let _guard = bound_port;
app.serve(url.clone(), StaticVersion::<0, 1> {}).await;
});crates/hotshot-builder/shared/src/utils/event_service_wrapper.rs (254-258)
The bound_port is dropped at the end of the scope, but run_app (which starts a server) is called right after. This creates a race condition where the port can be reused before the server binds to it. The BoundPort guard should be passed to run_app and kept alive until the server has bound the port.
crates/hotshot/examples/combined/all.rs (65-68)
The bound_private_port and bound_public_port guards are dropped at the end of the loop, but the broker is started in a spawned task. This creates a race condition. The guards should be moved into the spawned task to ensure the ports remain reserved until the broker has bound to them.
crates/hotshot/examples/infra/mod.rs (1064-1065)
The bound_port is dropped at the end of this match arm, but the port is used later to start a builder task. This creates a race condition. The BoundPort guard needs to be kept alive until the builder server has bound to the port. This might require refactoring initialize_builder to return the guard along with the task.
crates/hotshot/examples/push-cdn/all.rs (76-79)
The bound_private_port and bound_public_port guards are dropped at the end of the loop, but the broker is started in a spawned task. This creates a race condition. The guards should be moved into the spawned task to ensure the ports remain reserved until the broker has bound to them.
crates/hotshot/hotshot/src/traits/networking/cliquenet_network.rs (165-167)
The bound_port is dropped at the end of the loop iteration, but the port is used later to create Cliquenet instances. This is a race condition. The BoundPort guards should be collected and moved into the returned AsyncGenerator closure to keep them alive.
crates/hotshot/hotshot/src/traits/networking/libp2p_network.rs (228-229)
The bound_socket is dropped at the end of the loop iteration, but the port is used later to create Libp2pNetwork instances. This is a race condition. The BoundSocket guards should be collected and moved into the returned AsyncGenerator closure to keep them alive.
crates/hotshot/testing/src/test_runner.rs (319-321)
The bound_builder_port is dropped when init_builders returns, but the port is used later when the BuilderTask is started. This is a race condition.
To fix this, init_builders should also return the BoundPort guards, and the caller (run_test) should hold them until the test is complete.
crates/hotshot/testing/tests/tests_1/block_builder.rs (38-40)
The bound_port is dropped at the end of the scope, but RandomBuilderImplementation::start spawns a server in a background task. This creates a race condition where the port could be reused before the server binds to it. The bound_port guard should be moved into the task spawned by RandomBuilderImplementation::start.
espresso-dev-node/tests/dev_node_tests.rs (52-57)
This test binds ports and passes them as command-line arguments to a child process (espresso-dev-node). The parent process (cargo test) might release these ports before the child process has a chance to bind to them, creating a race condition. This can lead to flaky tests.
A more robust solution for inter-process port management is needed, such as having the child process bind to port 0 and print the chosen port to stdout for the parent to read.
hotshot-events-service/src/test.rs (45-47)
The bound_port is dropped at the end of the scope, but app.serve is running in a spawned task. This creates a race condition where the port could be reused by another process before the server task binds to it. The bound_port guard must be kept alive until the server has bound the port. You can achieve this by moving the guard into the spawned task.
sequencer/src/lib.rs (825-828)
The bound_port created inside unwrap_or_else is dropped immediately after the port number is extracted. This creates a race condition where the port can be reused before the builder task binds to it. The BoundPort guard needs to be kept alive.
crates/builder/src/lib.rs (118)
The _bound_builder_port is unused and dropped at the end of the statement, which means the port is not guaranteed to be reserved when hotshot_builder_url returns. This re-introduces a race condition. The BoundPort guard must be kept alive until the builder server, which is started later, has bound to the port.
hotshot-query-service/src/data_source/storage/sql.rs (1210-1211)
This code picks a free port on the host running the test, and then passes it to a Docker container. This is a race condition. Another process on the host could take the port before the Docker container binds to it. Furthermore, if Docker is running on a different host, the port might not be free on that host at all.
A more robust approach would be to let Docker assign a random host port by using -p 5432 and then query Docker for the assigned port.
hotshot-query-service/src/explorer.rs (870-871)
The bound_port is dropped at the end of the scope, but app.serve is running in a spawned task. This creates a race condition where the port could be reused by another process before the server task binds to it. The bound_port guard must be kept alive until the server has bound the port. You can achieve this by moving the guard into the spawned task.
hotshot-query-service/src/fetching/provider/any.rs (237-238)
The bound_port is dropped at the end of the scope, but app.serve is running in a spawned task. This creates a race condition where the port could be reused by another process before the server task binds to it. The bound_port guard must be kept alive until the server has bound the port. You can achieve this by moving the guard into the spawned task.
hotshot-query-service/src/fetching/provider/query_service.rs (618-619)
The bound_port is dropped at the end of the scope, but app.serve is running in a spawned task. This creates a race condition where the port could be reused by another process before the server task binds to it. This issue appears in multiple tests in this file. The fix is to move the guard into the spawned task.
hotshot-query-service/src/lib.rs (939-940)
The bound_port is dropped at the end of the scope, but app.serve is running in a spawned task. This creates a race condition where the port could be reused by another process before the server task binds to it. The bound_port guard must be kept alive until the server has bound the port. You can achieve this by moving the guard into the spawned task.
hotshot-query-service/src/node.rs (261-262)
The bound_port is dropped at the end of the scope, but app.serve is running in a spawned task. This creates a race condition where the port could be reused by another process before the server task binds to it. The bound_port guard must be kept alive until the server has bound the port. This issue appears in multiple tests in this file. The fix is to move the guard into the spawned task.
hotshot-query-service/src/status.rs (140-141)
The bound_port is dropped at the end of the scope, but app.serve is running in a spawned task. This creates a race condition where the port could be reused by another process before the server task binds to it. The bound_port guard must be kept alive until the server has bound the port. This issue appears in multiple tests in this file. The fix is to move the guard into the spawned task.
hotshot-query-service/src/testing/consensus.rs (115-116)
The bound_builder_port is dropped, but the port is used later to start a builder task. This is a race condition. The BoundPort guard needs to be kept alive until the builder server has bound to the port. The builder_task should take ownership of the guard.
light-client/src/client.rs (229-230)
The bound_port is dropped at the end of the scope, but the TestNetwork constructor starts servers that will use this port. This creates a race condition. The BoundPort guard must be kept alive until the server has bound the port. This issue appears in multiple tests in this file.
sequencer/src/api.rs (1798-1799)
The bound_port is dropped at the end of the scope, but the port is used later to start a server. This creates a race condition. The BoundPort guard must be kept alive until the server has bound the port. This issue appears in multiple tests in this file.
sequencer/src/bin/dev-cdn.rs (58-61)
The bound_public_port and bound_private_port guards are dropped before the Broker and Marshal tasks are spawned and have a chance to bind to the ports. This creates a race condition. The guards should be moved into the spawned tasks to ensure the ports remain reserved.
crates/builder/src/lib.rs (242-252)
The hotshot_event_streaming_api_url function, which used portpicker, has been removed. This is a good cleanup, but it's important to ensure that its replacement, which seems to be happening in crates/builder/src/non_permissioned.rs, correctly handles port allocation without race conditions. The new approach should use test-utils and keep the port guard alive.
crates/builder/src/non_permissioned.rs (325-327)
Explicitly dropping the port guards here is a good way to signal intent and manage resource lifetimes clearly. This ensures the ports remain bound for the duration of the test, preventing race conditions with the background server tasks.
crates/builder/src/lib.rs (394-404)
The hotshot_builder_url function now correctly returns the BoundPort guard along with the URL. This is the correct pattern to prevent race conditions. However, the caller at line 118 doesn't seem to be using it correctly.
slow-tests/tests/state.rs (47)
Using port 0 is a great way to let the OS assign an available port, which is robust against race conditions. This is a good pattern for tests.
types/src/v0/impls/l1.rs (1423-1425)
Using port 0 to let the OS assign an available port for the Anvil instance is a robust way to avoid port conflicts in tests. This is a good improvement.
types/src/v0/impls/l1.rs (1684-1685)
Using port 0 to let the OS assign an available port for the Anvil instance is a robust way to avoid port conflicts in tests. This is a good improvement.
- cliquenet_network: Store BoundPort in parties vec, keep alive in async closure - libp2p_network: Move bound_socket into async closure as guard - test_runner: Collect all ports upfront before starting any builders - sequencer/lib.rs: Keep BoundPort guard alive until server binds - dev_node_tests: Drop ports right before child process spawn - non_permissioned: Remove unnecessary explicit drops
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request is a work-in-progress to replace the portpicker crate with a new, more robust test-utils crate for reserving ports in tests. This is a commendable effort to reduce test flakiness by preventing port race conditions. The new utility is well-designed, using RAII guards to ensure ports remain reserved.
However, I've identified several instances where the new utility is not used correctly. The BoundPort guard, which keeps the port reserved, is often dropped prematurely, reintroducing the very race conditions this PR aims to fix. I've provided specific comments and suggestions on how to address these issues. Ensuring the BoundPort or BoundSocket instances are kept in scope for the required duration is critical for this refactoring to be effective.
Address Gemini review comments on PR #3932: - HotShotTestConfig: store _bound_builder_port in struct - TestConfigBuilder/TestConfig: store _bound_builder_port in struct - run_test_builder: return BoundPort so callers keep it alive - cliquenet benchmark: use bind_tcp_port() with proper guard lifetime Previously guards were dropped prematurely, releasing ports before the services bound to them.
The test network generators were holding BoundPort guards while the network tried to bind to the same port, causing "Address already in use" errors. Drop guards immediately after getting the port number.
Replace bind_tcp_port()/bind_udp_port() with reserve_tcp_port()/reserve_udp_port() across the codebase. The new functions use the TIME_WAIT trick from Yelp's ephemeral-port-reserve to provide race-free port allocation without requiring callers to manage guard lifetimes. The TIME_WAIT pattern works by completing a TCP handshake then dropping all sockets, which puts the port into TIME_WAIT state for ~60s. This prevents the OS from assigning the port to other ephemeral allocations while still allowing explicit binds. Changes: - Add reserve_tcp_port() using TIME_WAIT trick - Add reserve_udp_port() for UDP (tiny race, acceptable for tests) - Remove BoundPort/BoundSocket structs (no longer needed) - Remove bind_tcp_port()/bind_udp_port() functions - Migrate all callers to simpler reserve_*_port() API This reduces test-utils from 373 to 168 lines and eliminates the awkward pattern of keeping guards alive just to hold ports.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request is a significant and well-executed refactoring to improve test stability by replacing the portpicker crate with a custom test-utils crate. The new implementation for reserving TCP ports uses the TIME_WAIT trick, which is a robust method to prevent port allocation race conditions in tests. The changes are applied consistently across the codebase, and the new utility is well-documented and tested.
Additionally, the pull request introduces other good practices, such as using AnvilLayer for better management of the Anvil process lifecycle in tests and leveraging port 0 for OS-assigned ephemeral ports where applicable. These changes should collectively reduce test flakiness. The overall quality of the code is high, and I have no further suggestions for improvement.
Clippy complains about using ::default() on unit structs. Use the struct directly instead.
PortPicker was simplified to a unit struct that just wrapped reserve_tcp_port(). Remove it entirely and call reserve_tcp_port() directly.
Revert unnecessary splitting of chained method calls into separate statements. The original chained style is cleaner and there's no need to keep intermediate variables alive.
…state test Port 0 was passed to Options::with_port but also used to construct the client URL, resulting in the client connecting to localhost:0 instead of the OS-assigned port. Use reserve_tcp_port() to pre-reserve a port that both the server and client can use.
UDP has no TIME_WAIT state so the function name was misleading -- it didn't reserve anything. Inline the two-line UdpSocket::bind at the single call site in libp2p_network.rs.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request addresses a race condition in port selection for tests by replacing the portpicker crate with a more robust solution. A new test-utils crate is introduced, which provides a reserve_tcp_port function utilizing the TCP TIME_WAIT trick to reliably reserve ports. This change is applied consistently across the codebase.
The PR also includes several other improvements:
- In some cases, instead of picking a port beforehand, the code now lets the underlying service (like Anvil) bind to an ephemeral port (port 0) and then queries for the assigned port, which is an even more robust pattern.
- A critical bug in
sequencer/src/lib.rshas been fixed where a server'sservefuture was not being awaited, causing it to not run.
The changes are well-executed and significantly improve the reliability of tests that require network ports.
| spawn(async move { | ||
| app.serve( | ||
| format!("http://0.0.0.0:{port}") | ||
| .parse::<Url>() | ||
| .expect("Failed to parse builder listener"), | ||
| EpochVersion::instance(), | ||
| ), | ||
| ); | ||
| ) | ||
| .await | ||
| }); |
There was a problem hiding this comment.
While this change correctly awaits the app.serve future, the Result it returns is being ignored. It would be beneficial to handle the potential error case, for example by logging it. This would improve debuggability if the server fails to start.
| spawn(async move { | |
| app.serve( | |
| format!("http://0.0.0.0:{port}") | |
| .parse::<Url>() | |
| .expect("Failed to parse builder listener"), | |
| EpochVersion::instance(), | |
| ), | |
| ); | |
| ) | |
| .await | |
| }); | |
| spawn(async move { | |
| if let Err(e) = app.serve( | |
| format!("http://0.0.0.0:{port}") | |
| .parse::<Url>() | |
| .expect("Failed to parse builder listener"), | |
| EpochVersion::instance(), | |
| ) | |
| .await { | |
| tracing::error!("Legacy builder server failed: {e}"); | |
| } | |
| }); |
The cloned AnvilInstance already keeps the Anvil process alive, so holding the AnvilLayer in _anvil_layer is unnecessary.
| } | ||
| } | ||
|
|
||
| // url for the hotshot event streaming api |
There was a problem hiding this comment.
This seems to be unused, hence removed.
| use super::*; | ||
| use crate::non_permissioned::BuilderConfig; | ||
|
|
||
| #[derive(Clone)] |
There was a problem hiding this comment.
will add this back
| let public = T::SignatureKey::from_private(&secret); | ||
| let kpair = derive_keypair::<<T as NodeType>::SignatureKey>(&secret); | ||
| let port = portpicker::pick_unused_port().expect("an unused port is available"); | ||
| let port = test_utils::reserve_tcp_port().expect("Could not reserve TCP port"); |
There was a problem hiding this comment.
I would have preferred the "expect as precondition style", but not important.
There was a problem hiding this comment.
yeah i agree i will change it
| // Force TIME_WAIT by completing TCP handshake | ||
| let _client = TcpStream::connect(addr)?; | ||
| let (_accepted, _) = server.accept()?; | ||
| // All sockets drop here - port enters TIME_WAIT |
There was a problem hiding this comment.
It would perhaps be nice to upstream this to the port picker crate. It would neatly fit into https://github.com/Dentosal/portpicker-rs/blob/912f913ac325278a564d8828d550070db1d79373/src/lib.rs#L15. While under review the repo could be forked.
Bind to socket then release it which will make the OS not hand out the socket for 60s by default on linux.
Run 1: https://github.com/EspressoSystems/espresso-network/actions/runs/21724072749?pr=3932 ok
Run 2: https://github.com/EspressoSystems/espresso-network/actions/runs/21724930185?pr=3932 2 failures (also on main)
Run 3: https://github.com/EspressoSystems/espresso-network/actions/runs/21724930185?pr=3932 same 2 failures (also on main)
I didn't see any addr already in use error anymore.