feat(multiversion-tests): add cross-version integration test harness#1207
feat(multiversion-tests): add cross-version integration test harness#1207glottologist wants to merge 8 commits intomasterfrom
Conversation
📝 WalkthroughWalkthroughAdds a new multiversion test suite: workspace manifest and toolchain, fixtures, extensive Rust test utilities (binary resolver, cluster orchestration, config generator, fault injection, process/ports/probe modules), end-to-end and upgrade tests, CI workflow, xtask command, and a .gitignore rule. Changes
Sequence Diagram(s)sequenceDiagram
participant User as User
participant Resolver as BinaryResolver
participant Git as Git (worktree)
participant Cargo as Cargo
participant Cluster as Cluster
participant Node as NodeProcess
participant Probe as HttpProbe
User->>Resolver: resolve_new()/resolve_old()
Resolver->>Git: validate_git_ref() / create_worktree()
Git-->>Resolver: worktree_path / resolved_rev
Resolver->>Cargo: cargo build (worktree)
Cargo-->>Resolver: binary_path
Resolver-->>User: ResolvedBinary
User->>Cluster: start(ClusterSpec)
Cluster->>Cluster: allocate_ports()
Cluster->>Node: spawn nodes with per-node configs
Node-->>Cluster: running
Cluster->>Probe: wait_for_convergence(api_urls)
Probe->>Probe: poll_all_nodes()
Probe-->>Cluster: converged / timeout
Cluster-->>User: ready / error
sequenceDiagram
participant User as User
participant Cluster as Cluster
participant Resolver as BinaryResolver
participant Node as NodeProcess
participant Probe as HttpProbe
User->>Cluster: upgrade_node(name, new_binary)
Cluster->>Node: node_mut(name)
Cluster->>Node: shutdown()/kill()
Node-->>Cluster: terminated
Cluster->>Node: set_binary(new_binary) & respawn()
Node-->>Cluster: running
Cluster->>Probe: wait_for_ready(node)
Probe-->>Cluster: node ready
Cluster->>Probe: wait_for_convergence()
Probe-->>Cluster: converged
Cluster-->>User: upgrade complete
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
There was a problem hiding this comment.
Actionable comments posted: 8
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@multiversion-tests/Cargo.toml`:
- Line 11: The tokio dependency in Cargo.toml is missing the "sync" feature
required by multiversion-tests/src/process.rs which uses tokio::sync::mpsc
(UnboundedReceiver and unbounded_channel); update the tokio features array for
the tokio = { version = "1", ... } entry to include "sync" so the mpsc types and
helpers are available at compile time.
In `@multiversion-tests/src/binary.rs`:
- Around line 101-115: The try_env_override function currently calls
self.git_rev("HEAD") and embeds that value in the ResolvedBinary, which is
misleading for environment-provided binaries; change try_env_override to skip
calling self.git_rev and instead set the revision to a clear sentinel such as
"env-override" (or "unknown") when creating the ResolvedBinary via
resolved(label, rev, path), and ensure any error path still returns
BinaryError::NotFound for missing paths rather than attempting to resolve a git
rev.
In `@multiversion-tests/src/cluster.rs`:
- Around line 208-211: The function node_api_url currently indexes port_map with
port_map[name], which will panic if name is missing; change node_api_url to
perform a defensive lookup using port_map.get(name) and return a Result (e.g.,
Result<String, Error>/anyhow::Error) rather than unconditionally returning
String, and on None return an Err with a clear message including the missing
node name and type NodePorts; ensure callers of node_api_url are updated to
propagate or handle the Result and continue to use
crate::node_api_base(ports.api) when the lookup succeeds.
In `@multiversion-tests/src/fault/network.rs`:
- Around line 16-30: The current remove_all_rules() called from
NetworkPartitioner::drop flushes the host's default filter table and can break
unrelated networking; instead, modify the injector to create and use a dedicated
chain (e.g., "MULTIVERSION_TEST_CHAIN") when installing rules and change
remove_all_rules() to only list and delete rules from that dedicated chain and
then delete the chain itself (use iptables -N when creating rules, -F and -X
only on that chain or delete per-rule by comment/identifier), ensuring
NetworkPartitioner::drop calls this scoped cleanup; update any code that
installs rules to add them to the dedicated chain and avoid calling "iptables
-F" without specifying the chain.
- Around line 37-44: When adding the two DROP rules (calls to
add_drop_rule(source_port, dest_port) and add_drop_rule(dest_port,
source_port)), ensure that if the second add_drop_rule fails you undo the first
rule before returning the error: after awaiting the second add_drop_rule, on Err
capture the error, call sync_remove_drop_rule(source_port, dest_port) to roll
back the first insert (propagate any removal error or prefer returning the
original add error), and then return the original error; keep the existing
FaultGuard::new closure that removes both rules for the success path.
In `@multiversion-tests/src/ports.rs`:
- Around line 15-27: NodePorts::allocate() currently binds three TcpListeners
then drops them, so ports aren’t reserved; change the approach so sockets remain
open until the child process is actually listening or let the child bind :0 and
report its chosen ports back. Concretely, modify NodePorts::allocate() to return
and store the TcpListener instances (or a new struct holding listeners and port
numbers) instead of dropping them, update Cluster::start to hold those listeners
and pass/transfer/dup them to the child or wait for the child to accept on them
before dropping, and update the test consecutive_allocations_do_not_collide to
either read reported ports from the child or use the reserved listeners to avoid
races.
In `@multiversion-tests/src/probe.rs`:
- Around line 146-169: In poll_until, move or add the condition evaluation so it
runs before declaring timeout: call the provided check().await at the top of the
loop (or, when start.elapsed() > timeout, perform one final check().await) and
only return ProbeError::Timeout after that final attempt; update the loop in
poll_until to ensure check() is invoked before returning Err(ProbeError::Timeout
{ condition: condition_name.to_owned(), elapsed: timeout }) so a condition that
becomes true at the timeout boundary can still succeed.
In `@multiversion-tests/src/process.rs`:
- Around line 90-94: The kill() method currently sends SIGKILL then drops
self.child, which can leave a zombie; update kill() to take the existing child
handle (e.g. let mut child = self.child.take().ok_or(...)?), send SIGKILL using
signal::kill(self.pid()?, ...), then explicitly wait for the child to terminate
(use child.wait() or a small try_wait loop with a timeout/retries) before
returning Ok(()); ensure you still clear self.child only after wait succeeds or
after timeout. Also update respawn() to await the completion of start_kill() (or
join its handle/future) before spawning a new process so the old PID is reaped
before creating the replacement. Ensure references to pid(), child, kill(),
start_kill(), and respawn() are used to locate the changes.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 7e529c48-c3b3-4f5d-aad5-50f3e99c7188
⛔ Files ignored due to path filters (2)
Cargo.lockis excluded by!**/*.lockmultiversion-tests/Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (15)
.gitignoremultiversion-tests/Cargo.tomlmultiversion-tests/claude/prompts.mdmultiversion-tests/fixtures/base-config.tomlmultiversion-tests/rust-toolchain.tomlmultiversion-tests/src/binary.rsmultiversion-tests/src/cluster.rsmultiversion-tests/src/config.rsmultiversion-tests/src/fault/crash.rsmultiversion-tests/src/fault/mod.rsmultiversion-tests/src/fault/network.rsmultiversion-tests/src/lib.rsmultiversion-tests/src/ports.rsmultiversion-tests/src/probe.rsmultiversion-tests/src/process.rs
There was a problem hiding this comment.
Actionable comments posted: 11
♻️ Duplicate comments (3)
multiversion-tests/src/ports.rs (1)
15-27:⚠️ Potential issue | 🟠 MajorThese ports are only sampled, not reserved.
After Line 26 the listeners are gone, so the returned ports are immediately available to other processes again. Because
multiversion-tests/src/cluster.rs::Cluster::start()allocates the full map before it spawns children, startup can still race into intermittentEADDRINUSEhere. Keep the listeners alive until each child has bound, or let the child bind:0and report the chosen ports back.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@multiversion-tests/src/ports.rs` around lines 15 - 27, The allocate() function in multiversion-tests::ports currently drops its TcpListener handles before returning sampled ports, which allows races (EADDRINUSE) when Cluster::start() spawns children; keep the listeners alive until children have bound or change the approach so children bind to ":0" and report back. Fix by changing ports::allocate to return or hold the TcpListener instances (e.g., return a struct with fields api_listener, gossip_listener, reth_network_listener alongside ports) and ensure Cluster::start uses those listeners to accept/transfer the bound port ownership to children (or modify Cluster::start to spawn children that bind to ":0" and send their chosen ports back), updating any code that references Ports::allocate to consume the listener-backed struct or the post-bind reported port values.multiversion-tests/src/process.rs (1)
98-101:⚠️ Potential issue | 🟠 MajorWait for the old child to exit before clearing or replacing it.
kill()clearsself.childimmediately after sending SIGKILL, andrespawn()can launch the replacement immediately after requesting the old kill. Inmultiversion-tests/tests/upgrade.rsLines 174-183, the crash-upgrade path does exactly that sequence, so the new node can race the old one's socket teardown and fail intermittently on rebinding. Reap the old child, or add akill_and_wait()path for restart flows, before treating the slot as free.Also applies to: 139-149
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@multiversion-tests/src/process.rs` around lines 98 - 101, The kill() implementation sends SIGKILL then clears self.child immediately, which allows respawn() to reuse the slot before the old process has exited; change kill() to wait for/reap the killed child before clearing self.child (or add a separate kill_and_wait() used by restart flows), i.e. after signal::kill(self.pid()?, Signal::SIGKILL) call perform a blocking or timed wait on the child (using the same Process/Child wait API you use elsewhere) and only set self.child = None once the child has been reaped; ensure proper handling of wait errors (EINTR/ESRCH/timeouts) and return a ProcessError on failure so respawn() sees the slot as free only after the old process is gone.multiversion-tests/src/probe.rs (1)
165-176:⚠️ Potential issue | 🟡 MinorDo one final probe before timing out.
The loop checks
start.elapsed()before it runscheck().await, so a condition that flips right at the deadline still returnsTimeout. That makes the readiness and convergence waiters more flaky than they need to be.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@multiversion-tests/src/probe.rs` around lines 165 - 176, The loop currently tests start.elapsed() against timeout before calling check(), which makes conditions that become true exactly at the deadline return ProbeError::Timeout; modify the probe loop so it performs the asynchronous check() at least once after the timeout check (e.g., call check().await before returning Timeout or rearrange to check first then test elapsed), ensuring the final check can succeed and, if it still returns None, then return Err(ProbeError::Timeout { condition: condition_name.to_owned(), elapsed: timeout }). Reference the existing symbols start, timeout, check(), condition_name and ProbeError::Timeout when making this change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/multiversion.yml:
- Around line 43-45: Update the "Run multiversion tests" step: modify the run
command string "cargo test --test '*' -- --ignored --test-threads=1" to append
--nocapture so test output streams to the CI logs for debugging, and add an
explicit timeout (e.g., set timeout-minutes: 30) on the step/job to avoid
indefinite runs; target the step named "Run multiversion tests" and the run
invocation when applying the change.
In
`@multiversion-tests/claude/2026-03-11-001-learnings-session-handoff-multiversion-phases-implementation.md`:
- Line 76: Replace the hardcoded machine-specific cd command that uses
/home/glottologist/.../multiversion-testing with a portable alternative: either
change it to a relative path (e.g., cd multiversion-testing) or use the
repository root lookup $(git rev-parse --show-toplevel) to compute the project
root and cd into the multiversion-testing directory; update the line containing
the literal path string so it no longer references the developer's home
directory.
In `@multiversion-tests/src/binary.rs`:
- Around line 103-107: The current sequence (create_worktree, build_at,
remove_worktree, result?) can leak the worktree if build_at fails and
remove_worktree also fails; change the flow to always attempt removal of the
worktree regardless of build_at outcome (use a scope guard/defer/Drop-style
guard or explicit match) so remove_worktree(&worktree_path).await is executed in
a finally-like block; capture the build error from build_at (e.g., let build_err
= result.err()) and then call remove_worktree, and if removal fails combine or
log the removal error but return the original build error (or wrap both) so the
primary failure (build_at) is preserved while still ensuring cleanup; reference
functions/idents: create_worktree, build_at, remove_worktree, worktree_path,
cached.
In `@multiversion-tests/src/cluster.rs`:
- Around line 156-164: allocate_ports currently calls NodePorts::allocate while
inserting into port_map, which silently overwrites duplicates; first validate
uniqueness of NodeSpec.name for the entire nodes slice before allocating any
resources and return an error if a duplicate is found. In practice, add a
pre-check in allocate_ports that iterates nodes, uses a HashSet to detect
duplicates of NodeSpec.name, and if a duplicate is seen return an appropriate
ClusterError (include the offending NodeSpec.name in the error message) so
allocation via NodePorts::allocate() only runs when all names are unique.
- Around line 89-100: The convergence check is missing crashed nodes because
api_urls() filters out nodes where is_running() is false; update api_urls() to
return every node's api_url() (i.e., stop filtering by is_running()) so the
probe receives the full set of node endpoints (use values().map(|n|
n.api_url()).collect()), keeping the probe responsible for detecting
unreachable/down endpoints; reference api_urls(), is_running(), and api_url()
when making the change.
- Around line 217-226: The current peers.iter().filter_map(...) silently drops
unknown peer names; change this to validate every peer and return a ClusterError
when a peer name is missing from port_map instead of filtering it out. Locate
the code that builds the trusted peers (the iterator over peers and the creation
of PeerEndpoint using port_map.get) and replace the filter_map logic with a
mapping that yields a Result<PeerEndpoint, ClusterError> (or collects into
Result<Vec<PeerEndpoint>, ClusterError>), producing an explicit ClusterError if
port_map.get(peer_name) is None; ensure the function signature propagates or
returns that ClusterError so missing peer references fail fast.
In `@multiversion-tests/src/fault/crash.rs`:
- Around line 27-29: The extract_pid function must reject non-positive PIDs
before they reach kill: in extract_pid (handling FaultTarget::Process { pid })
check pid.as_raw() > 0 and return an Err(FaultError) (with an explanatory
message) when pid.as_raw() <= 0 instead of returning Ok(*pid); update any
callers/tests accordingly and add a regression test that constructs
FaultTarget::Process with raw 0 and a negative raw PID to assert the function
returns the error (ensuring we never propagate Pid::from_raw(0) or negative
values to kill).
In `@multiversion-tests/src/fault/network.rs`:
- Around line 19-23: The Drop impl for NetworkPartitioner currently calls
sync_cleanup_chain unconditionally, which can remove rules for other active
instances; modify NetworkPartitioner to track per-instance ownership and only
clean up when appropriate (e.g., add an active: bool field set to true when this
instance creates rules and only call sync_cleanup_chain in drop if active is
true), or implement a global AtomicUsize reference counter (increment when rules
are installed, decrement in Drop, and call sync_cleanup_chain only when the
counter reaches zero); update the NetworkPartitioner constructor/installer and
the Drop impl accordingly to use the chosen mechanism.
In `@multiversion-tests/tests/e2e.rs`:
- Around line 7-8: Extract the two timeout constants HEIGHT_TIMEOUT and
CONVERGENCE_TIMEOUT out of multiversion-tests/tests/e2e.rs into the shared test
utilities module (common) as pub consts, then replace the local definitions by
referencing the shared constants (e.g., common::HEIGHT_TIMEOUT and
common::CONVERGENCE_TIMEOUT) so all tests reuse the same values; ensure the
common module is imported where needed and update any docs/comments accordingly.
- Around line 10-12: Replace the manual #[tokio::test(flavor = "multi_thread")]
attribute and any explicit init_tracing() calls in the async tests with the
tracing-aware attribute #[test_log::test(tokio::test)] for
single_genesis_produces_blocks and the other two async tests in this file;
specifically, update the test attributes (e.g., on
single_genesis_produces_blocks) to #[test_log::test(tokio::test)] and remove the
separate init_tracing() invocation so tracing is initialized automatically by
the test macro.
In `@multiversion-tests/tests/upgrade.rs`:
- Around line 22-26: Replace the raw tokio test attributes and manual tracing
initialization in the async tests with the project-standard test_log wrapper:
change the attribute on upgrade_one_node_in_running_cluster (and the other async
test functions in this file) from #[tokio::test(flavor = "multi_thread")] to
#[test_log::test(tokio::test(flavor = "multi_thread"))], and remove the call to
common::init_tracing() at the start of each test so the test_log subscriber
handles tracing output instead.
---
Duplicate comments:
In `@multiversion-tests/src/ports.rs`:
- Around line 15-27: The allocate() function in multiversion-tests::ports
currently drops its TcpListener handles before returning sampled ports, which
allows races (EADDRINUSE) when Cluster::start() spawns children; keep the
listeners alive until children have bound or change the approach so children
bind to ":0" and report back. Fix by changing ports::allocate to return or hold
the TcpListener instances (e.g., return a struct with fields api_listener,
gossip_listener, reth_network_listener alongside ports) and ensure
Cluster::start uses those listeners to accept/transfer the bound port ownership
to children (or modify Cluster::start to spawn children that bind to ":0" and
send their chosen ports back), updating any code that references Ports::allocate
to consume the listener-backed struct or the post-bind reported port values.
In `@multiversion-tests/src/probe.rs`:
- Around line 165-176: The loop currently tests start.elapsed() against timeout
before calling check(), which makes conditions that become true exactly at the
deadline return ProbeError::Timeout; modify the probe loop so it performs the
asynchronous check() at least once after the timeout check (e.g., call
check().await before returning Timeout or rearrange to check first then test
elapsed), ensuring the final check can succeed and, if it still returns None,
then return Err(ProbeError::Timeout { condition: condition_name.to_owned(),
elapsed: timeout }). Reference the existing symbols start, timeout, check(),
condition_name and ProbeError::Timeout when making this change.
In `@multiversion-tests/src/process.rs`:
- Around line 98-101: The kill() implementation sends SIGKILL then clears
self.child immediately, which allows respawn() to reuse the slot before the old
process has exited; change kill() to wait for/reap the killed child before
clearing self.child (or add a separate kill_and_wait() used by restart flows),
i.e. after signal::kill(self.pid()?, Signal::SIGKILL) call perform a blocking or
timed wait on the child (using the same Process/Child wait API you use
elsewhere) and only set self.child = None once the child has been reaped; ensure
proper handling of wait errors (EINTR/ESRCH/timeouts) and return a ProcessError
on failure so respawn() sees the slot as free only after the old process is
gone.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: fc73493e-fac4-4ed6-84fc-2644115bee33
📒 Files selected for processing (18)
.github/workflows/multiversion.yml.gitignoremultiversion-tests/Cargo.tomlmultiversion-tests/claude/2026-03-11-001-learnings-session-handoff-multiversion-phases-implementation.mdmultiversion-tests/fixtures/base-config.tomlmultiversion-tests/src/binary.rsmultiversion-tests/src/cluster.rsmultiversion-tests/src/config.rsmultiversion-tests/src/fault/crash.rsmultiversion-tests/src/fault/network.rsmultiversion-tests/src/lib.rsmultiversion-tests/src/ports.rsmultiversion-tests/src/probe.rsmultiversion-tests/src/process.rsmultiversion-tests/tests/common/mod.rsmultiversion-tests/tests/e2e.rsmultiversion-tests/tests/upgrade.rsxtask/src/main.rs
...-tests/claude/2026-03-11-001-learnings-session-handoff-multiversion-phases-implementation.md
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (6)
multiversion-tests/src/cluster.rs (3)
251-264:⚠️ Potential issue | 🟠 MajorFail fast on unknown peer references.
filter_mapsilently drops anyNodeSpec.peersentry that is missing fromport_map. That starts the node with the wrong topology and turns a simple typo into a later readiness or convergence timeout. Make this returnResult<Vec<PeerEndpoint>, ClusterError>and error on the first unknown peer name.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@multiversion-tests/src/cluster.rs` around lines 251 - 264, The build_peer_endpoints function currently drops unknown peer names via filter_map; change its signature to return Result<Vec<PeerEndpoint>, ClusterError> and iterate over peers, looking up each peer_name in port_map; if a lookup fails, return an Err(ClusterError::UnknownPeer(peer_name.clone())) (or appropriate ClusterError variant), otherwise push the constructed PeerEndpoint (using NodePorts fields gossip, api, reth_network) into a Vec and return Ok(vec) at the end so the caller fails fast on missing peer references instead of silently omitting them.
89-99:⚠️ Potential issue | 🟠 MajorDon't drop crashed nodes from convergence polling.
wait_for_convergence()only probes the URLs returned here. Filtering onis_running()lets a dead node disappear from the convergence set, so the cluster can look healthy while one member is down. Return every node'sapi_url()and let the probe treat unreachable nodes as non-converged.Suggested change
- pub fn api_urls(&mut self) -> Vec<String> { - self.nodes - .values_mut() - .filter_map(|n| { - if n.is_running() { - Some(n.api_url()) - } else { - None - } - }) - .collect() + pub fn api_urls(&self) -> Vec<String> { + self.nodes.values().map(|n| n.api_url()).collect() }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@multiversion-tests/src/cluster.rs` around lines 89 - 99, The api_urls() helper currently filters out nodes where n.is_running(), causing wait_for_convergence() to never probe crashed members; change api_urls() (the function that iterates self.nodes.values_mut()) to return every node.api_url() unconditionally (i.e., map every node to n.api_url() and collect) so convergence polling sees unreachable nodes as non‑converged; update the iterator logic to remove the is_running() conditional and return all URLs.
156-163:⚠️ Potential issue | 🟠 MajorReject duplicate node names before allocating anything.
HashMap::insertsilently overwrites the earlier entry here. The rest of the module also keysnodes,port_map,config-{name}.toml, anddata-{name}byNodeSpec.name, so a duplicate can leave one process untracked while both instances contend for the same paths and ports. Validate uniqueness first and return aClusterError.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@multiversion-tests/src/cluster.rs` around lines 156 - 163, The allocate_ports function currently inserts into port_map while allocating per node, allowing later NodeSpec.name duplicates to silently overwrite earlier entries; before any NodePorts::allocate calls, validate that all NodeSpec.name values in the nodes slice are unique (e.g. collect names into a set and detect duplicates) and return a ClusterError if any duplicate is found so no ports or resources are allocated for conflicting names; only if uniqueness passes, proceed to allocate NodePorts and insert into port_map as currently implemented.multiversion-tests/src/process.rs (1)
141-150:⚠️ Potential issue | 🟠 MajorWait for the old child to exit before respawning.
start_kill()only initiates termination. Spawning the replacement immediately after that can race with the old node still holding the same ports or data directory, and this path still leaves reaping to a best-effort drop. Makerespawn()async and await the old child's exit, or route this through the existingkill()/shutdown()paths first.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@multiversion-tests/src/process.rs` around lines 141 - 150, The respawn() function currently calls start_kill() (which only initiates termination) and immediately spawns a new child, causing races on ports/data; change respawn() into an async function and ensure the old child is fully exited before spawning the replacement: either await the child's exit after start_kill() (or call the existing async kill()/shutdown() path and await its completion) and only after confirming the old child process has terminated, call spawn_child_with_logs to set self.child and self.log_rx; reference respawn(), start_kill(), kill()/shutdown(), spawn_child_with_logs, and the self.child/self.log_rx fields when applying the change.multiversion-tests/tests/upgrade.rs (1)
22-26: 🧹 Nitpick | 🔵 TrivialUse
test_loginstead of manual tracing setup in these async tests.These tests still pair raw
#[tokio::test(...)]withcommon::init_tracing(). Switching to#[test_log::test(tokio::test(...))]keeps tracing setup consistent and removes the per-test subscriber boilerplate.As per coding guidelines "Use
#[test_log::test(tokio::test)]for async tests with tracing output in Rust".Also applies to: 62-66, 106-110, 155-159
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@multiversion-tests/tests/upgrade.rs` around lines 22 - 26, Replace the manual tracing setup in async tests by swapping the #[tokio::test(...)] attribute for #[test_log::test(tokio::test(...))] on each async test (e.g., upgrade_one_node_in_running_cluster and the other async tests in this file that call common::init_tracing()), and remove the call to common::init_tracing() at the start of those test bodies so the test_log crate initializes tracing automatically.multiversion-tests/src/probe.rs (1)
177-188:⚠️ Potential issue | 🟡 Minor
poll_until()can report timeout one iteration too early.The loop returns
Timeoutbefore making a finalcheck().awaiton the iteration that crosses the deadline. That turns a boundary success into a false negative.Suggested change
let start = tokio::time::Instant::now(); loop { - if start.elapsed() > timeout { - return Err(ProbeError::Timeout { - condition: condition_name.to_owned(), - elapsed: timeout, - }); - } if let Some(result) = check().await { return Ok(result); } + if start.elapsed() >= timeout { + return Err(ProbeError::Timeout { + condition: condition_name.to_owned(), + elapsed: timeout, + }); + } tokio::time::sleep(interval).await; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@multiversion-tests/src/probe.rs` around lines 177 - 188, The poll_until loop returns Timeout before performing a final check when the deadline is crossed; move the timeout check to after the awaited check so the code calls check().await first and only returns ProbeError::Timeout (using ProbeError::Timeout, condition_name, start, timeout) if check() yields None and the elapsed time exceeded the deadline, then sleep and repeat; ensure the reordered logic still sleeps on intervals and returns Ok when check() returns Some.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@multiversion-tests/src/cluster.rs`:
- Around line 195-198: The code currently panics via expect when genesis_api_url
is None while peer_specs is non-empty; change Cluster::start() to return a
proper ClusterError instead of panicking: check genesis_api_url.as_deref() and
if None return Err(ClusterError::InvalidSpec("peers present but no genesis node"
or similar)), updating the match/if branch around peer_specs and genesis_api_url
to propagate that Result; reference the Cluster::start() function, the
peer_specs and genesis_api_url variables, and construct/return an appropriate
ClusterError variant rather than calling expect().
In `@multiversion-tests/src/process.rs`:
- Line 42: log_rx is currently an unbounded receiver created with
mpsc::unbounded_channel(), which retains unlimited log history; change to a
bounded channel (e.g., mpsc::channel(BUFFER_SIZE)) or implement a fixed-size
circular buffer to cap memory. Replace the unbounded creation with a bounded
sender/receiver pair and update the code that forwards stdout/stderr to use
try_send (or async send) and handle Full by either dropping the new message or
evicting the oldest entry into a circular VecDeque held on the struct; also
adjust drain_logs() to read from the bounded receiver and optionally flush from
the in-memory circular buffer so callers still get recent logs. Ensure you
update the log_rx field type and any send sites to match the bounded
sender/receiver API and preserve behavior when the channel is full.
---
Duplicate comments:
In `@multiversion-tests/src/cluster.rs`:
- Around line 251-264: The build_peer_endpoints function currently drops unknown
peer names via filter_map; change its signature to return
Result<Vec<PeerEndpoint>, ClusterError> and iterate over peers, looking up each
peer_name in port_map; if a lookup fails, return an
Err(ClusterError::UnknownPeer(peer_name.clone())) (or appropriate ClusterError
variant), otherwise push the constructed PeerEndpoint (using NodePorts fields
gossip, api, reth_network) into a Vec and return Ok(vec) at the end so the
caller fails fast on missing peer references instead of silently omitting them.
- Around line 89-99: The api_urls() helper currently filters out nodes where
n.is_running(), causing wait_for_convergence() to never probe crashed members;
change api_urls() (the function that iterates self.nodes.values_mut()) to return
every node.api_url() unconditionally (i.e., map every node to n.api_url() and
collect) so convergence polling sees unreachable nodes as non‑converged; update
the iterator logic to remove the is_running() conditional and return all URLs.
- Around line 156-163: The allocate_ports function currently inserts into
port_map while allocating per node, allowing later NodeSpec.name duplicates to
silently overwrite earlier entries; before any NodePorts::allocate calls,
validate that all NodeSpec.name values in the nodes slice are unique (e.g.
collect names into a set and detect duplicates) and return a ClusterError if any
duplicate is found so no ports or resources are allocated for conflicting names;
only if uniqueness passes, proceed to allocate NodePorts and insert into
port_map as currently implemented.
In `@multiversion-tests/src/probe.rs`:
- Around line 177-188: The poll_until loop returns Timeout before performing a
final check when the deadline is crossed; move the timeout check to after the
awaited check so the code calls check().await first and only returns
ProbeError::Timeout (using ProbeError::Timeout, condition_name, start, timeout)
if check() yields None and the elapsed time exceeded the deadline, then sleep
and repeat; ensure the reordered logic still sleeps on intervals and returns Ok
when check() returns Some.
In `@multiversion-tests/src/process.rs`:
- Around line 141-150: The respawn() function currently calls start_kill()
(which only initiates termination) and immediately spawns a new child, causing
races on ports/data; change respawn() into an async function and ensure the old
child is fully exited before spawning the replacement: either await the child's
exit after start_kill() (or call the existing async kill()/shutdown() path and
await its completion) and only after confirming the old child process has
terminated, call spawn_child_with_logs to set self.child and self.log_rx;
reference respawn(), start_kill(), kill()/shutdown(), spawn_child_with_logs, and
the self.child/self.log_rx fields when applying the change.
In `@multiversion-tests/tests/upgrade.rs`:
- Around line 22-26: Replace the manual tracing setup in async tests by swapping
the #[tokio::test(...)] attribute for #[test_log::test(tokio::test(...))] on
each async test (e.g., upgrade_one_node_in_running_cluster and the other async
tests in this file that call common::init_tracing()), and remove the call to
common::init_tracing() at the start of those test bodies so the test_log crate
initializes tracing automatically.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: e9094e81-cd66-40e8-a4c6-a625e8190659
📒 Files selected for processing (5)
multiversion-tests/src/cluster.rsmultiversion-tests/src/config.rsmultiversion-tests/src/probe.rsmultiversion-tests/src/process.rsmultiversion-tests/tests/upgrade.rs
There was a problem hiding this comment.
Actionable comments posted: 8
♻️ Duplicate comments (4)
multiversion-tests/src/process.rs (2)
42-42:⚠️ Potential issue | 🟠 MajorLog capture still has no memory bound.
mpsc::unbounded_channel()retains every stdout/stderr line untildrain_logs()is called. Successful multiversion runs can keep nodes alive for a long time and never drain, so a chatty binary can grow memory without limit. Use a bounded channel or a fixed-size ring buffer and keep only the recent tail.Tokio 1.x documentation for `tokio::sync::mpsc::unbounded_channel`: does it impose any capacity limit or backpressure when the receiver is not draining?Also applies to: 158-164, 197-220
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@multiversion-tests/src/process.rs` at line 42, The log receiver field log_rx currently uses an unbounded channel which can grow without limit; change it to a bounded channel (e.g., tokio::sync::mpsc::channel with a reasonable capacity) or implement a fixed-size ring buffer to only retain the most recent N log lines, and adjust all places that create the channel (where unbounded_channel() is called) and where messages are sent/received (including drain_logs() and any spawn/tasks that push logs) to handle TrySend/await send backpressure or drop/overwrite oldest entries; ensure types and error handling in functions interacting with log_rx (e.g., drain_logs, log producer tasks) are updated accordingly.
145-153:⚠️ Potential issue | 🟠 MajorWait for the old process to exit before respawning.
respawn()callsstart_kill()and immediately spawns the replacement. The stub test still passes because it never binds ports, but the real node binary reuses the same API/gossip/reth sockets, so this can intermittently fail with address-in-use until the previous child actually exits. Makerespawn()async and wait for termination, or require callers to stop the process first.Tokio 1.x documentation for `tokio::process::Child::start_kill`: does it wait for the child to exit, or must callers still await `wait()` / `try_wait()` before reusing resources such as bound ports?Also applies to: 303-311
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@multiversion-tests/src/process.rs` around lines 145 - 153, The respawn() function currently calls start_kill() on the old child and immediately spawns a new child (spawn_child_with_logs), causing port/address conflicts; change respawn to be async (pub async fn respawn(&mut self) -> Result<(), ProcessError>) and after calling start_kill() on the taken child await its termination (call old.wait().await or loop on old.try_wait() until Some) before calling spawn_child_with_logs; ensure you still log start_kill() errors and handle/propagate any wait() errors so the new child is only spawned once the previous child has fully exited (references: respawn, start_kill, spawn_child_with_logs, child / old).multiversion-tests/src/cluster.rs (2)
93-118:⚠️ Potential issue | 🟠 MajorConvergence can still look healthy after a node crashes.
api_urls()filters out nodes whoseis_running()flipped false, andwait_for_convergence()probes only that reduced set. A node that dies after startup can disappear from the health check instead of making convergence fail. Return every node URL here and letHttpProbesurface the unreachable one.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@multiversion-tests/src/cluster.rs` around lines 93 - 118, api_urls() currently filters out nodes where is_running() is false so wait_for_convergence() only probes running nodes; change api_urls() to return all node URLs (use nodes.values() or values_mut() without the is_running() filter and collect api_url() for every node) so HttpProbe (probe.wait_for_convergence) receives the full set and will surface unreachable/dead nodes; update any callers assuming filtered results if necessary (refer to api_urls, wait_for_convergence, nodes, is_running, api_url, and HttpProbe).
203-206:⚠️ Potential issue | 🟠 MajorDon't panic on a malformed cluster spec.
Line 206 uses
expect()when peers exist without any genesis node, which aborts the whole harness even thoughCluster::start()already returnsResult. Please add a typed invalid-specClusterErrorand return it here instead.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@multiversion-tests/src/cluster.rs` around lines 203 - 206, Replace the panic caused by expect() in Cluster::start when peer_specs is non-empty but genesis_api_url is None: introduce a typed ClusterError variant (e.g., ClusterError::InvalidSpec or ClusterError::MissingGenesis) and return Err(...) from Cluster::start instead of calling expect(); locate the code that computes genesis_url (the genesis_api_url.as_deref().expect(...) line) and change it to use ok_or_else or a manual check to produce and return the new ClusterError with a clear message referencing the malformed cluster spec and peer_specs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@multiversion-tests/claude/2026-03-11-001-learnings-session-handoff-multiversion-phases-implementation.md`:
- Around line 93-97: Remove the contradictory "Staging/committing" / "git-write"
next step from the "Remaining Work" section so the document no longer instructs
agents to perform git write operations; edit the "Remaining Work" bullet list
(the line containing "Staging/committing: When ready, stage all new/modified
files for the PR") to delete that bullet or replace it with a non-write action,
ensuring consistency with the prohibition mentioned elsewhere in the doc (the
statement on Line 120 forbidding git write operations).
In `@multiversion-tests/src/binary.rs`:
- Around line 203-215: The current logic blindly retries a failing
try_cargo_build(...) for "debug-release" as "--release", which masks real build
errors; change the Err(e) branch so it inspects the error from try_cargo_build
and only performs the fallback when the error clearly indicates the profile is
missing (e.g., match on a specific error variant returned by try_cargo_build or
check e.to_string().contains(...) for the Cargo message that a profile is
unknown), otherwise return Err(e) directly; update the match in the block that
sets built to call the fallback only on that specific "profile unavailable"
condition and propagate other errors unchanged.
- Around line 103-109: The code currently logs failures from
remove_worktree(...) and proceeds, which can leave a stale deterministic
worktree (worktree-{rev}) and permanently block future create_worktree(...)
calls; change the behavior in the block that calls
remove_worktree(&worktree_path).await so that on error you propagate a failure
(return Err) instead of swallowing it, or perform a robust cleanup (e.g.,
attempt git worktree prune/remove and only continue if cleanup succeeds) before
proceeding to use build_at(...); specifically update the logic surrounding
worktree_path, create_worktree, remove_worktree, and build_at so remove_worktree
failures cause the function to return an error (or retry/force-prune) to avoid
leaving stale registrations that block future builds.
In `@multiversion-tests/src/cluster.rs`:
- Around line 75-76: The port allocation race happens because
NodePorts::allocate() returns bare u16s (ports) and drops the TcpListeners
immediately, so allocate_ports() hands unreserved ports to
spawn_and_wait_for_nodes(), allowing races when spawn_node() later binds them;
fix by preserving bound listeners until the node processes start (e.g., change
NodePorts::allocate()/allocate_ports() to return owned TcpListener instances or
a NodePorts struct that holds listeners, thread those listeners through
spawn_and_wait_for_nodes() into spawn_node() so the sockets remain reserved
until spawn_node() takes ownership and binds/hand-offs), or alternatively move
the port selection into the child process (call the selection logic inside
spawn_node()) so no temporary unreserved ports are returned.
In `@multiversion-tests/src/fault/network.rs`:
- Around line 147-164: Consolidate the two tests into a single parameterized
rstest that exercises extract_ports for both variants of FaultTarget: supply a
case with FaultTarget::Network { source_port: 8080, dest_port: 9090 } expecting
Ok((8080, 9090)) and a case with FaultTarget::Process { pid: Pid::from_raw(1) }
expecting an Err; replace the separate functions
extract_ports_succeeds_for_network_target and
extract_ports_fails_for_process_target with a single #[rstest] function that
enumerates these inputs and asserts the expected result from
extract_ports(&target).
- Around line 36-48: The ACTIVE_INJECTIONS counter must be incremented before
creating drop rules and must always be decremented on any failure or during
cleanup; move ACTIVE_INJECTIONS.fetch_add(1, Ordering::Release) to before
calling add_drop_rule/add_drop_rule(dest...), and if any add_* fails decrement
the counter before returning error; also ensure the FaultGuard closure always
decrements ACTIVE_INJECTIONS even when sync_remove_drop_rule returns Err (i.e.,
perform removals but ignore/remove errors and always call
ACTIVE_INJECTIONS.fetch_sub(1, Ordering::Release) in a finally-style path) so
the counter cannot be left permanently elevated.
In `@multiversion-tests/src/probe.rs`:
- Around line 67-109: Each getter currently calls .send().await? then
.json().await? which treats 4xx/5xx responses as successes; update get_info,
get_peers, get_network_config and get_genesis_hash to call .error_for_status()?
on the response returned by .send().await? before attempting .json() (i.e., let
resp = self.client.get(url).send().await?.error_for_status()?; then
resp.json::<...>().await?), so HTTP error statuses are converted into ProbeError
failures instead of being deserialized as success.
In `@multiversion-tests/tests/common/mod.rs`:
- Around line 44-56: The peer_spec function silently reuses identities by
wrapping the index with modulo on MINING_KEYS/REWARD_ADDRESSES; change it to
fail fast by validating the provided index against MINING_KEYS.len() and
REWARD_ADDRESSES.len() and returning a clear error or panicking rather than
masking the problem—update peer_spec to check that index < MINING_KEYS.len() and
index < REWARD_ADDRESSES.len() before constructing the NodeSpec (or change the
signature to return Result<NodeSpec, Error> and return Err with a descriptive
message referencing peer_spec, mining_key and reward_address when the index is
out of range).
---
Duplicate comments:
In `@multiversion-tests/src/cluster.rs`:
- Around line 93-118: api_urls() currently filters out nodes where is_running()
is false so wait_for_convergence() only probes running nodes; change api_urls()
to return all node URLs (use nodes.values() or values_mut() without the
is_running() filter and collect api_url() for every node) so HttpProbe
(probe.wait_for_convergence) receives the full set and will surface
unreachable/dead nodes; update any callers assuming filtered results if
necessary (refer to api_urls, wait_for_convergence, nodes, is_running, api_url,
and HttpProbe).
- Around line 203-206: Replace the panic caused by expect() in Cluster::start
when peer_specs is non-empty but genesis_api_url is None: introduce a typed
ClusterError variant (e.g., ClusterError::InvalidSpec or
ClusterError::MissingGenesis) and return Err(...) from Cluster::start instead of
calling expect(); locate the code that computes genesis_url (the
genesis_api_url.as_deref().expect(...) line) and change it to use ok_or_else or
a manual check to produce and return the new ClusterError with a clear message
referencing the malformed cluster spec and peer_specs.
In `@multiversion-tests/src/process.rs`:
- Line 42: The log receiver field log_rx currently uses an unbounded channel
which can grow without limit; change it to a bounded channel (e.g.,
tokio::sync::mpsc::channel with a reasonable capacity) or implement a fixed-size
ring buffer to only retain the most recent N log lines, and adjust all places
that create the channel (where unbounded_channel() is called) and where messages
are sent/received (including drain_logs() and any spawn/tasks that push logs) to
handle TrySend/await send backpressure or drop/overwrite oldest entries; ensure
types and error handling in functions interacting with log_rx (e.g., drain_logs,
log producer tasks) are updated accordingly.
- Around line 145-153: The respawn() function currently calls start_kill() on
the old child and immediately spawns a new child (spawn_child_with_logs),
causing port/address conflicts; change respawn to be async (pub async fn
respawn(&mut self) -> Result<(), ProcessError>) and after calling start_kill()
on the taken child await its termination (call old.wait().await or loop on
old.try_wait() until Some) before calling spawn_child_with_logs; ensure you
still log start_kill() errors and handle/propagate any wait() errors so the new
child is only spawned once the previous child has fully exited (references:
respawn, start_kill, spawn_child_with_logs, child / old).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 31cb7c3d-922c-48d2-bf18-ea66d46809ff
⛔ Files ignored due to path filters (1)
multiversion-tests/Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (12)
.github/workflows/multiversion.ymlmultiversion-tests/Cargo.tomlmultiversion-tests/claude/2026-03-11-001-learnings-session-handoff-multiversion-phases-implementation.mdmultiversion-tests/src/binary.rsmultiversion-tests/src/cluster.rsmultiversion-tests/src/fault/crash.rsmultiversion-tests/src/fault/network.rsmultiversion-tests/src/probe.rsmultiversion-tests/src/process.rsmultiversion-tests/tests/common/mod.rsmultiversion-tests/tests/e2e.rsmultiversion-tests/tests/upgrade.rs
| ### Remaining Work | ||
| - **Compilation verification**: Small — just running checks and fixing any issues | ||
| - **prompts.md update**: Small — recording session prompts | ||
| - **Staging/committing**: When ready, stage all new/modified files for the PR | ||
|
|
There was a problem hiding this comment.
Remove the git-write next step.
This tells the next agent to stage changes even though the same document says on Line 120 that git write operations are prohibited. That contradiction is easy to follow in the wrong direction.
🔧 Minimal fix
-- **Staging/committing**: When ready, stage all new/modified files for the PR
+- **Hand-off**: When ready, hand the updated patch back for a human to stage/commit📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ### Remaining Work | |
| - **Compilation verification**: Small — just running checks and fixing any issues | |
| - **prompts.md update**: Small — recording session prompts | |
| - **Staging/committing**: When ready, stage all new/modified files for the PR | |
| ### Remaining Work | |
| - **Compilation verification**: Small — just running checks and fixing any issues | |
| - **prompts.md update**: Small — recording session prompts | |
| - **Hand-off**: When ready, hand the updated patch back for a human to stage/commit |
🧰 Tools
🪛 markdownlint-cli2 (0.21.0)
[warning] 93-93: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below
(MD022, blanks-around-headings)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@multiversion-tests/claude/2026-03-11-001-learnings-session-handoff-multiversion-phases-implementation.md`
around lines 93 - 97, Remove the contradictory "Staging/committing" /
"git-write" next step from the "Remaining Work" section so the document no
longer instructs agents to perform git write operations; edit the "Remaining
Work" bullet list (the line containing "Staging/committing: When ready, stage
all new/modified files for the PR") to delete that bullet or replace it with a
non-write action, ensuring consistency with the prohibition mentioned elsewhere
in the doc (the statement on Line 120 forbidding git write operations).
| #[test] | ||
| fn extract_ports_succeeds_for_network_target() { | ||
| let target = FaultTarget::Network { | ||
| source_port: 8080, | ||
| dest_port: 9090, | ||
| }; | ||
| let (s, d) = extract_ports(&target).unwrap(); | ||
| assert_eq!(s, 8080); | ||
| assert_eq!(d, 9090); | ||
| } | ||
|
|
||
| #[test] | ||
| fn extract_ports_fails_for_process_target() { | ||
| let target = FaultTarget::Process { | ||
| pid: Pid::from_raw(1), | ||
| }; | ||
| assert!(extract_ports(&target).is_err()); | ||
| } |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Consider using rstest for parameterized tests.
The tests could be consolidated using rstest for better coverage and consistency with other test files in this crate (e.g., crash.rs). As per coding guidelines, "Use rstest for parameterized tests."
♻️ Parameterized test example
+ use rstest::rstest;
+
- #[test]
- fn extract_ports_succeeds_for_network_target() {
- let target = FaultTarget::Network {
- source_port: 8080,
- dest_port: 9090,
- };
+ #[rstest]
+ #[case::typical(8080, 9090)]
+ #[case::same_port(443, 443)]
+ #[case::max_ports(65535, 65535)]
+ fn extract_ports_succeeds_for_network_target(#[case] source: u16, #[case] dest: u16) {
+ let target = FaultTarget::Network {
+ source_port: source,
+ dest_port: dest,
+ };
let (s, d) = extract_ports(&target).unwrap();
- assert_eq!(s, 8080);
- assert_eq!(d, 9090);
+ assert_eq!(s, source);
+ assert_eq!(d, dest);
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@multiversion-tests/src/fault/network.rs` around lines 147 - 164, Consolidate
the two tests into a single parameterized rstest that exercises extract_ports
for both variants of FaultTarget: supply a case with FaultTarget::Network {
source_port: 8080, dest_port: 9090 } expecting Ok((8080, 9090)) and a case with
FaultTarget::Process { pid: Pid::from_raw(1) } expecting an Err; replace the
separate functions extract_ports_succeeds_for_network_target and
extract_ports_fails_for_process_target with a single #[rstest] function that
enumerates these inputs and asserts the expected result from
extract_ports(&target).
e14b184 to
b0b65ef
Compare
There was a problem hiding this comment.
Actionable comments posted: 5
♻️ Duplicate comments (8)
multiversion-tests/src/ports.rs (1)
14-28:⚠️ Potential issue | 🟠 MajorPort reservation race condition remains unaddressed.
Dropping the listeners at line 26 releases the ports before the node processes bind to them. Between
allocate()returning and node startup, another process can claim these ports, causing intermittent test failures.Consider one of:
- Return the
TcpListenerhandles and keep them open until child processes are ready- Have child processes bind to
:0and report their assigned ports back- Use
SO_REUSEADDR/SO_REUSEPORTwith immediate child binding🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@multiversion-tests/src/ports.rs` around lines 14 - 28, The allocate() method currently binds three TcpListener sockets then drops them (drop((l1,l2,l3))), releasing ports and causing a race; modify the design so ports remain reserved until child processes bind: either change NodePorts and allocate() to return or store the TcpListener handles (e.g., add fields like api_listener/gossip_listener/reth_network_listener to the NodePorts struct and return the listeners alongside port numbers) and only drop them after the child processes signal readiness, or implement child-side binding to port 0 with a startup handshake to report assigned ports back; update any callers of NodePorts::allocate and the node startup logic to consume the listeners or perform the handshake so the ports are not released prematurely.multiversion-tests/src/fault/network.rs (1)
148-171: 🧹 Nitpick | 🔵 TrivialConsider consolidating tests with
rstestfor consistency.The tests work correctly but could be consolidated using
rstestfor better coverage and consistency with other test files in this crate (e.g.,crash.rs).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@multiversion-tests/src/fault/network.rs` around lines 148 - 171, Consolidate the two unit tests into a single parameterized rstest to match other files: create a #[rstest] that supplies two cases for FaultTarget (one Network { source_port: 8080, dest_port: 9090 } expecting Ok((8080,9090)) and one Process { pid: Pid::from_raw(1) } expecting Err), and replace the separate test functions with that single parameterized test which calls extract_ports(&target) and asserts the expected result; reference the existing symbols extract_ports and FaultTarget when updating the tests.multiversion-tests/src/probe.rs (1)
67-109:⚠️ Potential issue | 🟠 MajorTurn HTTP error statuses into probe failures.
send().await?succeeds on 4xx/5xx. A failing/v1/info,/v1/network/config, or/v1/genesisendpoint can still look healthy here if its error payload deserializes. Callerror_for_status()?beforejson()in each getter.Suggested fix pattern
pub async fn get_info(&self, api_url: &str) -> Result<NodeInfo, ProbeError> { let url = crate::api_endpoint(api_url, "v1/info"); - Ok(self - .client - .get(url) - .send() - .await? - .json::<NodeInfo>() - .await?) + let resp = self.client.get(url).send().await?.error_for_status()?; + Ok(resp.json::<NodeInfo>().await?) }Apply the same pattern to
get_peers,get_network_config, andget_genesis_hash.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@multiversion-tests/src/probe.rs` around lines 67 - 109, Each of the HTTP getters (get_info, get_peers, get_network_config, get_genesis_hash) must treat non-2xx responses as probe failures by calling error_for_status()? on the Response before deserializing; update each method to call .send().await?.error_for_status()? and then .json().await? (for get_genesis_hash keep the subsequent serde_json::Value handling and the existing genesis_block_hash check).multiversion-tests/src/cluster.rs (2)
172-183:⚠️ Potential issue | 🟠 MajorThese ports are not actually reserved.
allocate_ports()keeps only the numbers returned byNodePorts::allocate(), andmultiversion-tests/src/ports.rs:14-27drops the temporary listeners before returning. Another process can claim one of those ports beforespawn_node()binds it, which makes cluster startup flaky under parallel CI.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@multiversion-tests/src/cluster.rs` around lines 172 - 183, allocate_ports currently records ephemeral port numbers from NodePorts::allocate() while ports.rs drops the temporary listeners, allowing races; fix by making allocation hold the listeners until nodes are bound: modify NodePorts::allocate (in ports.rs) to return reserved TcpListener(s) or a NodePorts struct that stores TcpListener(s) instead of just u16s, update allocate_ports to insert and keep those NodePorts-with-listeners in the HashMap, and update spawn_node to accept/use the preserved listeners (or accept owned NodePorts) so the sockets remain reserved until the process binds them.
95-105:⚠️ Potential issue | 🟠 MajorDon't let crashed nodes disappear from convergence checks.
wait_for_convergence()does a liveness pass first, butapi_urls()re-checksis_running()and drops any node that dies in the gap. That lets the remaining nodes "converge" without the crashed member.Suggested fix
- pub fn api_urls(&mut self) -> Vec<String> { - self.nodes - .values_mut() - .filter_map(|n| { - if n.is_running() { - Some(n.api_url()) - } else { - None - } - }) - .collect() + pub fn api_urls(&self) -> Vec<String> { + self.nodes.values().map(|n| n.api_url()).collect() }Also applies to: 116-129
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@multiversion-tests/src/cluster.rs` around lines 95 - 105, The api_urls() helper is dropping nodes that died between the liveness pass and the convergence check by calling is_running(); remove that check so api_urls() returns each node's api_url() regardless of is_running() (e.g., replace the filter_map/is_running() logic with an unconditional map over self.nodes.values()/values_mut() to collect api_url()), so wait_for_convergence() can perform its own liveness logic; apply the same change to the other similar helper mentioned in the comment.multiversion-tests/tests/common/mod.rs (1)
44-56:⚠️ Potential issue | 🟠 MajorFail fast when the peer identity pool is exhausted.
peer_spec()silently reuses credentials here:index == 0collides with genesis, and the fourth peer wraps back to the first fixture. That makes larger scenarios look valid while actually booting duplicate identities.Suggested fix
pub fn peer_spec( name: &str, binary: &ResolvedBinary, index: usize, peers: Vec<String>, ) -> NodeSpec { + assert!( + index > 0 && index < MINING_KEYS.len() && index < REWARD_ADDRESSES.len(), + "missing test identity for peer index {index}" + ); NodeSpec { name: name.to_owned(), binary: binary.clone(), // clone: building owned NodeSpec from shared reference role: NodeRole::Peer, peers, - mining_key: MINING_KEYS[index % MINING_KEYS.len()].to_owned(), - reward_address: REWARD_ADDRESSES[index % REWARD_ADDRESSES.len()].to_owned(), + mining_key: MINING_KEYS[index].to_owned(), + reward_address: REWARD_ADDRESSES[index].to_owned(), } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@multiversion-tests/tests/common/mod.rs` around lines 44 - 56, peer_spec currently masks credential exhaustion by using modulo on MINING_KEYS and REWARD_ADDRESSES; replace the modulo indexing with direct indexing and fail fast: in peer_spec (when constructing NodeSpec), assert or return/panic if index is out of bounds for MINING_KEYS or REWARD_ADDRESSES (e.g., index >= MINING_KEYS.len() || index >= REWARD_ADDRESSES.len()) and include a clear error message referencing the peer name and index so duplicate identities (including collision with genesis at index 0) are not silently produced.multiversion-tests/src/process.rs (2)
145-155:⚠️ Potential issue | 🟠 MajorWait for the old child before respawning.
respawn()sendsstart_kill()and immediately launches the replacement against the same ports and data dir. Direct callers can race the old process and hitEADDRINUSEor overlapping on-disk state. Either make this path async and wait on the oldChild, or enforce that callers fully stop the process beforerespawn().🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@multiversion-tests/src/process.rs` around lines 145 - 155, respawn() currently calls start_kill() on the old Child and immediately spawns a new child, leading to races on ports/dirs; change respawn (or its control flow) to wait for the old process to exit before calling spawn_child_with_logs: after successfully calling old.start_kill(), call old.wait() (or wait_with_output()) on the taken Child and handle/propagate any wait errors (log non-fatal warnings), then proceed to spawn_child_with_logs and set self.child/self.log_rx; alternatively, if start_kill() is async-ish, make respawn async and await the termination future—refer to respawn, start_kill, child, and spawn_child_with_logs when making the change.
42-42:⚠️ Potential issue | 🟠 MajorCap the in-memory log backlog.
The log channel is unbounded, and the only way to release memory is
drain_logs(). In this harness that usually happens only on failure paths, so a noisy node can retain its entire stdout/stderr history in RAM.Also applies to: 158-163, 184-221
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@multiversion-tests/src/process.rs` at line 42, The log channel is currently unbounded (log_rx: mpsc::UnboundedReceiver<String>), which allows unbounded memory growth; change the channel to a bounded one and enforce a reasonable backlog size, replace UnboundedReceiver/UnboundedSender with a bounded mpsc::Receiver/mpsc::Sender (e.g., mpsc::channel(backlog_size)) when creating the channel, update the field log_rx and any senders/creators accordingly, and ensure drain_logs() and any places handling logs (the log reader/collector code referenced around the other ranges) handle a full/closed channel correctly (dropping or truncating old messages instead of retaining forever). Ensure backlog_size is configurable or set to a sane constant and update tests/constructors that create the channel to use the new bounded constructor.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@multiversion-tests/src/binary.rs`:
- Around line 137-148: The call to file.lock() in acquire_build_lock uses an
extension method provided by the fs4 crate, so add fs4 = "0.11" (or a compatible
version) to Cargo.toml and import the extension trait in binary.rs with use
fs4::FileExt; so that the File::lock() call compiles; update
multiversion-tests/Cargo.toml to include the dependency and add the use
statement near other imports in binary.rs.
In `@multiversion-tests/src/config.rs`:
- Around line 147-148: The code currently serializes config with
toml::to_string_pretty(&config) and then calls std::fs::write(config_path,
output) which can leave a truncated config if interrupted; instead write the
serialized output to a temporary file in the same directory (e.g., create a file
named config_path.with_extension(".tmp") or use tempfile in the same dir), flush
and sync the temp file, then atomically replace the original by renaming the
temp into config_path (std::fs::rename), ensuring any intermediate fsyncs as
needed; update the code around the variables output and config_path to perform
these steps (create temp, write output, sync, rename) rather than calling
std::fs::write directly.
- Around line 427-448: Replace the manual loop in the test
camel_to_snake_known_consensus_fields with an rstest parameterized test: add use
rstest::rstest, annotate the test with #[rstest], change the signature to fn
camel_to_snake_known_consensus_fields(#[case] camel: &str, #[case] snake: &str)
and provide one #[case] per pair (e.g. #[case("chainId", "chain_id")], ...)
instead of the cases array and for-loop, then keep the assertion
assert_eq!(camel_to_snake(camel), snake, "failed for {camel}"); to run each case
separately.
- Around line 156-161: The JSON number branch in the conversion currently uses
n.as_i64() then n.as_f64(), which can lose precision for large unsigned values;
update the serde_json::Value::Number handling so it first checks n.as_i64(),
then n.as_u64() and if the u64 value is > i64::MAX convert it to
Value::String(n.to_string()) (preserving precision), otherwise convert to
Value::Integer(u64 as i64); only then fall back to n.as_f64().map(Value::Float).
Also add a unit test that converts a JSON number equal to i64::MAX + 1 (as a raw
number) through the conversion and back to ensure it round-trips without
precision loss. Ensure you modify the function handling
serde_json::Value::Number and add the test for the large u64 case.
In `@xtask/src/main.rs`:
- Around line 530-539: The test invocation builds test_args and currently pushes
the literal flags "--test" and "*" which causes Cargo to error because "*" is
not a valid test name; update the construction used by test_args (the vector
that is passed to cmd!(sh, "cargo {test_args...}") ) to use the correct flag
"--tests" instead of the pair "--test" and "*" so Cargo runs all integration
tests (adjust any surrounding separators like "--" if needed to preserve
ignored/tests-thread flags).
---
Duplicate comments:
In `@multiversion-tests/src/cluster.rs`:
- Around line 172-183: allocate_ports currently records ephemeral port numbers
from NodePorts::allocate() while ports.rs drops the temporary listeners,
allowing races; fix by making allocation hold the listeners until nodes are
bound: modify NodePorts::allocate (in ports.rs) to return reserved
TcpListener(s) or a NodePorts struct that stores TcpListener(s) instead of just
u16s, update allocate_ports to insert and keep those NodePorts-with-listeners in
the HashMap, and update spawn_node to accept/use the preserved listeners (or
accept owned NodePorts) so the sockets remain reserved until the process binds
them.
- Around line 95-105: The api_urls() helper is dropping nodes that died between
the liveness pass and the convergence check by calling is_running(); remove that
check so api_urls() returns each node's api_url() regardless of is_running()
(e.g., replace the filter_map/is_running() logic with an unconditional map over
self.nodes.values()/values_mut() to collect api_url()), so
wait_for_convergence() can perform its own liveness logic; apply the same change
to the other similar helper mentioned in the comment.
In `@multiversion-tests/src/fault/network.rs`:
- Around line 148-171: Consolidate the two unit tests into a single
parameterized rstest to match other files: create a #[rstest] that supplies two
cases for FaultTarget (one Network { source_port: 8080, dest_port: 9090 }
expecting Ok((8080,9090)) and one Process { pid: Pid::from_raw(1) } expecting
Err), and replace the separate test functions with that single parameterized
test which calls extract_ports(&target) and asserts the expected result;
reference the existing symbols extract_ports and FaultTarget when updating the
tests.
In `@multiversion-tests/src/ports.rs`:
- Around line 14-28: The allocate() method currently binds three TcpListener
sockets then drops them (drop((l1,l2,l3))), releasing ports and causing a race;
modify the design so ports remain reserved until child processes bind: either
change NodePorts and allocate() to return or store the TcpListener handles
(e.g., add fields like api_listener/gossip_listener/reth_network_listener to the
NodePorts struct and return the listeners alongside port numbers) and only drop
them after the child processes signal readiness, or implement child-side binding
to port 0 with a startup handshake to report assigned ports back; update any
callers of NodePorts::allocate and the node startup logic to consume the
listeners or perform the handshake so the ports are not released prematurely.
In `@multiversion-tests/src/probe.rs`:
- Around line 67-109: Each of the HTTP getters (get_info, get_peers,
get_network_config, get_genesis_hash) must treat non-2xx responses as probe
failures by calling error_for_status()? on the Response before deserializing;
update each method to call .send().await?.error_for_status()? and then
.json().await? (for get_genesis_hash keep the subsequent serde_json::Value
handling and the existing genesis_block_hash check).
In `@multiversion-tests/src/process.rs`:
- Around line 145-155: respawn() currently calls start_kill() on the old Child
and immediately spawns a new child, leading to races on ports/dirs; change
respawn (or its control flow) to wait for the old process to exit before calling
spawn_child_with_logs: after successfully calling old.start_kill(), call
old.wait() (or wait_with_output()) on the taken Child and handle/propagate any
wait errors (log non-fatal warnings), then proceed to spawn_child_with_logs and
set self.child/self.log_rx; alternatively, if start_kill() is async-ish, make
respawn async and await the termination future—refer to respawn, start_kill,
child, and spawn_child_with_logs when making the change.
- Line 42: The log channel is currently unbounded (log_rx:
mpsc::UnboundedReceiver<String>), which allows unbounded memory growth; change
the channel to a bounded one and enforce a reasonable backlog size, replace
UnboundedReceiver/UnboundedSender with a bounded mpsc::Receiver/mpsc::Sender
(e.g., mpsc::channel(backlog_size)) when creating the channel, update the field
log_rx and any senders/creators accordingly, and ensure drain_logs() and any
places handling logs (the log reader/collector code referenced around the other
ranges) handle a full/closed channel correctly (dropping or truncating old
messages instead of retaining forever). Ensure backlog_size is configurable or
set to a sane constant and update tests/constructors that create the channel to
use the new bounded constructor.
In `@multiversion-tests/tests/common/mod.rs`:
- Around line 44-56: peer_spec currently masks credential exhaustion by using
modulo on MINING_KEYS and REWARD_ADDRESSES; replace the modulo indexing with
direct indexing and fail fast: in peer_spec (when constructing NodeSpec), assert
or return/panic if index is out of bounds for MINING_KEYS or REWARD_ADDRESSES
(e.g., index >= MINING_KEYS.len() || index >= REWARD_ADDRESSES.len()) and
include a clear error message referencing the peer name and index so duplicate
identities (including collision with genesis at index 0) are not silently
produced.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 0d135a76-65d8-4135-a532-5689b3c9b957
⛔ Files ignored due to path filters (1)
multiversion-tests/Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (17)
.github/workflows/multiversion.yml.gitignoremultiversion-tests/Cargo.tomlmultiversion-tests/fixtures/base-config.tomlmultiversion-tests/src/binary.rsmultiversion-tests/src/cluster.rsmultiversion-tests/src/config.rsmultiversion-tests/src/fault/crash.rsmultiversion-tests/src/fault/network.rsmultiversion-tests/src/lib.rsmultiversion-tests/src/ports.rsmultiversion-tests/src/probe.rsmultiversion-tests/src/process.rsmultiversion-tests/tests/common/mod.rsmultiversion-tests/tests/e2e.rsmultiversion-tests/tests/upgrade.rsxtask/src/main.rs
| serde_json::Value::Number(n) => { | ||
| if let Some(i) = n.as_i64() { | ||
| Some(Value::Integer(i)) | ||
| } else { | ||
| n.as_f64().map(Value::Float) | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Current conversion logic in this file
rg -n --type=rust -C3 'fn json_to_toml|as_i64\(|as_u64\(|as_f64\(' multiversion-tests/src/config.rs
# Search consensus-related numeric fields across the repo
rg -n --type=rust -C2 '\bstruct\s+ConsensusConfig\b|:\s*u64\b|:\s*u128\b|:\s*usize\b'Repository: Irys-xyz/irys
Length of output: 50372
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Find where json_to_toml is called
rg -n 'json_to_toml' multiversion-tests/src/ --type rust
# Check the full function context and tests
head -n 250 multiversion-tests/src/config.rs | tail -n +140Repository: Irys-xyz/irys
Length of output: 4768
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Look for consensus config patching or usage
rg -n 'ConsensusConfig|patch|json_to_toml' multiversion-tests/src/config.rs --type rust -B 2 -A 5
# Check what TOML Value types are available
rg -n 'enum Value|struct Value|toml::Value' multiversion-tests/src/config.rs --type rust -B 1 -A 3Repository: Irys-xyz/irys
Length of output: 3322
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Look at what consensus_json contains - check the API response type
rg -n 'network/config|consensus' multiversion-tests/src/ --type rust -B 2 -A 2 | head -50
# Check if there are any test fixtures that show what values come through
fd -e toml -e json multiversion-tests/fixtures/ -type fRepository: Irys-xyz/irys
Length of output: 3283
🏁 Script executed:
#!/bin/bash
set -euo zipefail
# Check the actual consensus config types and their numeric fields
rg -n 'struct ConsensusConfig|ConsensusType|pub.*u64|pub.*u128' crates/types/src/ --type rust -A 1 | grep -E '(struct|pub|u64|u128|i64)' | head -100Repository: Irys-xyz/irys
Length of output: 110
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Find ConsensusConfig struct definition
rg -n 'struct ConsensusConfig' crates/ --type rust -A 50 | head -100Repository: Irys-xyz/irys
Length of output: 3955
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Look at how the consensus JSON is used - check what v1/network/config returns
rg -n 'network/config\|v1/network' crates/ --type rust -B 3 -A 3Repository: Irys-xyz/irys
Length of output: 39
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Check if any consensus fields are u64 or u128 that might exceed i64::MAX
fd -type f -name '*.rs' crates/types/src/ | xargs grep -l 'ConsensusConfig' | head -3 | while read f; do
echo "=== $f ==="
head -n 300 "$f" | tail -n 200
doneRepository: Irys-xyz/irys
Length of output: 229
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Check the canonical serialization to understand how u64 fields are serialized
rg -n 'serialize_u64|serialize_u128|u64.*serialize|canonical' crates/types/src/canonical.rs -A 3 -B 1 | head -60Repository: Irys-xyz/irys
Length of output: 2227
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Look at the actual genesis/consensus structure to see what fields might be large u64s
rg -n 'struct.*Config|pub.*u64|pub.*u128' crates/types/src/config/consensus.rs | head -50Repository: Irys-xyz/irys
Length of output: 1481
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Check how JSON numbers with large values are handled by serde_json
cat > /tmp/test_json_num.rs << 'EOF'
fn main() {
use serde_json::json;
// Test with a large u64
let large_u64 = u64::MAX;
let json_num = serde_json::Number::from_u64(large_u64).unwrap();
println!("Large u64: {}", large_u64);
println!("as_i64(): {:?}", json_num.as_i64());
println!("as_u64(): {:?}", json_num.as_u64());
println!("as_f64(): {:?}", json_num.as_f64());
// Test with i64::MAX
let i64_max = i64::MAX as u64;
let json_num2 = serde_json::Number::from_u64(i64_max).unwrap();
println!("\ni64::MAX as u64: {}", i64_max);
println!("as_i64(): {:?}", json_num2.as_i64());
println!("as_u64(): {:?}", json_num2.as_u64());
}
EOF
cd /tmp && rustc test_json_num.rs -o test_json_num 2>/dev/null && ./test_json_num || echo "Rust not available in sandbox"Repository: Irys-xyz/irys
Length of output: 86
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Check the test that validates integer recovery from strings
rg -n 'json_to_toml' multiversion-tests/src/config.rs -B 5 -A 15 | grep -A 20 'test'Repository: Irys-xyz/irys
Length of output: 723
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Check if consensus_json is guaranteed to be canonical (all u64/u128 as strings)
# or if it could be raw JSON numbers
rg -n 'get_network_config\|/v1/network/config' multiversion-tests/ crates/ --type rust -B 5 -A 5Repository: Irys-xyz/irys
Length of output: 39
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Check if there's validation or parsing of the consensus JSON
rg -n 'patch_peer_consensus\|consensus_json' multiversion-tests/ --type rust -B 3 -A 10 | head -80Repository: Irys-xyz/irys
Length of output: 39
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Check what type the API returns - is it canonical JSON?
rg -n 'network/config|serde_json|Value' crates/types/src/ --type rust | grep -i 'api\|response\|network' | head -20Repository: Irys-xyz/irys
Length of output: 39
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Look at the probe.rs file where get_network_config is defined
cat multiversion-tests/src/probe.rs | grep -A 30 'get_network_config'Repository: Irys-xyz/irys
Length of output: 1188
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Look at where patch_peer_consensus is called
cat multiversion-tests/src/config.rs | grep -B 10 -A 10 'patch_peer_consensus' | head -50Repository: Irys-xyz/irys
Length of output: 894
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Check what the actual test does with consensus JSON
cat multiversion-tests/src/config.rs | sed -n '400,500p'Repository: Irys-xyz/irys
Length of output: 2553
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Check what type is returned by /v1/network/config endpoint
# Look for the handler/endpoint definition
rg -n 'v1/network/config|network.*config|GET.*config' crates/ --type rust -B 3 -A 5 | head -80Repository: Irys-xyz/irys
Length of output: 5053
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Check if ConsensusConfig is serialized with canonical format when returned from API
rg -n 'serde.*canonical|to_canonical|Canonical' crates/ --type rust | grep -i 'consensus\|api\|response' | head -20Repository: Irys-xyz/irys
Length of output: 1457
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Check the actual structure of what serde_json::json!() produces
# and verify if canonical JSON format is used for large numbers
cat > /tmp/check_precision.py << 'EOF'
# When serde_json serializes u64, does it produce JSON numbers or strings?
# The canonical module says it serializes u64 as strings for JS safety
# But that's during TO_canonical conversion, not when the API endpoint returns it
# Let's check: are large u64 values actually in consensus config?
# Looking at the fields we found:
# - chain_id: u64 (could be large)
# - chunk_size: u64 = 256 * 1024 = 262,144 (small, fits in i64)
# - block_tree_depth: u64
# - num_chunks_in_partition: u64 = 75,534,400 (small, fits in i64)
# - timestamp_millis: u128 (definitely needs string handling)
# - max_future_timestamp_drift_millis: u128
# The consensus fields are mostly moderate values, but u128 is used
# and u128 definitely won't fit in i64
# Question: does the API use canonical serialization or standard JSON?
print("Need to check: does /v1/network/config endpoint use canonical serialization?")
print("If yes: all u64/u128 are strings, current code works")
print("If no: large values could lose precision via f64 conversion")
EOF
python3 /tmp/check_precision.pyRepository: Irys-xyz/irys
Length of output: 246
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Find where the consensus config is actually serialized for API response
rg -n 'json!|serde_json::to_value|Serialize' crates/types/src/config/consensus.rs -B 2 -A 2Repository: Irys-xyz/irys
Length of output: 3124
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Check test fixtures to see if they use raw JSON numbers or canonical format
find multiversion-tests/fixtures -type f -name '*.json' -o -name '*.toml' | head -20
# Look for any integration tests that call patch_peer_consensus
rg -n 'patch_peer_consensus|json_to_toml' multiversion-tests/tests/ --type rust -B 3 -A 3Repository: Irys-xyz/irys
Length of output: 101
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Check if there are any tests with actual large u64 values
rg -n 'u64::MAX|i64::MAX|precision|float' multiversion-tests/src/config.rs -B 2 -A 2Repository: Irys-xyz/irys
Length of output: 293
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Verify: does the test only check i64 range?
sed -n '456,470p' multiversion-tests/src/config.rsRepository: Irys-xyz/irys
Length of output: 662
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Check if there's a test for large numbers that might exceed i64::MAX
rg -n 'u64|large|MAX' multiversion-tests/src/config.rs | grep testRepository: Irys-xyz/irys
Length of output: 39
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Verify the complete json_to_toml function to see all branches
sed -n '152,184p' multiversion-tests/src/config.rsRepository: Irys-xyz/irys
Length of output: 1301
Improve handling of large integers in JSON-to-TOML conversion.
The fallback to as_f64() at line 160 can lose precision for large integers. While the consensus config API endpoint returns canonical JSON (where u64/u128 are serialized as strings and handled correctly by the String branch), this function should defensively handle raw JSON numbers.
The suggested fix improves this by checking as_u64() before falling back to float, converting values exceeding i64::MAX to strings. However, the fix is incomplete: it does not address u128 types, which would still convert to Float and lose precision. Consider also adding a test case for large u64 values (e.g., i64::MAX + 1) to ensure they round-trip without precision loss.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@multiversion-tests/src/config.rs` around lines 156 - 161, The JSON number
branch in the conversion currently uses n.as_i64() then n.as_f64(), which can
lose precision for large unsigned values; update the serde_json::Value::Number
handling so it first checks n.as_i64(), then n.as_u64() and if the u64 value is
> i64::MAX convert it to Value::String(n.to_string()) (preserving precision),
otherwise convert to Value::Integer(u64 as i64); only then fall back to
n.as_f64().map(Value::Float). Also add a unit test that converts a JSON number
equal to i64::MAX + 1 (as a raw number) through the conversion and back to
ensure it round-trips without precision loss. Ensure you modify the function
handling serde_json::Value::Number and add the test for the large u64 case.
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (4)
multiversion-tests/src/config.rs (2)
147-148: 🧹 Nitpick | 🔵 TrivialConsider atomic writes to prevent partial-file corruption.
Direct
std::fs::writecan leave a truncated config file if the process is interrupted. For test infrastructure this is low-risk, but atomic write-then-rename is more robust.🛡️ Atomic write pattern
let output = toml::to_string_pretty(&config)?; - std::fs::write(config_path, output)?; + let tmp_path = config_path.with_extension("toml.tmp"); + std::fs::write(&tmp_path, &output)?; + std::fs::rename(&tmp_path, config_path)?; Ok(())🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@multiversion-tests/src/config.rs` around lines 147 - 148, The current use of toml::to_string_pretty(&config)? followed by std::fs::write(config_path, output)? can leave a truncated file if interrupted; change to an atomic write: write the serialized output to a temporary file (e.g., use tempfile::NamedTempFile or a config_path.with_extension("tmp")), flush and sync the file, then atomically replace the original via std::fs::rename(temp_path, config_path) and optionally sync the parent directory; update the code around the toml::to_string_pretty, std::fs::write, and use std::fs::rename (or tempfile) to implement this pattern.
156-161: 🧹 Nitpick | 🔵 TrivialConsider checking
as_u64()before falling back to float.The current Number handling goes
as_i64()→as_f64(). While the API uses canonical JSON (large integers as strings), defensively checkingas_u64()before the float fallback would prevent precision loss for raw JSON numbers exceedingi64::MAX.🛡️ Defensive number handling
serde_json::Value::Number(n) => { if let Some(i) = n.as_i64() { Some(Value::Integer(i)) + } else if let Some(u) = n.as_u64() { + // Value exceeds i64::MAX; preserve as string to avoid precision loss + Some(Value::String(u.to_string())) } else { n.as_f64().map(Value::Float) } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@multiversion-tests/src/config.rs` around lines 156 - 161, In the serde_json::Value::Number(n) match arm, add a check for n.as_u64() before falling back to n.as_f64(): first try n.as_i64().map(Value::Integer), then if that is None try n.as_u64() and map it to the appropriate unsigned/integer variant (or extend Value to include an unsigned variant) to preserve large positive integers, and only then call n.as_f64().map(Value::Float); update the match arm in multiversion-tests/src/config.rs (symbols: serde_json::Value::Number, n.as_u64(), Value::Integer, Value::Float) accordingly.multiversion-tests/src/cluster.rs (1)
96-107:⚠️ Potential issue | 🟡 MinorConvergence check may miss nodes that crash mid-wait.
api_urls()filters out non-running nodes. Whilewait_for_convergencechecks all nodes are running upfront (lines 118-126), a node that crashes during the subsequent probe loop will be silently excluded from convergence calculations, potentially reporting success with fewer nodes than expected.Consider either:
- Returning all URLs regardless of running state and letting the probe detect unreachable nodes
- Re-checking running state after
wait_for_convergencereturns🛡️ Option 1: Return all URLs unconditionally
pub fn api_urls(&self) -> Vec<String> { self.nodes - .values_mut() - .filter_map(|n| { - if n.is_running() { - Some(n.api_url()) - } else { - None - } - }) + .values() + .map(|n| n.api_url()) .collect() }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@multiversion-tests/src/cluster.rs` around lines 96 - 107, The current api_urls() filters out nodes where Node::is_running() is false which lets a node that crashes during the probe loop be excluded from convergence checks; update api_urls() to return all node URLs unconditionally (iterate over self.nodes.values() and collect n.api_url() for every node) so the probe loop and wait_for_convergence() will detect unreachable/crashed nodes instead of silently skipping them.multiversion-tests/src/process.rs (1)
145-156:⚠️ Potential issue | 🟡 Minor
respawn()may leave old process as zombie.The sync function calls
start_kill()but cannot await the old process's termination before spawning the new one. The old child handle is dropped without waiting, potentially leaving a zombie until the test process exits.If this is acceptable for the test harness use case, consider adding a brief comment documenting this limitation. Alternatively, convert to async:
♻️ Async alternative with proper cleanup
- pub fn respawn(&mut self) -> Result<(), ProcessError> { + pub async fn respawn(&mut self) -> Result<(), ProcessError> { if let Some(mut old) = self.child.take() - && let Err(e) = old.start_kill() { - tracing::warn!(error = %e, "failed to kill old process during respawn"); + if let Err(e) = old.start_kill() { + tracing::warn!(error = %e, "failed to kill old process during respawn"); + } + // Reap the old process + let _ = old.wait().await; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@multiversion-tests/src/process.rs` around lines 145 - 156, The respawn() implementation drops the old child after calling start_kill(), which can leave a zombie because it never awaits the old process's termination; either (A) convert respawn to async (change signature of respawn to async fn respawn(...) -> Result<..., ProcessError>) and after calling old.start_kill() await the child's termination (e.g., call old.wait().await / old.wait_with_output().await or the appropriate async wait method on the child handle) before calling spawn_child_with_logs, or (B) if synchronous behavior is intentional, add a clear comment above respawn() documenting that start_kill() is not awaited and may leave zombies until process exit; update references to child and log_rx accordingly when making it async.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@multiversion-tests/src/fault/network.rs`:
- Around line 23-29: The current Drop impl for NetworkPartitioner only calls
sync_cleanup_chain() when ACTIVE_INJECTIONS is zero at drop-time, which can
permanently skip cleanup if the atomic goes to zero later (e.g., in the guard
decrement); fix by calling sync_cleanup_chain() at the point where
ACTIVE_INJECTIONS is decremented to zero rather than only in Drop: update the
decrement path (the code that does ACTIVE_INJECTIONS.fetch_sub(...) in the
guard) to check the returned value and call sync_cleanup_chain() when that
fetch_sub indicates the count reached zero, or extract that logic into a helper
(e.g., decrement_active_and_cleanup()) and call it from both the guard and
NetworkPartitioner::drop so cleanup runs exactly when the active count falls to
zero.
In `@multiversion-tests/src/process.rs`:
- Around line 134-142: When the graceful shutdown times out in the shutdown
path, the code sends signal::kill(pid, Signal::SIGKILL) but returns immediately
leaving self.child populated; update the timeout branch so that after sending
SIGKILL you wait for the child to actually exit (e.g., call
self.child.as_mut().and_then(|c| c.wait().ok()) or poll with try_wait in a short
loop with small sleep) and then clear self.child = None before returning
Err(ProcessError::ShutdownTimeout { timeout }); this ensures the process handle
is released and prevents start_kill() from being invoked on an already-dead
process.
In `@xtask/src/main.rs`:
- Around line 530-539: The passthrough args end up after the test-binary
separator (`"--"`) so they are delivered to the test binary instead of Cargo;
fix by adding `args` into `test_args` before you push the `"--"` separator.
Concretely, in the `test_args` construction for `test_args` (and the `cmd!(sh,
"cargo {test_args...}")` invocation) build the vector with the Cargo-level flags
first (e.g. "test", "--test", "*"), then `extend(args)`, and only then push the
`"--"` separator and the test-binary flags like `"--ignored"` and
`"--test-threads=1"` so passthrough `args` are treated as Cargo args.
---
Duplicate comments:
In `@multiversion-tests/src/cluster.rs`:
- Around line 96-107: The current api_urls() filters out nodes where
Node::is_running() is false which lets a node that crashes during the probe loop
be excluded from convergence checks; update api_urls() to return all node URLs
unconditionally (iterate over self.nodes.values() and collect n.api_url() for
every node) so the probe loop and wait_for_convergence() will detect
unreachable/crashed nodes instead of silently skipping them.
In `@multiversion-tests/src/config.rs`:
- Around line 147-148: The current use of toml::to_string_pretty(&config)?
followed by std::fs::write(config_path, output)? can leave a truncated file if
interrupted; change to an atomic write: write the serialized output to a
temporary file (e.g., use tempfile::NamedTempFile or a
config_path.with_extension("tmp")), flush and sync the file, then atomically
replace the original via std::fs::rename(temp_path, config_path) and optionally
sync the parent directory; update the code around the toml::to_string_pretty,
std::fs::write, and use std::fs::rename (or tempfile) to implement this pattern.
- Around line 156-161: In the serde_json::Value::Number(n) match arm, add a
check for n.as_u64() before falling back to n.as_f64(): first try
n.as_i64().map(Value::Integer), then if that is None try n.as_u64() and map it
to the appropriate unsigned/integer variant (or extend Value to include an
unsigned variant) to preserve large positive integers, and only then call
n.as_f64().map(Value::Float); update the match arm in
multiversion-tests/src/config.rs (symbols: serde_json::Value::Number,
n.as_u64(), Value::Integer, Value::Float) accordingly.
In `@multiversion-tests/src/process.rs`:
- Around line 145-156: The respawn() implementation drops the old child after
calling start_kill(), which can leave a zombie because it never awaits the old
process's termination; either (A) convert respawn to async (change signature of
respawn to async fn respawn(...) -> Result<..., ProcessError>) and after calling
old.start_kill() await the child's termination (e.g., call old.wait().await /
old.wait_with_output().await or the appropriate async wait method on the child
handle) before calling spawn_child_with_logs, or (B) if synchronous behavior is
intentional, add a clear comment above respawn() documenting that start_kill()
is not awaited and may leave zombies until process exit; update references to
child and log_rx accordingly when making it async.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 55902fd9-5396-4e3b-9a17-5e445c86542c
⛔ Files ignored due to path filters (1)
multiversion-tests/Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (17)
.github/workflows/multiversion.yml.gitignoremultiversion-tests/Cargo.tomlmultiversion-tests/fixtures/base-config.tomlmultiversion-tests/src/binary.rsmultiversion-tests/src/cluster.rsmultiversion-tests/src/config.rsmultiversion-tests/src/fault/crash.rsmultiversion-tests/src/fault/network.rsmultiversion-tests/src/lib.rsmultiversion-tests/src/ports.rsmultiversion-tests/src/probe.rsmultiversion-tests/src/process.rsmultiversion-tests/tests/common/mod.rsmultiversion-tests/tests/e2e.rsmultiversion-tests/tests/upgrade.rsxtask/src/main.rs
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
multiversion-tests/src/process.rs (1)
149-160: 🧹 Nitpick | 🔵 Trivial
respawn()doesn't wait for old process termination.The
start_kill()call initiates termination but doesn't wait for the old process to exit before spawning the replacement. While the droppedChildhandle will eventually be reaped by Tokio's runtime, this could briefly leave a zombie. Consider makingrespawn()async to allow waiting:♻️ Optional: Make respawn async to ensure clean handoff
- pub fn respawn(&mut self) -> Result<(), ProcessError> { - if let Some(mut old) = self.child.take() - && let Err(e) = old.start_kill() - { - tracing::warn!(error = %e, "failed to kill old process during respawn"); + pub async fn respawn(&mut self) -> Result<(), ProcessError> { + if let Some(mut old) = self.child.take() { + if let Err(e) = old.start_kill() { + tracing::warn!(error = %e, "failed to kill old process during respawn"); + } + // Wait for old process to fully exit + let _ = old.wait().await; } let (child, log_rx) = spawn_child_with_logs(&self.config)?;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@multiversion-tests/src/process.rs` around lines 149 - 160, The respawn() method currently calls start_kill() on the old Child but returns immediately, risking a brief zombie; change respawn to be async (pub async fn respawn(&mut self) -> Result<(), ProcessError>) and after calling start_kill() await the child's termination (e.g., await old.wait().or old.wait_with_output()) before spawning a new child with spawn_child_with_logs; ensure you still handle and log any error from start_kill() and from waiting on old, and preserve updating self.child and self.log_rx after the old process has fully exited.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@multiversion-tests/src/fault/network.rs`:
- Around line 23-29: The Drop implementation for NetworkPartitioner currently
checks ACTIVE_INJECTIONS.load(...) == 0 and calls sync_cleanup_chain(), which
races with inject() setup; replace that ad-hoc check with the existing
synchronized cleanup path by calling decrement_and_maybe_cleanup() (the same
function noted as the primary cleanup path) so drops participate in the
coordinated decrement/cleanup logic; update both occurrences of the manual
load-and-sync (the Drop impl and the other instance at 45-47) to invoke
decrement_and_maybe_cleanup() instead of reading ACTIVE_INJECTIONS and calling
sync_cleanup_chain() directly.
- Around line 112-116: The stderr matching is case-sensitive and misses
"Permission denied"; update the check in the function (e.g.,
check_iptables_output) to compare case-insensitively — for example normalize
stderr to lowercase (or use a case-insensitive search) before testing for
"permission denied" and "operation not permitted", then return
FaultError::InsufficientPrivileges when matched and
FaultError::InjectionFailed(stderr.into_owned()) otherwise; ensure you still use
the existing symbols stderr, FaultError::InsufficientPrivileges, and
FaultError::InjectionFailed so callers remain unchanged.
---
Duplicate comments:
In `@multiversion-tests/src/process.rs`:
- Around line 149-160: The respawn() method currently calls start_kill() on the
old Child but returns immediately, risking a brief zombie; change respawn to be
async (pub async fn respawn(&mut self) -> Result<(), ProcessError>) and after
calling start_kill() await the child's termination (e.g., await old.wait().or
old.wait_with_output()) before spawning a new child with spawn_child_with_logs;
ensure you still handle and log any error from start_kill() and from waiting on
old, and preserve updating self.child and self.log_rx after the old process has
fully exited.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 751ef567-38eb-4922-a97c-f11bf2d824c5
📒 Files selected for processing (2)
multiversion-tests/src/fault/network.rsmultiversion-tests/src/process.rs
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (2)
multiversion-tests/src/config.rs (1)
158-163:⚠️ Potential issue | 🟠 MajorPreserve unsigned integer precision before float fallback.
At Line 162, falling back directly to
as_f64()afteras_i64()can silently round large raw JSON integers. This is still risky for cross-version inputs and should be handled before float conversion.💡 Suggested fix
serde_json::Value::Number(n) => { if let Some(i) = n.as_i64() { Some(Value::Integer(i)) + } else if let Some(u) = n.as_u64() { + if u <= i64::MAX as u64 { + Some(Value::Integer(u as i64)) + } else { + // toml::Value::Integer is i64-backed; preserve exact value as string + Some(Value::String(u.to_string())) + } } else { n.as_f64().map(Value::Float) } }+ #[test] + fn json_to_toml_preserves_large_raw_u64_number() { + let large = (i64::MAX as u64) + 1; + let json = serde_json::Value::Number(serde_json::Number::from(large)); + let toml_val = json_to_toml(&json).unwrap(); + assert_eq!(toml_val.as_str().unwrap(), large.to_string()); + }#!/bin/bash set -euo pipefail # Verify numeric conversion branch order in json_to_toml rg -n --type rust -C3 'serde_json::Value::Number|as_i64\(|as_u64\(|as_f64\(' multiversion-tests/src/config.rs # Verify coverage for large raw u64 numeric input rg -n --type rust -C2 'i64::MAX\s*as u64|json_to_toml_preserves_large_raw_u64_number|i64::MAX \+ 1' multiversion-tests/src/config.rsAlso applies to: 451-455
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@multiversion-tests/src/config.rs` around lines 158 - 163, In the json_to_toml conversion (match arm for serde_json::Value::Number) you must try n.as_u64() before falling back to n.as_f64() to avoid silently rounding large unsigned JSON integers; change the branch order so it checks n.as_i64(), then n.as_u64() and maps the u64 result appropriately (preserving integer precision) to the TOML representation instead of immediately calling as_f64(), and only use as_f64() to produce Value::Float if neither as_i64() nor as_u64() succeed.multiversion-tests/src/binary.rs (1)
105-107:⚠️ Potential issue | 🟠 MajorPre-clean deterministic worktree path before creating a new worktree
Line 105 uses a fixed
worktree-{rev}path, but Line 107 creates it directly without any prior cleanup. After an interrupted run, that path remains on disk with git worktree metadata intact. The subsequentgit worktree prune(line 106) only cleans stale metadata, not the actual directory, so the next rebuild attempt will fail whengit worktree addrejects the existing path.🔧 Proposed fix
let worktree_path = self.cache_dir.join(format!("worktree-{rev}")); self.prune_stale_worktrees().await; + // Best-effort cleanup in case a previous run crashed before removal. + let _ = self.remove_worktree(&worktree_path).await; + if worktree_path.exists() { + let _ = tokio::fs::remove_dir_all(&worktree_path).await; + } self.create_worktree(git_ref, &worktree_path).await?;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@multiversion-tests/src/binary.rs` around lines 105 - 107, The deterministic worktree path created as worktree_path = self.cache_dir.join(format!("worktree-{rev}")) must be cleaned before calling create_worktree to avoid git refusing to add an existing directory; update the sequence around prune_stale_worktrees and create_worktree so you first compute worktree_path, then run self.prune_stale_worktrees().await, then if worktree_path.exists() remove it (e.g. via tokio::fs or std::fs remove_dir_all depending on async context) or run the equivalent git worktree removal, and only after successful removal call self.create_worktree(git_ref, &worktree_path).await?; ensure removal errors are handled/propagated consistently with existing error handling in the function.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@multiversion-tests/src/binary.rs`:
- Around line 128-132: The code currently treats any existing path as valid for
env override (PathBuf::from(raw) with path.exists()), which allows directories;
change the existence check to ensure it's a regular file (use path.is_file())
and return the same BinaryError::NotFound { path } (or an appropriate error
variant) when that check fails so the env-override path is validated before
calling resolved(label, "env-override", path).
- Around line 163-170: The subprocess .output().await calls for external
commands (e.g., the git rev-parse spawn using Command::new("git") and other git
worktree and cargo build invocations) are currently unbounded; wrap each await
in a tokio::time::timeout with a reasonable Duration (e.g., few seconds/minutes)
and map timeout errors into a clear Err return so the test harness fails fast;
update the places that run "git worktree" and "cargo build" (the other
Command::new("git") and Command::new("cargo") calls in this file) to use the
same timeout pattern and handle TimeoutElapsed separately from command execution
errors.
In `@xtask/src/main.rs`:
- Around line 523-529: The env vars IRYS_BINARY_NEW and IRYS_BINARY_OLD are
being set after calling sh.push_dir("multiversion-tests"), so any relative paths
in binary_new/binary_old will be resolved against the new CWD and can break; fix
by resolving binary_new and binary_old to absolute paths (e.g., using
std::fs::canonicalize or joining with the original current_dir) before calling
sh.push_dir, then call sh.set_var("IRYS_BINARY_NEW", absolute_path) and
sh.set_var("IRYS_BINARY_OLD", absolute_path) (or alternatively move the
sh.set_var calls to before sh.push_dir) while keeping references to binary_new,
binary_old, sh.push_dir and sh.set_var unchanged.
- Around line 530-536: Replace the use of the glob-style flag in the test
argument construction: in the code that builds test_args (the let mut test_args
= ... and subsequent test_args.extend calls), remove the separate "--test" and
"*" entries and instead push the single idiomatic "--tests" flag so the test
runner uses the explicit --tests option when running integration tests.
---
Duplicate comments:
In `@multiversion-tests/src/binary.rs`:
- Around line 105-107: The deterministic worktree path created as worktree_path
= self.cache_dir.join(format!("worktree-{rev}")) must be cleaned before calling
create_worktree to avoid git refusing to add an existing directory; update the
sequence around prune_stale_worktrees and create_worktree so you first compute
worktree_path, then run self.prune_stale_worktrees().await, then if
worktree_path.exists() remove it (e.g. via tokio::fs or std::fs remove_dir_all
depending on async context) or run the equivalent git worktree removal, and only
after successful removal call self.create_worktree(git_ref,
&worktree_path).await?; ensure removal errors are handled/propagated
consistently with existing error handling in the function.
In `@multiversion-tests/src/config.rs`:
- Around line 158-163: In the json_to_toml conversion (match arm for
serde_json::Value::Number) you must try n.as_u64() before falling back to
n.as_f64() to avoid silently rounding large unsigned JSON integers; change the
branch order so it checks n.as_i64(), then n.as_u64() and maps the u64 result
appropriately (preserving integer precision) to the TOML representation instead
of immediately calling as_f64(), and only use as_f64() to produce Value::Float
if neither as_i64() nor as_u64() succeed.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 70ee288e-a1e7-44d1-bf64-e260f55aa5a9
📒 Files selected for processing (3)
multiversion-tests/src/binary.rsmultiversion-tests/src/config.rsxtask/src/main.rs
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (3)
multiversion-tests/src/fault/network.rs (2)
23-29:⚠️ Potential issue | 🟠 MajorFix cleanup/setup race by incrementing active injections before chain setup.
Line 45 performs
setup_chain()before marking the injection active. During that gap, Line 27 can observe zero and run cleanup from anotherDrop, removing the chain and causing flakyadd_drop_rulefailures.Suggested fix
impl Drop for NetworkPartitioner { fn drop(&mut self) { - // Safety cleanup: if no active injections remain, flush the chain. - // The primary cleanup path is decrement_and_maybe_cleanup(). - if ACTIVE_INJECTIONS.load(Ordering::Acquire) == 0 { - sync_cleanup_chain(); - } + // No-op: cleanup is coordinated by decrement_and_maybe_cleanup() + // on active-injection transitions to zero. } } @@ async fn inject(&self, target: &FaultTarget) -> Result<FaultGuard, FaultError> { let (source_port, dest_port) = extract_ports(target)?; - setup_chain().await?; ACTIVE_INJECTIONS.fetch_add(1, Ordering::AcqRel); + if let Err(e) = setup_chain().await { + decrement_and_maybe_cleanup(); + return Err(e); + } if let Err(e) = add_drop_rule(source_port, dest_port).await { decrement_and_maybe_cleanup(); return Err(e); }Also applies to: 45-47
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@multiversion-tests/src/fault/network.rs` around lines 23 - 29, The race occurs because setup_chain() is called before marking the injection active, allowing a concurrent Drop (impl Drop for NetworkPartitioner) to see ACTIVE_INJECTIONS == 0 and run sync_cleanup_chain(); fix by incrementing the active counter before any chain setup: call ACTIVE_INJECTIONS.fetch_add(1, Ordering::AcqRel) (or equivalent increment path used elsewhere) prior to calling setup_chain(), and keep the existing decrement/decrement_and_maybe_cleanup() logic in Drop so cleanup still runs when the injection is removed; apply the same change to the other setup site(s) that call setup_chain() (the paths around setup_chain() at the creation/setup functions referenced in the diff).
164-181: 🧹 Nitpick | 🔵 TrivialConsider consolidating
extract_portstests withrstestcases.These two tests are good, but they can be merged into one parameterized test for less duplication and easier expansion.
As per coding guidelines, "Use
rstestfor parameterized tests andproptestfor property-based tests in Rust".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@multiversion-tests/src/fault/network.rs` around lines 164 - 181, Combine the two tests into a single parameterized rstest that exercises extract_ports with both FaultTarget::Network and FaultTarget::Process cases: create a case with FaultTarget::Network { source_port: 8080, dest_port: 9090 } and assert the returned (s,d) equals (8080,9090) using unwrap, and a case with FaultTarget::Process { pid: Pid::from_raw(1) } that asserts extract_ports returns an Err (is_err); annotate the test with #[rstest] and use #[case] parameters to pass each FaultTarget variant to the single test function referencing extract_ports, FaultTarget::Network, FaultTarget::Process, and Pid::from_raw.xtask/src/main.rs (1)
523-524:⚠️ Potential issue | 🟡 MinorDon’t silently ignore
canonicalizefailures for binary overrides.At Line 523 and Line 524,
unwrap_or(p.into())can fall back to a relative path and later resolve against the changed cwd (multiversion-tests), producing hard-to-diagnose failures. Fail fast with context instead of swallowing canonicalization errors.Proposed fix
- let binary_new = binary_new.map(|p| std::fs::canonicalize(&p).unwrap_or(p.into())); - let binary_old = binary_old.map(|p| std::fs::canonicalize(&p).unwrap_or(p.into())); + let binary_new = binary_new + .map(|p| { + std::fs::canonicalize(&p).map_err(|e| { + eyre::eyre!("failed to canonicalize --binary-new '{}': {e}", p) + }) + }) + .transpose()?; + let binary_old = binary_old + .map(|p| { + std::fs::canonicalize(&p).map_err(|e| { + eyre::eyre!("failed to canonicalize --binary-old '{}': {e}", p) + }) + }) + .transpose()?;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@xtask/src/main.rs` around lines 523 - 524, The code currently swallows canonicalize errors for the binary override maps (binary_new and binary_old) by using unwrap_or(p.into()), which can allow relative paths to leak and later fail; change the mapping to fail fast with context instead of falling back: call std::fs::canonicalize(&p) and propagate or return a descriptive error on failure (e.g. using anyhow::Context or map_err to include which override path failed and the original error), so both binary_new.map(...) and binary_old.map(...) return a canonicalized PathBuf or an early error with clear context about the failing override.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@multiversion-tests/src/binary.rs`:
- Around line 74-75: The cache check treats mere existence of the path
referenced by cached and accepted by resolved("new", rev, cached) as a valid hit
but the writer later writes directly to that final path (line ~224), risking
partial/truncated files being reused; change the cache publishing to write to a
temporary file in the same directory (e.g., cached + ".tmp" or
cached.with_extension(".tmp")), fully complete the write/copy to that temp, then
atomically rename (fs::rename) the temp into cached and only then consider the
artifact valid; update the cache-reader logic to still check cached.exists() (or
optionally a small atomic marker like cached+".complete") but only accept files
that were created by the atomic rename, and ensure the writer cleans up the temp
file on failure so truncated temps are never promoted to cached.
---
Duplicate comments:
In `@multiversion-tests/src/fault/network.rs`:
- Around line 23-29: The race occurs because setup_chain() is called before
marking the injection active, allowing a concurrent Drop (impl Drop for
NetworkPartitioner) to see ACTIVE_INJECTIONS == 0 and run sync_cleanup_chain();
fix by incrementing the active counter before any chain setup: call
ACTIVE_INJECTIONS.fetch_add(1, Ordering::AcqRel) (or equivalent increment path
used elsewhere) prior to calling setup_chain(), and keep the existing
decrement/decrement_and_maybe_cleanup() logic in Drop so cleanup still runs when
the injection is removed; apply the same change to the other setup site(s) that
call setup_chain() (the paths around setup_chain() at the creation/setup
functions referenced in the diff).
- Around line 164-181: Combine the two tests into a single parameterized rstest
that exercises extract_ports with both FaultTarget::Network and
FaultTarget::Process cases: create a case with FaultTarget::Network {
source_port: 8080, dest_port: 9090 } and assert the returned (s,d) equals
(8080,9090) using unwrap, and a case with FaultTarget::Process { pid:
Pid::from_raw(1) } that asserts extract_ports returns an Err (is_err); annotate
the test with #[rstest] and use #[case] parameters to pass each FaultTarget
variant to the single test function referencing extract_ports,
FaultTarget::Network, FaultTarget::Process, and Pid::from_raw.
In `@xtask/src/main.rs`:
- Around line 523-524: The code currently swallows canonicalize errors for the
binary override maps (binary_new and binary_old) by using unwrap_or(p.into()),
which can allow relative paths to leak and later fail; change the mapping to
fail fast with context instead of falling back: call std::fs::canonicalize(&p)
and propagate or return a descriptive error on failure (e.g. using
anyhow::Context or map_err to include which override path failed and the
original error), so both binary_new.map(...) and binary_old.map(...) return a
canonicalized PathBuf or an early error with clear context about the failing
override.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 99ccdd01-12eb-488d-b1dd-ce3b6c83a17c
📒 Files selected for processing (3)
multiversion-tests/src/binary.rsmultiversion-tests/src/fault/network.rsxtask/src/main.rs
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@multiversion-tests/src/binary.rs`:
- Around line 117-124: The current flow can let remove_worktree errors
overshadow an earlier build_at error; change the logic so you store the result
from build_at in build_result, then always attempt worktree removal (calling
remove_worktree and prune_stale_worktrees as now), but when both build_result is
Err and removal returns Err you return the original build error (from
build_result) instead of the WorktreeRemoval error; reference the build_at call
(self.build_at), the build_result variable, remove_worktree,
prune_stale_worktrees, and worktree_path to locate and implement this
conditional return behavior.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: a3330365-da78-4290-a0e2-e063dfe94907
📒 Files selected for processing (1)
multiversion-tests/src/binary.rs
| let build_result = self.build_at(&worktree_path, &cached).await; | ||
| if let Err(e) = self.remove_worktree(&worktree_path).await { | ||
| tracing::warn!(error = %e, path = %worktree_path.display(), | ||
| "worktree removal failed, attempting prune and retry"); | ||
| self.prune_stale_worktrees().await; | ||
| self.remove_worktree(&worktree_path).await?; | ||
| } | ||
| build_result?; |
There was a problem hiding this comment.
Worktree removal error can mask the original build failure.
If build_at fails and remove_worktree also fails (even after prune+retry), line 122 propagates the WorktreeRemoval error, and line 124 is never reached. This masks the actual build error, making debugging harder.
🔧 Proposed fix to preserve build error priority
let build_result = self.build_at(&worktree_path, &cached).await;
if let Err(e) = self.remove_worktree(&worktree_path).await {
tracing::warn!(error = %e, path = %worktree_path.display(),
"worktree removal failed, attempting prune and retry");
self.prune_stale_worktrees().await;
- self.remove_worktree(&worktree_path).await?;
+ if let Err(e2) = self.remove_worktree(&worktree_path).await {
+ tracing::error!(error = %e2, path = %worktree_path.display(),
+ "worktree removal failed after prune, manual cleanup may be needed");
+ // Don't mask build_result - it's more actionable for the user
+ }
}
build_result?;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@multiversion-tests/src/binary.rs` around lines 117 - 124, The current flow
can let remove_worktree errors overshadow an earlier build_at error; change the
logic so you store the result from build_at in build_result, then always attempt
worktree removal (calling remove_worktree and prune_stale_worktrees as now), but
when both build_result is Err and removal returns Err you return the original
build error (from build_result) instead of the WorktreeRemoval error; reference
the build_at call (self.build_at), the build_result variable, remove_worktree,
prune_stale_worktrees, and worktree_path to locate and implement this
conditional return behavior.
Describe the changes
Before:
No mechanism existed to run different Irys node binary versions side-by-side and verify cross-version compatibility (e.g. upgrade/rollback scenarios, mixed-version clusters).
After:
Adds a standalone
multiversion-testscrate that builds binaries from arbitrary git refs, orchestrates multi-node clusters, injects faults, and polls for block-height convergence.Changes
Binary resolution (
binary.rs)BinaryResolverchecks out git refs via worktrees, buildsirys-chainat each revision, and caches binaries by commit hash undertarget/multiversion/IRYS_BINARY_NEW/IRYS_BINARY_OLDenv overrides to skip builds--prefix,.., null bytes)Process lifecycle (
process.rs)NodeProcesswraps atokio::process::Childwith SIGTERM/SIGKILL/SIGSTOP/SIGCONT controlmpsc::UnboundedReceivershutdown(timeout)with SIGKILL fallback;respawn()for restart-in-placeCluster orchestration (
cluster.rs)Cluster::start(ClusterSpec)allocates ephemeral ports, generates per-node TOML configs, spawns genesis nodes first (waiting for readiness), then peerswait_for_convergence()polls all running nodes until block heights are within toleranceFault injection (
fault/)FaultInjectortrait withFaultTarget(Process or Network) and RAIIFaultGuardthat auto-undoes on dropCrashInjectorsends SIGKILL to a target PIDNetworkPartitionerplaceholder for iptables-based port blocking (requires root)Health probing (
probe.rs)HttpProbepolls/v1/infoon each node's API port, parses block height from JSONwait_for_ready(url, timeout)andwait_for_convergence(urls, tolerance, timeout)with configurable poll intervalsConfig generation (
config.rs)is_genesisflagPort allocation (
ports.rs)NodePorts::allocate()binds three ephemeral TCP listeners (api, gossip, reth_network), extracts assigned ports, and drops the listenersRelated Issue(s)
Please link to the issue(s) that will be closed with this PR.
Checklist
Additional Context
Add any other context about the pull request here.
Summary by CodeRabbit
New Features
Tests
Chores