Skip to content

feat(multiversion-tests): add cross-version integration test harness#1207

Open
glottologist wants to merge 8 commits intomasterfrom
jason/multiversion-testing
Open

feat(multiversion-tests): add cross-version integration test harness#1207
glottologist wants to merge 8 commits intomasterfrom
jason/multiversion-testing

Conversation

@glottologist
Copy link
Contributor

@glottologist glottologist commented Mar 11, 2026

Describe the changes

Before:
No mechanism existed to run different Irys node binary versions side-by-side and verify cross-version compatibility (e.g. upgrade/rollback scenarios, mixed-version clusters).

After:
Adds a standalone multiversion-tests crate that builds binaries from arbitrary git refs, orchestrates multi-node clusters, injects faults, and polls for block-height convergence.

Changes

Binary resolution (binary.rs)

  • BinaryResolver checks out git refs via worktrees, builds irys-chain at each revision, and caches binaries by commit hash under target/multiversion/
  • Supports IRYS_BINARY_NEW / IRYS_BINARY_OLD env overrides to skip builds
  • Validates git refs against injection patterns (empty, -- prefix, .., null bytes)

Process lifecycle (process.rs)

  • NodeProcess wraps a tokio::process::Child with SIGTERM/SIGKILL/SIGSTOP/SIGCONT control
  • Async log forwarding (stdout + stderr) via mpsc::UnboundedReceiver
  • Graceful shutdown(timeout) with SIGKILL fallback; respawn() for restart-in-place

Cluster orchestration (cluster.rs)

  • Cluster::start(ClusterSpec) allocates ephemeral ports, generates per-node TOML configs, spawns genesis nodes first (waiting for readiness), then peers
  • wait_for_convergence() polls all running nodes until block heights are within tolerance

Fault injection (fault/)

  • FaultInjector trait with FaultTarget (Process or Network) and RAII FaultGuard that auto-undoes on drop
  • CrashInjector sends SIGKILL to a target PID
  • NetworkPartitioner placeholder for iptables-based port blocking (requires root)

Health probing (probe.rs)

  • HttpProbe polls /v1/info on each node's API port, parses block height from JSON
  • wait_for_ready(url, timeout) and wait_for_convergence(urls, tolerance, timeout) with configurable poll intervals

Config generation (config.rs)

  • Template-based TOML config: substitutes ports, data dir, mining key, reward address, peer list, and is_genesis flag

Port allocation (ports.rs)

  • NodePorts::allocate() binds three ephemeral TCP listeners (api, gossip, reth_network), extracts assigned ports, and drops the listeners

Related Issue(s)
Please link to the issue(s) that will be closed with this PR.

Checklist

  • Tests have been added/updated for the changes.
  • Documentation has been updated for the changes (if applicable).
  • The code follows Rust's style guidelines.

Additional Context
Add any other context about the pull request here.

Summary by CodeRabbit

  • New Features

    • Added a multiversion testing framework: cluster orchestration, binary resolution/build, config generation, probing, process & port management, and reversible fault injection (process & network).
  • Tests

    • New end-to-end and upgrade integration tests (convergence, upgrades, rollbacks, crashes) and shared test utilities.
  • Chores

    • New multiversion workspace, baseline test config, pinned toolchain, CI workflow, CLI command to run the suite, and updated .gitignore.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 11, 2026

📝 Walkthrough

Walkthrough

Adds a new multiversion test suite: workspace manifest and toolchain, fixtures, extensive Rust test utilities (binary resolver, cluster orchestration, config generator, fault injection, process/ports/probe modules), end-to-end and upgrade tests, CI workflow, xtask command, and a .gitignore rule.

Changes

Cohort / File(s) Summary
Project & CI config
/.gitignore, multiversion-tests/Cargo.toml, multiversion-tests/rust-toolchain.toml, .github/workflows/multiversion.yml
Adds multiversion workspace manifest, pins Rust toolchain to 1.93.0, adds CI workflow for multiversion tests, and ignores multiversion-tests/target.
Fixtures & templates
multiversion-tests/fixtures/base-config.toml
Adds comprehensive base TOML config template used to generate per-node configs for tests.
Library surface & exports
multiversion-tests/src/lib.rs
Exports new modules (binary, cluster, config, fault, ports, probe, process) and small URL helper functions with property tests.
Binary resolver
multiversion-tests/src/binary.rs
New BinaryResolver with git worktree resolution, cached builds, inter-process locks, error types, and ResolvedBinary API.
Cluster orchestration
multiversion-tests/src/cluster.rs
New Cluster implementation: port allocation, spawn/upgrade/shutdown nodes, config generation/patching, readiness/convergence polling, and detailed error handling.
Config generation
multiversion-tests/src/config.rs
TOML config generator and peer/consensus patcher with JSON→TOML helpers, key normalization, and tests.
Fault injection framework
multiversion-tests/src/fault/mod.rs, multiversion-tests/src/fault/crash.rs, multiversion-tests/src/fault/network.rs
Introduces FaultInjector API, FaultGuard, ProcessFreezer/ProcessKiller, and NetworkPartitioner (iptables) with availability checks and undo semantics.
Process, ports, probe utilities
multiversion-tests/src/process.rs, multiversion-tests/src/ports.rs, multiversion-tests/src/probe.rs
Adds NodeProcess subprocess wrapper with log streaming/control, ephemeral port allocator, and HttpProbe for node info, polling, and convergence logic.
Tests & helpers
multiversion-tests/tests/common/mod.rs, multiversion-tests/tests/e2e.rs, multiversion-tests/tests/upgrade.rs
Adds common test helpers/constants and E2E + upgrade test suites (ignored by default) that exercise cluster lifecycle and upgrade flows.
Xtask integration
xtask/src/main.rs
Adds Commands::MultiversionTest to run multiversion tests with optional binary overrides.

Sequence Diagram(s)

sequenceDiagram
    participant User as User
    participant Resolver as BinaryResolver
    participant Git as Git (worktree)
    participant Cargo as Cargo
    participant Cluster as Cluster
    participant Node as NodeProcess
    participant Probe as HttpProbe

    User->>Resolver: resolve_new()/resolve_old()
    Resolver->>Git: validate_git_ref() / create_worktree()
    Git-->>Resolver: worktree_path / resolved_rev
    Resolver->>Cargo: cargo build (worktree)
    Cargo-->>Resolver: binary_path
    Resolver-->>User: ResolvedBinary

    User->>Cluster: start(ClusterSpec)
    Cluster->>Cluster: allocate_ports()
    Cluster->>Node: spawn nodes with per-node configs
    Node-->>Cluster: running
    Cluster->>Probe: wait_for_convergence(api_urls)
    Probe->>Probe: poll_all_nodes()
    Probe-->>Cluster: converged / timeout
    Cluster-->>User: ready / error
Loading
sequenceDiagram
    participant User as User
    participant Cluster as Cluster
    participant Resolver as BinaryResolver
    participant Node as NodeProcess
    participant Probe as HttpProbe

    User->>Cluster: upgrade_node(name, new_binary)
    Cluster->>Node: node_mut(name)
    Cluster->>Node: shutdown()/kill()
    Node-->>Cluster: terminated
    Cluster->>Node: set_binary(new_binary) & respawn()
    Node-->>Cluster: running
    Cluster->>Probe: wait_for_ready(node)
    Probe-->>Cluster: node ready
    Cluster->>Probe: wait_for_convergence()
    Probe-->>Cluster: converged
    Cluster-->>User: upgrade complete
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Suggested reviewers

  • glottologist
  • roberts-pumpurs
🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 52.82% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat(multiversion-tests): add cross-version integration test harness' accurately summarizes the main change: introducing a comprehensive multiversion testing framework with a clear feature scope.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch jason/multiversion-testing
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@multiversion-tests/Cargo.toml`:
- Line 11: The tokio dependency in Cargo.toml is missing the "sync" feature
required by multiversion-tests/src/process.rs which uses tokio::sync::mpsc
(UnboundedReceiver and unbounded_channel); update the tokio features array for
the tokio = { version = "1", ... } entry to include "sync" so the mpsc types and
helpers are available at compile time.

In `@multiversion-tests/src/binary.rs`:
- Around line 101-115: The try_env_override function currently calls
self.git_rev("HEAD") and embeds that value in the ResolvedBinary, which is
misleading for environment-provided binaries; change try_env_override to skip
calling self.git_rev and instead set the revision to a clear sentinel such as
"env-override" (or "unknown") when creating the ResolvedBinary via
resolved(label, rev, path), and ensure any error path still returns
BinaryError::NotFound for missing paths rather than attempting to resolve a git
rev.

In `@multiversion-tests/src/cluster.rs`:
- Around line 208-211: The function node_api_url currently indexes port_map with
port_map[name], which will panic if name is missing; change node_api_url to
perform a defensive lookup using port_map.get(name) and return a Result (e.g.,
Result<String, Error>/anyhow::Error) rather than unconditionally returning
String, and on None return an Err with a clear message including the missing
node name and type NodePorts; ensure callers of node_api_url are updated to
propagate or handle the Result and continue to use
crate::node_api_base(ports.api) when the lookup succeeds.

In `@multiversion-tests/src/fault/network.rs`:
- Around line 16-30: The current remove_all_rules() called from
NetworkPartitioner::drop flushes the host's default filter table and can break
unrelated networking; instead, modify the injector to create and use a dedicated
chain (e.g., "MULTIVERSION_TEST_CHAIN") when installing rules and change
remove_all_rules() to only list and delete rules from that dedicated chain and
then delete the chain itself (use iptables -N when creating rules, -F and -X
only on that chain or delete per-rule by comment/identifier), ensuring
NetworkPartitioner::drop calls this scoped cleanup; update any code that
installs rules to add them to the dedicated chain and avoid calling "iptables
-F" without specifying the chain.
- Around line 37-44: When adding the two DROP rules (calls to
add_drop_rule(source_port, dest_port) and add_drop_rule(dest_port,
source_port)), ensure that if the second add_drop_rule fails you undo the first
rule before returning the error: after awaiting the second add_drop_rule, on Err
capture the error, call sync_remove_drop_rule(source_port, dest_port) to roll
back the first insert (propagate any removal error or prefer returning the
original add error), and then return the original error; keep the existing
FaultGuard::new closure that removes both rules for the success path.

In `@multiversion-tests/src/ports.rs`:
- Around line 15-27: NodePorts::allocate() currently binds three TcpListeners
then drops them, so ports aren’t reserved; change the approach so sockets remain
open until the child process is actually listening or let the child bind :0 and
report its chosen ports back. Concretely, modify NodePorts::allocate() to return
and store the TcpListener instances (or a new struct holding listeners and port
numbers) instead of dropping them, update Cluster::start to hold those listeners
and pass/transfer/dup them to the child or wait for the child to accept on them
before dropping, and update the test consecutive_allocations_do_not_collide to
either read reported ports from the child or use the reserved listeners to avoid
races.

In `@multiversion-tests/src/probe.rs`:
- Around line 146-169: In poll_until, move or add the condition evaluation so it
runs before declaring timeout: call the provided check().await at the top of the
loop (or, when start.elapsed() > timeout, perform one final check().await) and
only return ProbeError::Timeout after that final attempt; update the loop in
poll_until to ensure check() is invoked before returning Err(ProbeError::Timeout
{ condition: condition_name.to_owned(), elapsed: timeout }) so a condition that
becomes true at the timeout boundary can still succeed.

In `@multiversion-tests/src/process.rs`:
- Around line 90-94: The kill() method currently sends SIGKILL then drops
self.child, which can leave a zombie; update kill() to take the existing child
handle (e.g. let mut child = self.child.take().ok_or(...)?), send SIGKILL using
signal::kill(self.pid()?, ...), then explicitly wait for the child to terminate
(use child.wait() or a small try_wait loop with a timeout/retries) before
returning Ok(()); ensure you still clear self.child only after wait succeeds or
after timeout. Also update respawn() to await the completion of start_kill() (or
join its handle/future) before spawning a new process so the old PID is reaped
before creating the replacement. Ensure references to pid(), child, kill(),
start_kill(), and respawn() are used to locate the changes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7e529c48-c3b3-4f5d-aad5-50f3e99c7188

📥 Commits

Reviewing files that changed from the base of the PR and between 6722270 and fd7698b.

⛔ Files ignored due to path filters (2)
  • Cargo.lock is excluded by !**/*.lock
  • multiversion-tests/Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (15)
  • .gitignore
  • multiversion-tests/Cargo.toml
  • multiversion-tests/claude/prompts.md
  • multiversion-tests/fixtures/base-config.toml
  • multiversion-tests/rust-toolchain.toml
  • multiversion-tests/src/binary.rs
  • multiversion-tests/src/cluster.rs
  • multiversion-tests/src/config.rs
  • multiversion-tests/src/fault/crash.rs
  • multiversion-tests/src/fault/mod.rs
  • multiversion-tests/src/fault/network.rs
  • multiversion-tests/src/lib.rs
  • multiversion-tests/src/ports.rs
  • multiversion-tests/src/probe.rs
  • multiversion-tests/src/process.rs

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

♻️ Duplicate comments (3)
multiversion-tests/src/ports.rs (1)

15-27: ⚠️ Potential issue | 🟠 Major

These ports are only sampled, not reserved.

After Line 26 the listeners are gone, so the returned ports are immediately available to other processes again. Because multiversion-tests/src/cluster.rs::Cluster::start() allocates the full map before it spawns children, startup can still race into intermittent EADDRINUSE here. Keep the listeners alive until each child has bound, or let the child bind :0 and report the chosen ports back.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@multiversion-tests/src/ports.rs` around lines 15 - 27, The allocate()
function in multiversion-tests::ports currently drops its TcpListener handles
before returning sampled ports, which allows races (EADDRINUSE) when
Cluster::start() spawns children; keep the listeners alive until children have
bound or change the approach so children bind to ":0" and report back. Fix by
changing ports::allocate to return or hold the TcpListener instances (e.g.,
return a struct with fields api_listener, gossip_listener, reth_network_listener
alongside ports) and ensure Cluster::start uses those listeners to
accept/transfer the bound port ownership to children (or modify Cluster::start
to spawn children that bind to ":0" and send their chosen ports back), updating
any code that references Ports::allocate to consume the listener-backed struct
or the post-bind reported port values.
multiversion-tests/src/process.rs (1)

98-101: ⚠️ Potential issue | 🟠 Major

Wait for the old child to exit before clearing or replacing it.

kill() clears self.child immediately after sending SIGKILL, and respawn() can launch the replacement immediately after requesting the old kill. In multiversion-tests/tests/upgrade.rs Lines 174-183, the crash-upgrade path does exactly that sequence, so the new node can race the old one's socket teardown and fail intermittently on rebinding. Reap the old child, or add a kill_and_wait() path for restart flows, before treating the slot as free.

Also applies to: 139-149

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@multiversion-tests/src/process.rs` around lines 98 - 101, The kill()
implementation sends SIGKILL then clears self.child immediately, which allows
respawn() to reuse the slot before the old process has exited; change kill() to
wait for/reap the killed child before clearing self.child (or add a separate
kill_and_wait() used by restart flows), i.e. after signal::kill(self.pid()?,
Signal::SIGKILL) call perform a blocking or timed wait on the child (using the
same Process/Child wait API you use elsewhere) and only set self.child = None
once the child has been reaped; ensure proper handling of wait errors
(EINTR/ESRCH/timeouts) and return a ProcessError on failure so respawn() sees
the slot as free only after the old process is gone.
multiversion-tests/src/probe.rs (1)

165-176: ⚠️ Potential issue | 🟡 Minor

Do one final probe before timing out.

The loop checks start.elapsed() before it runs check().await, so a condition that flips right at the deadline still returns Timeout. That makes the readiness and convergence waiters more flaky than they need to be.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@multiversion-tests/src/probe.rs` around lines 165 - 176, The loop currently
tests start.elapsed() against timeout before calling check(), which makes
conditions that become true exactly at the deadline return ProbeError::Timeout;
modify the probe loop so it performs the asynchronous check() at least once
after the timeout check (e.g., call check().await before returning Timeout or
rearrange to check first then test elapsed), ensuring the final check can
succeed and, if it still returns None, then return Err(ProbeError::Timeout {
condition: condition_name.to_owned(), elapsed: timeout }). Reference the
existing symbols start, timeout, check(), condition_name and ProbeError::Timeout
when making this change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/multiversion.yml:
- Around line 43-45: Update the "Run multiversion tests" step: modify the run
command string "cargo test --test '*' -- --ignored --test-threads=1" to append
--nocapture so test output streams to the CI logs for debugging, and add an
explicit timeout (e.g., set timeout-minutes: 30) on the step/job to avoid
indefinite runs; target the step named "Run multiversion tests" and the run
invocation when applying the change.

In
`@multiversion-tests/claude/2026-03-11-001-learnings-session-handoff-multiversion-phases-implementation.md`:
- Line 76: Replace the hardcoded machine-specific cd command that uses
/home/glottologist/.../multiversion-testing with a portable alternative: either
change it to a relative path (e.g., cd multi­version-testing) or use the
repository root lookup $(git rev-parse --show-toplevel) to compute the project
root and cd into the multiversion-testing directory; update the line containing
the literal path string so it no longer references the developer's home
directory.

In `@multiversion-tests/src/binary.rs`:
- Around line 103-107: The current sequence (create_worktree, build_at,
remove_worktree, result?) can leak the worktree if build_at fails and
remove_worktree also fails; change the flow to always attempt removal of the
worktree regardless of build_at outcome (use a scope guard/defer/Drop-style
guard or explicit match) so remove_worktree(&worktree_path).await is executed in
a finally-like block; capture the build error from build_at (e.g., let build_err
= result.err()) and then call remove_worktree, and if removal fails combine or
log the removal error but return the original build error (or wrap both) so the
primary failure (build_at) is preserved while still ensuring cleanup; reference
functions/idents: create_worktree, build_at, remove_worktree, worktree_path,
cached.

In `@multiversion-tests/src/cluster.rs`:
- Around line 156-164: allocate_ports currently calls NodePorts::allocate while
inserting into port_map, which silently overwrites duplicates; first validate
uniqueness of NodeSpec.name for the entire nodes slice before allocating any
resources and return an error if a duplicate is found. In practice, add a
pre-check in allocate_ports that iterates nodes, uses a HashSet to detect
duplicates of NodeSpec.name, and if a duplicate is seen return an appropriate
ClusterError (include the offending NodeSpec.name in the error message) so
allocation via NodePorts::allocate() only runs when all names are unique.
- Around line 89-100: The convergence check is missing crashed nodes because
api_urls() filters out nodes where is_running() is false; update api_urls() to
return every node's api_url() (i.e., stop filtering by is_running()) so the
probe receives the full set of node endpoints (use values().map(|n|
n.api_url()).collect()), keeping the probe responsible for detecting
unreachable/down endpoints; reference api_urls(), is_running(), and api_url()
when making the change.
- Around line 217-226: The current peers.iter().filter_map(...) silently drops
unknown peer names; change this to validate every peer and return a ClusterError
when a peer name is missing from port_map instead of filtering it out. Locate
the code that builds the trusted peers (the iterator over peers and the creation
of PeerEndpoint using port_map.get) and replace the filter_map logic with a
mapping that yields a Result<PeerEndpoint, ClusterError> (or collects into
Result<Vec<PeerEndpoint>, ClusterError>), producing an explicit ClusterError if
port_map.get(peer_name) is None; ensure the function signature propagates or
returns that ClusterError so missing peer references fail fast.

In `@multiversion-tests/src/fault/crash.rs`:
- Around line 27-29: The extract_pid function must reject non-positive PIDs
before they reach kill: in extract_pid (handling FaultTarget::Process { pid })
check pid.as_raw() > 0 and return an Err(FaultError) (with an explanatory
message) when pid.as_raw() <= 0 instead of returning Ok(*pid); update any
callers/tests accordingly and add a regression test that constructs
FaultTarget::Process with raw 0 and a negative raw PID to assert the function
returns the error (ensuring we never propagate Pid::from_raw(0) or negative
values to kill).

In `@multiversion-tests/src/fault/network.rs`:
- Around line 19-23: The Drop impl for NetworkPartitioner currently calls
sync_cleanup_chain unconditionally, which can remove rules for other active
instances; modify NetworkPartitioner to track per-instance ownership and only
clean up when appropriate (e.g., add an active: bool field set to true when this
instance creates rules and only call sync_cleanup_chain in drop if active is
true), or implement a global AtomicUsize reference counter (increment when rules
are installed, decrement in Drop, and call sync_cleanup_chain only when the
counter reaches zero); update the NetworkPartitioner constructor/installer and
the Drop impl accordingly to use the chosen mechanism.

In `@multiversion-tests/tests/e2e.rs`:
- Around line 7-8: Extract the two timeout constants HEIGHT_TIMEOUT and
CONVERGENCE_TIMEOUT out of multiversion-tests/tests/e2e.rs into the shared test
utilities module (common) as pub consts, then replace the local definitions by
referencing the shared constants (e.g., common::HEIGHT_TIMEOUT and
common::CONVERGENCE_TIMEOUT) so all tests reuse the same values; ensure the
common module is imported where needed and update any docs/comments accordingly.
- Around line 10-12: Replace the manual #[tokio::test(flavor = "multi_thread")]
attribute and any explicit init_tracing() calls in the async tests with the
tracing-aware attribute #[test_log::test(tokio::test)] for
single_genesis_produces_blocks and the other two async tests in this file;
specifically, update the test attributes (e.g., on
single_genesis_produces_blocks) to #[test_log::test(tokio::test)] and remove the
separate init_tracing() invocation so tracing is initialized automatically by
the test macro.

In `@multiversion-tests/tests/upgrade.rs`:
- Around line 22-26: Replace the raw tokio test attributes and manual tracing
initialization in the async tests with the project-standard test_log wrapper:
change the attribute on upgrade_one_node_in_running_cluster (and the other async
test functions in this file) from #[tokio::test(flavor = "multi_thread")] to
#[test_log::test(tokio::test(flavor = "multi_thread"))], and remove the call to
common::init_tracing() at the start of each test so the test_log subscriber
handles tracing output instead.

---

Duplicate comments:
In `@multiversion-tests/src/ports.rs`:
- Around line 15-27: The allocate() function in multiversion-tests::ports
currently drops its TcpListener handles before returning sampled ports, which
allows races (EADDRINUSE) when Cluster::start() spawns children; keep the
listeners alive until children have bound or change the approach so children
bind to ":0" and report back. Fix by changing ports::allocate to return or hold
the TcpListener instances (e.g., return a struct with fields api_listener,
gossip_listener, reth_network_listener alongside ports) and ensure
Cluster::start uses those listeners to accept/transfer the bound port ownership
to children (or modify Cluster::start to spawn children that bind to ":0" and
send their chosen ports back), updating any code that references Ports::allocate
to consume the listener-backed struct or the post-bind reported port values.

In `@multiversion-tests/src/probe.rs`:
- Around line 165-176: The loop currently tests start.elapsed() against timeout
before calling check(), which makes conditions that become true exactly at the
deadline return ProbeError::Timeout; modify the probe loop so it performs the
asynchronous check() at least once after the timeout check (e.g., call
check().await before returning Timeout or rearrange to check first then test
elapsed), ensuring the final check can succeed and, if it still returns None,
then return Err(ProbeError::Timeout { condition: condition_name.to_owned(),
elapsed: timeout }). Reference the existing symbols start, timeout, check(),
condition_name and ProbeError::Timeout when making this change.

In `@multiversion-tests/src/process.rs`:
- Around line 98-101: The kill() implementation sends SIGKILL then clears
self.child immediately, which allows respawn() to reuse the slot before the old
process has exited; change kill() to wait for/reap the killed child before
clearing self.child (or add a separate kill_and_wait() used by restart flows),
i.e. after signal::kill(self.pid()?, Signal::SIGKILL) call perform a blocking or
timed wait on the child (using the same Process/Child wait API you use
elsewhere) and only set self.child = None once the child has been reaped; ensure
proper handling of wait errors (EINTR/ESRCH/timeouts) and return a ProcessError
on failure so respawn() sees the slot as free only after the old process is
gone.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: fc73493e-fac4-4ed6-84fc-2644115bee33

📥 Commits

Reviewing files that changed from the base of the PR and between fd7698b and ac363bb.

📒 Files selected for processing (18)
  • .github/workflows/multiversion.yml
  • .gitignore
  • multiversion-tests/Cargo.toml
  • multiversion-tests/claude/2026-03-11-001-learnings-session-handoff-multiversion-phases-implementation.md
  • multiversion-tests/fixtures/base-config.toml
  • multiversion-tests/src/binary.rs
  • multiversion-tests/src/cluster.rs
  • multiversion-tests/src/config.rs
  • multiversion-tests/src/fault/crash.rs
  • multiversion-tests/src/fault/network.rs
  • multiversion-tests/src/lib.rs
  • multiversion-tests/src/ports.rs
  • multiversion-tests/src/probe.rs
  • multiversion-tests/src/process.rs
  • multiversion-tests/tests/common/mod.rs
  • multiversion-tests/tests/e2e.rs
  • multiversion-tests/tests/upgrade.rs
  • xtask/src/main.rs

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (6)
multiversion-tests/src/cluster.rs (3)

251-264: ⚠️ Potential issue | 🟠 Major

Fail fast on unknown peer references.

filter_map silently drops any NodeSpec.peers entry that is missing from port_map. That starts the node with the wrong topology and turns a simple typo into a later readiness or convergence timeout. Make this return Result<Vec<PeerEndpoint>, ClusterError> and error on the first unknown peer name.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@multiversion-tests/src/cluster.rs` around lines 251 - 264, The
build_peer_endpoints function currently drops unknown peer names via filter_map;
change its signature to return Result<Vec<PeerEndpoint>, ClusterError> and
iterate over peers, looking up each peer_name in port_map; if a lookup fails,
return an Err(ClusterError::UnknownPeer(peer_name.clone())) (or appropriate
ClusterError variant), otherwise push the constructed PeerEndpoint (using
NodePorts fields gossip, api, reth_network) into a Vec and return Ok(vec) at the
end so the caller fails fast on missing peer references instead of silently
omitting them.

89-99: ⚠️ Potential issue | 🟠 Major

Don't drop crashed nodes from convergence polling.

wait_for_convergence() only probes the URLs returned here. Filtering on is_running() lets a dead node disappear from the convergence set, so the cluster can look healthy while one member is down. Return every node's api_url() and let the probe treat unreachable nodes as non-converged.

Suggested change
-    pub fn api_urls(&mut self) -> Vec<String> {
-        self.nodes
-            .values_mut()
-            .filter_map(|n| {
-                if n.is_running() {
-                    Some(n.api_url())
-                } else {
-                    None
-                }
-            })
-            .collect()
+    pub fn api_urls(&self) -> Vec<String> {
+        self.nodes.values().map(|n| n.api_url()).collect()
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@multiversion-tests/src/cluster.rs` around lines 89 - 99, The api_urls()
helper currently filters out nodes where n.is_running(), causing
wait_for_convergence() to never probe crashed members; change api_urls() (the
function that iterates self.nodes.values_mut()) to return every node.api_url()
unconditionally (i.e., map every node to n.api_url() and collect) so convergence
polling sees unreachable nodes as non‑converged; update the iterator logic to
remove the is_running() conditional and return all URLs.

156-163: ⚠️ Potential issue | 🟠 Major

Reject duplicate node names before allocating anything.

HashMap::insert silently overwrites the earlier entry here. The rest of the module also keys nodes, port_map, config-{name}.toml, and data-{name} by NodeSpec.name, so a duplicate can leave one process untracked while both instances contend for the same paths and ports. Validate uniqueness first and return a ClusterError.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@multiversion-tests/src/cluster.rs` around lines 156 - 163, The allocate_ports
function currently inserts into port_map while allocating per node, allowing
later NodeSpec.name duplicates to silently overwrite earlier entries; before any
NodePorts::allocate calls, validate that all NodeSpec.name values in the nodes
slice are unique (e.g. collect names into a set and detect duplicates) and
return a ClusterError if any duplicate is found so no ports or resources are
allocated for conflicting names; only if uniqueness passes, proceed to allocate
NodePorts and insert into port_map as currently implemented.
multiversion-tests/src/process.rs (1)

141-150: ⚠️ Potential issue | 🟠 Major

Wait for the old child to exit before respawning.

start_kill() only initiates termination. Spawning the replacement immediately after that can race with the old node still holding the same ports or data directory, and this path still leaves reaping to a best-effort drop. Make respawn() async and await the old child's exit, or route this through the existing kill()/shutdown() paths first.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@multiversion-tests/src/process.rs` around lines 141 - 150, The respawn()
function currently calls start_kill() (which only initiates termination) and
immediately spawns a new child, causing races on ports/data; change respawn()
into an async function and ensure the old child is fully exited before spawning
the replacement: either await the child's exit after start_kill() (or call the
existing async kill()/shutdown() path and await its completion) and only after
confirming the old child process has terminated, call spawn_child_with_logs to
set self.child and self.log_rx; reference respawn(), start_kill(),
kill()/shutdown(), spawn_child_with_logs, and the self.child/self.log_rx fields
when applying the change.
multiversion-tests/tests/upgrade.rs (1)

22-26: 🧹 Nitpick | 🔵 Trivial

Use test_log instead of manual tracing setup in these async tests.

These tests still pair raw #[tokio::test(...)] with common::init_tracing(). Switching to #[test_log::test(tokio::test(...))] keeps tracing setup consistent and removes the per-test subscriber boilerplate.

As per coding guidelines "Use #[test_log::test(tokio::test)] for async tests with tracing output in Rust".

Also applies to: 62-66, 106-110, 155-159

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@multiversion-tests/tests/upgrade.rs` around lines 22 - 26, Replace the manual
tracing setup in async tests by swapping the #[tokio::test(...)] attribute for
#[test_log::test(tokio::test(...))] on each async test (e.g.,
upgrade_one_node_in_running_cluster and the other async tests in this file that
call common::init_tracing()), and remove the call to common::init_tracing() at
the start of those test bodies so the test_log crate initializes tracing
automatically.
multiversion-tests/src/probe.rs (1)

177-188: ⚠️ Potential issue | 🟡 Minor

poll_until() can report timeout one iteration too early.

The loop returns Timeout before making a final check().await on the iteration that crosses the deadline. That turns a boundary success into a false negative.

Suggested change
     let start = tokio::time::Instant::now();
     loop {
-        if start.elapsed() > timeout {
-            return Err(ProbeError::Timeout {
-                condition: condition_name.to_owned(),
-                elapsed: timeout,
-            });
-        }
         if let Some(result) = check().await {
             return Ok(result);
         }
+        if start.elapsed() >= timeout {
+            return Err(ProbeError::Timeout {
+                condition: condition_name.to_owned(),
+                elapsed: timeout,
+            });
+        }
         tokio::time::sleep(interval).await;
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@multiversion-tests/src/probe.rs` around lines 177 - 188, The poll_until loop
returns Timeout before performing a final check when the deadline is crossed;
move the timeout check to after the awaited check so the code calls
check().await first and only returns ProbeError::Timeout (using
ProbeError::Timeout, condition_name, start, timeout) if check() yields None and
the elapsed time exceeded the deadline, then sleep and repeat; ensure the
reordered logic still sleeps on intervals and returns Ok when check() returns
Some.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@multiversion-tests/src/cluster.rs`:
- Around line 195-198: The code currently panics via expect when genesis_api_url
is None while peer_specs is non-empty; change Cluster::start() to return a
proper ClusterError instead of panicking: check genesis_api_url.as_deref() and
if None return Err(ClusterError::InvalidSpec("peers present but no genesis node"
or similar)), updating the match/if branch around peer_specs and genesis_api_url
to propagate that Result; reference the Cluster::start() function, the
peer_specs and genesis_api_url variables, and construct/return an appropriate
ClusterError variant rather than calling expect().

In `@multiversion-tests/src/process.rs`:
- Line 42: log_rx is currently an unbounded receiver created with
mpsc::unbounded_channel(), which retains unlimited log history; change to a
bounded channel (e.g., mpsc::channel(BUFFER_SIZE)) or implement a fixed-size
circular buffer to cap memory. Replace the unbounded creation with a bounded
sender/receiver pair and update the code that forwards stdout/stderr to use
try_send (or async send) and handle Full by either dropping the new message or
evicting the oldest entry into a circular VecDeque held on the struct; also
adjust drain_logs() to read from the bounded receiver and optionally flush from
the in-memory circular buffer so callers still get recent logs. Ensure you
update the log_rx field type and any send sites to match the bounded
sender/receiver API and preserve behavior when the channel is full.

---

Duplicate comments:
In `@multiversion-tests/src/cluster.rs`:
- Around line 251-264: The build_peer_endpoints function currently drops unknown
peer names via filter_map; change its signature to return
Result<Vec<PeerEndpoint>, ClusterError> and iterate over peers, looking up each
peer_name in port_map; if a lookup fails, return an
Err(ClusterError::UnknownPeer(peer_name.clone())) (or appropriate ClusterError
variant), otherwise push the constructed PeerEndpoint (using NodePorts fields
gossip, api, reth_network) into a Vec and return Ok(vec) at the end so the
caller fails fast on missing peer references instead of silently omitting them.
- Around line 89-99: The api_urls() helper currently filters out nodes where
n.is_running(), causing wait_for_convergence() to never probe crashed members;
change api_urls() (the function that iterates self.nodes.values_mut()) to return
every node.api_url() unconditionally (i.e., map every node to n.api_url() and
collect) so convergence polling sees unreachable nodes as non‑converged; update
the iterator logic to remove the is_running() conditional and return all URLs.
- Around line 156-163: The allocate_ports function currently inserts into
port_map while allocating per node, allowing later NodeSpec.name duplicates to
silently overwrite earlier entries; before any NodePorts::allocate calls,
validate that all NodeSpec.name values in the nodes slice are unique (e.g.
collect names into a set and detect duplicates) and return a ClusterError if any
duplicate is found so no ports or resources are allocated for conflicting names;
only if uniqueness passes, proceed to allocate NodePorts and insert into
port_map as currently implemented.

In `@multiversion-tests/src/probe.rs`:
- Around line 177-188: The poll_until loop returns Timeout before performing a
final check when the deadline is crossed; move the timeout check to after the
awaited check so the code calls check().await first and only returns
ProbeError::Timeout (using ProbeError::Timeout, condition_name, start, timeout)
if check() yields None and the elapsed time exceeded the deadline, then sleep
and repeat; ensure the reordered logic still sleeps on intervals and returns Ok
when check() returns Some.

In `@multiversion-tests/src/process.rs`:
- Around line 141-150: The respawn() function currently calls start_kill()
(which only initiates termination) and immediately spawns a new child, causing
races on ports/data; change respawn() into an async function and ensure the old
child is fully exited before spawning the replacement: either await the child's
exit after start_kill() (or call the existing async kill()/shutdown() path and
await its completion) and only after confirming the old child process has
terminated, call spawn_child_with_logs to set self.child and self.log_rx;
reference respawn(), start_kill(), kill()/shutdown(), spawn_child_with_logs, and
the self.child/self.log_rx fields when applying the change.

In `@multiversion-tests/tests/upgrade.rs`:
- Around line 22-26: Replace the manual tracing setup in async tests by swapping
the #[tokio::test(...)] attribute for #[test_log::test(tokio::test(...))] on
each async test (e.g., upgrade_one_node_in_running_cluster and the other async
tests in this file that call common::init_tracing()), and remove the call to
common::init_tracing() at the start of those test bodies so the test_log crate
initializes tracing automatically.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: e9094e81-cd66-40e8-a4c6-a625e8190659

📥 Commits

Reviewing files that changed from the base of the PR and between ac363bb and 19a201d.

📒 Files selected for processing (5)
  • multiversion-tests/src/cluster.rs
  • multiversion-tests/src/config.rs
  • multiversion-tests/src/probe.rs
  • multiversion-tests/src/process.rs
  • multiversion-tests/tests/upgrade.rs

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

♻️ Duplicate comments (4)
multiversion-tests/src/process.rs (2)

42-42: ⚠️ Potential issue | 🟠 Major

Log capture still has no memory bound.

mpsc::unbounded_channel() retains every stdout/stderr line until drain_logs() is called. Successful multiversion runs can keep nodes alive for a long time and never drain, so a chatty binary can grow memory without limit. Use a bounded channel or a fixed-size ring buffer and keep only the recent tail.

Tokio 1.x documentation for `tokio::sync::mpsc::unbounded_channel`: does it impose any capacity limit or backpressure when the receiver is not draining?

Also applies to: 158-164, 197-220

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@multiversion-tests/src/process.rs` at line 42, The log receiver field log_rx
currently uses an unbounded channel which can grow without limit; change it to a
bounded channel (e.g., tokio::sync::mpsc::channel with a reasonable capacity) or
implement a fixed-size ring buffer to only retain the most recent N log lines,
and adjust all places that create the channel (where unbounded_channel() is
called) and where messages are sent/received (including drain_logs() and any
spawn/tasks that push logs) to handle TrySend/await send backpressure or
drop/overwrite oldest entries; ensure types and error handling in functions
interacting with log_rx (e.g., drain_logs, log producer tasks) are updated
accordingly.

145-153: ⚠️ Potential issue | 🟠 Major

Wait for the old process to exit before respawning.

respawn() calls start_kill() and immediately spawns the replacement. The stub test still passes because it never binds ports, but the real node binary reuses the same API/gossip/reth sockets, so this can intermittently fail with address-in-use until the previous child actually exits. Make respawn() async and wait for termination, or require callers to stop the process first.

Tokio 1.x documentation for `tokio::process::Child::start_kill`: does it wait for the child to exit, or must callers still await `wait()` / `try_wait()` before reusing resources such as bound ports?

Also applies to: 303-311

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@multiversion-tests/src/process.rs` around lines 145 - 153, The respawn()
function currently calls start_kill() on the old child and immediately spawns a
new child (spawn_child_with_logs), causing port/address conflicts; change
respawn to be async (pub async fn respawn(&mut self) -> Result<(),
ProcessError>) and after calling start_kill() on the taken child await its
termination (call old.wait().await or loop on old.try_wait() until Some) before
calling spawn_child_with_logs; ensure you still log start_kill() errors and
handle/propagate any wait() errors so the new child is only spawned once the
previous child has fully exited (references: respawn, start_kill,
spawn_child_with_logs, child / old).
multiversion-tests/src/cluster.rs (2)

93-118: ⚠️ Potential issue | 🟠 Major

Convergence can still look healthy after a node crashes.

api_urls() filters out nodes whose is_running() flipped false, and wait_for_convergence() probes only that reduced set. A node that dies after startup can disappear from the health check instead of making convergence fail. Return every node URL here and let HttpProbe surface the unreachable one.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@multiversion-tests/src/cluster.rs` around lines 93 - 118, api_urls()
currently filters out nodes where is_running() is false so
wait_for_convergence() only probes running nodes; change api_urls() to return
all node URLs (use nodes.values() or values_mut() without the is_running()
filter and collect api_url() for every node) so HttpProbe
(probe.wait_for_convergence) receives the full set and will surface
unreachable/dead nodes; update any callers assuming filtered results if
necessary (refer to api_urls, wait_for_convergence, nodes, is_running, api_url,
and HttpProbe).

203-206: ⚠️ Potential issue | 🟠 Major

Don't panic on a malformed cluster spec.

Line 206 uses expect() when peers exist without any genesis node, which aborts the whole harness even though Cluster::start() already returns Result. Please add a typed invalid-spec ClusterError and return it here instead.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@multiversion-tests/src/cluster.rs` around lines 203 - 206, Replace the panic
caused by expect() in Cluster::start when peer_specs is non-empty but
genesis_api_url is None: introduce a typed ClusterError variant (e.g.,
ClusterError::InvalidSpec or ClusterError::MissingGenesis) and return Err(...)
from Cluster::start instead of calling expect(); locate the code that computes
genesis_url (the genesis_api_url.as_deref().expect(...) line) and change it to
use ok_or_else or a manual check to produce and return the new ClusterError with
a clear message referencing the malformed cluster spec and peer_specs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@multiversion-tests/claude/2026-03-11-001-learnings-session-handoff-multiversion-phases-implementation.md`:
- Around line 93-97: Remove the contradictory "Staging/committing" / "git-write"
next step from the "Remaining Work" section so the document no longer instructs
agents to perform git write operations; edit the "Remaining Work" bullet list
(the line containing "Staging/committing: When ready, stage all new/modified
files for the PR") to delete that bullet or replace it with a non-write action,
ensuring consistency with the prohibition mentioned elsewhere in the doc (the
statement on Line 120 forbidding git write operations).

In `@multiversion-tests/src/binary.rs`:
- Around line 203-215: The current logic blindly retries a failing
try_cargo_build(...) for "debug-release" as "--release", which masks real build
errors; change the Err(e) branch so it inspects the error from try_cargo_build
and only performs the fallback when the error clearly indicates the profile is
missing (e.g., match on a specific error variant returned by try_cargo_build or
check e.to_string().contains(...) for the Cargo message that a profile is
unknown), otherwise return Err(e) directly; update the match in the block that
sets built to call the fallback only on that specific "profile unavailable"
condition and propagate other errors unchanged.
- Around line 103-109: The code currently logs failures from
remove_worktree(...) and proceeds, which can leave a stale deterministic
worktree (worktree-{rev}) and permanently block future create_worktree(...)
calls; change the behavior in the block that calls
remove_worktree(&worktree_path).await so that on error you propagate a failure
(return Err) instead of swallowing it, or perform a robust cleanup (e.g.,
attempt git worktree prune/remove and only continue if cleanup succeeds) before
proceeding to use build_at(...); specifically update the logic surrounding
worktree_path, create_worktree, remove_worktree, and build_at so remove_worktree
failures cause the function to return an error (or retry/force-prune) to avoid
leaving stale registrations that block future builds.

In `@multiversion-tests/src/cluster.rs`:
- Around line 75-76: The port allocation race happens because
NodePorts::allocate() returns bare u16s (ports) and drops the TcpListeners
immediately, so allocate_ports() hands unreserved ports to
spawn_and_wait_for_nodes(), allowing races when spawn_node() later binds them;
fix by preserving bound listeners until the node processes start (e.g., change
NodePorts::allocate()/allocate_ports() to return owned TcpListener instances or
a NodePorts struct that holds listeners, thread those listeners through
spawn_and_wait_for_nodes() into spawn_node() so the sockets remain reserved
until spawn_node() takes ownership and binds/hand-offs), or alternatively move
the port selection into the child process (call the selection logic inside
spawn_node()) so no temporary unreserved ports are returned.

In `@multiversion-tests/src/fault/network.rs`:
- Around line 147-164: Consolidate the two tests into a single parameterized
rstest that exercises extract_ports for both variants of FaultTarget: supply a
case with FaultTarget::Network { source_port: 8080, dest_port: 9090 } expecting
Ok((8080, 9090)) and a case with FaultTarget::Process { pid: Pid::from_raw(1) }
expecting an Err; replace the separate functions
extract_ports_succeeds_for_network_target and
extract_ports_fails_for_process_target with a single #[rstest] function that
enumerates these inputs and asserts the expected result from
extract_ports(&target).
- Around line 36-48: The ACTIVE_INJECTIONS counter must be incremented before
creating drop rules and must always be decremented on any failure or during
cleanup; move ACTIVE_INJECTIONS.fetch_add(1, Ordering::Release) to before
calling add_drop_rule/add_drop_rule(dest...), and if any add_* fails decrement
the counter before returning error; also ensure the FaultGuard closure always
decrements ACTIVE_INJECTIONS even when sync_remove_drop_rule returns Err (i.e.,
perform removals but ignore/remove errors and always call
ACTIVE_INJECTIONS.fetch_sub(1, Ordering::Release) in a finally-style path) so
the counter cannot be left permanently elevated.

In `@multiversion-tests/src/probe.rs`:
- Around line 67-109: Each getter currently calls .send().await? then
.json().await? which treats 4xx/5xx responses as successes; update get_info,
get_peers, get_network_config and get_genesis_hash to call .error_for_status()?
on the response returned by .send().await? before attempting .json() (i.e., let
resp = self.client.get(url).send().await?.error_for_status()?; then
resp.json::<...>().await?), so HTTP error statuses are converted into ProbeError
failures instead of being deserialized as success.

In `@multiversion-tests/tests/common/mod.rs`:
- Around line 44-56: The peer_spec function silently reuses identities by
wrapping the index with modulo on MINING_KEYS/REWARD_ADDRESSES; change it to
fail fast by validating the provided index against MINING_KEYS.len() and
REWARD_ADDRESSES.len() and returning a clear error or panicking rather than
masking the problem—update peer_spec to check that index < MINING_KEYS.len() and
index < REWARD_ADDRESSES.len() before constructing the NodeSpec (or change the
signature to return Result<NodeSpec, Error> and return Err with a descriptive
message referencing peer_spec, mining_key and reward_address when the index is
out of range).

---

Duplicate comments:
In `@multiversion-tests/src/cluster.rs`:
- Around line 93-118: api_urls() currently filters out nodes where is_running()
is false so wait_for_convergence() only probes running nodes; change api_urls()
to return all node URLs (use nodes.values() or values_mut() without the
is_running() filter and collect api_url() for every node) so HttpProbe
(probe.wait_for_convergence) receives the full set and will surface
unreachable/dead nodes; update any callers assuming filtered results if
necessary (refer to api_urls, wait_for_convergence, nodes, is_running, api_url,
and HttpProbe).
- Around line 203-206: Replace the panic caused by expect() in Cluster::start
when peer_specs is non-empty but genesis_api_url is None: introduce a typed
ClusterError variant (e.g., ClusterError::InvalidSpec or
ClusterError::MissingGenesis) and return Err(...) from Cluster::start instead of
calling expect(); locate the code that computes genesis_url (the
genesis_api_url.as_deref().expect(...) line) and change it to use ok_or_else or
a manual check to produce and return the new ClusterError with a clear message
referencing the malformed cluster spec and peer_specs.

In `@multiversion-tests/src/process.rs`:
- Line 42: The log receiver field log_rx currently uses an unbounded channel
which can grow without limit; change it to a bounded channel (e.g.,
tokio::sync::mpsc::channel with a reasonable capacity) or implement a fixed-size
ring buffer to only retain the most recent N log lines, and adjust all places
that create the channel (where unbounded_channel() is called) and where messages
are sent/received (including drain_logs() and any spawn/tasks that push logs) to
handle TrySend/await send backpressure or drop/overwrite oldest entries; ensure
types and error handling in functions interacting with log_rx (e.g., drain_logs,
log producer tasks) are updated accordingly.
- Around line 145-153: The respawn() function currently calls start_kill() on
the old child and immediately spawns a new child (spawn_child_with_logs),
causing port/address conflicts; change respawn to be async (pub async fn
respawn(&mut self) -> Result<(), ProcessError>) and after calling start_kill()
on the taken child await its termination (call old.wait().await or loop on
old.try_wait() until Some) before calling spawn_child_with_logs; ensure you
still log start_kill() errors and handle/propagate any wait() errors so the new
child is only spawned once the previous child has fully exited (references:
respawn, start_kill, spawn_child_with_logs, child / old).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 31cb7c3d-922c-48d2-bf18-ea66d46809ff

📥 Commits

Reviewing files that changed from the base of the PR and between 19a201d and e14b184.

⛔ Files ignored due to path filters (1)
  • multiversion-tests/Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (12)
  • .github/workflows/multiversion.yml
  • multiversion-tests/Cargo.toml
  • multiversion-tests/claude/2026-03-11-001-learnings-session-handoff-multiversion-phases-implementation.md
  • multiversion-tests/src/binary.rs
  • multiversion-tests/src/cluster.rs
  • multiversion-tests/src/fault/crash.rs
  • multiversion-tests/src/fault/network.rs
  • multiversion-tests/src/probe.rs
  • multiversion-tests/src/process.rs
  • multiversion-tests/tests/common/mod.rs
  • multiversion-tests/tests/e2e.rs
  • multiversion-tests/tests/upgrade.rs

Comment on lines +93 to +97
### Remaining Work
- **Compilation verification**: Small — just running checks and fixing any issues
- **prompts.md update**: Small — recording session prompts
- **Staging/committing**: When ready, stage all new/modified files for the PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Remove the git-write next step.

This tells the next agent to stage changes even though the same document says on Line 120 that git write operations are prohibited. That contradiction is easy to follow in the wrong direction.

🔧 Minimal fix
-- **Staging/committing**: When ready, stage all new/modified files for the PR
+- **Hand-off**: When ready, hand the updated patch back for a human to stage/commit
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
### Remaining Work
- **Compilation verification**: Small — just running checks and fixing any issues
- **prompts.md update**: Small — recording session prompts
- **Staging/committing**: When ready, stage all new/modified files for the PR
### Remaining Work
- **Compilation verification**: Small — just running checks and fixing any issues
- **prompts.md update**: Small — recording session prompts
- **Hand-off**: When ready, hand the updated patch back for a human to stage/commit
🧰 Tools
🪛 markdownlint-cli2 (0.21.0)

[warning] 93-93: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@multiversion-tests/claude/2026-03-11-001-learnings-session-handoff-multiversion-phases-implementation.md`
around lines 93 - 97, Remove the contradictory "Staging/committing" /
"git-write" next step from the "Remaining Work" section so the document no
longer instructs agents to perform git write operations; edit the "Remaining
Work" bullet list (the line containing "Staging/committing: When ready, stage
all new/modified files for the PR") to delete that bullet or replace it with a
non-write action, ensuring consistency with the prohibition mentioned elsewhere
in the doc (the statement on Line 120 forbidding git write operations).

Comment on lines +147 to +164
#[test]
fn extract_ports_succeeds_for_network_target() {
let target = FaultTarget::Network {
source_port: 8080,
dest_port: 9090,
};
let (s, d) = extract_ports(&target).unwrap();
assert_eq!(s, 8080);
assert_eq!(d, 9090);
}

#[test]
fn extract_ports_fails_for_process_target() {
let target = FaultTarget::Process {
pid: Pid::from_raw(1),
};
assert!(extract_ports(&target).is_err());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider using rstest for parameterized tests.

The tests could be consolidated using rstest for better coverage and consistency with other test files in this crate (e.g., crash.rs). As per coding guidelines, "Use rstest for parameterized tests."

♻️ Parameterized test example
+    use rstest::rstest;
+
-    #[test]
-    fn extract_ports_succeeds_for_network_target() {
-        let target = FaultTarget::Network {
-            source_port: 8080,
-            dest_port: 9090,
-        };
+    #[rstest]
+    #[case::typical(8080, 9090)]
+    #[case::same_port(443, 443)]
+    #[case::max_ports(65535, 65535)]
+    fn extract_ports_succeeds_for_network_target(#[case] source: u16, #[case] dest: u16) {
+        let target = FaultTarget::Network {
+            source_port: source,
+            dest_port: dest,
+        };
         let (s, d) = extract_ports(&target).unwrap();
-        assert_eq!(s, 8080);
-        assert_eq!(d, 9090);
+        assert_eq!(s, source);
+        assert_eq!(d, dest);
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@multiversion-tests/src/fault/network.rs` around lines 147 - 164, Consolidate
the two tests into a single parameterized rstest that exercises extract_ports
for both variants of FaultTarget: supply a case with FaultTarget::Network {
source_port: 8080, dest_port: 9090 } expecting Ok((8080, 9090)) and a case with
FaultTarget::Process { pid: Pid::from_raw(1) } expecting an Err; replace the
separate functions extract_ports_succeeds_for_network_target and
extract_ports_fails_for_process_target with a single #[rstest] function that
enumerates these inputs and asserts the expected result from
extract_ports(&target).

@glottologist glottologist force-pushed the jason/multiversion-testing branch from e14b184 to b0b65ef Compare March 13, 2026 11:59
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

♻️ Duplicate comments (8)
multiversion-tests/src/ports.rs (1)

14-28: ⚠️ Potential issue | 🟠 Major

Port reservation race condition remains unaddressed.

Dropping the listeners at line 26 releases the ports before the node processes bind to them. Between allocate() returning and node startup, another process can claim these ports, causing intermittent test failures.

Consider one of:

  1. Return the TcpListener handles and keep them open until child processes are ready
  2. Have child processes bind to :0 and report their assigned ports back
  3. Use SO_REUSEADDR/SO_REUSEPORT with immediate child binding
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@multiversion-tests/src/ports.rs` around lines 14 - 28, The allocate() method
currently binds three TcpListener sockets then drops them (drop((l1,l2,l3))),
releasing ports and causing a race; modify the design so ports remain reserved
until child processes bind: either change NodePorts and allocate() to return or
store the TcpListener handles (e.g., add fields like
api_listener/gossip_listener/reth_network_listener to the NodePorts struct and
return the listeners alongside port numbers) and only drop them after the child
processes signal readiness, or implement child-side binding to port 0 with a
startup handshake to report assigned ports back; update any callers of
NodePorts::allocate and the node startup logic to consume the listeners or
perform the handshake so the ports are not released prematurely.
multiversion-tests/src/fault/network.rs (1)

148-171: 🧹 Nitpick | 🔵 Trivial

Consider consolidating tests with rstest for consistency.

The tests work correctly but could be consolidated using rstest for better coverage and consistency with other test files in this crate (e.g., crash.rs).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@multiversion-tests/src/fault/network.rs` around lines 148 - 171, Consolidate
the two unit tests into a single parameterized rstest to match other files:
create a #[rstest] that supplies two cases for FaultTarget (one Network {
source_port: 8080, dest_port: 9090 } expecting Ok((8080,9090)) and one Process {
pid: Pid::from_raw(1) } expecting Err), and replace the separate test functions
with that single parameterized test which calls extract_ports(&target) and
asserts the expected result; reference the existing symbols extract_ports and
FaultTarget when updating the tests.
multiversion-tests/src/probe.rs (1)

67-109: ⚠️ Potential issue | 🟠 Major

Turn HTTP error statuses into probe failures.

send().await? succeeds on 4xx/5xx. A failing /v1/info, /v1/network/config, or /v1/genesis endpoint can still look healthy here if its error payload deserializes. Call error_for_status()? before json() in each getter.

Suggested fix pattern
     pub async fn get_info(&self, api_url: &str) -> Result<NodeInfo, ProbeError> {
         let url = crate::api_endpoint(api_url, "v1/info");
-        Ok(self
-            .client
-            .get(url)
-            .send()
-            .await?
-            .json::<NodeInfo>()
-            .await?)
+        let resp = self.client.get(url).send().await?.error_for_status()?;
+        Ok(resp.json::<NodeInfo>().await?)
     }

Apply the same pattern to get_peers, get_network_config, and get_genesis_hash.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@multiversion-tests/src/probe.rs` around lines 67 - 109, Each of the HTTP
getters (get_info, get_peers, get_network_config, get_genesis_hash) must treat
non-2xx responses as probe failures by calling error_for_status()? on the
Response before deserializing; update each method to call
.send().await?.error_for_status()? and then .json().await? (for get_genesis_hash
keep the subsequent serde_json::Value handling and the existing
genesis_block_hash check).
multiversion-tests/src/cluster.rs (2)

172-183: ⚠️ Potential issue | 🟠 Major

These ports are not actually reserved.

allocate_ports() keeps only the numbers returned by NodePorts::allocate(), and multiversion-tests/src/ports.rs:14-27 drops the temporary listeners before returning. Another process can claim one of those ports before spawn_node() binds it, which makes cluster startup flaky under parallel CI.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@multiversion-tests/src/cluster.rs` around lines 172 - 183, allocate_ports
currently records ephemeral port numbers from NodePorts::allocate() while
ports.rs drops the temporary listeners, allowing races; fix by making allocation
hold the listeners until nodes are bound: modify NodePorts::allocate (in
ports.rs) to return reserved TcpListener(s) or a NodePorts struct that stores
TcpListener(s) instead of just u16s, update allocate_ports to insert and keep
those NodePorts-with-listeners in the HashMap, and update spawn_node to
accept/use the preserved listeners (or accept owned NodePorts) so the sockets
remain reserved until the process binds them.

95-105: ⚠️ Potential issue | 🟠 Major

Don't let crashed nodes disappear from convergence checks.

wait_for_convergence() does a liveness pass first, but api_urls() re-checks is_running() and drops any node that dies in the gap. That lets the remaining nodes "converge" without the crashed member.

Suggested fix
-    pub fn api_urls(&mut self) -> Vec<String> {
-        self.nodes
-            .values_mut()
-            .filter_map(|n| {
-                if n.is_running() {
-                    Some(n.api_url())
-                } else {
-                    None
-                }
-            })
-            .collect()
+    pub fn api_urls(&self) -> Vec<String> {
+        self.nodes.values().map(|n| n.api_url()).collect()
     }

Also applies to: 116-129

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@multiversion-tests/src/cluster.rs` around lines 95 - 105, The api_urls()
helper is dropping nodes that died between the liveness pass and the convergence
check by calling is_running(); remove that check so api_urls() returns each
node's api_url() regardless of is_running() (e.g., replace the
filter_map/is_running() logic with an unconditional map over
self.nodes.values()/values_mut() to collect api_url()), so
wait_for_convergence() can perform its own liveness logic; apply the same change
to the other similar helper mentioned in the comment.
multiversion-tests/tests/common/mod.rs (1)

44-56: ⚠️ Potential issue | 🟠 Major

Fail fast when the peer identity pool is exhausted.

peer_spec() silently reuses credentials here: index == 0 collides with genesis, and the fourth peer wraps back to the first fixture. That makes larger scenarios look valid while actually booting duplicate identities.

Suggested fix
 pub fn peer_spec(
     name: &str,
     binary: &ResolvedBinary,
     index: usize,
     peers: Vec<String>,
 ) -> NodeSpec {
+    assert!(
+        index > 0 && index < MINING_KEYS.len() && index < REWARD_ADDRESSES.len(),
+        "missing test identity for peer index {index}"
+    );
     NodeSpec {
         name: name.to_owned(),
         binary: binary.clone(), // clone: building owned NodeSpec from shared reference
         role: NodeRole::Peer,
         peers,
-        mining_key: MINING_KEYS[index % MINING_KEYS.len()].to_owned(),
-        reward_address: REWARD_ADDRESSES[index % REWARD_ADDRESSES.len()].to_owned(),
+        mining_key: MINING_KEYS[index].to_owned(),
+        reward_address: REWARD_ADDRESSES[index].to_owned(),
     }
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@multiversion-tests/tests/common/mod.rs` around lines 44 - 56, peer_spec
currently masks credential exhaustion by using modulo on MINING_KEYS and
REWARD_ADDRESSES; replace the modulo indexing with direct indexing and fail
fast: in peer_spec (when constructing NodeSpec), assert or return/panic if index
is out of bounds for MINING_KEYS or REWARD_ADDRESSES (e.g., index >=
MINING_KEYS.len() || index >= REWARD_ADDRESSES.len()) and include a clear error
message referencing the peer name and index so duplicate identities (including
collision with genesis at index 0) are not silently produced.
multiversion-tests/src/process.rs (2)

145-155: ⚠️ Potential issue | 🟠 Major

Wait for the old child before respawning.

respawn() sends start_kill() and immediately launches the replacement against the same ports and data dir. Direct callers can race the old process and hit EADDRINUSE or overlapping on-disk state. Either make this path async and wait on the old Child, or enforce that callers fully stop the process before respawn().

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@multiversion-tests/src/process.rs` around lines 145 - 155, respawn()
currently calls start_kill() on the old Child and immediately spawns a new
child, leading to races on ports/dirs; change respawn (or its control flow) to
wait for the old process to exit before calling spawn_child_with_logs: after
successfully calling old.start_kill(), call old.wait() (or wait_with_output())
on the taken Child and handle/propagate any wait errors (log non-fatal
warnings), then proceed to spawn_child_with_logs and set self.child/self.log_rx;
alternatively, if start_kill() is async-ish, make respawn async and await the
termination future—refer to respawn, start_kill, child, and
spawn_child_with_logs when making the change.

42-42: ⚠️ Potential issue | 🟠 Major

Cap the in-memory log backlog.

The log channel is unbounded, and the only way to release memory is drain_logs(). In this harness that usually happens only on failure paths, so a noisy node can retain its entire stdout/stderr history in RAM.

Also applies to: 158-163, 184-221

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@multiversion-tests/src/process.rs` at line 42, The log channel is currently
unbounded (log_rx: mpsc::UnboundedReceiver<String>), which allows unbounded
memory growth; change the channel to a bounded one and enforce a reasonable
backlog size, replace UnboundedReceiver/UnboundedSender with a bounded
mpsc::Receiver/mpsc::Sender (e.g., mpsc::channel(backlog_size)) when creating
the channel, update the field log_rx and any senders/creators accordingly, and
ensure drain_logs() and any places handling logs (the log reader/collector code
referenced around the other ranges) handle a full/closed channel correctly
(dropping or truncating old messages instead of retaining forever). Ensure
backlog_size is configurable or set to a sane constant and update
tests/constructors that create the channel to use the new bounded constructor.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@multiversion-tests/src/binary.rs`:
- Around line 137-148: The call to file.lock() in acquire_build_lock uses an
extension method provided by the fs4 crate, so add fs4 = "0.11" (or a compatible
version) to Cargo.toml and import the extension trait in binary.rs with use
fs4::FileExt; so that the File::lock() call compiles; update
multiversion-tests/Cargo.toml to include the dependency and add the use
statement near other imports in binary.rs.

In `@multiversion-tests/src/config.rs`:
- Around line 147-148: The code currently serializes config with
toml::to_string_pretty(&config) and then calls std::fs::write(config_path,
output) which can leave a truncated config if interrupted; instead write the
serialized output to a temporary file in the same directory (e.g., create a file
named config_path.with_extension(".tmp") or use tempfile in the same dir), flush
and sync the temp file, then atomically replace the original by renaming the
temp into config_path (std::fs::rename), ensuring any intermediate fsyncs as
needed; update the code around the variables output and config_path to perform
these steps (create temp, write output, sync, rename) rather than calling
std::fs::write directly.
- Around line 427-448: Replace the manual loop in the test
camel_to_snake_known_consensus_fields with an rstest parameterized test: add use
rstest::rstest, annotate the test with #[rstest], change the signature to fn
camel_to_snake_known_consensus_fields(#[case] camel: &str, #[case] snake: &str)
and provide one #[case] per pair (e.g. #[case("chainId", "chain_id")], ...)
instead of the cases array and for-loop, then keep the assertion
assert_eq!(camel_to_snake(camel), snake, "failed for {camel}"); to run each case
separately.
- Around line 156-161: The JSON number branch in the conversion currently uses
n.as_i64() then n.as_f64(), which can lose precision for large unsigned values;
update the serde_json::Value::Number handling so it first checks n.as_i64(),
then n.as_u64() and if the u64 value is > i64::MAX convert it to
Value::String(n.to_string()) (preserving precision), otherwise convert to
Value::Integer(u64 as i64); only then fall back to n.as_f64().map(Value::Float).
Also add a unit test that converts a JSON number equal to i64::MAX + 1 (as a raw
number) through the conversion and back to ensure it round-trips without
precision loss. Ensure you modify the function handling
serde_json::Value::Number and add the test for the large u64 case.

In `@xtask/src/main.rs`:
- Around line 530-539: The test invocation builds test_args and currently pushes
the literal flags "--test" and "*" which causes Cargo to error because "*" is
not a valid test name; update the construction used by test_args (the vector
that is passed to cmd!(sh, "cargo {test_args...}") ) to use the correct flag
"--tests" instead of the pair "--test" and "*" so Cargo runs all integration
tests (adjust any surrounding separators like "--" if needed to preserve
ignored/tests-thread flags).

---

Duplicate comments:
In `@multiversion-tests/src/cluster.rs`:
- Around line 172-183: allocate_ports currently records ephemeral port numbers
from NodePorts::allocate() while ports.rs drops the temporary listeners,
allowing races; fix by making allocation hold the listeners until nodes are
bound: modify NodePorts::allocate (in ports.rs) to return reserved
TcpListener(s) or a NodePorts struct that stores TcpListener(s) instead of just
u16s, update allocate_ports to insert and keep those NodePorts-with-listeners in
the HashMap, and update spawn_node to accept/use the preserved listeners (or
accept owned NodePorts) so the sockets remain reserved until the process binds
them.
- Around line 95-105: The api_urls() helper is dropping nodes that died between
the liveness pass and the convergence check by calling is_running(); remove that
check so api_urls() returns each node's api_url() regardless of is_running()
(e.g., replace the filter_map/is_running() logic with an unconditional map over
self.nodes.values()/values_mut() to collect api_url()), so
wait_for_convergence() can perform its own liveness logic; apply the same change
to the other similar helper mentioned in the comment.

In `@multiversion-tests/src/fault/network.rs`:
- Around line 148-171: Consolidate the two unit tests into a single
parameterized rstest to match other files: create a #[rstest] that supplies two
cases for FaultTarget (one Network { source_port: 8080, dest_port: 9090 }
expecting Ok((8080,9090)) and one Process { pid: Pid::from_raw(1) } expecting
Err), and replace the separate test functions with that single parameterized
test which calls extract_ports(&target) and asserts the expected result;
reference the existing symbols extract_ports and FaultTarget when updating the
tests.

In `@multiversion-tests/src/ports.rs`:
- Around line 14-28: The allocate() method currently binds three TcpListener
sockets then drops them (drop((l1,l2,l3))), releasing ports and causing a race;
modify the design so ports remain reserved until child processes bind: either
change NodePorts and allocate() to return or store the TcpListener handles
(e.g., add fields like api_listener/gossip_listener/reth_network_listener to the
NodePorts struct and return the listeners alongside port numbers) and only drop
them after the child processes signal readiness, or implement child-side binding
to port 0 with a startup handshake to report assigned ports back; update any
callers of NodePorts::allocate and the node startup logic to consume the
listeners or perform the handshake so the ports are not released prematurely.

In `@multiversion-tests/src/probe.rs`:
- Around line 67-109: Each of the HTTP getters (get_info, get_peers,
get_network_config, get_genesis_hash) must treat non-2xx responses as probe
failures by calling error_for_status()? on the Response before deserializing;
update each method to call .send().await?.error_for_status()? and then
.json().await? (for get_genesis_hash keep the subsequent serde_json::Value
handling and the existing genesis_block_hash check).

In `@multiversion-tests/src/process.rs`:
- Around line 145-155: respawn() currently calls start_kill() on the old Child
and immediately spawns a new child, leading to races on ports/dirs; change
respawn (or its control flow) to wait for the old process to exit before calling
spawn_child_with_logs: after successfully calling old.start_kill(), call
old.wait() (or wait_with_output()) on the taken Child and handle/propagate any
wait errors (log non-fatal warnings), then proceed to spawn_child_with_logs and
set self.child/self.log_rx; alternatively, if start_kill() is async-ish, make
respawn async and await the termination future—refer to respawn, start_kill,
child, and spawn_child_with_logs when making the change.
- Line 42: The log channel is currently unbounded (log_rx:
mpsc::UnboundedReceiver<String>), which allows unbounded memory growth; change
the channel to a bounded one and enforce a reasonable backlog size, replace
UnboundedReceiver/UnboundedSender with a bounded mpsc::Receiver/mpsc::Sender
(e.g., mpsc::channel(backlog_size)) when creating the channel, update the field
log_rx and any senders/creators accordingly, and ensure drain_logs() and any
places handling logs (the log reader/collector code referenced around the other
ranges) handle a full/closed channel correctly (dropping or truncating old
messages instead of retaining forever). Ensure backlog_size is configurable or
set to a sane constant and update tests/constructors that create the channel to
use the new bounded constructor.

In `@multiversion-tests/tests/common/mod.rs`:
- Around line 44-56: peer_spec currently masks credential exhaustion by using
modulo on MINING_KEYS and REWARD_ADDRESSES; replace the modulo indexing with
direct indexing and fail fast: in peer_spec (when constructing NodeSpec), assert
or return/panic if index is out of bounds for MINING_KEYS or REWARD_ADDRESSES
(e.g., index >= MINING_KEYS.len() || index >= REWARD_ADDRESSES.len()) and
include a clear error message referencing the peer name and index so duplicate
identities (including collision with genesis at index 0) are not silently
produced.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0d135a76-65d8-4135-a532-5689b3c9b957

📥 Commits

Reviewing files that changed from the base of the PR and between e14b184 and b0b65ef.

⛔ Files ignored due to path filters (1)
  • multiversion-tests/Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (17)
  • .github/workflows/multiversion.yml
  • .gitignore
  • multiversion-tests/Cargo.toml
  • multiversion-tests/fixtures/base-config.toml
  • multiversion-tests/src/binary.rs
  • multiversion-tests/src/cluster.rs
  • multiversion-tests/src/config.rs
  • multiversion-tests/src/fault/crash.rs
  • multiversion-tests/src/fault/network.rs
  • multiversion-tests/src/lib.rs
  • multiversion-tests/src/ports.rs
  • multiversion-tests/src/probe.rs
  • multiversion-tests/src/process.rs
  • multiversion-tests/tests/common/mod.rs
  • multiversion-tests/tests/e2e.rs
  • multiversion-tests/tests/upgrade.rs
  • xtask/src/main.rs

Comment on lines +156 to +161
serde_json::Value::Number(n) => {
if let Some(i) = n.as_i64() {
Some(Value::Integer(i))
} else {
n.as_f64().map(Value::Float)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Current conversion logic in this file
rg -n --type=rust -C3 'fn json_to_toml|as_i64\(|as_u64\(|as_f64\(' multiversion-tests/src/config.rs

# Search consensus-related numeric fields across the repo
rg -n --type=rust -C2 '\bstruct\s+ConsensusConfig\b|:\s*u64\b|:\s*u128\b|:\s*usize\b'

Repository: Irys-xyz/irys

Length of output: 50372


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Find where json_to_toml is called
rg -n 'json_to_toml' multiversion-tests/src/ --type rust

# Check the full function context and tests
head -n 250 multiversion-tests/src/config.rs | tail -n +140

Repository: Irys-xyz/irys

Length of output: 4768


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Look for consensus config patching or usage
rg -n 'ConsensusConfig|patch|json_to_toml' multiversion-tests/src/config.rs --type rust -B 2 -A 5

# Check what TOML Value types are available
rg -n 'enum Value|struct Value|toml::Value' multiversion-tests/src/config.rs --type rust -B 1 -A 3

Repository: Irys-xyz/irys

Length of output: 3322


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Look at what consensus_json contains - check the API response type
rg -n 'network/config|consensus' multiversion-tests/src/ --type rust -B 2 -A 2 | head -50

# Check if there are any test fixtures that show what values come through
fd -e toml -e json multiversion-tests/fixtures/ -type f

Repository: Irys-xyz/irys

Length of output: 3283


🏁 Script executed:

#!/bin/bash
set -euo zipefail

# Check the actual consensus config types and their numeric fields
rg -n 'struct ConsensusConfig|ConsensusType|pub.*u64|pub.*u128' crates/types/src/ --type rust -A 1 | grep -E '(struct|pub|u64|u128|i64)' | head -100

Repository: Irys-xyz/irys

Length of output: 110


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Find ConsensusConfig struct definition
rg -n 'struct ConsensusConfig' crates/ --type rust -A 50 | head -100

Repository: Irys-xyz/irys

Length of output: 3955


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Look at how the consensus JSON is used - check what v1/network/config returns
rg -n 'network/config\|v1/network' crates/ --type rust -B 3 -A 3

Repository: Irys-xyz/irys

Length of output: 39


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check if any consensus fields are u64 or u128 that might exceed i64::MAX
fd -type f -name '*.rs' crates/types/src/ | xargs grep -l 'ConsensusConfig' | head -3 | while read f; do
  echo "=== $f ==="
  head -n 300 "$f" | tail -n 200
done

Repository: Irys-xyz/irys

Length of output: 229


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check the canonical serialization to understand how u64 fields are serialized
rg -n 'serialize_u64|serialize_u128|u64.*serialize|canonical' crates/types/src/canonical.rs -A 3 -B 1 | head -60

Repository: Irys-xyz/irys

Length of output: 2227


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Look at the actual genesis/consensus structure to see what fields might be large u64s
rg -n 'struct.*Config|pub.*u64|pub.*u128' crates/types/src/config/consensus.rs | head -50

Repository: Irys-xyz/irys

Length of output: 1481


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check how JSON numbers with large values are handled by serde_json
cat > /tmp/test_json_num.rs << 'EOF'
fn main() {
    use serde_json::json;
    
    // Test with a large u64
    let large_u64 = u64::MAX;
    let json_num = serde_json::Number::from_u64(large_u64).unwrap();
    
    println!("Large u64: {}", large_u64);
    println!("as_i64(): {:?}", json_num.as_i64());
    println!("as_u64(): {:?}", json_num.as_u64());
    println!("as_f64(): {:?}", json_num.as_f64());
    
    // Test with i64::MAX
    let i64_max = i64::MAX as u64;
    let json_num2 = serde_json::Number::from_u64(i64_max).unwrap();
    println!("\ni64::MAX as u64: {}", i64_max);
    println!("as_i64(): {:?}", json_num2.as_i64());
    println!("as_u64(): {:?}", json_num2.as_u64());
}
EOF
cd /tmp && rustc test_json_num.rs -o test_json_num 2>/dev/null && ./test_json_num || echo "Rust not available in sandbox"

Repository: Irys-xyz/irys

Length of output: 86


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check the test that validates integer recovery from strings
rg -n 'json_to_toml' multiversion-tests/src/config.rs -B 5 -A 15 | grep -A 20 'test'

Repository: Irys-xyz/irys

Length of output: 723


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check if consensus_json is guaranteed to be canonical (all u64/u128 as strings)
# or if it could be raw JSON numbers
rg -n 'get_network_config\|/v1/network/config' multiversion-tests/ crates/ --type rust -B 5 -A 5

Repository: Irys-xyz/irys

Length of output: 39


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check if there's validation or parsing of the consensus JSON
rg -n 'patch_peer_consensus\|consensus_json' multiversion-tests/ --type rust -B 3 -A 10 | head -80

Repository: Irys-xyz/irys

Length of output: 39


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check what type the API returns - is it canonical JSON?
rg -n 'network/config|serde_json|Value' crates/types/src/ --type rust | grep -i 'api\|response\|network' | head -20

Repository: Irys-xyz/irys

Length of output: 39


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Look at the probe.rs file where get_network_config is defined
cat multiversion-tests/src/probe.rs | grep -A 30 'get_network_config'

Repository: Irys-xyz/irys

Length of output: 1188


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Look at where patch_peer_consensus is called
cat multiversion-tests/src/config.rs | grep -B 10 -A 10 'patch_peer_consensus' | head -50

Repository: Irys-xyz/irys

Length of output: 894


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check what the actual test does with consensus JSON
cat multiversion-tests/src/config.rs | sed -n '400,500p'

Repository: Irys-xyz/irys

Length of output: 2553


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check what type is returned by /v1/network/config endpoint
# Look for the handler/endpoint definition
rg -n 'v1/network/config|network.*config|GET.*config' crates/ --type rust -B 3 -A 5 | head -80

Repository: Irys-xyz/irys

Length of output: 5053


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check if ConsensusConfig is serialized with canonical format when returned from API
rg -n 'serde.*canonical|to_canonical|Canonical' crates/ --type rust | grep -i 'consensus\|api\|response' | head -20

Repository: Irys-xyz/irys

Length of output: 1457


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check the actual structure of what serde_json::json!() produces
# and verify if canonical JSON format is used for large numbers
cat > /tmp/check_precision.py << 'EOF'
# When serde_json serializes u64, does it produce JSON numbers or strings?
# The canonical module says it serializes u64 as strings for JS safety
# But that's during TO_canonical conversion, not when the API endpoint returns it

# Let's check: are large u64 values actually in consensus config?
# Looking at the fields we found:
# - chain_id: u64 (could be large)
# - chunk_size: u64 = 256 * 1024 = 262,144 (small, fits in i64)
# - block_tree_depth: u64
# - num_chunks_in_partition: u64 = 75,534,400 (small, fits in i64)
# - timestamp_millis: u128 (definitely needs string handling)
# - max_future_timestamp_drift_millis: u128

# The consensus fields are mostly moderate values, but u128 is used
# and u128 definitely won't fit in i64

# Question: does the API use canonical serialization or standard JSON?
print("Need to check: does /v1/network/config endpoint use canonical serialization?")
print("If yes: all u64/u128 are strings, current code works")
print("If no: large values could lose precision via f64 conversion")
EOF
python3 /tmp/check_precision.py

Repository: Irys-xyz/irys

Length of output: 246


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Find where the consensus config is actually serialized for API response
rg -n 'json!|serde_json::to_value|Serialize' crates/types/src/config/consensus.rs -B 2 -A 2

Repository: Irys-xyz/irys

Length of output: 3124


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check test fixtures to see if they use raw JSON numbers or canonical format
find multiversion-tests/fixtures -type f -name '*.json' -o -name '*.toml' | head -20

# Look for any integration tests that call patch_peer_consensus
rg -n 'patch_peer_consensus|json_to_toml' multiversion-tests/tests/ --type rust -B 3 -A 3

Repository: Irys-xyz/irys

Length of output: 101


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check if there are any tests with actual large u64 values
rg -n 'u64::MAX|i64::MAX|precision|float' multiversion-tests/src/config.rs -B 2 -A 2

Repository: Irys-xyz/irys

Length of output: 293


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Verify: does the test only check i64 range?
sed -n '456,470p' multiversion-tests/src/config.rs

Repository: Irys-xyz/irys

Length of output: 662


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check if there's a test for large numbers that might exceed i64::MAX
rg -n 'u64|large|MAX' multiversion-tests/src/config.rs | grep test

Repository: Irys-xyz/irys

Length of output: 39


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Verify the complete json_to_toml function to see all branches
sed -n '152,184p' multiversion-tests/src/config.rs

Repository: Irys-xyz/irys

Length of output: 1301


Improve handling of large integers in JSON-to-TOML conversion.

The fallback to as_f64() at line 160 can lose precision for large integers. While the consensus config API endpoint returns canonical JSON (where u64/u128 are serialized as strings and handled correctly by the String branch), this function should defensively handle raw JSON numbers.

The suggested fix improves this by checking as_u64() before falling back to float, converting values exceeding i64::MAX to strings. However, the fix is incomplete: it does not address u128 types, which would still convert to Float and lose precision. Consider also adding a test case for large u64 values (e.g., i64::MAX + 1) to ensure they round-trip without precision loss.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@multiversion-tests/src/config.rs` around lines 156 - 161, The JSON number
branch in the conversion currently uses n.as_i64() then n.as_f64(), which can
lose precision for large unsigned values; update the serde_json::Value::Number
handling so it first checks n.as_i64(), then n.as_u64() and if the u64 value is
> i64::MAX convert it to Value::String(n.to_string()) (preserving precision),
otherwise convert to Value::Integer(u64 as i64); only then fall back to
n.as_f64().map(Value::Float). Also add a unit test that converts a JSON number
equal to i64::MAX + 1 (as a raw number) through the conversion and back to
ensure it round-trips without precision loss. Ensure you modify the function
handling serde_json::Value::Number and add the test for the large u64 case.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (4)
multiversion-tests/src/config.rs (2)

147-148: 🧹 Nitpick | 🔵 Trivial

Consider atomic writes to prevent partial-file corruption.

Direct std::fs::write can leave a truncated config file if the process is interrupted. For test infrastructure this is low-risk, but atomic write-then-rename is more robust.

🛡️ Atomic write pattern
     let output = toml::to_string_pretty(&config)?;
-    std::fs::write(config_path, output)?;
+    let tmp_path = config_path.with_extension("toml.tmp");
+    std::fs::write(&tmp_path, &output)?;
+    std::fs::rename(&tmp_path, config_path)?;
     Ok(())
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@multiversion-tests/src/config.rs` around lines 147 - 148, The current use of
toml::to_string_pretty(&config)? followed by std::fs::write(config_path,
output)? can leave a truncated file if interrupted; change to an atomic write:
write the serialized output to a temporary file (e.g., use
tempfile::NamedTempFile or a config_path.with_extension("tmp")), flush and sync
the file, then atomically replace the original via std::fs::rename(temp_path,
config_path) and optionally sync the parent directory; update the code around
the toml::to_string_pretty, std::fs::write, and use std::fs::rename (or
tempfile) to implement this pattern.

156-161: 🧹 Nitpick | 🔵 Trivial

Consider checking as_u64() before falling back to float.

The current Number handling goes as_i64()as_f64(). While the API uses canonical JSON (large integers as strings), defensively checking as_u64() before the float fallback would prevent precision loss for raw JSON numbers exceeding i64::MAX.

🛡️ Defensive number handling
         serde_json::Value::Number(n) => {
             if let Some(i) = n.as_i64() {
                 Some(Value::Integer(i))
+            } else if let Some(u) = n.as_u64() {
+                // Value exceeds i64::MAX; preserve as string to avoid precision loss
+                Some(Value::String(u.to_string()))
             } else {
                 n.as_f64().map(Value::Float)
             }
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@multiversion-tests/src/config.rs` around lines 156 - 161, In the
serde_json::Value::Number(n) match arm, add a check for n.as_u64() before
falling back to n.as_f64(): first try n.as_i64().map(Value::Integer), then if
that is None try n.as_u64() and map it to the appropriate unsigned/integer
variant (or extend Value to include an unsigned variant) to preserve large
positive integers, and only then call n.as_f64().map(Value::Float); update the
match arm in multiversion-tests/src/config.rs (symbols:
serde_json::Value::Number, n.as_u64(), Value::Integer, Value::Float)
accordingly.
multiversion-tests/src/cluster.rs (1)

96-107: ⚠️ Potential issue | 🟡 Minor

Convergence check may miss nodes that crash mid-wait.

api_urls() filters out non-running nodes. While wait_for_convergence checks all nodes are running upfront (lines 118-126), a node that crashes during the subsequent probe loop will be silently excluded from convergence calculations, potentially reporting success with fewer nodes than expected.

Consider either:

  1. Returning all URLs regardless of running state and letting the probe detect unreachable nodes
  2. Re-checking running state after wait_for_convergence returns
🛡️ Option 1: Return all URLs unconditionally
     pub fn api_urls(&self) -> Vec<String> {
         self.nodes
-            .values_mut()
-            .filter_map(|n| {
-                if n.is_running() {
-                    Some(n.api_url())
-                } else {
-                    None
-                }
-            })
+            .values()
+            .map(|n| n.api_url())
             .collect()
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@multiversion-tests/src/cluster.rs` around lines 96 - 107, The current
api_urls() filters out nodes where Node::is_running() is false which lets a node
that crashes during the probe loop be excluded from convergence checks; update
api_urls() to return all node URLs unconditionally (iterate over
self.nodes.values() and collect n.api_url() for every node) so the probe loop
and wait_for_convergence() will detect unreachable/crashed nodes instead of
silently skipping them.
multiversion-tests/src/process.rs (1)

145-156: ⚠️ Potential issue | 🟡 Minor

respawn() may leave old process as zombie.

The sync function calls start_kill() but cannot await the old process's termination before spawning the new one. The old child handle is dropped without waiting, potentially leaving a zombie until the test process exits.

If this is acceptable for the test harness use case, consider adding a brief comment documenting this limitation. Alternatively, convert to async:

♻️ Async alternative with proper cleanup
-    pub fn respawn(&mut self) -> Result<(), ProcessError> {
+    pub async fn respawn(&mut self) -> Result<(), ProcessError> {
         if let Some(mut old) = self.child.take()
-            && let Err(e) = old.start_kill()
         {
-            tracing::warn!(error = %e, "failed to kill old process during respawn");
+            if let Err(e) = old.start_kill() {
+                tracing::warn!(error = %e, "failed to kill old process during respawn");
+            }
+            // Reap the old process
+            let _ = old.wait().await;
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@multiversion-tests/src/process.rs` around lines 145 - 156, The respawn()
implementation drops the old child after calling start_kill(), which can leave a
zombie because it never awaits the old process's termination; either (A) convert
respawn to async (change signature of respawn to async fn respawn(...) ->
Result<..., ProcessError>) and after calling old.start_kill() await the child's
termination (e.g., call old.wait().await / old.wait_with_output().await or the
appropriate async wait method on the child handle) before calling
spawn_child_with_logs, or (B) if synchronous behavior is intentional, add a
clear comment above respawn() documenting that start_kill() is not awaited and
may leave zombies until process exit; update references to child and log_rx
accordingly when making it async.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@multiversion-tests/src/fault/network.rs`:
- Around line 23-29: The current Drop impl for NetworkPartitioner only calls
sync_cleanup_chain() when ACTIVE_INJECTIONS is zero at drop-time, which can
permanently skip cleanup if the atomic goes to zero later (e.g., in the guard
decrement); fix by calling sync_cleanup_chain() at the point where
ACTIVE_INJECTIONS is decremented to zero rather than only in Drop: update the
decrement path (the code that does ACTIVE_INJECTIONS.fetch_sub(...) in the
guard) to check the returned value and call sync_cleanup_chain() when that
fetch_sub indicates the count reached zero, or extract that logic into a helper
(e.g., decrement_active_and_cleanup()) and call it from both the guard and
NetworkPartitioner::drop so cleanup runs exactly when the active count falls to
zero.

In `@multiversion-tests/src/process.rs`:
- Around line 134-142: When the graceful shutdown times out in the shutdown
path, the code sends signal::kill(pid, Signal::SIGKILL) but returns immediately
leaving self.child populated; update the timeout branch so that after sending
SIGKILL you wait for the child to actually exit (e.g., call
self.child.as_mut().and_then(|c| c.wait().ok()) or poll with try_wait in a short
loop with small sleep) and then clear self.child = None before returning
Err(ProcessError::ShutdownTimeout { timeout }); this ensures the process handle
is released and prevents start_kill() from being invoked on an already-dead
process.

In `@xtask/src/main.rs`:
- Around line 530-539: The passthrough args end up after the test-binary
separator (`"--"`) so they are delivered to the test binary instead of Cargo;
fix by adding `args` into `test_args` before you push the `"--"` separator.
Concretely, in the `test_args` construction for `test_args` (and the `cmd!(sh,
"cargo {test_args...}")` invocation) build the vector with the Cargo-level flags
first (e.g. "test", "--test", "*"), then `extend(args)`, and only then push the
`"--"` separator and the test-binary flags like `"--ignored"` and
`"--test-threads=1"` so passthrough `args` are treated as Cargo args.

---

Duplicate comments:
In `@multiversion-tests/src/cluster.rs`:
- Around line 96-107: The current api_urls() filters out nodes where
Node::is_running() is false which lets a node that crashes during the probe loop
be excluded from convergence checks; update api_urls() to return all node URLs
unconditionally (iterate over self.nodes.values() and collect n.api_url() for
every node) so the probe loop and wait_for_convergence() will detect
unreachable/crashed nodes instead of silently skipping them.

In `@multiversion-tests/src/config.rs`:
- Around line 147-148: The current use of toml::to_string_pretty(&config)?
followed by std::fs::write(config_path, output)? can leave a truncated file if
interrupted; change to an atomic write: write the serialized output to a
temporary file (e.g., use tempfile::NamedTempFile or a
config_path.with_extension("tmp")), flush and sync the file, then atomically
replace the original via std::fs::rename(temp_path, config_path) and optionally
sync the parent directory; update the code around the toml::to_string_pretty,
std::fs::write, and use std::fs::rename (or tempfile) to implement this pattern.
- Around line 156-161: In the serde_json::Value::Number(n) match arm, add a
check for n.as_u64() before falling back to n.as_f64(): first try
n.as_i64().map(Value::Integer), then if that is None try n.as_u64() and map it
to the appropriate unsigned/integer variant (or extend Value to include an
unsigned variant) to preserve large positive integers, and only then call
n.as_f64().map(Value::Float); update the match arm in
multiversion-tests/src/config.rs (symbols: serde_json::Value::Number,
n.as_u64(), Value::Integer, Value::Float) accordingly.

In `@multiversion-tests/src/process.rs`:
- Around line 145-156: The respawn() implementation drops the old child after
calling start_kill(), which can leave a zombie because it never awaits the old
process's termination; either (A) convert respawn to async (change signature of
respawn to async fn respawn(...) -> Result<..., ProcessError>) and after calling
old.start_kill() await the child's termination (e.g., call old.wait().await /
old.wait_with_output().await or the appropriate async wait method on the child
handle) before calling spawn_child_with_logs, or (B) if synchronous behavior is
intentional, add a clear comment above respawn() documenting that start_kill()
is not awaited and may leave zombies until process exit; update references to
child and log_rx accordingly when making it async.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 55902fd9-5396-4e3b-9a17-5e445c86542c

📥 Commits

Reviewing files that changed from the base of the PR and between e14b184 and 94af38d.

⛔ Files ignored due to path filters (1)
  • multiversion-tests/Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (17)
  • .github/workflows/multiversion.yml
  • .gitignore
  • multiversion-tests/Cargo.toml
  • multiversion-tests/fixtures/base-config.toml
  • multiversion-tests/src/binary.rs
  • multiversion-tests/src/cluster.rs
  • multiversion-tests/src/config.rs
  • multiversion-tests/src/fault/crash.rs
  • multiversion-tests/src/fault/network.rs
  • multiversion-tests/src/lib.rs
  • multiversion-tests/src/ports.rs
  • multiversion-tests/src/probe.rs
  • multiversion-tests/src/process.rs
  • multiversion-tests/tests/common/mod.rs
  • multiversion-tests/tests/e2e.rs
  • multiversion-tests/tests/upgrade.rs
  • xtask/src/main.rs

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
multiversion-tests/src/process.rs (1)

149-160: 🧹 Nitpick | 🔵 Trivial

respawn() doesn't wait for old process termination.

The start_kill() call initiates termination but doesn't wait for the old process to exit before spawning the replacement. While the dropped Child handle will eventually be reaped by Tokio's runtime, this could briefly leave a zombie. Consider making respawn() async to allow waiting:

♻️ Optional: Make respawn async to ensure clean handoff
-    pub fn respawn(&mut self) -> Result<(), ProcessError> {
-        if let Some(mut old) = self.child.take()
-            && let Err(e) = old.start_kill()
-        {
-            tracing::warn!(error = %e, "failed to kill old process during respawn");
+    pub async fn respawn(&mut self) -> Result<(), ProcessError> {
+        if let Some(mut old) = self.child.take() {
+            if let Err(e) = old.start_kill() {
+                tracing::warn!(error = %e, "failed to kill old process during respawn");
+            }
+            // Wait for old process to fully exit
+            let _ = old.wait().await;
         }
 
         let (child, log_rx) = spawn_child_with_logs(&self.config)?;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@multiversion-tests/src/process.rs` around lines 149 - 160, The respawn()
method currently calls start_kill() on the old Child but returns immediately,
risking a brief zombie; change respawn to be async (pub async fn respawn(&mut
self) -> Result<(), ProcessError>) and after calling start_kill() await the
child's termination (e.g., await old.wait().or old.wait_with_output()) before
spawning a new child with spawn_child_with_logs; ensure you still handle and log
any error from start_kill() and from waiting on old, and preserve updating
self.child and self.log_rx after the old process has fully exited.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@multiversion-tests/src/fault/network.rs`:
- Around line 23-29: The Drop implementation for NetworkPartitioner currently
checks ACTIVE_INJECTIONS.load(...) == 0 and calls sync_cleanup_chain(), which
races with inject() setup; replace that ad-hoc check with the existing
synchronized cleanup path by calling decrement_and_maybe_cleanup() (the same
function noted as the primary cleanup path) so drops participate in the
coordinated decrement/cleanup logic; update both occurrences of the manual
load-and-sync (the Drop impl and the other instance at 45-47) to invoke
decrement_and_maybe_cleanup() instead of reading ACTIVE_INJECTIONS and calling
sync_cleanup_chain() directly.
- Around line 112-116: The stderr matching is case-sensitive and misses
"Permission denied"; update the check in the function (e.g.,
check_iptables_output) to compare case-insensitively — for example normalize
stderr to lowercase (or use a case-insensitive search) before testing for
"permission denied" and "operation not permitted", then return
FaultError::InsufficientPrivileges when matched and
FaultError::InjectionFailed(stderr.into_owned()) otherwise; ensure you still use
the existing symbols stderr, FaultError::InsufficientPrivileges, and
FaultError::InjectionFailed so callers remain unchanged.

---

Duplicate comments:
In `@multiversion-tests/src/process.rs`:
- Around line 149-160: The respawn() method currently calls start_kill() on the
old Child but returns immediately, risking a brief zombie; change respawn to be
async (pub async fn respawn(&mut self) -> Result<(), ProcessError>) and after
calling start_kill() await the child's termination (e.g., await old.wait().or
old.wait_with_output()) before spawning a new child with spawn_child_with_logs;
ensure you still handle and log any error from start_kill() and from waiting on
old, and preserve updating self.child and self.log_rx after the old process has
fully exited.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 751ef567-38eb-4922-a97c-f11bf2d824c5

📥 Commits

Reviewing files that changed from the base of the PR and between 94af38d and ffe490a.

📒 Files selected for processing (2)
  • multiversion-tests/src/fault/network.rs
  • multiversion-tests/src/process.rs

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (2)
multiversion-tests/src/config.rs (1)

158-163: ⚠️ Potential issue | 🟠 Major

Preserve unsigned integer precision before float fallback.

At Line 162, falling back directly to as_f64() after as_i64() can silently round large raw JSON integers. This is still risky for cross-version inputs and should be handled before float conversion.

💡 Suggested fix
         serde_json::Value::Number(n) => {
             if let Some(i) = n.as_i64() {
                 Some(Value::Integer(i))
+            } else if let Some(u) = n.as_u64() {
+                if u <= i64::MAX as u64 {
+                    Some(Value::Integer(u as i64))
+                } else {
+                    // toml::Value::Integer is i64-backed; preserve exact value as string
+                    Some(Value::String(u.to_string()))
+                }
             } else {
                 n.as_f64().map(Value::Float)
             }
         }
+    #[test]
+    fn json_to_toml_preserves_large_raw_u64_number() {
+        let large = (i64::MAX as u64) + 1;
+        let json = serde_json::Value::Number(serde_json::Number::from(large));
+        let toml_val = json_to_toml(&json).unwrap();
+        assert_eq!(toml_val.as_str().unwrap(), large.to_string());
+    }
#!/bin/bash
set -euo pipefail

# Verify numeric conversion branch order in json_to_toml
rg -n --type rust -C3 'serde_json::Value::Number|as_i64\(|as_u64\(|as_f64\(' multiversion-tests/src/config.rs

# Verify coverage for large raw u64 numeric input
rg -n --type rust -C2 'i64::MAX\s*as u64|json_to_toml_preserves_large_raw_u64_number|i64::MAX \+ 1' multiversion-tests/src/config.rs

Also applies to: 451-455

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@multiversion-tests/src/config.rs` around lines 158 - 163, In the json_to_toml
conversion (match arm for serde_json::Value::Number) you must try n.as_u64()
before falling back to n.as_f64() to avoid silently rounding large unsigned JSON
integers; change the branch order so it checks n.as_i64(), then n.as_u64() and
maps the u64 result appropriately (preserving integer precision) to the TOML
representation instead of immediately calling as_f64(), and only use as_f64() to
produce Value::Float if neither as_i64() nor as_u64() succeed.
multiversion-tests/src/binary.rs (1)

105-107: ⚠️ Potential issue | 🟠 Major

Pre-clean deterministic worktree path before creating a new worktree

Line 105 uses a fixed worktree-{rev} path, but Line 107 creates it directly without any prior cleanup. After an interrupted run, that path remains on disk with git worktree metadata intact. The subsequent git worktree prune (line 106) only cleans stale metadata, not the actual directory, so the next rebuild attempt will fail when git worktree add rejects the existing path.

🔧 Proposed fix
         let worktree_path = self.cache_dir.join(format!("worktree-{rev}"));
         self.prune_stale_worktrees().await;
+        // Best-effort cleanup in case a previous run crashed before removal.
+        let _ = self.remove_worktree(&worktree_path).await;
+        if worktree_path.exists() {
+            let _ = tokio::fs::remove_dir_all(&worktree_path).await;
+        }
         self.create_worktree(git_ref, &worktree_path).await?;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@multiversion-tests/src/binary.rs` around lines 105 - 107, The deterministic
worktree path created as worktree_path =
self.cache_dir.join(format!("worktree-{rev}")) must be cleaned before calling
create_worktree to avoid git refusing to add an existing directory; update the
sequence around prune_stale_worktrees and create_worktree so you first compute
worktree_path, then run self.prune_stale_worktrees().await, then if
worktree_path.exists() remove it (e.g. via tokio::fs or std::fs remove_dir_all
depending on async context) or run the equivalent git worktree removal, and only
after successful removal call self.create_worktree(git_ref,
&worktree_path).await?; ensure removal errors are handled/propagated
consistently with existing error handling in the function.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@multiversion-tests/src/binary.rs`:
- Around line 128-132: The code currently treats any existing path as valid for
env override (PathBuf::from(raw) with path.exists()), which allows directories;
change the existence check to ensure it's a regular file (use path.is_file())
and return the same BinaryError::NotFound { path } (or an appropriate error
variant) when that check fails so the env-override path is validated before
calling resolved(label, "env-override", path).
- Around line 163-170: The subprocess .output().await calls for external
commands (e.g., the git rev-parse spawn using Command::new("git") and other git
worktree and cargo build invocations) are currently unbounded; wrap each await
in a tokio::time::timeout with a reasonable Duration (e.g., few seconds/minutes)
and map timeout errors into a clear Err return so the test harness fails fast;
update the places that run "git worktree" and "cargo build" (the other
Command::new("git") and Command::new("cargo") calls in this file) to use the
same timeout pattern and handle TimeoutElapsed separately from command execution
errors.

In `@xtask/src/main.rs`:
- Around line 523-529: The env vars IRYS_BINARY_NEW and IRYS_BINARY_OLD are
being set after calling sh.push_dir("multiversion-tests"), so any relative paths
in binary_new/binary_old will be resolved against the new CWD and can break; fix
by resolving binary_new and binary_old to absolute paths (e.g., using
std::fs::canonicalize or joining with the original current_dir) before calling
sh.push_dir, then call sh.set_var("IRYS_BINARY_NEW", absolute_path) and
sh.set_var("IRYS_BINARY_OLD", absolute_path) (or alternatively move the
sh.set_var calls to before sh.push_dir) while keeping references to binary_new,
binary_old, sh.push_dir and sh.set_var unchanged.
- Around line 530-536: Replace the use of the glob-style flag in the test
argument construction: in the code that builds test_args (the let mut test_args
= ... and subsequent test_args.extend calls), remove the separate "--test" and
"*" entries and instead push the single idiomatic "--tests" flag so the test
runner uses the explicit --tests option when running integration tests.

---

Duplicate comments:
In `@multiversion-tests/src/binary.rs`:
- Around line 105-107: The deterministic worktree path created as worktree_path
= self.cache_dir.join(format!("worktree-{rev}")) must be cleaned before calling
create_worktree to avoid git refusing to add an existing directory; update the
sequence around prune_stale_worktrees and create_worktree so you first compute
worktree_path, then run self.prune_stale_worktrees().await, then if
worktree_path.exists() remove it (e.g. via tokio::fs or std::fs remove_dir_all
depending on async context) or run the equivalent git worktree removal, and only
after successful removal call self.create_worktree(git_ref,
&worktree_path).await?; ensure removal errors are handled/propagated
consistently with existing error handling in the function.

In `@multiversion-tests/src/config.rs`:
- Around line 158-163: In the json_to_toml conversion (match arm for
serde_json::Value::Number) you must try n.as_u64() before falling back to
n.as_f64() to avoid silently rounding large unsigned JSON integers; change the
branch order so it checks n.as_i64(), then n.as_u64() and maps the u64 result
appropriately (preserving integer precision) to the TOML representation instead
of immediately calling as_f64(), and only use as_f64() to produce Value::Float
if neither as_i64() nor as_u64() succeed.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 70ee288e-a1e7-44d1-bf64-e260f55aa5a9

📥 Commits

Reviewing files that changed from the base of the PR and between ffe490a and bf8e36e.

📒 Files selected for processing (3)
  • multiversion-tests/src/binary.rs
  • multiversion-tests/src/config.rs
  • xtask/src/main.rs

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
multiversion-tests/src/fault/network.rs (2)

23-29: ⚠️ Potential issue | 🟠 Major

Fix cleanup/setup race by incrementing active injections before chain setup.

Line 45 performs setup_chain() before marking the injection active. During that gap, Line 27 can observe zero and run cleanup from another Drop, removing the chain and causing flaky add_drop_rule failures.

Suggested fix
 impl Drop for NetworkPartitioner {
     fn drop(&mut self) {
-        // Safety cleanup: if no active injections remain, flush the chain.
-        // The primary cleanup path is decrement_and_maybe_cleanup().
-        if ACTIVE_INJECTIONS.load(Ordering::Acquire) == 0 {
-            sync_cleanup_chain();
-        }
+        // No-op: cleanup is coordinated by decrement_and_maybe_cleanup()
+        // on active-injection transitions to zero.
     }
 }
@@
     async fn inject(&self, target: &FaultTarget) -> Result<FaultGuard, FaultError> {
         let (source_port, dest_port) = extract_ports(target)?;
 
-        setup_chain().await?;
         ACTIVE_INJECTIONS.fetch_add(1, Ordering::AcqRel);
+        if let Err(e) = setup_chain().await {
+            decrement_and_maybe_cleanup();
+            return Err(e);
+        }
 
         if let Err(e) = add_drop_rule(source_port, dest_port).await {
             decrement_and_maybe_cleanup();
             return Err(e);
         }

Also applies to: 45-47

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@multiversion-tests/src/fault/network.rs` around lines 23 - 29, The race
occurs because setup_chain() is called before marking the injection active,
allowing a concurrent Drop (impl Drop for NetworkPartitioner) to see
ACTIVE_INJECTIONS == 0 and run sync_cleanup_chain(); fix by incrementing the
active counter before any chain setup: call ACTIVE_INJECTIONS.fetch_add(1,
Ordering::AcqRel) (or equivalent increment path used elsewhere) prior to calling
setup_chain(), and keep the existing decrement/decrement_and_maybe_cleanup()
logic in Drop so cleanup still runs when the injection is removed; apply the
same change to the other setup site(s) that call setup_chain() (the paths around
setup_chain() at the creation/setup functions referenced in the diff).

164-181: 🧹 Nitpick | 🔵 Trivial

Consider consolidating extract_ports tests with rstest cases.

These two tests are good, but they can be merged into one parameterized test for less duplication and easier expansion.

As per coding guidelines, "Use rstest for parameterized tests and proptest for property-based tests in Rust".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@multiversion-tests/src/fault/network.rs` around lines 164 - 181, Combine the
two tests into a single parameterized rstest that exercises extract_ports with
both FaultTarget::Network and FaultTarget::Process cases: create a case with
FaultTarget::Network { source_port: 8080, dest_port: 9090 } and assert the
returned (s,d) equals (8080,9090) using unwrap, and a case with
FaultTarget::Process { pid: Pid::from_raw(1) } that asserts extract_ports
returns an Err (is_err); annotate the test with #[rstest] and use #[case]
parameters to pass each FaultTarget variant to the single test function
referencing extract_ports, FaultTarget::Network, FaultTarget::Process, and
Pid::from_raw.
xtask/src/main.rs (1)

523-524: ⚠️ Potential issue | 🟡 Minor

Don’t silently ignore canonicalize failures for binary overrides.

At Line 523 and Line 524, unwrap_or(p.into()) can fall back to a relative path and later resolve against the changed cwd (multiversion-tests), producing hard-to-diagnose failures. Fail fast with context instead of swallowing canonicalization errors.

Proposed fix
-            let binary_new = binary_new.map(|p| std::fs::canonicalize(&p).unwrap_or(p.into()));
-            let binary_old = binary_old.map(|p| std::fs::canonicalize(&p).unwrap_or(p.into()));
+            let binary_new = binary_new
+                .map(|p| {
+                    std::fs::canonicalize(&p).map_err(|e| {
+                        eyre::eyre!("failed to canonicalize --binary-new '{}': {e}", p)
+                    })
+                })
+                .transpose()?;
+            let binary_old = binary_old
+                .map(|p| {
+                    std::fs::canonicalize(&p).map_err(|e| {
+                        eyre::eyre!("failed to canonicalize --binary-old '{}': {e}", p)
+                    })
+                })
+                .transpose()?;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@xtask/src/main.rs` around lines 523 - 524, The code currently swallows
canonicalize errors for the binary override maps (binary_new and binary_old) by
using unwrap_or(p.into()), which can allow relative paths to leak and later
fail; change the mapping to fail fast with context instead of falling back: call
std::fs::canonicalize(&p) and propagate or return a descriptive error on failure
(e.g. using anyhow::Context or map_err to include which override path failed and
the original error), so both binary_new.map(...) and binary_old.map(...) return
a canonicalized PathBuf or an early error with clear context about the failing
override.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@multiversion-tests/src/binary.rs`:
- Around line 74-75: The cache check treats mere existence of the path
referenced by cached and accepted by resolved("new", rev, cached) as a valid hit
but the writer later writes directly to that final path (line ~224), risking
partial/truncated files being reused; change the cache publishing to write to a
temporary file in the same directory (e.g., cached + ".tmp" or
cached.with_extension(".tmp")), fully complete the write/copy to that temp, then
atomically rename (fs::rename) the temp into cached and only then consider the
artifact valid; update the cache-reader logic to still check cached.exists() (or
optionally a small atomic marker like cached+".complete") but only accept files
that were created by the atomic rename, and ensure the writer cleans up the temp
file on failure so truncated temps are never promoted to cached.

---

Duplicate comments:
In `@multiversion-tests/src/fault/network.rs`:
- Around line 23-29: The race occurs because setup_chain() is called before
marking the injection active, allowing a concurrent Drop (impl Drop for
NetworkPartitioner) to see ACTIVE_INJECTIONS == 0 and run sync_cleanup_chain();
fix by incrementing the active counter before any chain setup: call
ACTIVE_INJECTIONS.fetch_add(1, Ordering::AcqRel) (or equivalent increment path
used elsewhere) prior to calling setup_chain(), and keep the existing
decrement/decrement_and_maybe_cleanup() logic in Drop so cleanup still runs when
the injection is removed; apply the same change to the other setup site(s) that
call setup_chain() (the paths around setup_chain() at the creation/setup
functions referenced in the diff).
- Around line 164-181: Combine the two tests into a single parameterized rstest
that exercises extract_ports with both FaultTarget::Network and
FaultTarget::Process cases: create a case with FaultTarget::Network {
source_port: 8080, dest_port: 9090 } and assert the returned (s,d) equals
(8080,9090) using unwrap, and a case with FaultTarget::Process { pid:
Pid::from_raw(1) } that asserts extract_ports returns an Err (is_err); annotate
the test with #[rstest] and use #[case] parameters to pass each FaultTarget
variant to the single test function referencing extract_ports,
FaultTarget::Network, FaultTarget::Process, and Pid::from_raw.

In `@xtask/src/main.rs`:
- Around line 523-524: The code currently swallows canonicalize errors for the
binary override maps (binary_new and binary_old) by using unwrap_or(p.into()),
which can allow relative paths to leak and later fail; change the mapping to
fail fast with context instead of falling back: call std::fs::canonicalize(&p)
and propagate or return a descriptive error on failure (e.g. using
anyhow::Context or map_err to include which override path failed and the
original error), so both binary_new.map(...) and binary_old.map(...) return a
canonicalized PathBuf or an early error with clear context about the failing
override.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 99ccdd01-12eb-488d-b1dd-ce3b6c83a17c

📥 Commits

Reviewing files that changed from the base of the PR and between bf8e36e and 76a8a1d.

📒 Files selected for processing (3)
  • multiversion-tests/src/binary.rs
  • multiversion-tests/src/fault/network.rs
  • xtask/src/main.rs

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@multiversion-tests/src/binary.rs`:
- Around line 117-124: The current flow can let remove_worktree errors
overshadow an earlier build_at error; change the logic so you store the result
from build_at in build_result, then always attempt worktree removal (calling
remove_worktree and prune_stale_worktrees as now), but when both build_result is
Err and removal returns Err you return the original build error (from
build_result) instead of the WorktreeRemoval error; reference the build_at call
(self.build_at), the build_result variable, remove_worktree,
prune_stale_worktrees, and worktree_path to locate and implement this
conditional return behavior.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: a3330365-da78-4290-a0e2-e063dfe94907

📥 Commits

Reviewing files that changed from the base of the PR and between 76a8a1d and ee179d5.

📒 Files selected for processing (1)
  • multiversion-tests/src/binary.rs

Comment on lines +117 to +124
let build_result = self.build_at(&worktree_path, &cached).await;
if let Err(e) = self.remove_worktree(&worktree_path).await {
tracing::warn!(error = %e, path = %worktree_path.display(),
"worktree removal failed, attempting prune and retry");
self.prune_stale_worktrees().await;
self.remove_worktree(&worktree_path).await?;
}
build_result?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Worktree removal error can mask the original build failure.

If build_at fails and remove_worktree also fails (even after prune+retry), line 122 propagates the WorktreeRemoval error, and line 124 is never reached. This masks the actual build error, making debugging harder.

🔧 Proposed fix to preserve build error priority
         let build_result = self.build_at(&worktree_path, &cached).await;
         if let Err(e) = self.remove_worktree(&worktree_path).await {
             tracing::warn!(error = %e, path = %worktree_path.display(),
                 "worktree removal failed, attempting prune and retry");
             self.prune_stale_worktrees().await;
-            self.remove_worktree(&worktree_path).await?;
+            if let Err(e2) = self.remove_worktree(&worktree_path).await {
+                tracing::error!(error = %e2, path = %worktree_path.display(),
+                    "worktree removal failed after prune, manual cleanup may be needed");
+                // Don't mask build_result - it's more actionable for the user
+            }
         }
         build_result?;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@multiversion-tests/src/binary.rs` around lines 117 - 124, The current flow
can let remove_worktree errors overshadow an earlier build_at error; change the
logic so you store the result from build_at in build_result, then always attempt
worktree removal (calling remove_worktree and prune_stale_worktrees as now), but
when both build_result is Err and removal returns Err you return the original
build error (from build_result) instead of the WorktreeRemoval error; reference
the build_at call (self.build_at), the build_result variable, remove_worktree,
prune_stale_worktrees, and worktree_path to locate and implement this
conditional return behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants