Skip to content

feat(simulation): add deterministic replicated cluster harness#61

Merged
skel84 merged 4 commits intomainfrom
issue-50-deterministic-cluster-harness
Mar 13, 2026
Merged

feat(simulation): add deterministic replicated cluster harness#61
skel84 merged 4 commits intomainfrom
issue-50-deterministic-cluster-harness

Conversation

@skel84
Copy link
Owner

@skel84 skel84 commented Mar 13, 2026

Summary

Build the deterministic three-replica cluster harness for the replicated prototype.

Linked Issue

Closes #50

Changes

  • add a test-only replicated harness that hosts three real ReplicaNodes with independent durable workspaces
  • add an explicit protocol-message queue, connectivity matrix, crash/restart controls, and seeded schedule exploration transcript support
  • expose the small shared simulation helpers needed to reuse the seeded slot driver infrastructure
  • add replicated harness coverage for bootstrap, transcript reproducibility, partition healing, and crash/restart workspace stability
  • align docs/testing.md and docs/status.md with the shipped harness surface and the next milestone focus

Validation

  • scripts/preflight.sh
  • cargo test -p allocdb-node replicated_simulation -- --nocapture

Docs

  • docs updated if behavior, invariants, failure modes, or operator semantics changed

CodeRabbit Triage

  • CodeRabbit status completed
  • requested @coderabbitai summary if no visible review comment or thread appeared
  • applied relevant correctness, safety, recovery, testing, and docs suggestions
  • documented any intentionally rejected suggestions

@coderabbitai
Copy link

coderabbitai bot commented Mar 13, 2026

Warning

Rate limit exceeded

@skel84 has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 5 minutes and 28 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 95311fe5-4a92-4df6-8b64-8230bbff5d65

📥 Commits

Reviewing files that changed from the base of the PR and between 3c68340 and e220889.

📒 Files selected for processing (1)
  • crates/allocdb-node/src/replicated_simulation.rs

Walkthrough

Adds a deterministic three-replica test harness and integration tests: a ReplicatedSimulationHarness running three real ReplicaNodes with independent durable workspaces, a shared seeded slot driver, explicit protocol-message queue and connectivity matrix, driver actions (queue/deliver/drop, set_connectivity, crash/restart, explore_schedule), and replayable transcripts.

Changes

Cohort / File(s) Summary
Replicated simulation harness & tests
crates/allocdb-node/src/replicated_simulation.rs, crates/allocdb-node/src/replicated_simulation_tests.rs
Adds a deterministic 3-replica harness and extensive integration tests: HarnessReplica, ReplicatedSimulationHarness, ConnectivityMatrix, QueuedProtocolMessage, schedule action/observation enums, deterministic schedule exploration/execution, transcript recording, and lifecycle/message-delivery semantics.
Module registration & visibility
crates/allocdb-node/src/lib.rs, crates/allocdb-node/src/simulation.rs
Registers test-only modules (replicated_simulation, simulation) under #[cfg(test)] and increases visibility of SimulatedSlotDriver::choose_index and simulation_workspace_path to pub(crate) for harness access.
Documentation
docs/status.md, docs/testing.md
Documents the deterministic 3-replica harness, driver actions, connectivity/messaging semantics, adds replicated_simulation test target, and updates testing guidance to reflect the new harness surface.

Sequence Diagram

sequenceDiagram
    participant Test as Test/Driver
    participant Harness as ReplicatedSimulationHarness
    participant SlotDriver as SimulatedSlotDriver
    participant Queue as Message Queue
    participant R1 as Replica 1
    participant R2 as Replica 2
    participant R3 as Replica 3

    Test->>Harness: new(seed, configs) 
    Harness->>R1: open & start
    Harness->>R2: open & start
    Harness->>R3: open & start
    Harness->>SlotDriver: init(seed)

    Test->>Harness: queue_protocol_message(from,to,label)
    Harness->>Queue: enqueue(message)

    Test->>Harness: set_connectivity(a,b,allowed)
    Harness->>Harness: update ConnectivityMatrix

    Test->>Harness: explore_schedule(actions)
    loop for each resolved action (slot ordered)
        Harness->>SlotDriver: resolve_schedule_slot()
        alt QueueProtocolMessage
            Harness->>Queue: push message
        else DeliverProtocolMessage
            Harness->>Queue: pop candidate
            Harness->>Harness: connectivity_allows?
            alt allowed
                Harness->>R?: deliver message
            else blocked
                Harness-->>Queue: retain message
            end
        else DropProtocolMessage
            Harness->>Queue: remove message
        else CrashReplica
            Harness->>R?: stop & mark Crashed
        else RestartReplica
            Harness->>R?: recover & mark Running
        end
    end

    Test->>Harness: replica_observation(id)
    Harness-->>Test: ReplicaObservation
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 14.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat(simulation): add deterministic replicated cluster harness' clearly and concisely describes the primary change: adding a deterministic replicated cluster harness to the simulation module.
Description check ✅ Passed The PR description follows the template with all required sections: Summary, Linked Issue, Changes, Validation (with checkmarks), Docs (with checkmarks), and CodeRabbit Triage. Content is complete and detailed.
Linked Issues check ✅ Passed The PR implementation fully addresses all requirements from issue #50: hosts three real replicas with independent workspaces, provides explicit message queue and connectivity matrix, makes delivery/partitioning/crashes/restarts controllable, records replayable transcripts, preserves seed-and-transcript discipline, and maintains aligned docs.
Out of Scope Changes check ✅ Passed All changes are directly aligned with issue #50 requirements: test modules for the harness, simulation helpers exposure, documentation updates, and no unrelated code modifications.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch issue-50-deterministic-cluster-harness
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

@skel84
Copy link
Owner Author

skel84 commented Mar 13, 2026

@coderabbitai summary

@coderabbitai
Copy link

coderabbitai bot commented Mar 13, 2026

✅ Actions performed

Summary regeneration triggered.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
crates/allocdb-node/src/replicated_simulation.rs (1)

330-359: Consider adding debug logging for message queue operations.

Per coding guidelines, logging materially improves debuggability. Adding debug! or trace! logging for queue/deliver/drop operations would help trace message flow during test debugging, especially when investigating non-determinism or unexpected transcript differences.

Example locations:

  • After successfully queuing a message (line 354)
  • After delivering/dropping a message

This is optional for test harness code but would aid in diagnosing schedule exploration issues.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/allocdb-node/src/replicated_simulation.rs` around lines 330 - 359, Add
debug logging to trace protocol message queue operations: inside
queue_protocol_message (and similarly in the deliver/drop handlers), log when a
message is successfully queued/delivered/dropped including the
QueuedProtocolMessage fields (message_id, label, from, to), the updated
pending_messages.len() and next_message_id; emit at debug or trace level and
include context like "ProtocolMessageQueued" to align with
ReplicatedScheduleObservationKind::ProtocolMessageQueued so logs map to
observations for easier test debugging.
crates/allocdb-node/src/replicated_simulation_tests.rs (1)

1-167: Consider adding negative-path tests for harness error conditions.

Per coding guidelines favoring negative-path tests, consider adding coverage for:

  • UnknownReplica error when accessing invalid replica ID
  • ReplicaAlreadyCrashed error when queuing from or delivering to a crashed replica
  • ReplicaAlreadyRunning error on double-restart

These would strengthen confidence in the harness error handling, though they are optional given the current coverage of the main acceptance criteria.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/allocdb-node/src/replicated_simulation_tests.rs` around lines 1 - 167,
Add negative-path unit tests in replicated_simulation_tests.rs that exercise
harness error handling: create small schedules (use
run_schedule/transcript_for_seed helpers or directly
ReplicatedSimulationHarness::new and explore_schedule) that trigger and assert
the specific errors UnknownReplica, ReplicaAlreadyCrashed, and
ReplicaAlreadyRunning; e.g., call queue_message or deliver_message with an
invalid replica id to expect UnknownReplica, queue or deliver to a replica after
invoking crash_replica to expect ReplicaAlreadyCrashed, and call restart_replica
twice (or restart a running replica) to expect ReplicaAlreadyRunning; assert the
returned ReplicatedSimulationError variant for each case.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@crates/allocdb-node/src/replicated_simulation_tests.rs`:
- Around line 1-167: Add negative-path unit tests in
replicated_simulation_tests.rs that exercise harness error handling: create
small schedules (use run_schedule/transcript_for_seed helpers or directly
ReplicatedSimulationHarness::new and explore_schedule) that trigger and assert
the specific errors UnknownReplica, ReplicaAlreadyCrashed, and
ReplicaAlreadyRunning; e.g., call queue_message or deliver_message with an
invalid replica id to expect UnknownReplica, queue or deliver to a replica after
invoking crash_replica to expect ReplicaAlreadyCrashed, and call restart_replica
twice (or restart a running replica) to expect ReplicaAlreadyRunning; assert the
returned ReplicatedSimulationError variant for each case.

In `@crates/allocdb-node/src/replicated_simulation.rs`:
- Around line 330-359: Add debug logging to trace protocol message queue
operations: inside queue_protocol_message (and similarly in the deliver/drop
handlers), log when a message is successfully queued/delivered/dropped including
the QueuedProtocolMessage fields (message_id, label, from, to), the updated
pending_messages.len() and next_message_id; emit at debug or trace level and
include context like "ProtocolMessageQueued" to align with
ReplicatedScheduleObservationKind::ProtocolMessageQueued so logs map to
observations for easier test debugging.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 53fb0dcd-b357-4707-8841-4f3b1716e2aa

📥 Commits

Reviewing files that changed from the base of the PR and between 722d50c and abbd1d6.

📒 Files selected for processing (6)
  • crates/allocdb-node/src/lib.rs
  • crates/allocdb-node/src/replicated_simulation.rs
  • crates/allocdb-node/src/replicated_simulation_tests.rs
  • crates/allocdb-node/src/simulation.rs
  • docs/status.md
  • docs/testing.md
📜 Review details
🧰 Additional context used
📓 Path-based instructions (3)
**/*.rs

📄 CodeRabbit inference engine (AGENTS.md)

**/*.rs: Write extensive tests for every meaningful behavior change. Favor invariant tests, negative-path tests, recovery tests, and regression tests over shallow happy-path coverage.
Add extensive logging where it materially improves debuggability or operational clarity. Use the right log level: error for invariant breaks, corruption, and failed operations that require intervention; warn for degraded but expected conditions such as overload, lag, or rejected requests; info for meaningful lifecycle and state-transition events; debug for detailed execution traces useful in development; trace only for very high-volume diagnostic detail.
Logging must be structured and purposeful. Do not add noisy logs that obscure signal or hide bugs.

Files:

  • crates/allocdb-node/src/lib.rs
  • crates/allocdb-node/src/simulation.rs
  • crates/allocdb-node/src/replicated_simulation_tests.rs
  • crates/allocdb-node/src/replicated_simulation.rs
**/*.md

📄 CodeRabbit inference engine (AGENTS.md)

Keep documentation up to date with the code and design. If a change affects behavior, invariants, failure modes, operational semantics, testing strategy, or implementation sequencing, update the relevant docs in the same task or PR.

Files:

  • docs/testing.md
  • docs/status.md
docs/status.md

📄 CodeRabbit inference engine (AGENTS.md)

Keep docs/status.md current as the single-file progress snapshot for the repository. Update it whenever milestone state, implementation coverage, or the recommended next step materially changes.

Files:

  • docs/status.md
🧠 Learnings (1)
📚 Learning: 2026-03-12T15:18:53.046Z
Learnt from: CR
Repo: skel84/allocdb PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T15:18:53.046Z
Learning: Applies to **/*.rs : Write extensive tests for every meaningful behavior change. Favor invariant tests, negative-path tests, recovery tests, and regression tests over shallow happy-path coverage.

Applied to files:

  • crates/allocdb-node/src/replicated_simulation_tests.rs
🧬 Code graph analysis (1)
crates/allocdb-node/src/replicated_simulation.rs (3)
crates/allocdb-node/src/simulation.rs (1)
  • simulation_workspace_path (450-456)
crates/allocdb-node/src/replica_tests.rs (1)
  • identity (58-63)
crates/allocdb-node/src/replica.rs (1)
  • paths (601-603)
🔇 Additional comments (13)
crates/allocdb-node/src/simulation.rs (1)

137-144: LGTM!

The visibility changes to pub(crate) for choose_index and simulation_workspace_path are appropriate. This enables the new replicated_simulation harness to reuse the seeded slot driver infrastructure and workspace path generation without exposing these helpers publicly.

Also applies to: 450-456

docs/status.md (1)

37-42: LGTM!

Documentation updates accurately reflect the new replicated simulation harness:

  • The validated chunk description now includes the three-replica harness with message queue and connectivity matrix
  • "What Exists" section properly documents the harness capabilities and real ReplicaNode::recover usage
  • Test target for replicated_simulation is added
  • Current focus correctly references using the M7-T02 harness for quorum work

Also applies to: 137-142, 152-152, 159-160

crates/allocdb-node/src/replicated_simulation.rs (5)

213-255: LGTM!

The new() method properly initializes three real replicas with independent workspaces. Each replica gets a unique identity derived from its index, and paths are correctly generated using the shared simulation_workspace_path helper. The error handling via map_err propagates ReplicaOpenError appropriately.


404-418: LGTM!

The crash_replica method correctly:

  1. Captures the before observation
  2. Validates the replica is currently running
  3. Drops the node to simulate crash (losing volatile state)
  4. Captures the after observation showing Crashed status

The state transition is properly recorded in the observation for transcript replay.


420-439: LGTM!

The restart_replica method properly uses ReplicaNode::recover to restart from durable state, matching the documented behavior of restart through the real recovery path. The before/after observations correctly capture the state transition.


441-481: LGTM!

The explore_schedule method correctly implements deterministic schedule exploration:

  1. Validates unique labels
  2. Resolves candidate slots using the seeded driver
  3. Sorts by (slot, label) for deterministic ordering
  4. Selects among ready actions at each slot using seeded choice
  5. Records transcript for reproducibility

This mirrors the single-node explore_schedule pattern appropriately.


576-604: LGTM!

The Drop implementation properly cleans up all replica artifacts (WAL, snapshot, metadata, and temp metadata files). The remove_if_exists helper correctly handles the NotFound case without panicking, which is appropriate for cleanup of files that may not exist.

docs/testing.md (1)

203-218: LGTM!

The documentation update accurately describes the M7-T02 implementation surface. The enumerated driver actions match what is implemented in replicated_simulation.rs, and the note about protocol payloads being opaque labels (with M7-T03 adding real semantics) correctly sets expectations for the current milestone scope.

crates/allocdb-node/src/replicated_simulation_tests.rs (4)

169-209: LGTM!

This test verifies the core acceptance criterion: three real replicas with independent workspaces. It properly validates:

  • Each replica is Active with correct identity
  • Metadata files exist on disk
  • Paths are distinct across replicas
  • Connectivity matrix is fully connected by default

211-304: LGTM!

This is a strong invariant test that validates the core reproducibility property:

  • Same seed produces identical transcripts
  • Different seed produces different transcripts
  • Slot sequence is deterministic
  • All observation kinds (delivery, connectivity, crash, restart) are properly recorded

The use of first_distinct_seed to find an alternate seed that produces a different transcript is a good testing pattern.


306-376: LGTM!

Good negative-path test for the connectivity matrix:

  • Validates blocked delivery returns the correct error
  • Validates message remains pending after blocked delivery
  • Validates drop removes the message
  • Validates healed connectivity allows subsequent delivery

This covers the partition-and-heal scenario family described in the testing documentation.


378-422: LGTM!

Good recovery test validating workspace stability across crash/restart:

  • Paths remain identical before and after restart
  • State transitions are properly observed (Running → Crashed → Running)
  • Restarted replica returns to Active status with expected view/LSN state
crates/allocdb-node/src/lib.rs (1)

5-8: LGTM!

The module declarations are properly gated with #[cfg(test)] and use pub(crate) visibility. This correctly exposes the simulation harnesses only during test compilation without polluting the public API.

@skel84
Copy link
Owner Author

skel84 commented Mar 13, 2026

Applied the CodeRabbit follow-up items on f825f45:

  • added negative-path replicated harness coverage for UnknownReplica, ReplicaAlreadyCrashed, and ReplicaAlreadyRunning
  • added debug! tracing for protocol message queue, deliver, and drop operations

Reran:

  • cargo test -p allocdb-node replicated_simulation -- --nocapture
  • cargo test -p allocdb-node
  • ./scripts/preflight.sh

@skel84
Copy link
Owner Author

skel84 commented Mar 13, 2026

@coderabbitai summary

@coderabbitai
Copy link

coderabbitai bot commented Mar 13, 2026

✅ Actions performed

Summary regeneration triggered.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
crates/allocdb-node/src/replicated_simulation_tests.rs (1)

306-376: Add a missing-label negative-path regression.

This module covers most harness errors, but neither deliver_protocol_message nor drop_protocol_message is exercised with an unknown message label. That leaves ReplicatedSimulationError::MessageNotFound unguarded in the dedicated harness suite. As per coding guidelines, **/*.rs: Write extensive tests for every meaningful behavior change. Favor invariant tests, negative-path tests, recovery tests, and regression tests over shallow happy-path coverage.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/allocdb-node/src/replicated_simulation_tests.rs` around lines 306 -
376, Extend the test suite to cover the negative path where a non-existent
protocol message label is used: call
harness.deliver_protocol_message("unknown-label") and
harness.drop_protocol_message("unknown-label") and assert they return
Err(ReplicatedSimulationError::MessageNotFound { label }) (matching the label
string); add this either as a new test or append to
connectivity_matrix_controls_delivery_until_partition_heals after existing
happy-path checks, referencing the harness methods deliver_protocol_message and
drop_protocol_message and the error enum
ReplicatedSimulationError::MessageNotFound to locate the logic to exercise.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/allocdb-node/src/replicated_simulation.rs`:
- Around line 47-52: QueuedProtocolMessage currently holds only message_id,
label, from, to and deliver_protocol_message only removes the entry and
snapshots state; change the queued item to carry the real protocol payload or an
executable delivery closure (e.g., include a boxed enum/struct like
ProtocolPayload or Box<dyn FnOnce(&mut ReplicaNode)>) so that when
deliver_protocol_message runs it actually invokes the recipient's handler (apply
payload to the target ReplicaNode identified by to) before recording the
observation; update the queue producer (where outbound messages are enqueued) to
store the actual payload/action and update any code paths that construct
QueuedProtocolMessage (and the consumer logic referenced around
deliver_protocol_message and the ReplicaNode delivery code at lines ~368-402) to
execute the payload against the recipient ReplicaNode, then snapshot/record the
observation after successful delivery.
- Around line 595-640: The harness Drop currently calls remove_if_exists which
panics on unexpected I/O errors; change remove_if_exists so it does not panic
during teardown: in function remove_if_exists(Path) handle Ok and NotFound as
no-ops and on other Err call warn! (or the crate logger) with a clear message
including path.display() and the error instead of panic; ensure the function is
used by the ReplicatedSimulationHarness Drop impl as-is so teardown is
best-effort and does not abort if tests are already unwinding.

---

Nitpick comments:
In `@crates/allocdb-node/src/replicated_simulation_tests.rs`:
- Around line 306-376: Extend the test suite to cover the negative path where a
non-existent protocol message label is used: call
harness.deliver_protocol_message("unknown-label") and
harness.drop_protocol_message("unknown-label") and assert they return
Err(ReplicatedSimulationError::MessageNotFound { label }) (matching the label
string); add this either as a new test or append to
connectivity_matrix_controls_delivery_until_partition_heals after existing
happy-path checks, referencing the harness methods deliver_protocol_message and
drop_protocol_message and the error enum
ReplicatedSimulationError::MessageNotFound to locate the logic to exercise.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: fb0c66ad-bb27-4d19-acad-2083e29147dc

📥 Commits

Reviewing files that changed from the base of the PR and between abbd1d6 and f825f45.

📒 Files selected for processing (2)
  • crates/allocdb-node/src/replicated_simulation.rs
  • crates/allocdb-node/src/replicated_simulation_tests.rs
📜 Review details
🧰 Additional context used
📓 Path-based instructions (1)
**/*.rs

📄 CodeRabbit inference engine (AGENTS.md)

**/*.rs: Write extensive tests for every meaningful behavior change. Favor invariant tests, negative-path tests, recovery tests, and regression tests over shallow happy-path coverage.
Add extensive logging where it materially improves debuggability or operational clarity. Use the right log level: error for invariant breaks, corruption, and failed operations that require intervention; warn for degraded but expected conditions such as overload, lag, or rejected requests; info for meaningful lifecycle and state-transition events; debug for detailed execution traces useful in development; trace only for very high-volume diagnostic detail.
Logging must be structured and purposeful. Do not add noisy logs that obscure signal or hide bugs.

Files:

  • crates/allocdb-node/src/replicated_simulation.rs
  • crates/allocdb-node/src/replicated_simulation_tests.rs
🧠 Learnings (2)
📓 Common learnings
Learnt from: CR
Repo: skel84/allocdb PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T15:18:53.046Z
Learning: Treat CodeRabbit as part of the required review path when it is enabled on the repository. Wait for its status to complete before merge. If it completes without a visible review comment or review thread, request visible output with `coderabbitai summary`. Address every substantive CodeRabbit comment explicitly before merge by either applying the change or documenting why it is not being applied. Apply correctness, safety, recovery, test, and docs-alignment feedback by default; document why you reject suggestions that would weaken determinism, boundedness, or trusted-core discipline.
📚 Learning: 2026-03-12T15:18:53.046Z
Learnt from: CR
Repo: skel84/allocdb PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T15:18:53.046Z
Learning: Applies to **/*.rs : Write extensive tests for every meaningful behavior change. Favor invariant tests, negative-path tests, recovery tests, and regression tests over shallow happy-path coverage.

Applied to files:

  • crates/allocdb-node/src/replicated_simulation_tests.rs
🧬 Code graph analysis (2)
crates/allocdb-node/src/replicated_simulation.rs (2)
crates/allocdb-node/src/simulation.rs (1)
  • simulation_workspace_path (450-456)
crates/allocdb-node/src/replica.rs (1)
  • paths (601-603)
crates/allocdb-node/src/replicated_simulation_tests.rs (1)
crates/allocdb-node/src/replicated_simulation.rs (7)
  • replica (266-271)
  • set_connectivity (319-329)
  • crash_replica (423-437)
  • restart_replica (439-458)
  • new (215-256)
  • connectivity_allows (309-317)
  • pending_messages (305-307)
🔇 Additional comments (1)
crates/allocdb-node/src/replicated_simulation_tests.rs (1)

211-304: Strong determinism regression coverage.

Same-seed equality, alternate-seed divergence, and explicit crash/restart/connectivity assertions make this a solid replayability test for the harness.

@skel84
Copy link
Owner Author

skel84 commented Mar 13, 2026

Applied the latest CodeRabbit follow-up on 3c68340:

  • added the missing negative-path regression for ReplicatedSimulationError::MessageNotFound on unknown deliver/drop labels
  • changed replicated-harness teardown cleanup to warn instead of panic on unexpected file-removal errors

Not applied:

  • I did not change QueuedProtocolMessage to execute real protocol payloads during deliver_protocol_message. For M7-T02, the harness is intentionally limited to opaque queued labels and deterministic transcript control, as documented in docs/testing.md; folding real message execution into this PR would collapse M7-T03 quorum/protocol work into the harness milestone and weaken task boundaries rather than fix a correctness bug in the current scope.

Reran:

  • cargo test -p allocdb-node replicated_simulation -- --nocapture
  • ./scripts/preflight.sh

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
crates/allocdb-node/src/replicated_simulation.rs (1)

423-458: Consider adding info-level logging for crash and restart lifecycle events.

crash_replica and restart_replica represent significant replica lifecycle transitions. Adding info! logs would improve operational clarity and debugging when reviewing transcripts, consistent with the debug! logging already added for message queue events.

♻️ Suggested logging additions
     pub(crate) fn crash_replica(
         &mut self,
         replica_id: ReplicaId,
     ) -> Result<ReplicatedScheduleObservationKind, ReplicatedSimulationError> {
         let before = self.replica_observation(replica_id)?;
         {
             let replica = self.replica_entry_mut(replica_id)?;
             if replica.node.is_none() {
                 return Err(ReplicatedSimulationError::ReplicaAlreadyCrashed(replica_id));
             }
             drop(replica.node.take());
         }
         let after = self.replica_observation(replica_id)?;
+        log::info!("ReplicaCrashed: replica_id={}", replica_id.get());
         Ok(ReplicatedScheduleObservationKind::ReplicaCrashed { before, after })
     }

     pub(crate) fn restart_replica(
         &mut self,
         replica_id: ReplicaId,
     ) -> Result<ReplicatedScheduleObservationKind, ReplicatedSimulationError> {
         // ... existing code ...
         self.replica_entry_mut(replica_id)?.node = Some(recovered);
         let after = self.replica_observation(replica_id)?;
+        log::info!("ReplicaRestarted: replica_id={}", replica_id.get());
         Ok(ReplicatedScheduleObservationKind::ReplicaRestarted { before, after })
     }

Based on learnings: "Add extensive logging where it materially improves debuggability or operational clarity. Use the right log level: ... info for meaningful lifecycle and state-transition events."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/allocdb-node/src/replicated_simulation.rs` around lines 423 - 458, Add
info-level logging to both crash_replica and restart_replica to record lifecycle
transitions: inside crash_replica, log an info message before/after taking the
node down (include replica_id and the before/after
ReplicatedScheduleObservationKind or key fields) and also log when returning
ReplicaAlreadyCrashed; inside restart_replica, log an info message when starting
recovery and after successful recovery (include replica_id and before/after
observations and any identity/paths metadata helpful for tracing). Use the
existing logging macro style (info!) consistent with debug! usage and ensure
errors still map to ReplicatedSimulationError variants (ReplicaAlreadyCrashed,
ReplicaAlreadyRunning, RecoverReplica) without changing error flow.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@crates/allocdb-node/src/replicated_simulation.rs`:
- Around line 423-458: Add info-level logging to both crash_replica and
restart_replica to record lifecycle transitions: inside crash_replica, log an
info message before/after taking the node down (include replica_id and the
before/after ReplicatedScheduleObservationKind or key fields) and also log when
returning ReplicaAlreadyCrashed; inside restart_replica, log an info message
when starting recovery and after successful recovery (include replica_id and
before/after observations and any identity/paths metadata helpful for tracing).
Use the existing logging macro style (info!) consistent with debug! usage and
ensure errors still map to ReplicatedSimulationError variants
(ReplicaAlreadyCrashed, ReplicaAlreadyRunning, RecoverReplica) without changing
error flow.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 80620483-dcc6-4bb6-ba2f-cf934782dfc1

📥 Commits

Reviewing files that changed from the base of the PR and between f825f45 and 3c68340.

📒 Files selected for processing (2)
  • crates/allocdb-node/src/replicated_simulation.rs
  • crates/allocdb-node/src/replicated_simulation_tests.rs
📜 Review details
🧰 Additional context used
📓 Path-based instructions (1)
**/*.rs

📄 CodeRabbit inference engine (AGENTS.md)

**/*.rs: Write extensive tests for every meaningful behavior change. Favor invariant tests, negative-path tests, recovery tests, and regression tests over shallow happy-path coverage.
Add extensive logging where it materially improves debuggability or operational clarity. Use the right log level: error for invariant breaks, corruption, and failed operations that require intervention; warn for degraded but expected conditions such as overload, lag, or rejected requests; info for meaningful lifecycle and state-transition events; debug for detailed execution traces useful in development; trace only for very high-volume diagnostic detail.
Logging must be structured and purposeful. Do not add noisy logs that obscure signal or hide bugs.

Files:

  • crates/allocdb-node/src/replicated_simulation_tests.rs
  • crates/allocdb-node/src/replicated_simulation.rs
🧠 Learnings (2)
📚 Learning: 2026-03-12T15:18:53.046Z
Learnt from: CR
Repo: skel84/allocdb PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T15:18:53.046Z
Learning: Applies to **/*.rs : Write extensive tests for every meaningful behavior change. Favor invariant tests, negative-path tests, recovery tests, and regression tests over shallow happy-path coverage.

Applied to files:

  • crates/allocdb-node/src/replicated_simulation_tests.rs
📚 Learning: 2026-03-12T15:18:53.046Z
Learnt from: CR
Repo: skel84/allocdb PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T15:18:53.046Z
Learning: Applies to **/*.rs : Add extensive logging where it materially improves debuggability or operational clarity. Use the right log level: `error` for invariant breaks, corruption, and failed operations that require intervention; `warn` for degraded but expected conditions such as overload, lag, or rejected requests; `info` for meaningful lifecycle and state-transition events; `debug` for detailed execution traces useful in development; `trace` only for very high-volume diagnostic detail.

Applied to files:

  • crates/allocdb-node/src/replicated_simulation.rs
🔇 Additional comments (18)
crates/allocdb-node/src/replicated_simulation.rs (12)

1-22: LGTM!

Module setup, imports, constants, and test module reference are well organized. The REPLICA_COUNT and ENDPOINT_COUNT constants provide clear configuration for the three-replica harness.


24-44: LGTM!

The ClusterEndpoint, ReplicaRuntimeStatus, and ReplicaObservation types provide a clean abstraction for representing cluster topology and replica state. The observation struct captures all relevant metadata for transcript recording.


46-52: Delivery path is bookkeeping-only; consider adding actual protocol dispatch in a future iteration.

This concern was raised in a previous review: QueuedProtocolMessage captures only labels and replica IDs, and deliver_protocol_message (lines 368-402) removes the entry and snapshots state without invoking the recipient ReplicaNode. The current design provides test scaffolding for scheduling and transcript recording, but protocol handler correctness cannot be validated until real payloads are dispatched.


54-122: LGTM!

Schedule action and observation types are well-designed. The action kinds cover all harness operations (queue/deliver/drop messages, connectivity, crash/restart), and the observation variants capture the relevant before/after state for transcript replay.


124-166: LGTM!

Error types are comprehensive with clear Display implementations. All error variants (OpenReplica, RecoverReplica, UnknownReplica, ReplicaAlreadyCrashed, ReplicaAlreadyRunning, MessageNotFound, MessageDeliveryBlocked) cover the expected failure modes.


168-212: LGTM!

Internal types (HarnessReplica, ConnectivityMatrix, ResolvedReplicatedAction) and the main ReplicatedSimulationHarness struct are well-structured. The connectivity matrix with endpoint indexing supports asymmetric partition modeling.


214-256: LGTM!

Harness initialization correctly creates three replicas with independent workspaces, initializes the seeded slot driver, and starts with a fully-connected connectivity matrix.


258-329: LGTM!

State query methods (current_slot, replica, replica_paths, replica_observation, pending_messages, connectivity_allows) and connectivity mutation (set_connectivity) are clean and correctly validate replica IDs.


331-421: LGTM!

Message queue operations (queue_protocol_message, deliver_protocol_message, drop_protocol_message) correctly validate sender/recipient state, check connectivity for delivery, and include structured debug logging via log_protocol_message_event. The unique-label assertion provides clear invariant enforcement.


460-547: LGTM!

explore_schedule correctly resolves candidate slots deterministically, sorts by (slot, label) for stable ordering, and uses the seeded slot driver for tie-breaking. The unique-label and slot validation assertions enforce harness invariants. resolve_schedule_slot deduplicates candidates before selection.


549-592: LGTM!

Helper methods for message and replica lookup are clean. pending_message_index, endpoint_index, replica_index, replica_entry, and replica_entry_mut provide consistent error handling via ReplicatedSimulationError.


595-643: LGTM!

The Drop implementation properly cleans up replica nodes and workspace artifacts. remove_if_exists now logs warnings instead of panicking on unexpected I/O errors, making teardown best-effort and preventing test-abort scenarios during unwinding.

crates/allocdb-node/src/replicated_simulation_tests.rs (6)

1-167: LGTM!

Well-designed test helpers reduce boilerplate and improve readability. core_config, engine_config, and the action builders (queue_message, deliver_message, drop_message, set_connectivity, crash_replica, restart_replica) provide a clean DSL for test authoring. run_schedule, transcript_for_seed, first_distinct_seed, and delivered_message support transcript-based assertions.


169-209: LGTM!

replicated_harness_bootstraps_three_real_replicas_with_independent_workspaces thoroughly validates the bootstrap invariants: all three replicas are active, have distinct identities, have independent workspace paths, and the connectivity matrix starts fully connected.


211-304: LGTM!

replicated_schedule_transcript_is_reproducible provides comprehensive coverage of the deterministic transcript requirement: identical seeds produce identical transcripts, different seeds produce different transcripts, and individual observation types (delivered, connectivity, crash, restart) are validated. The slot sequence assertion confirms scheduling order.


306-390: LGTM!

connectivity_matrix_controls_delivery_until_partition_heals exercises the partition and healing flow with good negative-path coverage: blocked delivery returns MessageDeliveryBlocked, dropped messages clear the queue, and healed connectivity allows delivery. The trailing assertions for MessageNotFound errors validate missing-label handling.


392-457: LGTM!

Negative-path tests invalid_replica_access_returns_unknown_replica, queue_and_deliver_reject_crashed_replicas, and restart_rejects_running_replica comprehensively validate error conditions. These align with the coding guideline to favor negative-path tests.


459-503: LGTM!

crash_and_restart_keep_replica_workspace_stable validates the recovery invariant: workspace paths remain stable across crash/restart cycles, and the restarted replica returns to Active status with expected initial metadata. This aligns with the coding guideline to favor recovery tests.

@skel84
Copy link
Owner Author

skel84 commented Mar 13, 2026

Applied the remaining lifecycle-logging nitpick in crates/allocdb-node/src/replicated_simulation.rs.

Changed:

  • added info! logging for crash/restart lifecycle transitions
  • logged rejected crash/restart requests with the current replica observation
  • logged recovery start with replica identity and durable path context

Reran:

  • cargo test -p allocdb-node replicated_simulation -- --nocapture
  • ./scripts/preflight.sh

@skel84 skel84 merged commit 1c0a2f1 into main Mar 13, 2026
2 checks passed
@skel84 skel84 deleted the issue-50-deterministic-cluster-harness branch March 13, 2026 14:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

M7-T02 Build deterministic 3-replica cluster harness

1 participant