feat(simulation): add deterministic replicated cluster harness#61
feat(simulation): add deterministic replicated cluster harness#61
Conversation
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
WalkthroughAdds a deterministic three-replica test harness and integration tests: a ReplicatedSimulationHarness running three real ReplicaNodes with independent durable workspaces, a shared seeded slot driver, explicit protocol-message queue and connectivity matrix, driver actions (queue/deliver/drop, set_connectivity, crash/restart, explore_schedule), and replayable transcripts. Changes
Sequence DiagramsequenceDiagram
participant Test as Test/Driver
participant Harness as ReplicatedSimulationHarness
participant SlotDriver as SimulatedSlotDriver
participant Queue as Message Queue
participant R1 as Replica 1
participant R2 as Replica 2
participant R3 as Replica 3
Test->>Harness: new(seed, configs)
Harness->>R1: open & start
Harness->>R2: open & start
Harness->>R3: open & start
Harness->>SlotDriver: init(seed)
Test->>Harness: queue_protocol_message(from,to,label)
Harness->>Queue: enqueue(message)
Test->>Harness: set_connectivity(a,b,allowed)
Harness->>Harness: update ConnectivityMatrix
Test->>Harness: explore_schedule(actions)
loop for each resolved action (slot ordered)
Harness->>SlotDriver: resolve_schedule_slot()
alt QueueProtocolMessage
Harness->>Queue: push message
else DeliverProtocolMessage
Harness->>Queue: pop candidate
Harness->>Harness: connectivity_allows?
alt allowed
Harness->>R?: deliver message
else blocked
Harness-->>Queue: retain message
end
else DropProtocolMessage
Harness->>Queue: remove message
else CrashReplica
Harness->>R?: stop & mark Crashed
else RestartReplica
Harness->>R?: recover & mark Running
end
end
Test->>Harness: replica_observation(id)
Harness-->>Test: ReplicaObservation
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
|
@coderabbitai summary |
✅ Actions performedSummary regeneration triggered. |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
crates/allocdb-node/src/replicated_simulation.rs (1)
330-359: Consider adding debug logging for message queue operations.Per coding guidelines, logging materially improves debuggability. Adding
debug!ortrace!logging for queue/deliver/drop operations would help trace message flow during test debugging, especially when investigating non-determinism or unexpected transcript differences.Example locations:
- After successfully queuing a message (line 354)
- After delivering/dropping a message
This is optional for test harness code but would aid in diagnosing schedule exploration issues.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/allocdb-node/src/replicated_simulation.rs` around lines 330 - 359, Add debug logging to trace protocol message queue operations: inside queue_protocol_message (and similarly in the deliver/drop handlers), log when a message is successfully queued/delivered/dropped including the QueuedProtocolMessage fields (message_id, label, from, to), the updated pending_messages.len() and next_message_id; emit at debug or trace level and include context like "ProtocolMessageQueued" to align with ReplicatedScheduleObservationKind::ProtocolMessageQueued so logs map to observations for easier test debugging.crates/allocdb-node/src/replicated_simulation_tests.rs (1)
1-167: Consider adding negative-path tests for harness error conditions.Per coding guidelines favoring negative-path tests, consider adding coverage for:
UnknownReplicaerror when accessing invalid replica IDReplicaAlreadyCrashederror when queuing from or delivering to a crashed replicaReplicaAlreadyRunningerror on double-restartThese would strengthen confidence in the harness error handling, though they are optional given the current coverage of the main acceptance criteria.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/allocdb-node/src/replicated_simulation_tests.rs` around lines 1 - 167, Add negative-path unit tests in replicated_simulation_tests.rs that exercise harness error handling: create small schedules (use run_schedule/transcript_for_seed helpers or directly ReplicatedSimulationHarness::new and explore_schedule) that trigger and assert the specific errors UnknownReplica, ReplicaAlreadyCrashed, and ReplicaAlreadyRunning; e.g., call queue_message or deliver_message with an invalid replica id to expect UnknownReplica, queue or deliver to a replica after invoking crash_replica to expect ReplicaAlreadyCrashed, and call restart_replica twice (or restart a running replica) to expect ReplicaAlreadyRunning; assert the returned ReplicatedSimulationError variant for each case.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@crates/allocdb-node/src/replicated_simulation_tests.rs`:
- Around line 1-167: Add negative-path unit tests in
replicated_simulation_tests.rs that exercise harness error handling: create
small schedules (use run_schedule/transcript_for_seed helpers or directly
ReplicatedSimulationHarness::new and explore_schedule) that trigger and assert
the specific errors UnknownReplica, ReplicaAlreadyCrashed, and
ReplicaAlreadyRunning; e.g., call queue_message or deliver_message with an
invalid replica id to expect UnknownReplica, queue or deliver to a replica after
invoking crash_replica to expect ReplicaAlreadyCrashed, and call restart_replica
twice (or restart a running replica) to expect ReplicaAlreadyRunning; assert the
returned ReplicatedSimulationError variant for each case.
In `@crates/allocdb-node/src/replicated_simulation.rs`:
- Around line 330-359: Add debug logging to trace protocol message queue
operations: inside queue_protocol_message (and similarly in the deliver/drop
handlers), log when a message is successfully queued/delivered/dropped including
the QueuedProtocolMessage fields (message_id, label, from, to), the updated
pending_messages.len() and next_message_id; emit at debug or trace level and
include context like "ProtocolMessageQueued" to align with
ReplicatedScheduleObservationKind::ProtocolMessageQueued so logs map to
observations for easier test debugging.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 53fb0dcd-b357-4707-8841-4f3b1716e2aa
📒 Files selected for processing (6)
crates/allocdb-node/src/lib.rscrates/allocdb-node/src/replicated_simulation.rscrates/allocdb-node/src/replicated_simulation_tests.rscrates/allocdb-node/src/simulation.rsdocs/status.mddocs/testing.md
📜 Review details
🧰 Additional context used
📓 Path-based instructions (3)
**/*.rs
📄 CodeRabbit inference engine (AGENTS.md)
**/*.rs: Write extensive tests for every meaningful behavior change. Favor invariant tests, negative-path tests, recovery tests, and regression tests over shallow happy-path coverage.
Add extensive logging where it materially improves debuggability or operational clarity. Use the right log level:errorfor invariant breaks, corruption, and failed operations that require intervention;warnfor degraded but expected conditions such as overload, lag, or rejected requests;infofor meaningful lifecycle and state-transition events;debugfor detailed execution traces useful in development;traceonly for very high-volume diagnostic detail.
Logging must be structured and purposeful. Do not add noisy logs that obscure signal or hide bugs.
Files:
crates/allocdb-node/src/lib.rscrates/allocdb-node/src/simulation.rscrates/allocdb-node/src/replicated_simulation_tests.rscrates/allocdb-node/src/replicated_simulation.rs
**/*.md
📄 CodeRabbit inference engine (AGENTS.md)
Keep documentation up to date with the code and design. If a change affects behavior, invariants, failure modes, operational semantics, testing strategy, or implementation sequencing, update the relevant docs in the same task or PR.
Files:
docs/testing.mddocs/status.md
docs/status.md
📄 CodeRabbit inference engine (AGENTS.md)
Keep
docs/status.mdcurrent as the single-file progress snapshot for the repository. Update it whenever milestone state, implementation coverage, or the recommended next step materially changes.
Files:
docs/status.md
🧠 Learnings (1)
📚 Learning: 2026-03-12T15:18:53.046Z
Learnt from: CR
Repo: skel84/allocdb PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T15:18:53.046Z
Learning: Applies to **/*.rs : Write extensive tests for every meaningful behavior change. Favor invariant tests, negative-path tests, recovery tests, and regression tests over shallow happy-path coverage.
Applied to files:
crates/allocdb-node/src/replicated_simulation_tests.rs
🧬 Code graph analysis (1)
crates/allocdb-node/src/replicated_simulation.rs (3)
crates/allocdb-node/src/simulation.rs (1)
simulation_workspace_path(450-456)crates/allocdb-node/src/replica_tests.rs (1)
identity(58-63)crates/allocdb-node/src/replica.rs (1)
paths(601-603)
🔇 Additional comments (13)
crates/allocdb-node/src/simulation.rs (1)
137-144: LGTM!The visibility changes to
pub(crate)forchoose_indexandsimulation_workspace_pathare appropriate. This enables the newreplicated_simulationharness to reuse the seeded slot driver infrastructure and workspace path generation without exposing these helpers publicly.Also applies to: 450-456
docs/status.md (1)
37-42: LGTM!Documentation updates accurately reflect the new replicated simulation harness:
- The validated chunk description now includes the three-replica harness with message queue and connectivity matrix
- "What Exists" section properly documents the harness capabilities and real
ReplicaNode::recoverusage- Test target for
replicated_simulationis added- Current focus correctly references using the M7-T02 harness for quorum work
Also applies to: 137-142, 152-152, 159-160
crates/allocdb-node/src/replicated_simulation.rs (5)
213-255: LGTM!The
new()method properly initializes three real replicas with independent workspaces. Each replica gets a unique identity derived from its index, and paths are correctly generated using the sharedsimulation_workspace_pathhelper. The error handling viamap_errpropagatesReplicaOpenErrorappropriately.
404-418: LGTM!The
crash_replicamethod correctly:
- Captures the before observation
- Validates the replica is currently running
- Drops the node to simulate crash (losing volatile state)
- Captures the after observation showing
CrashedstatusThe state transition is properly recorded in the observation for transcript replay.
420-439: LGTM!The
restart_replicamethod properly usesReplicaNode::recoverto restart from durable state, matching the documented behavior of restart through the real recovery path. The before/after observations correctly capture the state transition.
441-481: LGTM!The
explore_schedulemethod correctly implements deterministic schedule exploration:
- Validates unique labels
- Resolves candidate slots using the seeded driver
- Sorts by (slot, label) for deterministic ordering
- Selects among ready actions at each slot using seeded choice
- Records transcript for reproducibility
This mirrors the single-node
explore_schedulepattern appropriately.
576-604: LGTM!The
Dropimplementation properly cleans up all replica artifacts (WAL, snapshot, metadata, and temp metadata files). Theremove_if_existshelper correctly handles theNotFoundcase without panicking, which is appropriate for cleanup of files that may not exist.docs/testing.md (1)
203-218: LGTM!The documentation update accurately describes the M7-T02 implementation surface. The enumerated driver actions match what is implemented in
replicated_simulation.rs, and the note about protocol payloads being opaque labels (with M7-T03 adding real semantics) correctly sets expectations for the current milestone scope.crates/allocdb-node/src/replicated_simulation_tests.rs (4)
169-209: LGTM!This test verifies the core acceptance criterion: three real replicas with independent workspaces. It properly validates:
- Each replica is
Activewith correct identity- Metadata files exist on disk
- Paths are distinct across replicas
- Connectivity matrix is fully connected by default
211-304: LGTM!This is a strong invariant test that validates the core reproducibility property:
- Same seed produces identical transcripts
- Different seed produces different transcripts
- Slot sequence is deterministic
- All observation kinds (delivery, connectivity, crash, restart) are properly recorded
The use of
first_distinct_seedto find an alternate seed that produces a different transcript is a good testing pattern.
306-376: LGTM!Good negative-path test for the connectivity matrix:
- Validates blocked delivery returns the correct error
- Validates message remains pending after blocked delivery
- Validates drop removes the message
- Validates healed connectivity allows subsequent delivery
This covers the partition-and-heal scenario family described in the testing documentation.
378-422: LGTM!Good recovery test validating workspace stability across crash/restart:
- Paths remain identical before and after restart
- State transitions are properly observed (Running → Crashed → Running)
- Restarted replica returns to
Activestatus with expected view/LSN statecrates/allocdb-node/src/lib.rs (1)
5-8: LGTM!The module declarations are properly gated with
#[cfg(test)]and usepub(crate)visibility. This correctly exposes the simulation harnesses only during test compilation without polluting the public API.
|
Applied the CodeRabbit follow-up items on
Reran:
|
|
@coderabbitai summary |
✅ Actions performedSummary regeneration triggered. |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
crates/allocdb-node/src/replicated_simulation_tests.rs (1)
306-376: Add a missing-label negative-path regression.This module covers most harness errors, but neither
deliver_protocol_messagenordrop_protocol_messageis exercised with an unknown message label. That leavesReplicatedSimulationError::MessageNotFoundunguarded in the dedicated harness suite. As per coding guidelines,**/*.rs: Write extensive tests for every meaningful behavior change. Favor invariant tests, negative-path tests, recovery tests, and regression tests over shallow happy-path coverage.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/allocdb-node/src/replicated_simulation_tests.rs` around lines 306 - 376, Extend the test suite to cover the negative path where a non-existent protocol message label is used: call harness.deliver_protocol_message("unknown-label") and harness.drop_protocol_message("unknown-label") and assert they return Err(ReplicatedSimulationError::MessageNotFound { label }) (matching the label string); add this either as a new test or append to connectivity_matrix_controls_delivery_until_partition_heals after existing happy-path checks, referencing the harness methods deliver_protocol_message and drop_protocol_message and the error enum ReplicatedSimulationError::MessageNotFound to locate the logic to exercise.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/allocdb-node/src/replicated_simulation.rs`:
- Around line 47-52: QueuedProtocolMessage currently holds only message_id,
label, from, to and deliver_protocol_message only removes the entry and
snapshots state; change the queued item to carry the real protocol payload or an
executable delivery closure (e.g., include a boxed enum/struct like
ProtocolPayload or Box<dyn FnOnce(&mut ReplicaNode)>) so that when
deliver_protocol_message runs it actually invokes the recipient's handler (apply
payload to the target ReplicaNode identified by to) before recording the
observation; update the queue producer (where outbound messages are enqueued) to
store the actual payload/action and update any code paths that construct
QueuedProtocolMessage (and the consumer logic referenced around
deliver_protocol_message and the ReplicaNode delivery code at lines ~368-402) to
execute the payload against the recipient ReplicaNode, then snapshot/record the
observation after successful delivery.
- Around line 595-640: The harness Drop currently calls remove_if_exists which
panics on unexpected I/O errors; change remove_if_exists so it does not panic
during teardown: in function remove_if_exists(Path) handle Ok and NotFound as
no-ops and on other Err call warn! (or the crate logger) with a clear message
including path.display() and the error instead of panic; ensure the function is
used by the ReplicatedSimulationHarness Drop impl as-is so teardown is
best-effort and does not abort if tests are already unwinding.
---
Nitpick comments:
In `@crates/allocdb-node/src/replicated_simulation_tests.rs`:
- Around line 306-376: Extend the test suite to cover the negative path where a
non-existent protocol message label is used: call
harness.deliver_protocol_message("unknown-label") and
harness.drop_protocol_message("unknown-label") and assert they return
Err(ReplicatedSimulationError::MessageNotFound { label }) (matching the label
string); add this either as a new test or append to
connectivity_matrix_controls_delivery_until_partition_heals after existing
happy-path checks, referencing the harness methods deliver_protocol_message and
drop_protocol_message and the error enum
ReplicatedSimulationError::MessageNotFound to locate the logic to exercise.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: fb0c66ad-bb27-4d19-acad-2083e29147dc
📒 Files selected for processing (2)
crates/allocdb-node/src/replicated_simulation.rscrates/allocdb-node/src/replicated_simulation_tests.rs
📜 Review details
🧰 Additional context used
📓 Path-based instructions (1)
**/*.rs
📄 CodeRabbit inference engine (AGENTS.md)
**/*.rs: Write extensive tests for every meaningful behavior change. Favor invariant tests, negative-path tests, recovery tests, and regression tests over shallow happy-path coverage.
Add extensive logging where it materially improves debuggability or operational clarity. Use the right log level:errorfor invariant breaks, corruption, and failed operations that require intervention;warnfor degraded but expected conditions such as overload, lag, or rejected requests;infofor meaningful lifecycle and state-transition events;debugfor detailed execution traces useful in development;traceonly for very high-volume diagnostic detail.
Logging must be structured and purposeful. Do not add noisy logs that obscure signal or hide bugs.
Files:
crates/allocdb-node/src/replicated_simulation.rscrates/allocdb-node/src/replicated_simulation_tests.rs
🧠 Learnings (2)
📓 Common learnings
Learnt from: CR
Repo: skel84/allocdb PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T15:18:53.046Z
Learning: Treat CodeRabbit as part of the required review path when it is enabled on the repository. Wait for its status to complete before merge. If it completes without a visible review comment or review thread, request visible output with `coderabbitai summary`. Address every substantive CodeRabbit comment explicitly before merge by either applying the change or documenting why it is not being applied. Apply correctness, safety, recovery, test, and docs-alignment feedback by default; document why you reject suggestions that would weaken determinism, boundedness, or trusted-core discipline.
📚 Learning: 2026-03-12T15:18:53.046Z
Learnt from: CR
Repo: skel84/allocdb PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T15:18:53.046Z
Learning: Applies to **/*.rs : Write extensive tests for every meaningful behavior change. Favor invariant tests, negative-path tests, recovery tests, and regression tests over shallow happy-path coverage.
Applied to files:
crates/allocdb-node/src/replicated_simulation_tests.rs
🧬 Code graph analysis (2)
crates/allocdb-node/src/replicated_simulation.rs (2)
crates/allocdb-node/src/simulation.rs (1)
simulation_workspace_path(450-456)crates/allocdb-node/src/replica.rs (1)
paths(601-603)
crates/allocdb-node/src/replicated_simulation_tests.rs (1)
crates/allocdb-node/src/replicated_simulation.rs (7)
replica(266-271)set_connectivity(319-329)crash_replica(423-437)restart_replica(439-458)new(215-256)connectivity_allows(309-317)pending_messages(305-307)
🔇 Additional comments (1)
crates/allocdb-node/src/replicated_simulation_tests.rs (1)
211-304: Strong determinism regression coverage.Same-seed equality, alternate-seed divergence, and explicit crash/restart/connectivity assertions make this a solid replayability test for the harness.
|
Applied the latest CodeRabbit follow-up on
Not applied:
Reran:
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
crates/allocdb-node/src/replicated_simulation.rs (1)
423-458: Consider adding info-level logging for crash and restart lifecycle events.
crash_replicaandrestart_replicarepresent significant replica lifecycle transitions. Addinginfo!logs would improve operational clarity and debugging when reviewing transcripts, consistent with thedebug!logging already added for message queue events.♻️ Suggested logging additions
pub(crate) fn crash_replica( &mut self, replica_id: ReplicaId, ) -> Result<ReplicatedScheduleObservationKind, ReplicatedSimulationError> { let before = self.replica_observation(replica_id)?; { let replica = self.replica_entry_mut(replica_id)?; if replica.node.is_none() { return Err(ReplicatedSimulationError::ReplicaAlreadyCrashed(replica_id)); } drop(replica.node.take()); } let after = self.replica_observation(replica_id)?; + log::info!("ReplicaCrashed: replica_id={}", replica_id.get()); Ok(ReplicatedScheduleObservationKind::ReplicaCrashed { before, after }) } pub(crate) fn restart_replica( &mut self, replica_id: ReplicaId, ) -> Result<ReplicatedScheduleObservationKind, ReplicatedSimulationError> { // ... existing code ... self.replica_entry_mut(replica_id)?.node = Some(recovered); let after = self.replica_observation(replica_id)?; + log::info!("ReplicaRestarted: replica_id={}", replica_id.get()); Ok(ReplicatedScheduleObservationKind::ReplicaRestarted { before, after }) }Based on learnings: "Add extensive logging where it materially improves debuggability or operational clarity. Use the right log level: ...
infofor meaningful lifecycle and state-transition events."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/allocdb-node/src/replicated_simulation.rs` around lines 423 - 458, Add info-level logging to both crash_replica and restart_replica to record lifecycle transitions: inside crash_replica, log an info message before/after taking the node down (include replica_id and the before/after ReplicatedScheduleObservationKind or key fields) and also log when returning ReplicaAlreadyCrashed; inside restart_replica, log an info message when starting recovery and after successful recovery (include replica_id and before/after observations and any identity/paths metadata helpful for tracing). Use the existing logging macro style (info!) consistent with debug! usage and ensure errors still map to ReplicatedSimulationError variants (ReplicaAlreadyCrashed, ReplicaAlreadyRunning, RecoverReplica) without changing error flow.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@crates/allocdb-node/src/replicated_simulation.rs`:
- Around line 423-458: Add info-level logging to both crash_replica and
restart_replica to record lifecycle transitions: inside crash_replica, log an
info message before/after taking the node down (include replica_id and the
before/after ReplicatedScheduleObservationKind or key fields) and also log when
returning ReplicaAlreadyCrashed; inside restart_replica, log an info message
when starting recovery and after successful recovery (include replica_id and
before/after observations and any identity/paths metadata helpful for tracing).
Use the existing logging macro style (info!) consistent with debug! usage and
ensure errors still map to ReplicatedSimulationError variants
(ReplicaAlreadyCrashed, ReplicaAlreadyRunning, RecoverReplica) without changing
error flow.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 80620483-dcc6-4bb6-ba2f-cf934782dfc1
📒 Files selected for processing (2)
crates/allocdb-node/src/replicated_simulation.rscrates/allocdb-node/src/replicated_simulation_tests.rs
📜 Review details
🧰 Additional context used
📓 Path-based instructions (1)
**/*.rs
📄 CodeRabbit inference engine (AGENTS.md)
**/*.rs: Write extensive tests for every meaningful behavior change. Favor invariant tests, negative-path tests, recovery tests, and regression tests over shallow happy-path coverage.
Add extensive logging where it materially improves debuggability or operational clarity. Use the right log level:errorfor invariant breaks, corruption, and failed operations that require intervention;warnfor degraded but expected conditions such as overload, lag, or rejected requests;infofor meaningful lifecycle and state-transition events;debugfor detailed execution traces useful in development;traceonly for very high-volume diagnostic detail.
Logging must be structured and purposeful. Do not add noisy logs that obscure signal or hide bugs.
Files:
crates/allocdb-node/src/replicated_simulation_tests.rscrates/allocdb-node/src/replicated_simulation.rs
🧠 Learnings (2)
📚 Learning: 2026-03-12T15:18:53.046Z
Learnt from: CR
Repo: skel84/allocdb PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T15:18:53.046Z
Learning: Applies to **/*.rs : Write extensive tests for every meaningful behavior change. Favor invariant tests, negative-path tests, recovery tests, and regression tests over shallow happy-path coverage.
Applied to files:
crates/allocdb-node/src/replicated_simulation_tests.rs
📚 Learning: 2026-03-12T15:18:53.046Z
Learnt from: CR
Repo: skel84/allocdb PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-12T15:18:53.046Z
Learning: Applies to **/*.rs : Add extensive logging where it materially improves debuggability or operational clarity. Use the right log level: `error` for invariant breaks, corruption, and failed operations that require intervention; `warn` for degraded but expected conditions such as overload, lag, or rejected requests; `info` for meaningful lifecycle and state-transition events; `debug` for detailed execution traces useful in development; `trace` only for very high-volume diagnostic detail.
Applied to files:
crates/allocdb-node/src/replicated_simulation.rs
🔇 Additional comments (18)
crates/allocdb-node/src/replicated_simulation.rs (12)
1-22: LGTM!Module setup, imports, constants, and test module reference are well organized. The
REPLICA_COUNTandENDPOINT_COUNTconstants provide clear configuration for the three-replica harness.
24-44: LGTM!The
ClusterEndpoint,ReplicaRuntimeStatus, andReplicaObservationtypes provide a clean abstraction for representing cluster topology and replica state. The observation struct captures all relevant metadata for transcript recording.
46-52: Delivery path is bookkeeping-only; consider adding actual protocol dispatch in a future iteration.This concern was raised in a previous review:
QueuedProtocolMessagecaptures only labels and replica IDs, anddeliver_protocol_message(lines 368-402) removes the entry and snapshots state without invoking the recipientReplicaNode. The current design provides test scaffolding for scheduling and transcript recording, but protocol handler correctness cannot be validated until real payloads are dispatched.
54-122: LGTM!Schedule action and observation types are well-designed. The action kinds cover all harness operations (queue/deliver/drop messages, connectivity, crash/restart), and the observation variants capture the relevant before/after state for transcript replay.
124-166: LGTM!Error types are comprehensive with clear
Displayimplementations. All error variants (OpenReplica,RecoverReplica,UnknownReplica,ReplicaAlreadyCrashed,ReplicaAlreadyRunning,MessageNotFound,MessageDeliveryBlocked) cover the expected failure modes.
168-212: LGTM!Internal types (
HarnessReplica,ConnectivityMatrix,ResolvedReplicatedAction) and the mainReplicatedSimulationHarnessstruct are well-structured. The connectivity matrix with endpoint indexing supports asymmetric partition modeling.
214-256: LGTM!Harness initialization correctly creates three replicas with independent workspaces, initializes the seeded slot driver, and starts with a fully-connected connectivity matrix.
258-329: LGTM!State query methods (
current_slot,replica,replica_paths,replica_observation,pending_messages,connectivity_allows) and connectivity mutation (set_connectivity) are clean and correctly validate replica IDs.
331-421: LGTM!Message queue operations (
queue_protocol_message,deliver_protocol_message,drop_protocol_message) correctly validate sender/recipient state, check connectivity for delivery, and include structured debug logging vialog_protocol_message_event. The unique-label assertion provides clear invariant enforcement.
460-547: LGTM!
explore_schedulecorrectly resolves candidate slots deterministically, sorts by(slot, label)for stable ordering, and uses the seeded slot driver for tie-breaking. The unique-label and slot validation assertions enforce harness invariants.resolve_schedule_slotdeduplicates candidates before selection.
549-592: LGTM!Helper methods for message and replica lookup are clean.
pending_message_index,endpoint_index,replica_index,replica_entry, andreplica_entry_mutprovide consistent error handling viaReplicatedSimulationError.
595-643: LGTM!The
Dropimplementation properly cleans up replica nodes and workspace artifacts.remove_if_existsnow logs warnings instead of panicking on unexpected I/O errors, making teardown best-effort and preventing test-abort scenarios during unwinding.crates/allocdb-node/src/replicated_simulation_tests.rs (6)
1-167: LGTM!Well-designed test helpers reduce boilerplate and improve readability.
core_config,engine_config, and the action builders (queue_message,deliver_message,drop_message,set_connectivity,crash_replica,restart_replica) provide a clean DSL for test authoring.run_schedule,transcript_for_seed,first_distinct_seed, anddelivered_messagesupport transcript-based assertions.
169-209: LGTM!
replicated_harness_bootstraps_three_real_replicas_with_independent_workspacesthoroughly validates the bootstrap invariants: all three replicas are active, have distinct identities, have independent workspace paths, and the connectivity matrix starts fully connected.
211-304: LGTM!
replicated_schedule_transcript_is_reproducibleprovides comprehensive coverage of the deterministic transcript requirement: identical seeds produce identical transcripts, different seeds produce different transcripts, and individual observation types (delivered, connectivity, crash, restart) are validated. The slot sequence assertion confirms scheduling order.
306-390: LGTM!
connectivity_matrix_controls_delivery_until_partition_healsexercises the partition and healing flow with good negative-path coverage: blocked delivery returnsMessageDeliveryBlocked, dropped messages clear the queue, and healed connectivity allows delivery. The trailing assertions forMessageNotFounderrors validate missing-label handling.
392-457: LGTM!Negative-path tests
invalid_replica_access_returns_unknown_replica,queue_and_deliver_reject_crashed_replicas, andrestart_rejects_running_replicacomprehensively validate error conditions. These align with the coding guideline to favor negative-path tests.
459-503: LGTM!
crash_and_restart_keep_replica_workspace_stablevalidates the recovery invariant: workspace paths remain stable across crash/restart cycles, and the restarted replica returns toActivestatus with expected initial metadata. This aligns with the coding guideline to favor recovery tests.
|
Applied the remaining lifecycle-logging nitpick in Changed:
Reran:
|
Summary
Build the deterministic three-replica cluster harness for the replicated prototype.
Linked Issue
Closes #50
Changes
ReplicaNodes with independent durable workspacesdocs/testing.mdanddocs/status.mdwith the shipped harness surface and the next milestone focusValidation
scripts/preflight.shcargo test -p allocdb-node replicated_simulation -- --nocaptureDocs
CodeRabbit Triage
@coderabbitai summaryif no visible review comment or thread appeared