-
Notifications
You must be signed in to change notification settings - Fork 137
Description
Summary
A safety vulnerability exists in the EPaxos implementation where replicas can respond to PreAccept messages before durably persisting the instance state. After a crash and restart, a replica may "forget" its participation in a fast quorum, breaking quorum intersection guarantees. This can allow two conflicting commands to both fast-commit with empty dependency sets, leading to execution order divergence and state inconsistency across replicas.
Bug Location
File: src/epaxos/epaxos.go
Function: handlePreAccept (lines 900-998) and sync (lines 218-224)
Problematic Code
1. The sync() function may be a no-op:
// sync with the stable store
func (r *Replica) sync() {
if !r.Durable { // ⚠️ If Durable=false, nothing is persisted!
return
}
r.StableStore.Sync()
}2. PreAccept handler sends response after sync():
func (r *Replica) handlePreAccept(preAccept *epaxosproto.PreAccept) {
// ... update instance state in memory ...
r.InstanceSpace[preAccept.Replica][preAccept.Instance] = &Instance{
preAccept.Command,
preAccept.Ballot,
status,
seq,
deps,
// ...
}
r.recordInstanceMetadata(r.InstanceSpace[preAccept.Replica][preAccept.Instance])
r.recordCommands(preAccept.Command)
r.sync() // ← This may do nothing if Durable=false!
// Then send response (lines 982-995)
if changed || uncommittedDeps || ... {
r.replyPreAccept(preAccept.LeaderId, &epaxosproto.PreAcceptReply{...})
} else {
r.SendMsg(preAccept.LeaderId, r.preAcceptOKRPC, pok)
}
}Root Cause
- Default configuration:
Durable = falseby default, meaningsync()does nothing - Even with
Durable = true: There's still a window betweensync()completing and network send - Memory-only state: If crash occurs after sending PreAcceptOK but before actual disk persistence, the instance state is lost
Attack Scenario
Consider N=5 replicas (R0-R4):
-
Command A issued to R0:
- R0 broadcasts
PreAccept(A)to fast quorum {R0, R1, R2, R3} - All reply
PreAcceptOK(A, seq=1, deps=∅) - R0 fast-commits A
- R0 broadcasts
-
R2 crashes after sending PreAcceptOK but before durable persistence:
- R2's in-memory state of A is lost
- On restart, R2 has no record of A
-
Command B (conflicting with A) issued to R4:
- R4 broadcasts
PreAccept(B)to fast quorum {R1, R2, R3, R4} - R2 (having "forgotten" A) replies
PreAcceptOK(B, seq=1, deps=∅) - R4 fast-commits B with no dependency on A
- R4 broadcasts
-
Result: Both A and B are committed with
seq=1anddeps=∅- R0, R1 execute: A then B → final value = B
- R3, R4 execute: B then A → final value = A
- State divergence!
Test Case
func TestCrashThenForgetFastQuorumVotes(t *testing.T)Test Output:
=== RUN TestCrashThenForgetFastQuorumVotes
crash_then_forget_fast_quorum_test.go:167: Both A and B are COMMITTED
with no dependency edges between them on all replicas.
crash_then_forget_fast_quorum_test.go:220: Final values for key k
across replicas: [2 2 2 1 1]
crash_then_forget_fast_quorum_test.go:223: execution-order agreement
violation: replicas disagree on final value of k (min=1, max=2)
--- FAIL: TestCrashThenForgetFastQuorumVotes (0.12s)
Impact
- Severity: Critical
- Type: Safety/Agreement violation
- Impact: Replicas can permanently diverge in state, violating linearizability
Suggested Fix
Option 1: Force durable persistence before responding
func (r *Replica) handlePreAccept(preAccept *epaxosproto.PreAccept) {
// ... update instance state ...
r.recordInstanceMetadata(inst)
r.recordCommands(preAccept.Command)
// MUST sync before responding, regardless of Durable flag
r.StableStore.Sync() // Force sync
// Only then send response
r.replyPreAccept(...)
}Option 2: Make Durable=true the default
func NewReplica(...) *Replica {
r := &Replica{
// ...
Durable: true, // Default to durable for safety
// ...
}
}Option 3: Use write-ahead logging
Ensure all state changes are written to a WAL before any response is sent, and replay the WAL on recovery.
Notes
This is a known class of vulnerability in distributed consensus systems. The EPaxos paper assumes durable storage semantics, but the implementation allows non-durable mode for testing/performance, which breaks safety guarantees.