[Safety Bug] Non-durable PreAccept responses enable conflicting fast commits

## Summary

A safety vulnerability exists in the EPaxos implementation where replicas can respond to `PreAccept` messages before durably persisting the instance state. After a crash and restart, a replica may "forget" its participation in a fast quorum, breaking quorum intersection guarantees. This can allow two conflicting commands to both fast-commit with empty dependency sets, leading to execution order divergence and state inconsistency across replicas.

## Bug Location

**File**: `src/epaxos/epaxos.go`  
**Function**: `handlePreAccept` (lines 900-998) and `sync` (lines 218-224)

### Problematic Code

#### 1. The `sync()` function may be a no-op:

```go
// sync with the stable store
func (r *Replica) sync() {
    if !r.Durable {     // ⚠️ If Durable=false, nothing is persisted!
        return
    }
    r.StableStore.Sync()
}
```

#### 2. PreAccept handler sends response after `sync()`:

```go
func (r *Replica) handlePreAccept(preAccept *epaxosproto.PreAccept) {
    // ... update instance state in memory ...
    
    r.InstanceSpace[preAccept.Replica][preAccept.Instance] = &Instance{
        preAccept.Command,
        preAccept.Ballot,
        status,
        seq,
        deps,
        // ...
    }
    
    r.recordInstanceMetadata(r.InstanceSpace[preAccept.Replica][preAccept.Instance])
    r.recordCommands(preAccept.Command)
    r.sync()  // ← This may do nothing if Durable=false!
    
    // Then send response (lines 982-995)
    if changed || uncommittedDeps || ... {
        r.replyPreAccept(preAccept.LeaderId, &epaxosproto.PreAcceptReply{...})
    } else {
        r.SendMsg(preAccept.LeaderId, r.preAcceptOKRPC, pok)
    }
}
```

## Root Cause

1. **Default configuration**: `Durable = false` by default, meaning `sync()` does nothing
2. **Even with `Durable = true`**: There's still a window between `sync()` completing and network send
3. **Memory-only state**: If crash occurs after sending PreAcceptOK but before actual disk persistence, the instance state is lost

## Attack Scenario

Consider N=5 replicas (R0-R4):

1. **Command A issued to R0**:
   - R0 broadcasts `PreAccept(A)` to fast quorum {R0, R1, R2, R3}
   - All reply `PreAcceptOK(A, seq=1, deps=∅)`
   - R0 fast-commits A

2. **R2 crashes after sending PreAcceptOK but before durable persistence**:
   - R2's in-memory state of A is lost
   - On restart, R2 has no record of A

3. **Command B (conflicting with A) issued to R4**:
   - R4 broadcasts `PreAccept(B)` to fast quorum {R1, R2, R3, R4}
   - R2 (having "forgotten" A) replies `PreAcceptOK(B, seq=1, deps=∅)`
   - R4 fast-commits B with no dependency on A

4. **Result**: Both A and B are committed with `seq=1` and `deps=∅`
   - R0, R1 execute: A then B → final value = B
   - R3, R4 execute: B then A → final value = A
   - **State divergence!**

## Test Case

```go
func TestCrashThenForgetFastQuorumVotes(t *testing.T)
```

**Test Output**:
```
=== RUN   TestCrashThenForgetFastQuorumVotes
    crash_then_forget_fast_quorum_test.go:167: Both A and B are COMMITTED 
        with no dependency edges between them on all replicas.
    crash_then_forget_fast_quorum_test.go:220: Final values for key k 
        across replicas: [2 2 2 1 1]
    crash_then_forget_fast_quorum_test.go:223: execution-order agreement 
        violation: replicas disagree on final value of k (min=1, max=2)
--- FAIL: TestCrashThenForgetFastQuorumVotes (0.12s)
```

## Impact

- **Severity**: Critical
- **Type**: Safety/Agreement violation
- **Impact**: Replicas can permanently diverge in state, violating linearizability

## Suggested Fix

### Option 1: Force durable persistence before responding

```go
func (r *Replica) handlePreAccept(preAccept *epaxosproto.PreAccept) {
    // ... update instance state ...
    
    r.recordInstanceMetadata(inst)
    r.recordCommands(preAccept.Command)
    
    // MUST sync before responding, regardless of Durable flag
    r.StableStore.Sync()  // Force sync
    
    // Only then send response
    r.replyPreAccept(...)
}
```

### Option 2: Make Durable=true the default

```go
func NewReplica(...) *Replica {
    r := &Replica{
        // ...
        Durable: true,  // Default to durable for safety
        // ...
    }
}
```

### Option 3: Use write-ahead logging

Ensure all state changes are written to a WAL before any response is sent, and replay the WAL on recovery.

## Notes

This is a known class of vulnerability in distributed consensus systems. The EPaxos paper assumes durable storage semantics, but the implementation allows non-durable mode for testing/performance, which breaks safety guarantees.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Safety Bug] Non-durable PreAccept responses enable conflicting fast commits #28

Summary

Bug Location

Problematic Code

1. The `sync()` function may be a no-op:

2. PreAccept handler sends response after `sync()`:

Root Cause

Attack Scenario

Test Case

Impact

Suggested Fix

Option 1: Force durable persistence before responding

Option 2: Make Durable=true the default

Option 3: Use write-ahead logging

Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Safety Bug] Non-durable PreAccept responses enable conflicting fast commits #28

Description

Summary

Bug Location

Problematic Code

1. The sync() function may be a no-op:

2. PreAccept handler sends response after sync():

Root Cause

Attack Scenario

Test Case

Impact

Suggested Fix

Option 1: Force durable persistence before responding

Option 2: Make Durable=true the default

Option 3: Use write-ahead logging

Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. The `sync()` function may be a no-op:

2. PreAccept handler sends response after `sync()`: