Skip to content

Commit be3bfa3

Browse files
feat: Deterministic commit race resolution (MIP-03) (#152)
* feat: Deterministic commit race resolution (MIP-03) * docs: Update changelogs for race resolution support * test: Improve MLS SQLite storage coverage * fix(mdk-core): redact group id in error and simplify rollback cleanup * fix(mdk-core): fail commit processing if epoch snapshot creation fails * fix(mdk-core): tighten assertions in race chain rollback test * chore(mdk-memory-storage): remove speculative comments about lru crate * fix(mdk-memory-storage): redact sensitive MLS storage fields in Debug * docs: add Debug bound to MdkCallback in plan documentation * fix(mdk-core): remove unused variables in race chain test * Fix race conditions in memory storage and improve test determinism - Improve thread safety in MdkMemoryStorage group operations by holding locks during existence checks. - Add Debug implementations for snapshot types to redact sensitive information. - Refactor commit race tests to ensure deterministic timestamps and rename test_commit_race_chain_rollback to test_commit_race_simple_rollback. - Remove implemented plan document. * Ensure correct commit processing and rollback support - Conditionally attach epoch info to ProcessMessageWrongEpoch error only for Commit messages. - Add epoch snapshot creation before merging pending commits in the OwnCommitPending path to ensure consistent rollback support (MIP-03). * Prevent GroupId leakage in test logs * Add PR link to CHANGELOG * Replace SQLite savepoints with persistent application-level snapshots - Replace SQLite savepoint mechanism with group_state_snapshots table - Snapshots now persist across restarts (survive connection drops) - Each snapshot is group-scoped (no interference between groups) - Add test verifying group membership is preserved after rollback - Fix clippy type_complexity lint in snapshot restoration The previous savepoint-based approach had issues: 1. Savepoints don't survive connection restarts 2. Nested savepoints can interfere across groups 3. No persistence for crash recovery The new approach stores snapshots in a dedicated table with: - snapshot_name: unique identifier - group_id: scopes snapshot to specific group - table_name: which table the row belongs to - row_key/row_data: serialized row content Co-Authored-By: Claude Opus 4.5 <[email protected]> * Address CodeRabbit review feedback for PR #152 Security fixes: - Remove snapshot name (contains GroupId) from error messages - Redact EpochSnapshotManager inner state in Debug impl - Remove error details from snapshot/rollback logging Atomicity improvements: - Wrap SQLite snapshot_group_state in transaction - Wrap SQLite restore_group_from_snapshot in transaction CHANGELOG fixes: - Correct method names to match trait signatures - Remove duplicate section headers - Fix entry placement Other: - Add Clone and Eq derives to MdkStorageError Co-Authored-By: Claude Opus 4.5 <[email protected]> * Fix formatting to match CI rustfmt Co-Authored-By: Claude Opus 4.5 <[email protected]> * Add comprehensive tests for epoch snapshot and error handling - Add 20 tests for EpochSnapshotManager in epoch_snapshots.rs covering: - MIP-03 ordering rules (timestamp priority, event ID tiebreaker) - Snapshot creation, rollback, and release operations - Debug implementations - Add 10 tests for error variants in error.rs covering: - ProcessMessageWrongEpochWithInfo error formatting - OwnCommitPending error - Storage error conversion from MdkStorageError - Add 7 snapshot tests to mdk-sqlite-storage covering: - Group snapshot and rollback operations - Snapshot release without rollback - Exporter secrets rollback - Group isolation between snapshots - Multiple sequential snapshots Co-Authored-By: Claude Opus 4.5 <[email protected]> * Fix snapshot atomicity docs to match implementation The documentation incorrectly stated that snapshot operations were "not atomic" and acquired "multiple independent locks sequentially". In reality, both create_snapshot() and restore_snapshot() use a single global RwLock on self.inner, making them atomic. Updated docs in four locations: - lib.rs module-level doc - lib.rs MdkMemoryStorage struct doc - snapshot.rs module-level doc - snapshot.rs MemoryStorageSnapshot struct doc Co-Authored-By: Claude Opus 4.5 <[email protected]> * Add epoch-aware message tracking and rollback retry support Implements epoch tracking for messages and processed_messages to enable proper commit race resolution: - Add epoch column to messages and processed_messages tables - Store epoch at decryption time for proper invalidation targeting - Add MessageState::EpochInvalidated for messages from rolled-back commits - Enrich RollbackInfo callback with invalidated_messages and messages_needing_refetch - Add find_failed_messages_for_retry() to find messages that failed to decrypt with wrong keys but can succeed after rollback applies correct commit - Consolidate V002 migration into V001 (nothing deployed yet) The key insight: after rollback, two categories of messages exist: 1. Invalidated messages (processed with wrong commit's keys) - lost forever 2. Failed messages (encrypted with correct keys, couldn't decrypt) - retryable Co-Authored-By: Claude Opus 4.5 <[email protected]> * Add SnapshotCreationFailed error variant for better error handling Addresses PR review feedback to add a proper error variant instead of using the generic Error::Message for snapshot creation failures. - Add Error::SnapshotCreationFailed(String) variant - Update process_commit_message_for_group to use new variant - Include underlying error details in log message and error - Add test for new error variant Co-Authored-By: Claude Opus 4.5 <[email protected]> * Fix group-scoped snapshot isolation in memory storage Previously, memory storage snapshots captured ALL data but rollback restored ALL data, causing rollback of Group A to also affect Group B. This implements proper group-scoped snapshots matching SQLite behavior: - Add GroupScopedSnapshot struct for per-group snapshot isolation - Implement create_group_scoped_snapshot() with JSON-serialized MLS keys - Implement restore_group_scoped_snapshot() for group-only restoration - Add snapshot existence check in SQLite to prevent silent data deletion - Add isolation tests verifying rollback doesn't affect other groups - Fix clippy collapsible-if warnings in messages.rs Co-Authored-By: Claude Opus 4.5 <[email protected]> * Remove implementation comment from SQLite snapshot code Co-Authored-By: Claude Opus 4.5 <[email protected]> * Enhance commit race tests with member and message verification Address PR review feedback to verify more data after rollback: - Add member verification to test_commit_race_simple_better_commit_wins: - Verify original members (Alice, Bob, Carol) still present - Verify correct new member added based on which commit won - Verify total member count is exactly 4 - Add new test_message_invalidation_during_rollback: - Verify messages sent at rolled-back epochs are invalidated - Confirm RollbackInfo contains the invalidated message IDs - Add get_rollback_infos() helper to TestCallback for full access to RollbackInfo including invalidated_messages Co-Authored-By: Claude Opus 4.5 <[email protected]> * Add FK constraint to group_state_snapshots with CASCADE delete - Adds FOREIGN KEY (group_id) REFERENCES groups(mls_group_id) ON DELETE CASCADE to ensure snapshots are cleaned up when a group is deleted - Updates restore_group_from_snapshot to read snapshot data into memory BEFORE deleting the group (to avoid CASCADE deleting the data) - Preserves other snapshots during rollback by re-inserting them after the CASCADE deletion and group restoration - Fixes clippy warnings (type_complexity, needless_borrow) Co-Authored-By: Claude Opus 4.5 <[email protected]> * Remove redundant index and fix restore order for FK constraints - Remove idx_group_state_snapshots_lookup index since SQLite automatically indexes primary keys, and (snapshot_name, group_id) is a prefix of the PK - Fix restore_group_from_snapshot to insert the groups row first before group_relays and group_exporter_secrets (which have FK constraints referencing groups) Co-Authored-By: Claude Opus 4.5 <[email protected]> * Refactor snapshot_group_state into smaller helper methods Break down the 290-line method into 7 focused helper methods: - snapshot_openmls_group_data() - snapshot_openmls_proposals() - snapshot_openmls_own_leaf_nodes() - snapshot_openmls_epoch_key_pairs() - snapshot_groups_table() - snapshot_group_relays() - snapshot_group_exporter_secrets() The main method is now ~65 lines and clearly shows the 7 tables being snapshotted. Each helper is self-contained and easier to understand and maintain. Co-Authored-By: Claude Opus 4.5 <[email protected]> * Add snapshot TTL and hydration support for epoch snapshots This commit adds time-based cleanup and startup hydration for epoch snapshots to prevent unbounded storage growth and ensure proper state recovery after app restart. Changes: - Add snapshot_ttl_seconds config (default 1 week) to MdkConfig - Add list_group_snapshots() and prune_expired_snapshots() to storage trait - Add created_at timestamp to GroupScopedSnapshot - Implement TTL methods in both memory and SQLite storage backends - Add hydration logic to EpochSnapshotManager with ensure_hydrated() - Prune expired snapshots on startup in MDKBuilder::build() - Update is_better_candidate() to accept storage parameter for hydration - Add comprehensive test coverage (14 new tests) The hydration system: - Only activates for persistent storage backends (SQLite) - Parses snapshot names to reconstruct EpochSnapshot metadata - Marks groups as hydrated to prevent redundant loads - Handles missing timestamp gracefully (uses 0 placeholder) Co-Authored-By: Claude Opus 4.5 <[email protected]> * Formatting --------- Co-authored-by: Claude Opus 4.5 <[email protected]>
1 parent 88f99d9 commit be3bfa3

File tree

31 files changed

+6885
-918
lines changed

31 files changed

+6885
-918
lines changed

Plans/epoch-aware-messages.md

Lines changed: 247 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,247 @@
1+
# Epoch-Aware Message System and Snapshot Consistency
2+
3+
## Problem Statement
4+
5+
PR #152 introduces MIP-03 commit race resolution with snapshot/rollback, but has two critical issues:
6+
7+
1. **Messages lack epoch tracking** - Can't identify which messages belong to which epoch
8+
2. **SQLite/Memory snapshot inconsistency** - Memory includes messages in snapshots, SQLite doesn't
9+
3. **No message cleanup on rollback** - Messages from invalidated epochs persist and may be undecryptable
10+
11+
## Design Decisions
12+
13+
### 1. Add Epoch Field
14+
15+
Add `epoch: Option<u64>` to both `Message` and `ProcessedMessage` structs:
16+
- `Option` for backward compatibility (existing rows get NULL)
17+
- Captures the epoch when the message was decrypted/processed
18+
19+
### 2. New State: `EpochInvalidated` (Both Message and ProcessedMessage)
20+
21+
Add `EpochInvalidated` to **both** `MessageState` and `ProcessedMessageState`:
22+
23+
```rust
24+
pub enum MessageState {
25+
Created,
26+
Processed,
27+
Deleted,
28+
EpochInvalidated, // NEW: epoch was rolled back, content may be invalid
29+
}
30+
31+
pub enum ProcessedMessageState {
32+
Created,
33+
Processed,
34+
ProcessedCommit,
35+
Failed,
36+
EpochInvalidated, // NEW: epoch was rolled back, needs reprocessing
37+
}
38+
```
39+
40+
This maintains consistency - when a rollback happens, both the message record and its processing record get the same invalidated state. Applications can then decide whether to display invalidated content with uncertainty indicators or hide it entirely.
41+
42+
### 3. SQLite Snapshot Consistency
43+
44+
Add `messages` and `processed_messages` tables to SQLite snapshots to match memory storage behavior.
45+
46+
### 4. Rollback Message Handling
47+
48+
On rollback to `target_epoch`, mark records as `EpochInvalidated` (don't delete):
49+
50+
| Data | Condition | Action |
51+
|------|-----------|--------|
52+
| Messages | epoch > target_epoch | Mark state = `EpochInvalidated` |
53+
| Messages | epoch <= target_epoch or NULL | KEEP unchanged |
54+
| ProcessedMessage | epoch > target_epoch | Mark state = `EpochInvalidated` |
55+
| ProcessedMessage | epoch <= target_epoch or NULL | KEEP unchanged |
56+
57+
**Rationale**: Keep all records so applications can:
58+
- Show messages with uncertainty indicators ("this message may have changed")
59+
- Allow users to see conversation history even during state transitions
60+
- Re-fetch and reprocess when the correct epoch state is restored
61+
62+
### 5. Enhanced Callback
63+
64+
```rust
65+
pub struct RollbackInfo {
66+
pub group_id: GroupId,
67+
pub target_epoch: u64,
68+
pub new_head_event: EventId,
69+
pub invalidated_messages: Vec<EventId>, // Messages marked EpochInvalidated
70+
pub messages_needing_refetch: Vec<EventId>, // ProcessedMessages needing refetch
71+
}
72+
73+
pub trait MdkCallback: Send + Sync + Debug {
74+
fn on_rollback(&self, info: &RollbackInfo);
75+
}
76+
```
77+
78+
### 6. Group-Scope ProcessedMessages
79+
80+
Add `mls_group_id` column to `processed_messages` table to enable group-scoped epoch queries during rollback.
81+
82+
---
83+
84+
## Implementation Plan
85+
86+
### Phase 1: Storage Types (mdk-storage-traits)
87+
88+
**File: `crates/mdk-storage-traits/src/messages/types.rs`**
89+
90+
1. Add `epoch: Option<u64>` to `Message` struct (after `wrapper_event_id`)
91+
2. Add `epoch: Option<u64>` to `ProcessedMessage` struct (after `processed_at`)
92+
3. Add `mls_group_id: Option<GroupId>` to `ProcessedMessage` struct
93+
4. Add `EpochInvalidated` variant to `ProcessedMessageState`
94+
5. Update `as_str()`, `FromStr`, serialization for new state
95+
96+
**File: `crates/mdk-storage-traits/src/messages/mod.rs`**
97+
98+
Add new trait methods:
99+
```rust
100+
/// Mark messages with epoch > target as EpochInvalidated
101+
/// Returns EventIds of invalidated messages
102+
fn invalidate_messages_after_epoch(&self, group_id: &GroupId, epoch: u64)
103+
-> Result<Vec<EventId>, MessageError>;
104+
105+
/// Mark processed_messages with epoch > target as EpochInvalidated
106+
/// Returns wrapper EventIds of invalidated records
107+
fn invalidate_processed_messages_after_epoch(&self, group_id: &GroupId, epoch: u64)
108+
-> Result<Vec<EventId>, MessageError>;
109+
110+
/// Find messages in EpochInvalidated state (for UI filtering or reprocessing)
111+
fn find_invalidated_messages(&self, group_id: &GroupId)
112+
-> Result<Vec<Message>, MessageError>;
113+
114+
/// Find processed_messages in EpochInvalidated state
115+
fn find_invalidated_processed_messages(&self, group_id: &GroupId)
116+
-> Result<Vec<ProcessedMessage>, MessageError>;
117+
```
118+
119+
### Phase 2: Database Migration (mdk-sqlite-storage)
120+
121+
**New file: `crates/mdk-sqlite-storage/migrations/V002__epoch_awareness.sql`**
122+
123+
```sql
124+
-- Add epoch to messages
125+
ALTER TABLE messages ADD COLUMN epoch INTEGER;
126+
127+
-- Add epoch and group_id to processed_messages
128+
ALTER TABLE processed_messages ADD COLUMN epoch INTEGER;
129+
ALTER TABLE processed_messages ADD COLUMN mls_group_id BLOB;
130+
131+
-- Indexes for rollback queries
132+
CREATE INDEX idx_messages_epoch ON messages(mls_group_id, epoch);
133+
CREATE INDEX idx_processed_messages_epoch ON processed_messages(mls_group_id, epoch);
134+
```
135+
136+
### Phase 3: SQLite Storage Implementation
137+
138+
**File: `crates/mdk-sqlite-storage/src/messages.rs`**
139+
140+
1. Update `save_message()` to include epoch column
141+
2. Update `save_processed_message()` to include epoch and mls_group_id
142+
3. Update row-to-struct conversions to read epoch
143+
4. Implement new trait methods:
144+
- `invalidate_messages_after_epoch()` - UPDATE state = 'epoch_invalidated' WHERE epoch > target
145+
- `invalidate_processed_messages_after_epoch()` - same pattern
146+
- `find_invalidated_messages()` - SELECT WHERE state = 'epoch_invalidated'
147+
- `find_invalidated_processed_messages()` - same pattern
148+
149+
**File: `crates/mdk-sqlite-storage/src/lib.rs`**
150+
151+
In `snapshot_group_state()` (around line 745), add:
152+
- Section 8: Snapshot `messages` table for this group
153+
- Section 9: Snapshot `processed_messages` table for this group
154+
155+
In `restore_group_from_snapshot()` (around line 815), add:
156+
- Delete current messages/processed_messages for group before restore
157+
- Restore from snapshot table
158+
159+
### Phase 4: Memory Storage Implementation
160+
161+
**File: `crates/mdk-memory-storage/src/messages.rs`**
162+
163+
Implement the three new `MessageStorage` trait methods.
164+
165+
### Phase 5: Callback Enhancement (mdk-core)
166+
167+
**File: `crates/mdk-core/src/callback.rs`**
168+
169+
1. Add `RollbackInfo` struct
170+
2. Update `MdkCallback::on_rollback` to take `&RollbackInfo`
171+
172+
### Phase 6: Core Message Processing (mdk-core)
173+
174+
**File: `crates/mdk-core/src/messages.rs`**
175+
176+
1. **Capture epoch when saving messages** (~line 260):
177+
```rust
178+
let message = Message {
179+
epoch: Some(mls_group.epoch().as_u64()),
180+
// ... existing fields
181+
};
182+
```
183+
184+
2. **Capture epoch when saving processed_messages**:
185+
```rust
186+
let processed = ProcessedMessage {
187+
epoch: Some(group.epoch),
188+
mls_group_id: Some(group.mls_group_id.clone()),
189+
// ... existing fields
190+
};
191+
```
192+
193+
3. **Update rollback handler** (~line 1622-1642):
194+
```rust
195+
// After successful rollback_to_epoch()
196+
let invalidated_msgs = storage.invalidate_messages_after_epoch(&group_id, msg_epoch)?;
197+
let invalidated_processed = storage.invalidate_processed_messages_after_epoch(&group_id, msg_epoch)?;
198+
199+
if let Some(cb) = &self.callback {
200+
cb.on_rollback(&RollbackInfo {
201+
group_id: group_id.clone(),
202+
target_epoch: msg_epoch,
203+
new_head_event: event.id,
204+
invalidated_messages: invalidated_msgs,
205+
messages_needing_refetch: invalidated_processed,
206+
});
207+
}
208+
```
209+
210+
4. **Allow reprocessing of EpochInvalidated** (~line 1941-1956):
211+
```rust
212+
if processed.state == ProcessedMessageState::Failed {
213+
return Err(...); // Keep existing rejection
214+
}
215+
// EpochInvalidated and other states continue processing
216+
```
217+
218+
---
219+
220+
## Testing Strategy
221+
222+
### Unit Tests
223+
- Epoch field serialization/deserialization for Message and ProcessedMessage
224+
- EpochInvalidated state for both MessageState and ProcessedMessageState
225+
- Migration with NULL epochs (backward compatibility)
226+
227+
### Integration Tests
228+
- **Rollback with messages**: verify both Message and ProcessedMessage marked EpochInvalidated
229+
- **Reprocessing after rollback**: verify EpochInvalidated messages can be reprocessed
230+
- **Snapshot consistency**: verify SQLite and Memory storage behave identically
231+
- **Callback info**: verify RollbackInfo contains correct invalidated message lists
232+
233+
### Update Existing Tests
234+
- Tests creating Message/ProcessedMessage need epoch field
235+
- Callback tests need RollbackInfo handling
236+
- Commit race tests should verify message invalidation
237+
238+
---
239+
240+
## Verification
241+
242+
1. Run `cargo test -p mdk-storage-traits`
243+
2. Run `cargo test -p mdk-sqlite-storage`
244+
3. Run `cargo test -p mdk-memory-storage`
245+
4. Run `cargo test -p mdk-core`
246+
5. Run full test suite: `cargo test --workspace`
247+
6. Verify commit race tests pass with message cleanup

0 commit comments

Comments
 (0)