|
| 1 | +# Delayed Reconcile ACK for Storage Reconciliation |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +Today, the server handles a `reconcile` batch by immediately replying with: |
| 6 | + |
| 7 | +1. one `load` per out-of-sync CoValue, then |
| 8 | +2. `reconcile-ack` for the batch. |
| 9 | + |
| 10 | +This allows the client to send the next batch before the current batch has actually finished syncing content, which can create too much concurrent load on servers. |
| 11 | + |
| 12 | +This design changes the behavior so the server sends `reconcile-ack` **only after all out-of-sync CoValues in that batch are fully synced**. |
| 13 | + |
| 14 | +## Goal |
| 15 | + |
| 16 | +Apply backpressure across reconciliation batches: |
| 17 | + |
| 18 | +- The client should only send the next batch after the previous batch is actually synced. |
| 19 | +- The server should not acknowledge a batch until all reconcile-triggered sync work for that batch is complete. |
| 20 | + |
| 21 | +## Completion Semantics |
| 22 | + |
| 23 | +A CoValue in a reconcile batch is considered complete when the server observes completion of the corresponding load request: |
| 24 | + |
| 25 | +- `known` response: complete immediately. |
| 26 | +- `content` response: complete when streaming finishes (existing `trackLoadRequestComplete` behavior). |
| 27 | + |
| 28 | +This avoids relying on a strict known-state equality checkpoint, which can be unstable while writes continue. |
| 29 | + |
| 30 | +## Scope |
| 31 | + |
| 32 | +- **In scope**: server-side batching/ack behavior for `reconcile` in `packages/cojson/src/sync.ts`. |
| 33 | +- **In scope**: tests validating delayed ACK behavior and batch backpressure. |
| 34 | +- **Out of scope**: changing reconcile message format, lock scheduling strategy, or client-side batch ordering logic. |
| 35 | + |
| 36 | +## Proposed Changes |
| 37 | + |
| 38 | +### 1. Add server-side reconcile batch state tracking |
| 39 | + |
| 40 | +**File:** `packages/cojson/src/sync.ts` |
| 41 | + |
| 42 | +Track per-peer per-batch pending CoValues: |
| 43 | + |
| 44 | +- `reconcileBatchState[peerId][batchId] = { pending: Set<RawCoID>, createdAt }` |
| 45 | +- Reverse index for fast completion: |
| 46 | + - `reconcileBatchesByCoValue[peerId][coValueId] -> Set<batchId>` |
| 47 | + |
| 48 | +Requirements: |
| 49 | + |
| 50 | +- O(1)-ish completion update by CoValue ID. |
| 51 | +- Safe cleanup on ack/disconnect. |
| 52 | +- No global coupling between peers. |
| 53 | + |
| 54 | +### 2. Update `handleReconcile` to defer ACK |
| 55 | + |
| 56 | +**File:** `packages/cojson/src/sync.ts` |
| 57 | + |
| 58 | +For each `[coValueId, sessionsHash]`: |
| 59 | + |
| 60 | +- If in sync: do nothing. |
| 61 | +- If out of sync: send `load` and register `coValueId` in batch `pending`. |
| 62 | + |
| 63 | +After iteration: |
| 64 | + |
| 65 | +- If `pending.size === 0`, send `reconcile-ack` immediately. |
| 66 | +- Else, persist batch tracker and **do not** ack yet. |
| 67 | + |
| 68 | +### 3. Mark reconcile items complete from existing receive paths |
| 69 | + |
| 70 | +**File:** `packages/cojson/src/sync.ts` |
| 71 | + |
| 72 | +After existing load-complete logic in: |
| 73 | + |
| 74 | +- `handleKnownState` |
| 75 | +- `handleNewContent` |
| 76 | + |
| 77 | +call: |
| 78 | + |
| 79 | +- `markReconcileItemComplete(peer.id, msg.id)` |
| 80 | + |
| 81 | +`markReconcileItemComplete` behavior: |
| 82 | + |
| 83 | +- Remove the CoValue from all pending batches for that peer. |
| 84 | +- If a batch becomes empty, send `reconcile-ack` for that batch and clear tracker state. |
| 85 | + |
| 86 | +### 4. Add lifecycle cleanup and safety guards |
| 87 | + |
| 88 | +**File:** `packages/cojson/src/sync.ts` |
| 89 | + |
| 90 | +- On peer close, clear outstanding reconcile batch trackers for that peer. |
| 91 | +- Add optional stale-batch timeout logging and cleanup to avoid memory leaks in pathological cases. |
| 92 | + |
| 93 | +### 5. Keep client-side behavior unchanged |
| 94 | + |
| 95 | +`startStorageReconciliation` already waits for `reconcile-ack` before continuing to the next storage batch via `StorageReconciliationAckTracker.waitForAck`. |
| 96 | + |
| 97 | +No protocol or client flow changes are required to gain backpressure once server ACK timing changes. |
| 98 | + |
| 99 | +## Testing Plan |
| 100 | + |
| 101 | +### Unit/Integration tests |
| 102 | + |
| 103 | +**Likely files:** `packages/cojson/src/tests/*sync*.test.ts` |
| 104 | + |
| 105 | +Add cases: |
| 106 | + |
| 107 | +1. Server does not send `reconcile-ack` immediately after receiving a reconcile batch with mismatches. |
| 108 | +2. Server sends `reconcile-ack` only after all mismatched CoValues complete (`known` or final `content`). |
| 109 | +3. Delayed completion of one CoValue delays the batch ack. |
| 110 | +4. Batch with only in-sync CoValues is acked immediately. |
| 111 | +5. Peer disconnect mid-batch clears tracker state and does not emit stale ack after reconnect. |
| 112 | +6. Client-side next-batch progression remains gated by ack (effective backpressure). |
| 113 | + |
| 114 | +## Rollout Notes |
| 115 | + |
| 116 | +- This is a server behavior change but wire-compatible (no message schema changes). |
| 117 | +- Expected result: lower peak server load by reducing overlap between reconciliation batches. |
0 commit comments