Skip to content

Commit 6c46ef6

Browse files
committed
replication docs
1 parent 33ff9f4 commit 6c46ef6

File tree

1 file changed

+159
-0
lines changed

1 file changed

+159
-0
lines changed

replication/doc.go

Lines changed: 159 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,159 @@
1+
/*
2+
Package replication implements the Chotki distributed synchronization protocol.
3+
4+
# Protocol Overview
5+
6+
The Chotki replication protocol enables peer-to-peer synchronization of CRDT-based
7+
databases between replicas without a master server or consensus algorithm. The protocol
8+
maintains causal consistency and supports efficient differential synchronization.
9+
10+
# Synchronization States
11+
12+
The protocol operates through a finite state machine with the following states:
13+
14+
SendHandshake → SendDiff → SendLive → SendEOF → SendNone
15+
↓ ↕
16+
SendPing ← → SendPong
17+
18+
State descriptions:
19+
- SendHandshake: Initial connection setup and capability negotiation
20+
- SendDiff: Bulk synchronization of historical data differences
21+
- SendLive: Real-time streaming of new changes as they occur
22+
- SendPing/SendPong: Keep-alive mechanism during live sync
23+
- SendEOF: Graceful connection termination
24+
- SendNone: Connection closed state
25+
26+
# Synchronization Modes
27+
28+
The protocol supports different synchronization modes via bitmask flags:
29+
30+
SyncRead (1): Can receive data from peer
31+
SyncWrite (2): Can send data to peer
32+
SyncLive (4): Supports real-time live synchronization
33+
34+
Common combinations:
35+
- SyncRW = SyncRead | SyncWrite (bidirectional batch sync)
36+
- SyncRL = SyncRead | SyncLive (read + live updates)
37+
- SyncRWLive = SyncRead | SyncWrite | SyncLive (full bidirectional)
38+
39+
# Protocol Flow
40+
41+
## 1. Handshake Phase
42+
43+
The synchronization begins with a handshake message:
44+
45+
H(T{snapshot_id} M{mode} V{version_vector} S{trace_id})
46+
47+
Where:
48+
- H: Handshake packet type
49+
- T: Snapshot timestamp ID (current replica state)
50+
- M: Sync mode bitmask (read/write/live capabilities)
51+
- V: Version vector (TLV-encoded replica states)
52+
- S: Session trace ID (for logging and debugging)
53+
54+
The handshake establishes:
55+
- Peer capabilities and sync mode
56+
- Version vectors for differential calculation
57+
- Session trace IDs for request tracking
58+
59+
## 2. Diffsync Phase
60+
61+
After handshake, the protocol sends block-based diffs:
62+
63+
D(T{snapshot_id} R{block_range} F{offset}+)
64+
65+
Where:
66+
- D: Diff packet type
67+
- T: Reference snapshot ID
68+
- R: Block range being synchronized
69+
- F: Frame offset within block followed by operation data
70+
71+
The diff phase:
72+
- Compares version vectors to identify missing data
73+
- Sends data in blocks with maximum size of 100MB (MaxParcelSize)
74+
- Uses CRDT-specific diff algorithms (rdx.Xdiff) for efficiency
75+
- Processes blocks sequentially until all differences are sent
76+
77+
## 3. Version Vector Exchange
78+
79+
After diff completion, version vectors are exchanged:
80+
81+
V(R{block_range} version_vector_data)
82+
83+
This finalizes the differential sync by confirming the new state.
84+
85+
## 4. Live Sync Phase (Optional)
86+
87+
If SyncLive mode is enabled, the protocol enters real-time sync:
88+
- Streams new operations as they occur via the outbound queue
89+
- Maintains connection with periodic ping/pong messages
90+
- Ping interval and timeout are configurable per connection
91+
92+
Ping/Pong messages:
93+
94+
P("ping") - Keep-alive ping
95+
P("pong") - Keep-alive response
96+
97+
## 5. Termination
98+
99+
Connections end with a bye message containing the reason:
100+
101+
B(T{final_snapshot_id} reason_text)
102+
103+
# Versioning
104+
105+
Key component of the protocol is the version vector. Each time we change data on any replica, this change
106+
is stamped by an rdx.ID (autoincrementing sequence number + source replica id). This rdx.ID will then help us to
107+
determine which things we need to sync.
108+
109+
Version vector is maintained in VKey0, so every time we see change from the same replica or from other replica,
110+
we update VKey0
111+
Versions vector or (VV) layout is basically a map: src_id -> last seen rdx.ID. So, we always no which latest event we saw from each replica.
112+
113+
There is a special src_id = 0, which is a bunch of objects created in Log0, so we could have some common system objects
114+
on each replica. So when those objects are updated they we also store them like: 0 -> last seen rdx.ID in the VV.
115+
116+
# Handshake
117+
This is the first thing happens when we connect to a new replica. We init the sync session.
118+
But most importantly we send our VV (also we send trace_id for logging) and we also create pebble snapshot, that
119+
we will use during diffsync. When we just connected, we start to accumulate live events, until we completed diffsync.
120+
We will apply them later, while we do not guarantee that created snapshot do not contain any live events from the queue,
121+
we do not care as all operations in chotki are idempotent.
122+
123+
# Diff sync
124+
As described above, when replicas first connect after handshake they enter diff sync state. The idea is that
125+
before we start live sync, we need to equalize the state between 2 snapshots we made on previous step.
126+
In diff sync we sync data using blocks. Block is basically a range of rdx.ID, so when we update our VV,
127+
we then take this rdx.ID and apply SyncBlockMask to it (basically cutting first N bits from it) and consider it a block.
128+
Then for each block we will also maintain a VV for all updates associated with this block, meaning if DB update occurs,
129+
we update 2 VVs:
130+
- global VV (VKey0)
131+
- block VV (VKey(block_id))
132+
133+
The algorithm of diff sync is as follows:
134+
135+
1. We take a block and look at its VV, then we look at other replica global VV
136+
2. If other replica has not seen something from this block, we start syncing of this block
137+
3. We start iterating all objects inside this block (with rdx.IDs that conform to this block range)
138+
4. If other replica has not seen this object at all (meaning its VV for the src of this object is non existent or smaller than this object rdx.ID), then we just send this object
139+
5. Ohterwise, we use XDiff, however in a majority of cases it will just send whole object anyway for simplicity. It may change in future.
140+
6. When we sync all blocks we then exchange VVs of blocks we synced, so other replica can update them.
141+
7. Each replica accumulate updates in pebble.Batch, and when it receives V packet in the end, it applies it to the DB.\
142+
143+
Important note that during diff sync we also broadcast all 'D' and 'H' packets to all other replicas.
144+
Imagine there are 3 replicas: A <-> B <-> C.
145+
- Let's say A, B are live syncing and C is just connected to B.
146+
- If we do not broadcast 'D' packets, then B and C will sync, however if there is something in C, that A haven't seen it will not be synced, until diff sync between A and B.
147+
- But if we broadcast 'D' and 'H' packets, we basically open syncing session between all upstream replicas and C, so
148+
they will sync all the data from C
149+
150+
Unfortunattely, it can create a lot of excess traffic, especially when we rollout new version and there are a lot of diffsyncs happenings.
151+
But it simplified protocol a lot.
152+
153+
# Live sync
154+
155+
After diff sync, we now can process all updates that were accumulated in the queue and continue process them as we go.
156+
When we receive a bunch of records during live sync, we apply them to the DB immediately and also broadcast then to all other replicas.
157+
Due to the fact that currently chotki only allows tree replication structures, we know that we can safely send events to all connected replicas without fear of loops.
158+
*/
159+
package replication

0 commit comments

Comments
 (0)