Skip to content

Commit 72d36df

Browse files
committed
update docs
1 parent 7882d79 commit 72d36df

File tree

2 files changed

+166
-158
lines changed

2 files changed

+166
-158
lines changed

replication/README.md

Lines changed: 165 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,165 @@
1+
# Protocol Overview
2+
3+
The Chotki replication protocol enables peer-to-peer synchronization of CRDT-based
4+
databases between replicas without a master server or consensus algorithm. The protocol
5+
maintains causal consistency and supports efficient differential synchronization.
6+
7+
# Synchronization States
8+
9+
The protocol operates through a finite state machine with the following states:
10+
11+
SendHandshake → SendDiff → SendLive → SendEOF → SendNone
12+
↓ ↕
13+
SendPing ← → SendPong
14+
15+
State descriptions:
16+
17+
- SendHandshake: Initial connection setup and capability negotiation
18+
- SendDiff: Bulk synchronization of historical data differences
19+
- SendLive: Real-time streaming of new changes as they occur
20+
- SendPing/SendPong: Keep-alive mechanism during live sync
21+
- SendEOF: Graceful connection termination
22+
- SendNone: Connection closed state
23+
24+
# Synchronization Modes
25+
26+
The protocol supports different synchronization modes via bitmask flags:
27+
28+
SyncRead (1): Can receive data from peer
29+
SyncWrite (2): Can send data to peer
30+
SyncLive (4): Supports real-time live synchronization
31+
32+
Common combinations:
33+
34+
- SyncRW = SyncRead | SyncWrite (bidirectional batch sync)
35+
- SyncRL = SyncRead | SyncLive (read + live updates)
36+
- SyncRWLive = SyncRead | SyncWrite | SyncLive (full bidirectional)
37+
38+
# Protocol Flow
39+
40+
## 1. Handshake Phase
41+
42+
The synchronization begins with a handshake message:
43+
44+
H(T{snapshot_id} M{mode} V{version_vector} S{trace_id})
45+
46+
Where:
47+
48+
- H: Handshake packet type
49+
- T: Snapshot timestamp ID (current replica state)
50+
- M: Sync mode bitmask (read/write/live capabilities)
51+
- V: Version vector (TLV-encoded replica states)
52+
- S: Session trace ID (for logging and debugging)
53+
54+
The handshake establishes:
55+
56+
- Peer capabilities and sync mode
57+
- Version vectors for differential calculation
58+
- Session trace IDs for request tracking
59+
60+
## 2. Diffsync Phase
61+
62+
After handshake, the protocol sends block-based diffs:
63+
64+
D(T{snapshot_id} R{block_range} F{offset}+)
65+
66+
Where:
67+
68+
- D: Diff packet type
69+
- T: Reference snapshot ID
70+
- R: Block range being synchronized
71+
- F: Frame offset within block followed by operation data
72+
73+
The diff phase:
74+
75+
- Compares version vectors to identify missing data
76+
- Sends data in blocks with maximum size of 100MB (MaxParcelSize)
77+
- Uses CRDT-specific diff algorithms (rdx.Xdiff) for efficiency
78+
- Processes blocks sequentially until all differences are sent
79+
80+
## 3. Version Vector Exchange
81+
82+
After diff completion, version vectors are exchanged:
83+
84+
V(R{block_range} version_vector_data)
85+
86+
This finalizes the differential sync by confirming the new state.
87+
88+
## 4. Live Sync Phase (Optional)
89+
90+
If SyncLive mode is enabled, the protocol enters real-time sync:
91+
92+
- Streams new operations as they occur via the outbound queue
93+
- Maintains connection with periodic ping/pong messages
94+
- Ping interval and timeout are configurable per connection
95+
96+
Ping/Pong messages:
97+
98+
P("ping") - Keep-alive ping
99+
P("pong") - Keep-alive response
100+
101+
## 5. Termination
102+
103+
Connections end with a bye message containing the reason:
104+
105+
B(T{final_snapshot_id} reason_text)
106+
107+
# Versioning
108+
109+
Key component of the protocol is the version vector. Each time we change data on any replica, this change
110+
is stamped by an rdx.ID (autoincrementing sequence number + source replica id). This rdx.ID will then help us to
111+
determine which things we need to sync.
112+
113+
Version vector is maintained in VKey0, so every time we see change from the same replica or from other replica,
114+
we update VKey0
115+
Versions vector or (VV) layout is basically a map: src_id -> last seen rdx.ID. So, we always no which latest event we saw from each replica.
116+
117+
There is a special src_id = 0, which is a bunch of objects created in Log0, so we could have some common system objects
118+
on each replica. So when those objects are updated they we also store them like: 0 -> last seen rdx.ID in the VV.
119+
120+
# Handshake
121+
122+
This is the first thing happens when we connect to a new replica. We init the sync session.
123+
But most importantly we send our VV (also we send trace_id for logging) and we also create pebble snapshot, that
124+
we will use during diffsync. When we just connected, we start to accumulate live events, until we completed diffsync.
125+
We will apply them later, while we do not guarantee that created snapshot do not contain any live events from the queue,
126+
we do not care as all operations in chotki are idempotent.
127+
128+
# Diff sync
129+
130+
As described above, when replicas first connect after handshake they enter diff sync state. The idea is that
131+
before we start live sync, we need to equalize the state between 2 snapshots we made on previous step.
132+
In diff sync we sync data using blocks. Block is basically a range of rdx.ID, so when we update our VV,
133+
we then take this rdx.ID and apply SyncBlockMask to it (basically cutting first N bits from it) and consider it a block.
134+
Then for each block we will also maintain a VV for all updates associated with this block, meaning if DB update occurs,
135+
we update 2 VVs:
136+
137+
- global VV (VKey0)
138+
- block VV (VKey(block_id))
139+
140+
The algorithm of diff sync is as follows:
141+
142+
1. We take a block and look at its VV, then we look at other replica global VV
143+
2. If other replica has not seen something from this block, we start syncing of this block
144+
3. We start iterating all objects inside this block (with rdx.IDs that conform to this block range)
145+
4. If other replica has not seen this object at all (meaning its VV for the src of this object is non existent or smaller than this object rdx.ID), then we just send this object
146+
5. Ohterwise, we use XDiff, however in a majority of cases it will just send whole object anyway for simplicity. It may change in future.
147+
6. When we sync all blocks we then exchange VVs of blocks we synced, so other replica can update them.
148+
7. Each replica accumulate updates in pebble.Batch, and when it receives V packet in the end, it applies it to the DB.\
149+
150+
Important note that during diff sync we also broadcast all 'D' and 'H' packets to all other replicas.
151+
Imagine there are 3 replicas: A <-> B <-> C.
152+
153+
- Let's say A, B are live syncing and C is just connected to B.
154+
- If we do not broadcast 'D' packets, then B and C will sync, however if there is something in C, that A haven't seen it will not be synced, until diff sync between A and B.
155+
- But if we broadcast 'D' and 'H' packets, we basically open syncing session between all upstream replicas and C, so
156+
they will sync all the data from C
157+
158+
Unfortunattely, it can create a lot of excess traffic, especially when we rollout new version and there are a lot of diffsyncs happenings.
159+
But it simplified protocol a lot.
160+
161+
# Live sync
162+
163+
After diff sync, we now can process all updates that were accumulated in the queue and continue process them as we go.
164+
When we receive a bunch of records during live sync, we apply them to the DB immediately and also broadcast then to all other replicas.
165+
Due to the fact that currently chotki only allows tree replication structures, we know that we can safely send events to all connected replicas without fear of loops.

replication/doc.go

Lines changed: 1 addition & 158 deletions
Original file line numberDiff line numberDiff line change
@@ -1,159 +1,2 @@
1-
/*
2-
Implements the Chotki distributed synchronization protocol.
3-
4-
# Protocol Overview
5-
6-
The Chotki replication protocol enables peer-to-peer synchronization of CRDT-based
7-
databases between replicas without a master server or consensus algorithm. The protocol
8-
maintains causal consistency and supports efficient differential synchronization.
9-
10-
# Synchronization States
11-
12-
The protocol operates through a finite state machine with the following states:
13-
14-
SendHandshake → SendDiff → SendLive → SendEOF → SendNone
15-
↓ ↕
16-
SendPing ← → SendPong
17-
18-
State descriptions:
19-
- SendHandshake: Initial connection setup and capability negotiation
20-
- SendDiff: Bulk synchronization of historical data differences
21-
- SendLive: Real-time streaming of new changes as they occur
22-
- SendPing/SendPong: Keep-alive mechanism during live sync
23-
- SendEOF: Graceful connection termination
24-
- SendNone: Connection closed state
25-
26-
# Synchronization Modes
27-
28-
The protocol supports different synchronization modes via bitmask flags:
29-
30-
SyncRead (1): Can receive data from peer
31-
SyncWrite (2): Can send data to peer
32-
SyncLive (4): Supports real-time live synchronization
33-
34-
Common combinations:
35-
- SyncRW = SyncRead | SyncWrite (bidirectional batch sync)
36-
- SyncRL = SyncRead | SyncLive (read + live updates)
37-
- SyncRWLive = SyncRead | SyncWrite | SyncLive (full bidirectional)
38-
39-
# Protocol Flow
40-
41-
## 1. Handshake Phase
42-
43-
The synchronization begins with a handshake message:
44-
45-
H(T{snapshot_id} M{mode} V{version_vector} S{trace_id})
46-
47-
Where:
48-
- H: Handshake packet type
49-
- T: Snapshot timestamp ID (current replica state)
50-
- M: Sync mode bitmask (read/write/live capabilities)
51-
- V: Version vector (TLV-encoded replica states)
52-
- S: Session trace ID (for logging and debugging)
53-
54-
The handshake establishes:
55-
- Peer capabilities and sync mode
56-
- Version vectors for differential calculation
57-
- Session trace IDs for request tracking
58-
59-
## 2. Diffsync Phase
60-
61-
After handshake, the protocol sends block-based diffs:
62-
63-
D(T{snapshot_id} R{block_range} F{offset}+)
64-
65-
Where:
66-
- D: Diff packet type
67-
- T: Reference snapshot ID
68-
- R: Block range being synchronized
69-
- F: Frame offset within block followed by operation data
70-
71-
The diff phase:
72-
- Compares version vectors to identify missing data
73-
- Sends data in blocks with maximum size of 100MB (MaxParcelSize)
74-
- Uses CRDT-specific diff algorithms (rdx.Xdiff) for efficiency
75-
- Processes blocks sequentially until all differences are sent
76-
77-
## 3. Version Vector Exchange
78-
79-
After diff completion, version vectors are exchanged:
80-
81-
V(R{block_range} version_vector_data)
82-
83-
This finalizes the differential sync by confirming the new state.
84-
85-
## 4. Live Sync Phase (Optional)
86-
87-
If SyncLive mode is enabled, the protocol enters real-time sync:
88-
- Streams new operations as they occur via the outbound queue
89-
- Maintains connection with periodic ping/pong messages
90-
- Ping interval and timeout are configurable per connection
91-
92-
Ping/Pong messages:
93-
94-
P("ping") - Keep-alive ping
95-
P("pong") - Keep-alive response
96-
97-
## 5. Termination
98-
99-
Connections end with a bye message containing the reason:
100-
101-
B(T{final_snapshot_id} reason_text)
102-
103-
# Versioning
104-
105-
Key component of the protocol is the version vector. Each time we change data on any replica, this change
106-
is stamped by an rdx.ID (autoincrementing sequence number + source replica id). This rdx.ID will then help us to
107-
determine which things we need to sync.
108-
109-
Version vector is maintained in VKey0, so every time we see change from the same replica or from other replica,
110-
we update VKey0
111-
Versions vector or (VV) layout is basically a map: src_id -> last seen rdx.ID. So, we always no which latest event we saw from each replica.
112-
113-
There is a special src_id = 0, which is a bunch of objects created in Log0, so we could have some common system objects
114-
on each replica. So when those objects are updated they we also store them like: 0 -> last seen rdx.ID in the VV.
115-
116-
# Handshake
117-
This is the first thing happens when we connect to a new replica. We init the sync session.
118-
But most importantly we send our VV (also we send trace_id for logging) and we also create pebble snapshot, that
119-
we will use during diffsync. When we just connected, we start to accumulate live events, until we completed diffsync.
120-
We will apply them later, while we do not guarantee that created snapshot do not contain any live events from the queue,
121-
we do not care as all operations in chotki are idempotent.
122-
123-
# Diff sync
124-
As described above, when replicas first connect after handshake they enter diff sync state. The idea is that
125-
before we start live sync, we need to equalize the state between 2 snapshots we made on previous step.
126-
In diff sync we sync data using blocks. Block is basically a range of rdx.ID, so when we update our VV,
127-
we then take this rdx.ID and apply SyncBlockMask to it (basically cutting first N bits from it) and consider it a block.
128-
Then for each block we will also maintain a VV for all updates associated with this block, meaning if DB update occurs,
129-
we update 2 VVs:
130-
- global VV (VKey0)
131-
- block VV (VKey(block_id))
132-
133-
The algorithm of diff sync is as follows:
134-
135-
1. We take a block and look at its VV, then we look at other replica global VV
136-
2. If other replica has not seen something from this block, we start syncing of this block
137-
3. We start iterating all objects inside this block (with rdx.IDs that conform to this block range)
138-
4. If other replica has not seen this object at all (meaning its VV for the src of this object is non existent or smaller than this object rdx.ID), then we just send this object
139-
5. Ohterwise, we use XDiff, however in a majority of cases it will just send whole object anyway for simplicity. It may change in future.
140-
6. When we sync all blocks we then exchange VVs of blocks we synced, so other replica can update them.
141-
7. Each replica accumulate updates in pebble.Batch, and when it receives V packet in the end, it applies it to the DB.\
142-
143-
Important note that during diff sync we also broadcast all 'D' and 'H' packets to all other replicas.
144-
Imagine there are 3 replicas: A <-> B <-> C.
145-
- Let's say A, B are live syncing and C is just connected to B.
146-
- If we do not broadcast 'D' packets, then B and C will sync, however if there is something in C, that A haven't seen it will not be synced, until diff sync between A and B.
147-
- But if we broadcast 'D' and 'H' packets, we basically open syncing session between all upstream replicas and C, so
148-
they will sync all the data from C
149-
150-
Unfortunattely, it can create a lot of excess traffic, especially when we rollout new version and there are a lot of diffsyncs happenings.
151-
But it simplified protocol a lot.
152-
153-
# Live sync
154-
155-
After diff sync, we now can process all updates that were accumulated in the queue and continue process them as we go.
156-
When we receive a bunch of records during live sync, we apply them to the DB immediately and also broadcast then to all other replicas.
157-
Due to the fact that currently chotki only allows tree replication structures, we know that we can safely send events to all connected replicas without fear of loops.
158-
*/
1+
// Implements the Chotki distributed synchronization protocol.
1592
package replication

0 commit comments

Comments
 (0)