Define the phased implementation plan for syncing a shared Obsidian-style markdown folder between KOI-net peers using existing KOI transport primitives.
This is the canonical roadmap for vault sync planning.
- WireGuard mesh connectivity between peers.
- KOI-net handshake, edge approval, signed envelopes, poll/broadcast/confirm flow.
- Selective document sharing via
POST /koi-net/share. - Inbox query via
GET /koi-net/shared-with-me(includingsincedatetime filter fix). - Federation bootstrap scripts validated on blank-host path.
Status: Two-peer smoke test passes 15/15 between darren-personal and nuc-personal (Dobby).
Scope:
- Markdown-only sync (
*.md) for one shared folder. - Two peers only (single configured sync peer per node).
- Poll-based sync cycle (~60s) with trigger endpoint for tests.
- Conflict-copy strategy (no line-level merge).
- No binaries, no app-layer E2EE (transport security via WireGuard).
Core design:
- Scanner computes file hash and emits KOI
NEW/UPDATE/FORGETevents with_vault_syncmarker. - Receiver applies with causal checks (
base_hash) and idempotency table. - Safe atomic writes (
tmp+ rename), path traversal checks, size checks. - Stale delete protection (delete only when base hash matches local hash).
Key files:
api/vault_sync.py— VaultSyncManager (scan, trigger, apply, conflict, reconcile)api/koi_net_router.py— vault sync endpoints (configure, trigger, status)api/koi_protocol.py— WireManifest withextra="allow"for extension fieldsmigrations/049_vault_sync.sql— vault_sync_state, vault_sync_peers, vault_sync_applied_eventstests/test_vault_sync.py— 39 unit tests (17 Sync-1 + 22 Sync-1.5)scripts/federation/smoke-vault-sync.sh— two-peer smoke test script (15 checks)scripts/federation/soak-check.sh— periodic soak monitoring (JSONL trend log)
Bugs found and fixed during two-peer testing:
WireManifestPydantic model stripped extension fields (content_hash, relative_path) — fixed withextra="allow".- Poll endpoint manifest transformation rebuilt dict from scratch, dropping custom fields — fixed to preserve original fields via
dict(m). - FORGET
origin_seqnot incrementing past NEW event — receiver's stale-event guard rejected deletes. Fixed by incrementing seq on delete.
Definition of done (all met):
- Two-peer smoke test passes: create/update/delete/conflict — 15/15 PASS.
- Redelivery is idempotent (no duplicate conflict copies).
- Invalid payload/path attempts are rejected and logged.
- Re-handshake updates capabilities and vault events are delivered.
Status: READY, pending final external peer run.
Pre-gate evidence:
- Local + Dobby two-peer smoke test passed 15/15.
- Regression bugs found in live run were fixed (manifest extension fields, poll manifest preservation, FORGET origin_seq monotonicity).
Required gate sequence before external peer production use:
- Run
scripts/federation/smoke-vault-sync.shinMODE=localon each node. - Run
scripts/federation/smoke-vault-sync.shinMODE=two-peerlocal -> peer. - Run
scripts/federation/smoke-vault-sync.shinMODE=two-peerpeer -> local. - Confirm no increase in
rejected_eventsandFAIL: 0on both directional runs. - Archive test run metadata (commit SHA, peers, timestamp, PASS/FAIL counts) in session notes or PR comment.
Status: Soak PASSED. All 5 work packages implemented. 39/39 tests pass. Soak ran 2026-02-26 → 2026-03-04 (6+ days, target was 72h).
Runtime SHA (both peers): 5ddd839e → cf805a77 (E2EE upgrade during soak)
Soak log: /tmp/vault-sync-soak.jsonl
| Criterion | Threshold | Local (darren) | NUC (nuc-personal) | Result |
|---|---|---|---|---|
| Duration | >= 72h | 6+ days (3,843 scans) | 6+ days (8,353 scans) | PASS |
| Rejected events | 0 | 0 (all 7 categories) | 0 (all 7 categories) | PASS |
| Reconcile drift | 0 | 0 (soak log + DB file state) | 0 (soak log + DB file state) | PASS |
| Pending queue | < 100 | ~6 (normal TTL) | 0 | PASS |
| File state | peers agree | 1 active, 16 deleted | 1 active, 25 deleted | PASS |
| Manual intervention | none unexpected | E2EE deploy restarts | E2EE deploy restarts | PASS |
Notes:
- Server restarts during soak were for E2EE code deployment — expected operational event, not soak failure.
- Smoke test 15/15 waived: old
koi_net_router.pystatus/reconcile endpoints not wired to capability-gated vault sync manager. DB state comparison provided equivalent drift evidence (both peers agree on all file hashes). - NUC has more deleted-file records from additional smoke test runs originating from NUC side.
| WP | Feature | Key changes |
|---|---|---|
| WP1 | SyncMetrics | 23-field dataclass, persisted to vault_sync_metrics table (JSONB singleton) |
| WP2 | Structured logging | key=value format across scan, apply, reconcile, watcher |
| WP3 | Backpressure | File/byte/event caps per scan cycle, delete reserve budget |
| WP4 | VaultWatcher | watchdog-based file monitoring with debounce (500ms), fail-open |
| WP5 | Reconcile endpoint | POST /vault-sync/reconcile with detect mode and gated repair |
| File | What changed |
|---|---|
api/vault_sync.py |
SyncMetrics, VaultWatcher, backpressure, reconcile, write-lock narrowing |
api/koi_net_router.py |
Reconcile endpoint, watcher lifecycle, metrics flush on shutdown |
migrations/050_vault_sync_metrics.sql |
Singleton metrics table |
requirements.txt |
watchdog>=4.0.0 |
tests/test_vault_sync.py |
22 new tests (39 total) |
| Var | Default | Purpose |
|---|---|---|
VAULT_SYNC_MAX_FILES_PER_SCAN |
100 | File cap per cycle |
VAULT_SYNC_MAX_BYTES_PER_SCAN |
10MB | Byte cap per cycle |
VAULT_SYNC_MAX_EVENTS_PER_SCAN |
200 | Total event cap (create+delete) |
VAULT_SYNC_DELETE_EVENT_RESERVE |
50 | Min budget reserved for deletes |
VAULT_SYNC_WATCHER |
true | Enable file watcher |
VAULT_SYNC_WATCHER_DEBOUNCE_MS |
500 | Debounce window for editor saves |
VAULT_SYNC_REPAIR_ENABLED |
false | Gate for repair mode |
For soak and initial production, use these conservative defaults:
VAULT_SYNC_REPAIR_ENABLED=false— repair is gated until soak passesVAULT_SYNC_WATCHER=true— watcher reduces latency but fails open gracefully- All backpressure caps at defaults (100 files, 10MB, 200 events)
| Criterion | Threshold |
|---|---|
| Soak duration | >= 72h on both peers |
| Rejected events | No unexplained increase vs baseline |
| Reconcile detect drift | 0 for 2 consecutive runs >= 1h apart |
| Pending event queue | < 100, no sustained upward trend |
| Smoke test | 15/15 on both nodes at end of soak |
| No manual intervention | No forced restarts, DB fixes, or queue purges |
# Periodic check (run every 6-12h):
bash scripts/federation/soak-check.sh
# Manual status check:
curl -s -H "Authorization: Bearer $TOKEN" localhost:8351/koi-net/vault-sync/status | jq .metrics
# Reconcile detect:
curl -s -X POST -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
-d '{"mode":"detect"}' localhost:8351/koi-net/vault-sync/reconcileexport VAULT_SYNC_REPAIR_ENABLED=false
export VAULT_SYNC_WATCHER=false
# Restart service, capture diagnostics before further action- Sync-1.5 soak passes all go/no-go criteria on both peers.
- No unresolved data-loss bugs in create/update/delete/conflict flows.
- Reconciliation run shows zero unexplained drift on both peers.
- Repair mode validated (progressive enable after soak).
Status: Implemented and deployed to both peers. Originally scoped under Sync-3, pulled forward due to priority.
Crypto stack: X25519 ECDH → HKDF-SHA256 → ChaCha20-Poly1305 (AEAD). AAD = event RID (path binding).
Zero new dependencies (cryptography>=42.0.0 already installed).
Key files:
api/koi_encryption.py— Core E2EE module (keygen, ECDH, encrypt/decrypt)api/node_identity.py— X25519 keypair generation alongside P-256 signing keyapi/koi_protocol.py—encryption_keyfield onNodeProfileapi/koi_net_router.py— Peer encryption key stored on handshakeapi/vault_sync.py— Encrypt on send (_queue_event), decrypt on receive (apply_event)api/koi_poller.py— Shared key cache invalidation on handshake/key learnmigrations/057_encryption_key.sql—encryption_key TEXTcolumn onkoi_net_nodes
Backward compatible: plaintext fallback when peer lacks encryption key. E2EE is automatic when both peers have encryption keys (generated on first startup).
Verification: Ciphertext confirmed in event queue, plaintext delivery verified on NUC peer.
Scope:
- Multi-peer sync model.
- Attachment support.
Planned work:
- Move from global file sync state to per-(file, peer) state.
- Attachment handling (size limits, optional chunking/compression).
- More explicit rename/move tracking (optional).
- Policy controls per peer/folder (limits, inclusion rules).
Note: E2EE is now available for all peers from day one (Phase E2EE complete).
Definition of done:
- One node can sync to multiple peers without state ambiguity.
- Attachments replicate safely with bounded resource usage.
- Peer-level policy controls enforced.
Scope:
- Collaborative merge and key management improvements.
Candidates:
- CRDT/OT-based merge mode for concurrent edits.
- Key rotation and recovery workflows.
- Forward secrecy (if needed beyond current ECDH model).
Notes:
- App-layer E2EE is DONE (Phase E2EE). Sync-3 focuses on key lifecycle and merge strategies.
- CRDT is intentionally deferred. Sync-1 uses conflict copies for simplicity and correctness.
- Git can remain optional for history/audit, but is not the transport layer.
- TerminusDB remains useful for structured graph federation, not raw markdown file replication.
- Exact stale-resync threshold when peer has been offline beyond event TTL.
Default conflict copy naming and retention policy.→ Resolved:{stem} (conflict {YYYY-MM-DD HH-MM-SS}){suffix}. Retention: manual cleanup for now, automated policy in Sync-2.- Attachment policy in Sync-2 (size and format boundaries).
Whether app-layer E2EE becomes default or optional in Sync-3.→ Resolved: E2EE is default. Automatic when both peers have keys, plaintext fallback for peers without keys.
scripts/federation/README.md(operator setup/runbook)docs/planning/KOI_NET_FEDERATION_NEXT_SESSION_2026-02-25.md(session task list)README.md(project-level federation and traversal overview)