Version: v0.2 (plan checkpoint; incorporates locked decisions from discussion) Scope: local-only, workspace-scoped coordination substrate for AI coding agents Primary deliverables: SQLite schema + event log, Bun hub daemon (HTTP+WS), stateless CLI, TypeScript SDK, plugin isolation, minimal UI Out of scope (v1): multi-machine sync, accounts/permissions, Zulip-style unread/reactions/emoji, rich renderer, internet-facing service
- Read Part 0: Executive Blueprint end-to-end—that's the contract
- Treat Section 0.14: ADR Expansions as locked unless explicitly revised
- Implement Phases 0 → 4 in order; use Quality Gates as PR merge requirements
- Track work via Part X: Master TODO Inventory—the execution board
Document note: code and SQL are "shape-accurate" specs, not copy/paste final implementations. Where it matters, query semantics and invariants are exact.
You're building a local-first, durable coordination hub for AI agents inside a workspace. The core promise is a shared local truth that is:
- Durable: state survives crashes/restarts (SQLite WAL)
- Observable: monotonic event stream with replay (
event_id) - Addressable:
channel_id / topic_id / message_id - Extensible: isolated TypeScript plugins for enrichment + extraction
- Offline/private: localhost-bound, no internet dependency
The "Zulip-inspired" piece is the channel/topic mental model, with one decisive structural commitment:
Topics are first-class entities with stable IDs. Messages reference
topic_id.
Additionally (locked from day 1):
- Messages support edits (explicit events with optimistic concurrency)
- "Delete" is a tombstone mutation (rows are never removed)
- No hard deletes ever for
messages(events are immutable/append-only)
Success looks like: Multiple agents and a human can tail a topic, post, retopic (same-channel only), edit, tombstone-delete, and rely on replay after disconnects—without data loss or divergence.
Stop-ship invariants. If any is violated, the system is untrusted.
The system provides idempotency at multiple layers:
A. Attachment insertion (strong idempotency):
- Same
(topic_id, kind, key, dedupe_key)inserted twice → second insert returns existing attachment, no new event - Guaranteed by unique index; safe to retry
B. Message deletion (tombstone; idempotent on retry):
- Delete already-deleted message → 200 OK, no state change, no new event
- Safe to retry; outcome stable
C. Retopic to current topic (idempotent success):
- Retopic message to its current topic → 200 OK, no state change, no new events
- Safe to retry; outcome stable
D. Message creation (NOT idempotent):
- Same content sent twice → two distinct messages created
- v1: no deduplication; client must track sent message IDs to avoid duplicates
- Future: support
client_request_idfor server-side deduplication
E. Message edit (NOT idempotent):
- Edit to same content → still creates new event and increments version
- Rationale: edit is a user action; event log preserves action history regardless of content change
- Client should avoid retrying edits unnecessarily
F. WS event delivery (at-least-once):
- Same event may be delivered multiple times (reconnect, replay)
- Client deduplicates by
event_id(effectively idempotent)
G. Plugin execution (conditional idempotency):
- Enrichments: no built-in deduplication (rely on staleness guard)
- Attachments:
dedupe_keyensures idempotency - Multiple runs on same message may produce duplicate enrichments; future re-enrichment must handle this
H. Schema migration (forward-only):
- Re-running same migration may fail or succeed depending on DDL (use
IF NOT EXISTSfor idempotency) - Rollback requires restore from backup
- Single-writer: only the hub writes to
.agentlip/db.sqlite3. - Atomic mutation + event: every mutation commits its state change and corresponding
eventsrow(s) in the same SQLite transaction. - Monotonic event stream:
events.event_idis strictly increasing and defines total order of mutations and derived outputs. - At-least-once delivery over WS; clients dedupe by
event_id. - Ordering: for any
message_id,message.createdcommits before any derived events sourced from that message. - Stateless reads: CLI can query
.agentlip/db.sqlite3read-only without hub participation.
- No hard deletes:
messagesrows are never deleted. "Delete" is a tombstone mutation. - Explicit edit/delete events: edits and tombstone deletes emit durable events (
message.edited,message.deleted). - Optimistic concurrency for content mutations: edit and delete support
expected_version; mismatch ⇒ conflict response, no state change, no events. - Message version discipline: any successful mutation (edit, delete, retopic) increments
messages.versionby 1. Rationale: version tracks mutation history for conflict detection, even for non-content changes like retopic. - Derived staleness protection: derived jobs must not publish results derived from stale content. When persisting outputs, verify the message's current
content_rawstill matches what was processed (don't gate onversion, sincemove_topicalso bumps it). - Privacy implication: immutable event log means old message content (before edits) may persist in
message.editedevent payloads. Tombstone deletes do not erase; "deleted" content remains in DB and historical events. This is by design for audit/replay but precludes secure erasure.
- Local-only bind: hub binds to
127.0.0.1(and optionally::1), never0.0.0.0. - Auth token required for mutations and WS connections (cryptographically random token ≥128-bit stored in
server.jsonwith mode 0600). - Plugins are isolated (Worker or subprocess). They cannot block ingestion; failures are contained. Plugins must not have write access to
.agentlip/db.sqlite3orserver.json. - Input validation: all endpoints validate and sanitize inputs; reject oversized payloads (message content, attachment metadata, etc.).
- Rate limiting: per-connection and global rate limits prevent DoS (configurable, sensible defaults).
- No secrets in logs: structured logs never include auth tokens, full message content, or other sensitive data.
- Stale server discovery is safe:
server.jsonis advisory;/healthvalidation is authoritative. - Backpressure enforced: slow WS clients are disconnected; reconnection + replay is the recovery path.
- Connection limits: max concurrent WS connections enforced to prevent resource exhaustion.
- Migrations are forward-only and must include a rollback story (backup/snapshot + recompute derived tables).
In scope (v1):
- Malicious or buggy plugins (sandboxing, timeouts, resource limits)
- Accidental exposure of auth token (file permissions, log redaction)
- Local DoS via API abuse (rate limits, size limits, connection limits)
- Path traversal during workspace discovery
- SQL injection via user inputs
- Sensitive data leakage in logs or error messages
- Untrusted workspace config (
agentlip.config.tsexecutes code)
Out of scope (v1 assumes localhost is trusted):
- Network-level attacks (no TLS; localhost-only)
- Multi-user/multi-tenant isolation (single workspace owner)
- Secure deletion/erasure of message history (tombstones do not erase; events are immutable)
- Supply-chain attacks on npm dependencies (assumed trusted; mitigation: use lockfiles, periodic
npm audit, consider SRI for plugins in future)
- Workspace config boundary:
agentlip.config.tsis code execution; only load from trusted workspace root (never traverse upward through untrusted directories). - Plugin boundary:
- Plugins run isolated (Worker/subprocess) with no write access to
.agentlip/directory - v1: plugins CAN access network and filesystem (Worker limitations); document this risk
- v2+: explicit capability grants (network/filesystem/environment)
- Plugins receive read-only message data; cannot directly mutate DB
- Plugin outputs (enrichments/attachments) validated before insertion
- Plugins run isolated (Worker/subprocess) with no write access to
- Client boundary: CLI/SDK/UI are trusted (same user); auth token in
server.jsonis shared secret. - Data boundary: event log is durable and immutable; "deleted" messages remain in history (tombstoned); UI/clients must respect tombstone semantics.
- Hub binds
127.0.0.1only (not0.0.0.0) server.jsonmode 0600- Rate limits: 100 req/s per connection, 1000 req/s global (configurable)
- Max WS connections: 100 (configurable)
- Max message size: 64KB
- Max attachment metadata: 16KB
- Max WS message: 256KB
- Max event replay batch: 1000 events
- Plugin timeout: 5s (default)
- Plugin memory limit: 128MB (if enforceable)
- Prepared statements for all SQL queries
- Error responses: generic messages (detailed errors in server logs only)
Build a minimal, stable kernel that:
- persists canonical conversation state (channels/topics/messages)
- persists structured grounding (topic attachments)
- exposes a replayable change feed (events)
- is ergonomic for agents (CLI JSONL + SDK async iterator)
- supports deterministic server-side enrichment via isolated plugins
- supports message edit + tombstone delete from day 1 with explicit events and optimistic concurrency
- Multi-machine sync or LAN collaboration
- Users/accounts/permissions
- Agentlip "unread" model, typing indicators, reactions
- Complex search language (support basic filtering + optional FTS5)
- Full markdown/HTML rendering engine
- Secure erasure / "history wipe" semantics (tombstones do not remove past events)
Strict dependency direction; keep the core small.
- SQLite schema (
schema_v1.sql+ optionalschema_v1_fts.sql) - DB invariants + indexes for tail/pagination + event replay
- Versioning fields (
meta, schema_version, db_id)
- Bun daemon
- HTTP API (
/api/v1/...) - WebSocket feed (
/ws) with replay - Derived pipelines (enrichment + attachment extraction) async
- Lock + lifecycle (
server.json, writer.lock)
- Stateless CLI:
- reads DB directly (queries)
- writes via hub (mutations)
- listens via WS (JSONL)
- TypeScript SDK (
@agentlip/client) - Minimal UI consuming same APIs
- Plugin system (isolated runtime)
Dependency rule: clients/plugins depend on protocol types; hub depends on protocol + kernel schema; kernel depends on nothing.
.agentlip/
db.sqlite3
server.json
config.json # optional generated snapshot
logs/
locks/
writer.lock
agentlip.config.ts # workspace config (plugins, limits)
packages/
protocol/ # protocol_v1.ts (single source of truth)
client/ # @agentlip/client
cli/ # agentlip
hub/ # agentlipd (Bun server)
ui/ # minimal UI assets
plugins/ # built-in plugins (url extractor, etc.)
migrations/
0001_schema_v1.sql
0001_schema_v1_fts.sql
docs/
plan.md
protocol.md
ops.md
channels.id,topics.id,messages.idare stable identifiers.topicsare unique by(channel_id, title)(human-addressability).- Messages reference
topic_id. Topics are first-class.
messagesrows are never deleted (tombstone-only).messages.versionstarts at 1 and increments on edit/delete/move_topic.- Tombstone delete sets
deleted_at,deleted_by, and replacescontent_rawwith a canonical tombstone string (e.g."[deleted]").
- Every mutation inserts exactly one "primary" event row (plus optional derived events).
event_idstrictly increases; replay is byevent_id.eventsrows are immutable and append-only (no update/delete).events.scope_*columns are populated so replay queries are index-backed and correct.
- Retopic updates
messages.topic_id(notmessages.channel_id) and emitsmessage.moved_topic. - Fanout correctness:
- deliver to old topic subscribers
- deliver to new topic subscribers
- deliver to channel subscribers
- Derived data (enrichments, auto attachments) is recomputable and must not be required for correctness of ingestion.
- Derived jobs must not publish stale outputs if message content changed mid-flight.
Churn magnets-lock early.
- Topics are entities with stable IDs (locked).
- Events are the integration surface (WS + replay; additive evolution) (locked).
- Single-writer hub + stateless readers (locked).
- Replay boundary contract:
replay_untilhandshake semantics (locked). - Cross-channel retopic: forbidden in v1 (locked).
- Message mutability model: edits are explicit events with optimistic concurrency; deletes are tombstones; no hard deletes ever (locked).
- Version semantics:
messages.versionincrements on edit/delete/move_topic; conflicts enforced whenexpected_versionprovided (locked). - Attachment idempotency:
topic_attachments.dedupe_key+ unique index; hub computes if absent; emit event only on new insert (locked). - Plugin isolation mechanism: Bun Worker by default; subprocess reserved for later (locked).
- FTS optionality: separate schema applied opportunistically; fallback behavior explicit (locked).
Expanded in Section 0.14: ADR Expansions.
- Schema initializes cleanly in empty workspace
- Optional FTS schema applies if supported; failure non-fatal and detectable
- Every mutation endpoint commits state + event in same SQLite transaction
- Verify with failure injection: no state change without corresponding event row(s)
Given subscription set S and last processed event_id = k:
- Replay query returns exactly events matching
Swithevent_id > k(ascending order) - Streaming thereafter produces no gaps (duplicates allowed; client dedupes)
When moving message from topic A → B:
- Subscribers to topic A, topic B, and parent channel all receive event
- Event includes old/new topic IDs and mode
- Cross-channel moves rejected (no events, DB unchanged)
- Plugin hangs bounded by timeout; hub continues ingesting messages
- Plugin failures logged; may emit internal error events; do not crash hub
- CLI
--json/--jsonloutput is versioned and additive-only - SDK reconnects indefinitely, making forward progress using stored
event_id
If expected_version provided and mismatched:
- Return conflict response
- No DB change
- No new events
After successful delete:
- Message row still exists
deleted_at != NULL,deleted_bynon-emptycontent_rawis tombstonedmessage.deletedemitted exactly once
If message edited or deleted while enrichment/extraction job running:
- Job must not commit stale derived rows
- Job must not emit derived events for old content
- Auth token ≥128-bit cryptographically random, stored with mode 0600
- Hub binds localhost only (rejects
0.0.0.0by default) - All SQL uses prepared statements
- Rate limits enforced (per-connection and global)
- Input size limits enforced (message ≤64KB, attachment ≤16KB, WS ≤256KB)
- Logs never contain auth tokens or full message content
- Plugin isolation: no write access to
.agentlip/directory - Workspace config loaded only from discovered workspace root
All API errors return a consistent shape:
{
"error": "human-readable message",
"code": "MACHINE_READABLE_CODE",
"details": {} // optional context
}Standard error codes:
| Code | HTTP | Meaning | Example |
|---|---|---|---|
INVALID_INPUT |
400 | Validation failed | Missing required field, invalid format |
PAYLOAD_TOO_LARGE |
400 | Size limit exceeded | Message >64KB |
NOT_FOUND |
404 | Entity doesn't exist | Topic/message/channel not found |
VERSION_CONFLICT |
409 | Optimistic lock failed | expected_version mismatch; includes current_version |
CROSS_CHANNEL_MOVE |
400 | Invalid retopic | Target topic in different channel |
UNAUTHORIZED |
401 | Auth failed | Missing/invalid token |
RATE_LIMITED |
429 | Too many requests | Exceeded per-connection or global limit |
SERVICE_UNAVAILABLE |
503 | Temporary failure | DB lock contention, shutdown in progress |
INTERNAL_ERROR |
500 | Unexpected server error | Log correlation ID for debugging |
Conflict response example (version mismatch):
{
"error": "version conflict",
"code": "VERSION_CONFLICT",
"details": {
"expected": 2,
"current": 4,
"message_id": "msg_456"
}
}Rate limit response example:
{
"error": "rate limit exceeded",
"code": "RATE_LIMITED",
"details": {
"limit": 100,
"window": "1s",
"retry_after": 0.5
}
}Global flags:
--workspace <path>- explicit workspace (otherwise auto-discover from cwd)--json- machine-readable JSON output--jsonl- newline-delimited JSON (for streaming)
Read-only queries (direct DB access, no hub required):
agentlip channel list [--json]
- Output: table or JSON array of channels
- Example JSON:
[{"id": "ch_123", "name": "general", "description": null, "created_at": "2026-02-04T20:00:00Z"}]
agentlip topic list --channel <name|id> [--json]
- Output: topics in channel, sorted by updated_at DESC
- Example:
agentlip topic list --channel general --json
agentlip msg tail --topic-id <id> [--limit 50] [--json]
- Output: latest N messages in topic (newest first)
- Example JSON:
[{"id": "msg_456", "sender": "agent-1", "content_raw": "Hello", "version": 1, "created_at": "...", "edited_at": null, "deleted_at": null}]
agentlip msg page --topic-id <id> [--before-id <id>] [--after-id <id>] [--limit 50] [--json]
- Bidirectional pagination
- Example:
agentlip msg page --topic-id topic_xyz --before-id msg_100 --limit 20
agentlip search <query> [--channel <name>] [--topic-id <id>] [--limit 100] [--json]
- Basic search (LIKE-based); uses FTS5 if available (faster, better ranking)
- Query syntax:
- FTS available:
"exact phrase",word1 word2(AND),word1 OR word2 - FTS unavailable: simple substring match (
WHERE content_raw LIKE '%query%')
- FTS available:
- Example:
agentlip search "error message" --channel general --limit 10 - Example phrase:
agentlip search '"connection refused"' --json - Response includes
fts_used: booleanfield indicating search method used
agentlip attachment list --topic-id <id> [--kind <kind>] [--json]
- List attachments for a topic
- Example:
agentlip attachment list --topic-id topic_xyz --kind url --json
Mutations (require running hub):
agentlip msg send --topic-id <id> --sender <name> [--content <text>] [--stdin]
- Send message (content from arg or stdin)
- Example:
echo "Hello world" | agentlip msg send --topic-id topic_xyz --sender agent-1 --stdin - Response:
{"message_id": "msg_789", "event_id": 42}
agentlip msg edit <message_id> --content <text> [--expected-version <n>]
- Edit message content with optional optimistic lock
- Example:
agentlip msg edit msg_456 --content "Updated text" --expected-version 2 - On conflict: exit code 2, stderr:
Error: version conflict (current: 4)
agentlip msg delete <message_id> --actor <name> [--expected-version <n>]
- Tombstone delete
- Example:
agentlip msg delete msg_456 --actor agent-1 - Response:
{"deleted": true, "event_id": 43}
agentlip msg retopic <message_id> --to-topic-id <id> --mode <one|later|all> [--force]
- Move message(s) to different topic (same channel only)
--forcerequired for mode=all (safety guardrail)- Example:
agentlip msg retopic msg_100 --to-topic-id topic_new --mode later - Example all:
agentlip msg retopic msg_50 --to-topic-id topic_archive --mode all --force - Error on cross-channel: exit code 1, stderr:
Error: cross-channel move forbidden
agentlip topic rename <topic_id> --title <new_title>
- Rename topic
- Example:
agentlip topic rename topic_xyz --title "New Title"
agentlip attachment add --topic-id <id> --kind <kind> --value-json <json> [--key <key>] [--source-message-id <id>] [--dedupe-key <key>]
- Add attachment (manual or scripted)
- Example:
agentlip attachment add --topic-id topic_xyz --kind url --value-json '{"url":"https://example.com","title":"Example"}' --source-message-id msg_123 - Response on new:
{"attachment_id": "att_999", "event_id": 44} - Response on dedupe:
{"attachment_id": "att_888", "event_id": null, "deduplicated": true}
Listening (WebSocket stream):
agentlip listen [--since <event_id>] [--channel <name|id>...] [--topic-id <id>...] [--format jsonl]
- Stream events to stdout
- Defaults: since=0 (all history), no filters (all events), format=jsonl
- Example:
agentlip listen --since 42 --channel general --format jsonl - Output: one JSON envelope per line
- Reconnects automatically on disconnect; resumes from last seen event_id
- Exit: Ctrl+C or SIGTERM
Daemon control:
agentlipd up [--port <port>] [--host 127.0.0.1] [--config <path>]
- Start hub daemon
- Defaults: port from
server.jsonor random, host=127.0.0.1 - Writes
server.jsonwith token + instance_id - Example:
agentlipd up --port 8080
agentlipd down
- Graceful shutdown (finds hub via server.json, sends SIGTERM)
agentlipd status
- Check hub health and print info
- Output:
{"status": "running", "instance_id": "...", "db_id": "...", "schema_version": 1, "port": 8080}
agentlip init [--workspace <path>]
- Initialize workspace (create
.agentlip/and schema) - Example:
agentlip init(in repo root)
agentlip doctor
- Run diagnostics (DB integrity, schema version, server health, etc.)
Exit codes:
0- success1- general error (invalid input, not found, etc.)2- conflict (version mismatch)3- hub not running / connection failed4- authentication failed
Authentication: All mutation endpoints and WS require Authorization: Bearer <token> header. Token from server.json.
Common request headers:
Authorization: Bearer <token>- required for mutations and WSContent-Type: application/json- for POST/PATCH with bodyX-Request-ID: <uuid>- optional; echoed in response for correlation
Common response headers:
X-Request-ID: <uuid>- echoed from request, or server-generatedX-RateLimit-Limit: <n>- requests allowed per windowX-RateLimit-Remaining: <n>- requests remaining in current windowX-RateLimit-Reset: <timestamp>- ISO8601 when limit resetsX-Instance-ID: <id>- hub instance ID (for debugging multi-hub issues)
Common response codes:
200 OK- success400 Bad Request- invalid input (body includes{error: string, code: string})401 Unauthorized- missing/invalid auth token404 Not Found- entity not found409 Conflict- optimistic concurrency failure (includescurrent_version)429 Too Many Requests- rate limit exceeded503 Service Unavailable- DB lock contention or temporary failure
Endpoints:
GET /health
- No auth required
- Response:
{instance_id: string, db_id: string, schema_version: number, protocol_version: string} - Example:
{"instance_id": "abc123", "db_id": "def456", "schema_version": 1, "protocol_version": "v1"}
GET /api/v1/channels
- Response:
{channels: [{id: string, name: string, description: string|null, created_at: string}]}
POST /api/v1/channels
- Request:
{name: string, description?: string} - Response:
{channel: {id: string, name: string, ...}, event_id: number}
GET /api/v1/channels/:channel_id/topics
- Query params:
?limit=50&before_id=...(pagination) - Response:
{topics: [{id: string, channel_id: string, title: string, created_at: string, updated_at: string}]}
POST /api/v1/topics
- Request:
{channel_id: string, title: string} - Response:
{topic: {id: string, ...}, event_id: number}
PATCH /api/v1/topics/:topic_id
- Request:
{title: string} - Response:
{topic: {id: string, title: string, ...}, event_id: number}
GET /api/v1/messages
- Query params:
?channel_id=...&topic_id=...&limit=50&before_id=...&after_id=... - At least one of
channel_idortopic_idrequired - Pagination: use
before_id(older messages) orafter_id(newer messages) - Response:
{messages: [{id: string, topic_id: string, channel_id: string, sender: string, content_raw: string, version: number, created_at: string, edited_at: string|null, deleted_at: string|null, deleted_by: string|null}], has_more: boolean, cursor?: string} - Example:
GET /api/v1/messages?topic_id=topic_xyz&limit=20&before_id=msg_500 - Returns up to 20 messages older than msg_500, newest first
has_more: trueif more messages available in requested direction
POST /api/v1/messages
- Request:
{topic_id: string, sender: string, content_raw: string} - Response:
{message: {id: string, version: 1, ...}, event_id: number} - Example request:
{"topic_id": "topic_abc", "sender": "agent-1", "content_raw": "Hello world"} - Validation:
content_rawmax 64KB;senderrequired non-empty string
PATCH /api/v1/messages/:message_id
- Operations via
opfield:
Edit operation:
{
"op": "edit",
"content_raw": "Updated content",
"expected_version": 2
}Response on success: {message: {..., version: 3, edited_at: "..."}, event_id: number}
Response on conflict: 409 {"error": "version conflict", "code": "VERSION_CONFLICT", "current_version": 4}
Delete operation (tombstone):
{
"op": "delete",
"actor": "agent-1",
"expected_version": 2
}Response: {message: {..., deleted_at: "...", deleted_by: "agent-1", version: 3}, event_id: number}
Move topic operation:
{
"op": "move_topic",
"to_topic_id": "new_topic_xyz",
"mode": "one"|"later"|"all",
"expected_version": 2
}Response: {affected_count: number, event_ids: number[]}
Error if cross-channel: 400 {"error": "cross-channel move forbidden", "code": "CROSS_CHANNEL_MOVE"}
GET /api/v1/topics/:topic_id/attachments
- Response:
{attachments: [{id: string, topic_id: string, kind: string, key: string|null, value_json: object, dedupe_key: string, source_message_id: string|null, created_at: string}]}
POST /api/v1/topics/:topic_id/attachments
- Request:
{kind: string, key?: string, value_json: object, dedupe_key?: string, source_message_id?: string} - Response on new insert:
{attachment: {...}, event_id: number} - Response on dedupe:
{attachment: {...}, event_id: null}(no new event) - Example:
{"kind": "url", "value_json": {"url": "https://example.com", "title": "Example"}, "source_message_id": "msg_123"} - Validation:
value_jsonmax 16KB serialized
GET /api/v1/events?after=&limit= (optional fallback for non-WS clients)
- Query params:
after(event_id),limit(default 100, max 1000) - Response:
{events: [{event_id: number, ts: string, name: string, data_json: object}]}
Connection: ws://localhost:<port>/ws?token=<auth_token>
Message format: All messages are JSON objects with a type field.
Handshake sequence:
- Client connects and sends
hello:
{
"type": "hello",
"after_event_id": 42,
"subscriptions": {
"channels": ["channel_abc"],
"topics": ["topic_xyz", "topic_123"]
}
}after_event_id: last event processed by client (0 for fresh start)subscriptions: channels and/or topics to follow (omit field or pass empty array for none)
- Server responds with
hello_ok:
{
"type": "hello_ok",
"replay_until": 100,
"instance_id": "abc123"
}replay_until: server'slatest_event_idat handshake time; defines replay boundary
- Server sends replay events (if any):
{
"type": "event",
"event_id": 43,
"ts": "2026-02-04T23:30:00.000Z",
"name": "message.created",
"scope": {
"channel_id": "channel_abc",
"topic_id": "topic_xyz"
},
"data": {
"message": {
"id": "msg_456",
"topic_id": "topic_xyz",
"channel_id": "channel_abc",
"sender": "agent-2",
"content_raw": "Hello",
"version": 1,
"created_at": "2026-02-04T23:30:00.000Z"
}
}
}- After replay completes (all events
<= replay_untilsent), server streams live events (> replay_until)
Event envelope structure:
{
type: "event",
event_id: number, // strictly increasing, unique
ts: string, // ISO8601 timestamp
name: string, // event type (see event catalog below)
scope: { // routing metadata
channel_id?: string,
topic_id?: string, // primary topic
topic_id2?: string // secondary topic (for moves)
},
data: object // event-specific payload
}Event catalog (v1):
channel.created- data:{channel: {...}}topic.created- data:{topic: {...}}topic.renamed- data:{topic_id: string, old_title: string, new_title: string}message.created- data:{message: {...}}message.edited- data:{message_id: string, old_content: string, new_content: string, version: number}message.deleted- data:{message_id: string, deleted_by: string, version: number}message.moved_topic- data:{message_id: string, old_topic_id: string, new_topic_id: string, channel_id: string, mode: string, version: number}message.enriched- data:{message_id: string, enrichments: [{kind: string, span: {start: number, end: number}, data: object}]}topic.attachment_added- data:{attachment: {...}}
Client responsibilities:
- Deduplicate events by
event_id(server guarantees at-least-once delivery) - Store
latest_processed_event_iddurably for reconnection - Handle backpressure disconnect gracefully (reconnect with last processed id)
Server backpressure policy:
- Each connection has bounded outbound queue (default: 1000 events)
- If queue fills, disconnect with close code 1008 (policy violation)
- Client should reconnect with
after_event_id
Connection limits:
- Max concurrent connections: 100 (configurable)
- Connection refused with HTTP 503 if limit reached
WebSocket close codes:
1000(Normal Closure): graceful shutdown, client should not auto-reconnect1001(Going Away): server shutdown in progress, client should reconnect after delay1008(Policy Violation): backpressure limit exceeded, client should reconnect with last processed event_id1011(Internal Error): unexpected server error, client should reconnect with exponential backoff4401(Unauthorized): invalid auth token, client should not reconnect without re-authentication
Connection lifecycle example:
- Client connects:
ws://localhost:8080/ws?token=abc123... - Client sends
hello:
{"type": "hello", "after_event_id": 42, "subscriptions": {"channels": ["general"]}}- Server validates token and subscriptions
- Server responds
hello_ok:
{"type": "hello_ok", "replay_until": 100, "instance_id": "xyz789"}- Server sends replay events (43..100)
- Server sends live events (>100) as they occur
- If backpressure: server closes with 1008, client reconnects from last processed event_id
- On shutdown: server sends close 1001, client waits 5s and reconnects
- On auth failure: server sends close 4401, client exits (requires manual intervention)
Client reconnection strategy (recommended):
let reconnectDelay = 1000; // start at 1s
const maxDelay = 30000; // cap at 30s
async function connect() {
try {
const ws = await connectWebSocket();
reconnectDelay = 1000; // reset on success
// ... handle messages
} catch (err) {
if (err.code === 4401) {
console.error('Auth failed, cannot reconnect');
process.exit(1);
}
// Exponential backoff
await sleep(reconnectDelay);
reconnectDelay = Math.min(reconnectDelay * 2, maxDelay);
connect(); // retry
}
}Client reconnection edge cases:
-
Reconnect loop during hub shutdown:
- Hub sends close 1001 (Going Away) for graceful shutdown
- Client should wait longer (e.g., 5-10s) before reconnecting (not immediate)
- If hub doesn't come back after max retries (e.g., 5 attempts): exit or alert user
-
Reconnect with stale after_event_id:
- Client last processed event_id 100, but hub restarted with new DB (events start from 1)
- Replay query returns no events (none match subscription +
event_id > 100) - Client receives replay_until=50 (current max), waits indefinitely for events >100
- Mitigation: if
replay_until < after_event_id: client should reset toafter=0orafter=replay_until(fresh start)
-
Reconnect during hub migration:
- Hub offline for 5 minutes during schema migration
- Client reconnects repeatedly, fails (connection refused)
- After migration completes: client reconnects, new
instance_id, resumes from last processed event_id - No special handling needed (transparent to client)
-
Reconnect with invalid subscription (topic deleted):
- Client subscribed to topic A, hub restarts, topic A deleted during downtime
- Client reconnects with subscription to topic A (now invalid/non-existent)
- Hub accepts subscription (no validation; topic may exist in future)
- Replay returns no events for topic A (no matching scope_topic_id)
- Client receives no errors; just no events for deleted topic
-
Hub instance_id changed mid-connection (impossible but paranoid check):
- Client connects, receives
instance_id=abc - Hub restarts mid-connection (connection dropped, but hypothetically...)
- In practice: connection drops, client reconnects, gets new instance_id
- No special handling needed (connection drop forces reconnect)
- Client connects, receives
-
Multiple clients with same after_event_id:
- Two clients both last processed event_id 100
- Both reconnect simultaneously
- Both receive replay 101-200 (current events)
- No conflict; replay is idempotent, read-only
- Hub may serve both from cache (if implemented)
-
Client storage corruption (loses after_event_id):
- Client loses durable state, doesn't know last processed event_id
- Options:
a. Reconnect with
after=0(full replay from beginning) b. Reconnect withafter=current_time(skip history, only new events) - v1: client decides policy (no hub-side guidance)
- Future: hub could suggest "reasonable" replay window (e.g., last 1000 events)
server.json (generated by hub, mode 0600):
{
"instance_id": "abc123-def456",
"db_id": "workspace-unique-uuid",
"port": 8080,
"host": "127.0.0.1",
"auth_token": "64-char-hex-string",
"pid": 12345,
"started_at": "2026-02-04T20:00:00.000Z",
"protocol_version": "v1"
}- Written on hub startup
auth_token: cryptographically random ≥128-bit (e.g.,crypto.randomBytes(32).toString('hex'))db_id: must matchmeta.db_idfrom database- Clients read this to discover port and token
- Advisory only;
/healthvalidation is authoritative
agentlip.config.ts (workspace config, optional):
import type { WorkspaceConfig } from '@agentlip/hub';
const config: WorkspaceConfig = {
// Plugin configuration
plugins: [
{
name: 'url-extractor',
type: 'extractor',
enabled: true,
config: {
allowedDomains: ['example.com', 'github.com'], // optional allowlist
timeout: 5000 // ms
}
},
{
name: 'code-linkifier',
type: 'linkifier',
enabled: true,
module: './custom-plugins/code-links.ts',
config: {
repoRoot: process.env.REPO_ROOT
}
}
],
// Rate limiting
rateLimits: {
perConnection: 100, // requests per second
global: 1000
},
// Resource limits
limits: {
maxMessageSize: 65536, // 64KB
maxAttachmentSize: 16384, // 16KB
maxWsMessageSize: 262144, // 256KB
maxWsConnections: 100,
maxWsQueueSize: 1000,
maxEventReplayBatch: 1000
},
// Plugin execution
pluginDefaults: {
timeout: 5000, // ms
memoryLimit: 134217728 // 128MB (if enforceable)
}
};
export default config;WorkspaceConfig TypeScript interface:
interface WorkspaceConfig {
plugins?: PluginConfig[];
rateLimits?: {
perConnection?: number;
global?: number;
};
limits?: {
maxMessageSize?: number;
maxAttachmentSize?: number;
maxWsMessageSize?: number;
maxWsConnections?: number;
maxWsQueueSize?: number;
maxEventReplayBatch?: number;
};
pluginDefaults?: {
timeout?: number;
memoryLimit?: number;
};
}
interface PluginConfig {
name: string;
type: 'linkifier' | 'extractor';
enabled: boolean;
module?: string; // path to custom plugin (default: built-in)
config?: Record<string, unknown>; // plugin-specific config
}Plugin types:
- Linkifier (enrichment): analyzes message content, returns structured enrichments
- Extractor (attachment): analyzes message content, returns topic attachments
Plugin interface (Worker-based):
// Plugin implementation (user-provided or built-in)
export interface LinkifierPlugin {
name: string;
version: string;
// Called for each new/edited message
enrich(input: EnrichInput): Promise<Enrichment[]>;
}
export interface ExtractorPlugin {
name: string;
version: string;
// Called for each new/edited message
extract(input: ExtractInput): Promise<Attachment[]>;
}
// Input types
interface EnrichInput {
message: {
id: string;
content_raw: string;
sender: string;
topic_id: string;
channel_id: string;
created_at: string;
};
config: Record<string, unknown>; // from agentlip.config.ts
}
interface ExtractInput {
message: {
id: string;
content_raw: string;
sender: string;
topic_id: string;
channel_id: string;
created_at: string;
};
config: Record<string, unknown>;
}
// Output types
interface Enrichment {
kind: string; // e.g., 'url', 'code_ref', 'file_path'
span: {
start: number; // character offset
end: number;
};
data: Record<string, unknown>; // enrichment-specific structured data
}
interface Attachment {
kind: string; // e.g., 'url', 'file', 'image'
key?: string; // optional namespace
value_json: Record<string, unknown>;
dedupe_key?: string; // optional (hub will compute if absent)
}
// Example enrichment output
const exampleEnrichment: Enrichment = {
kind: 'url',
span: { start: 10, end: 30 },
data: {
url: 'https://example.com',
title: 'Example Domain',
resolved: true
}
};
// Example attachment output
const exampleAttachment: Attachment = {
kind: 'url',
value_json: {
url: 'https://github.com/owner/repo/issues/42',
title: 'Issue #42',
issue_number: 42,
repo: 'owner/repo'
},
dedupe_key: 'url:https://github.com/owner/repo/issues/42'
};Plugin isolation contract:
- Plugins run in Bun Worker (separate thread, no shared memory)
- Timeout enforced (default 5s, configurable per plugin)
- If plugin throws or times out: log error, may emit internal error event, do not crash hub
- No write access to
.agentlip/directory (read-only DB access via RPC if needed in future) - v1 limitation: plugins CAN access network and filesystem (Worker limitations); documented risk
- Future: explicit capability grants
Plugin lifecycle:
- Hub loads plugins from
agentlip.config.tson startup - For each new/edited message:
- Hub spawns Worker with plugin code
- Passes message + config via RPC
- Waits for result (with timeout)
- Validates output (size, schema)
- Staleness guard: verify message content unchanged before persisting
- Insert enrichments/attachments + emit events
- Close Worker
Staleness guard (critical for correctness): Before committing plugin outputs, hub must:
// Re-read current message state
const current = await db.get(
'SELECT content_raw, deleted_at FROM messages WHERE id = ?',
[messageId]
);
// Verify content unchanged and not deleted
if (current.content_raw !== originalContent || current.deleted_at !== null) {
// Discard plugin outputs; do not commit or emit events
return;
}
// Safe to commit
await db.run('INSERT INTO enrichments ...');
await db.run('INSERT INTO events ...');packages/protocol/protocol_v1.ts is the source of truth for:
- WS messages
- event envelope + payload types
- HTTP request/response shapes
- plugin interfaces
Protocol versioning and compatibility:
v1 protocol principles:
- Additive evolution only: new optional fields, new event types, new endpoints OK
- Breaking changes forbidden: removing fields, renaming fields, changing types, changing semantics require v2
- Client resilience: clients must ignore unknown event types and unknown fields (forward compatibility)
- Graceful degradation: older clients connecting to newer hub should continue working (within v1 protocol version)
Backward-compatible changes (safe within v1):
- Adding optional fields (HTTP request/response, WS message)
- Adding new event types (old clients ignore)
- Adding new endpoints (old clients unaffected)
- Adding new CLI commands (old scripts unaffected)
Breaking changes (require v2):
- Removing required fields
- Renaming fields
- Changing field types incompatibly (e.g., string→number)
- Changing event payload structure in non-additive way
- Removing endpoints
- Changing WS handshake protocol
- Changing authentication mechanism
Protocol negotiation:
GET /healthreturnsprotocol_version: "v1"- Clients check this before connecting
- Future: clients could request specific protocol version via header/query param
Deprecation process (v1 → v2 transition):
- Announce deprecation in v1 release (docs, logs)
- Add v2 endpoints alongside v1
- Mark v1 endpoints deprecated (header:
X-Deprecated: true) - Run both protocols in parallel during transition period
- Remove v1 in major version bump (provide migration guide)
Event catalog evolution:
- New event types can be added anytime within v1
- Event type names immutable once published
- Event payload fields additive-only within v1
- Events never deleted from catalog (deprecated events remain documented)
Event ID gap scenarios:
-
Transaction rollback within same session:
- Transaction inserts event with ID 100
- Transaction rolls back (constraint violation, conflict, etc.)
- Next successful transaction gets ID 101 (gap at 100)
- SQLite reuses rolled-back IDs in same connection/session
- Result: no gap if same connection; possible gap if connection closed/reopened
-
Hub crash mid-transaction:
- Transaction inserts event with ID 100, crashes before commit
- Transaction fully rolled back (WAL recovery)
- Next hub start: next event gets ID 101 (gap at 100, or ID reused)
- SQLite behavior: autoincrement IDs may or may not be reused after crash (depends on internal state)
- Consequence: event_id gaps possible but rare
-
Intentional gaps (future: event log compaction):
- v1: no compaction; events never deleted
- Future: if events deleted (admin purge old events): gaps intentional
- Client replay: if gap detected (e.g., request >100, receive 150), no events in range 101-149
Gap detection and handling:
agentlip doctorshould scan event log for gaps:-- Find gaps in event_id sequence WITH RECURSIVE cnt(id) AS ( SELECT MIN(event_id) FROM events UNION ALL SELECT id+1 FROM cnt WHERE id < (SELECT MAX(event_id) FROM events) ) SELECT id FROM cnt WHERE id NOT IN (SELECT event_id FROM events);
- If gaps found: log warning; gaps are safe but indicate rollbacks or crashes
- Clients: if replaying and see gap (e.g., last event 100, next event 150), no action needed; simply means events 101-149 don't exist
Event immutability edge cases:
-
Attempt to UPDATE event row:
- Trigger
prevent_event_mutationfires, aborts transaction - Returns error; no state change
- Hub code should never attempt UPDATE; guard rails in DB layer
- Trigger
-
Attempt to DELETE event row:
- Trigger
prevent_event_deletefires, aborts transaction - Returns error; no state change
- Only way to remove events: delete DB file (catastrophic; not supported)
- Trigger
-
Event payload size unbounded:
data_jsonis TEXT (unlimited in SQLite)- Risk: single event with 10MB payload (e.g., message.edited with huge content)
- Mitigation: enforce max event payload size (e.g., 1MB); reject mutations that would generate oversized events
- v1: rely on message content size limit (64KB); event payload will be <100KB typically
-
Event timestamp in past (clock skew):
- Hub generates
ts = new Date().toISOString() - If system clock set backward: new events have earlier
tsthan old events - Consequence:
tsordering violated, butevent_idordering preserved - Clients should sort by
event_id, treattsas advisory
- Hub generates
-
Event timestamp far future (clock skew):
- System clock set forward (e.g., +1 year)
- Events have future
ts - Hub later corrected (clock set back to now)
- New events have earlier
tsthan recent events - Consequence: same as above;
event_idauthoritative
-
Event scope columns NULL (invalid event):
- Some events may not have channel/topic scope (e.g., system-level events)
- v1: all events MUST have at least one scope (channel or topic)
- Validation: before inserting event, ensure
scope_channel_idORscope_topic_idis non-NULL - Invalid events won't match any subscription; effectively invisible to clients
-
Concurrent event inserts (impossible with single writer):
- Single-writer guarantee prevents concurrent inserts
- All inserts serialized by SQLite
- Event IDs strictly increasing (no race)
Event replay correctness (detailed):
- Client sends
after_event_id = 100 - Hub computes
replay_until = MAX(event_id)at handshake time (e.g., 200) - Hub queries:
WHERE event_id > 100 AND event_id <= 200 ORDER BY event_id ASC - Events 101-200 replayed
- During replay (takes 1s), new events 201-205 committed
- After replay completes, hub starts live stream:
WHERE event_id > 200 - Live stream sends 201-205 (and any newer)
- Client dedupes by event_id; sees each event exactly once
Replay boundary race (pathological case):
- Client sends
after=100 - Hub computes
replay_until=200(snapshot) - Before replay query executes, events 201-210 committed
- Replay query executes: returns 101-200
- Live stream starts: sends >200 (i.e., 201-210)
- Result: correct; no gap (client sees 101-200, then 201-210)
Replay timeout (very stale client):
- Client requests replay from
after=0(all history) - Event log has 1M events
- Replay query: paginate by
maxEventReplayBatch(1000) - Hub sends 1k events, waits for client to ack (or next batch request)
- If client slow: hub enforces WS backpressure (disconnect after queue full)
- Client reconnects with last processed event_id, resumes
- Total replay time: 1M / 1000 batches * ~1s per batch = ~15 minutes (if no backpressure)
- Mitigation: consider rejecting replays older than TTL (e.g., 7 days worth of events)
Example additive event evolution:
v1.0 message.created:
{
"message": {
"id": "msg_123",
"content_raw": "Hello"
}
}v1.5 (added optional field):
{
"message": {
"id": "msg_123",
"content_raw": "Hello",
"word_count": 1 // new optional field
}
}Old clients ignore word_count; new clients can use it. Both work.
.agentlip/locks/writer.lockacquired via exclusive create.- Hub verifies staleness by
/health(and PID liveness if available). - DB uses WAL + configured busy timeout.
Mutation transaction must include
- state change
- insert corresponding event row(s) with correct scopes + payload(s)
- commit
Crash safety: if hub crashes between steps 1 and 2 (or before commit), the entire transaction rolls back automatically (SQLite WAL guarantees). No partial state is possible.
Edge cases and mitigations:
- Disk full during transaction: SQLite returns
SQLITE_FULL; transaction auto-rolls back; return 503 to client; log disk space exhaustion; consider WAL checkpoint to reclaim space - Lock contention timeout: if
busy_timeoutexpires, return 503 withRetry-Afterheader; client should implement exponential backoff - WAL checkpoint failure (disk full, I/O error): checkpoint is best-effort; WAL can grow; monitor WAL size; if WAL exceeds threshold (e.g., 100MB), reject new writes with 503 until checkpoint succeeds or admin intervenes
- Power loss mid-transaction: WAL recovery on restart; transaction either fully committed or fully rolled back (atomicity guarantee)
- Corruption detection: on any
SQLITE_CORRUPTerror, immediately stop serving, mark DB as suspect, requireagentlip doctor --repairbefore restart
Derived pipelines run in separate transactions after commit. If hub crashes during derived processing, derived data may be incomplete but canonical state (messages/events) is intact and replayable.
Derived pipeline crash recovery:
- On hub restart: scan for messages with no enrichments/attachments but should have them (heuristic: recent messages, or messages modified after last enrichment timestamp)
- Option 1: background re-enrichment job
- Option 2: lazy re-enrichment on read (if enrichments missing, queue job)
- v1: no automatic recovery; manual
agentlip re-enrich --since <event_id>command for admin
For edit/delete/move_topic:
- If
expected_versionis provided, validatemessages.version == expected_versioninside the transaction. - On mismatch: rollback and return
conflict.
Concurrent mutation edge cases:
-
Two edits racing (no expected_version):
- Transaction serialization ensures one commits first (increments version to 2)
- Second commits after (increments version to 3)
- Both succeed; both emit events; event_id determines order
- Last writer wins for content; full edit history in event log
-
Two edits racing (both with expected_version=1):
- First edit commits (version 1→2), emits event
- Second edit's txn sees version=2, conflicts, rolls back, returns 409
- Client receives conflict response with
current_version: 2 - Client must decide: retry with version 2 (re-read current content, recompute edit), or abort
-
Edit vs. delete race:
- If delete commits first: sets
deleted_at, tombstones content, version 1→2 - Subsequent edit sees
deleted_at != NULL; decision: allow edit of tombstoned message (setdeleted_at=NULL, restore content, increment version) OR reject edit of deleted message - v1 decision: reject edits of tombstoned messages (check
deleted_at IS NULLbefore edit; return 400 "cannot edit deleted message")
- If delete commits first: sets
-
Edit vs. retopic race:
- Retopic increments version (v1→v2), changes topic_id
- Concurrent edit with expected_version=1 will conflict (version now 2)
- This is correct behavior: retopic is a mutation; version tracking prevents lost updates
-
Delete vs. delete race:
- First delete commits (sets
deleted_at, version 1→2) - Second delete sees version=2 (if expected_version=1 provided): conflicts
- If no expected_version: second delete sees
deleted_at != NULL; decision: idempotent success (return 200, no state change, no new event) OR error - v1 decision: idempotent success (deleting already-deleted message is no-op; return success with existing state)
- First delete commits (sets
-
Rapid successive edits by same client:
- Each edit commits sequentially (v1→v2→v3...)
- Each emits
message.editedevent - Event log preserves full history
- UI may coalesce edit events for display (e.g., show "edited 3 times" instead of 3 separate events)
- No special handling needed; version monotonically increases
-
Retopic "all" mode concurrent with new message insert in source topic:
- Retopic transaction selects all messages in topic A at transaction start
- New message inserts into topic A after retopic starts but before retopic commits
- Two outcomes: a. New message commits first: retopic includes it (correct) b. Retopic commits first: new message remains in topic A (correct; message arrived after retopic started)
- Both outcomes are correct; no lost messages; serialization guarantees consistency
-
Version overflow (2^31-1 edits):
- SQLite INTEGER is 64-bit signed; practical limit is 2^63-1
- If version overflows: wrap to negative (unlikely in practice)
- v1: no overflow handling; document that >2B edits per message is unsupported
- Future: detect approaching overflow, prevent further edits, require manual intervention
- Maintain per-connection subscriptions (channel/topic)
- On new committed event:
- match by scopes (
scope_channel_id,scope_topic_id,scope_topic_id2) - send envelope
- match by scopes (
- Backpressure:
- bounded outbound queue per socket
- disconnect when threshold exceeded
- client reconnects using last processed
event_id
WS delivery edge cases:
-
Events committed during replay period:
- Scenario: client requests replay from
after=100, hub setsreplay_until=200, but events 201-205 commit before replay finishes - Solution: replay sends
<= replay_until(100-200), then live stream sends> replay_until(201+); no gap; client may receive duplicates at boundary (200/201); client dedupes byevent_id
- Scenario: client requests replay from
-
Client disconnect mid-replay:
- Replay is best-effort; on disconnect, abandon replay
- Client reconnects with same
after_event_id(last processed, not last received) - New replay boundary computed; may re-send events (client dedupes)
-
Send failure mid-batch:
- If WS send fails partway through sending multiple events: close connection immediately
- Do NOT attempt partial retry; client reconnects with last processed (ack'd) event_id
- Server does not track which events were received; relies on client to report
after_event_idon reconnect
-
Replay query returns huge result set:
- Enforce
maxEventReplayBatch(default 1000) per query - If more events match, send in multiple batches (pagination)
- After each batch, check if connection still healthy; abort if client disconnected
- Risk: very stale clients (e.g.,
after=0with 1M events) may take long time and resource; consider rejecting replays older than threshold (e.g., 7 days) with "too stale, reinitialize" error
- Enforce
-
Concurrent event emission during fanout:
- Events may commit while fanout loop is iterating connections
- Solution: fanout reads event once, iterates connections, sends same envelope to each
- New events (committed after fanout started) will be picked up by next fanout cycle
- No event is dropped; at-most-once per cycle, at-least-once over time
-
Clock skew / timestamp ordering:
event_idis authoritative order, notts- If system clock jumps backward,
tsmay be out of order butevent_idmonotonicity is preserved - Clients should sort/order by
event_id, usetsfor display only
-
Hub restart during active WS connections:
- On graceful shutdown: close all WS with code 1001 (Going Away)
- Clients reconnect with last processed
event_id - New hub instance has new
instance_id; clients detect and proceed (no special handling needed) - On crash/kill: connections drop; clients detect disconnect, reconnect with backoff
Major "churn magnet" decisions now locked:
move_topicandedited_at: Retopic does not setedited_at(it's routing metadata, not content change). Event timestamp is authoritative.- Attachment behavior on retopic: No automatic attachment migration; attachments stay with topic they were inserted into.
- Plugin environment: Worker-only in v1; subprocess reserved for v2 (simpler isolation).
- FTS fallback semantics: Basic LIKE-based filtering on message content when FTS5 unavailable; document limitations.
Ship when all true:
- ✅ Workspace init creates
.agentlip/and schema v1 - ✅ Hub starts, acquires write lock, writes
server.json, serves/health - ✅ Channels/topics/messages CRUD (as specified)
- ✅ Message edit with optimistic concurrency (emits
message.edited) - ✅ Message tombstone delete (emits
message.deleted; no hard deletes possible) - ✅ Retopic modes
one|later|allwith CLI guardrails (same-channel only) - ✅ WS replay + live stream with
after_event_idcorrectness (Gates B/C) - ✅ Topic attachments API + CLI + auto URL extraction with
dedupe_key - ✅ Plugin system v1: isolation, timeouts,
message.enrichedevents - ✅ SDK: connect/replay/reconnect; async iterator yields typed envelopes
- ✅ Minimal UI: browse channels/topics/messages/attachments with live updates
- ✅ Test suite covers Gates A-J; CI runs deterministically
Conservative budgets on a typical dev laptop.
- Message insert (excluding enrichment): p50 < 10ms, p99 < 50ms
- Message edit/delete/retopic (excluding derived): p50 < 15ms, p99 < 75ms
- Event fanout (single client): < 5ms overhead per event
- WS replay: 10k events in < 1s (localhost)
- Tail query: latest 50 messages by (channel, topic) in < 20ms @ 100k messages
- Retopic "later": 1k messages in < 200ms (single transaction; index-dependent)
Add a bench command (or integration test mode) that:
- populates N messages/topics
- measures key queries and endpoints
- exercises WS replay
- records metrics to JSON for regression tracking (relaxed CI thresholds)
Decision: topics are entities with stable IDs; messages reference topic_id.
Tests: rename topic doesn't rewrite messages; retopic updates messages.topic_id and emits events.
Decision: durable events with WS + replay by event_id.
Tests: replay equivalence; crash atomicity.
Decision (contract)
- On WS
hello, server computes snapshot boundaryreplay_until = latest_event_id_at_handshake. - Server replies
hello_ok.latest_event_id = replay_until. - Server replays events matching subscriptions where:
after_event_id < event_id <= replay_until
- After replay completes, server streams new matching events with
event_id > replay_until.
Reference SQL (shape)
SELECT event_id, ts, name, data_json
FROM events
WHERE event_id > :after
AND event_id <= :until
AND (
scope_channel_id IN (/* channelSubs */)
OR scope_topic_id IN (/* topicSubs */)
OR scope_topic_id2 IN (/* topicSubs */)
)
ORDER BY event_id ASC
LIMIT :limit;Tests: deterministic replay set/order; boundary test for events inserted during replay.
Decision
- Implement
one|later|allselection exactly. - Cross-channel moves are forbidden in v1.
to_topic_idmust belong to the message's channel. - Retopic increments
messages.versionand emits per-messagemessage.moved_topic(plus scopes).
Selection SQL (shape)
- one:
SELECT id FROM messages WHERE id = :msg_id AND topic_id = :old_topic_id;- later:
SELECT id FROM messages
WHERE topic_id = :old_topic_id AND id >= :msg_id
ORDER BY id ASC;- all:
SELECT id FROM messages
WHERE topic_id = :old_topic_id
ORDER BY id ASC;Write pattern
- In one transaction:
- validate channel constraint
- read affected IDs
- update
topic_id, bumpversion - insert
message.moved_topicevent per message with:scope_channel_id = channel_idscope_topic_id = old_topic_idscope_topic_id2 = new_topic_id
- commit
Tests: fanout correctness; cross-channel negative test; version bump.
Detailed retopic example:
Given:
- Channel
generalwith topicsbugsandarchive - Messages in
bugs: msg_1, msg_2, msg_3, msg_4, msg_5
Scenario: agentlip msg retopic msg_3 --to-topic-id archive --mode later
Expected behavior:
- Select messages: msg_3, msg_4, msg_5 (all with
id >= msg_3in topicbugs) - Update each:
topic_id = 'archive',version += 1 - Emit 3 events (one per message moved):
{
"event_id": 101,
"name": "message.moved_topic",
"scope": {
"channel_id": "general",
"topic_id": "bugs", // old topic
"topic_id2": "archive" // new topic
},
"data": {
"message_id": "msg_3",
"old_topic_id": "bugs",
"new_topic_id": "archive",
"channel_id": "general",
"mode": "later",
"version": 2 // incremented
}
}
// ... events 102, 103 for msg_4, msg_5Subscribers affected:
- Subscribed to channel
general: receive all 3 events (viascope.channel_id) - Subscribed to topic
bugs: receive all 3 events (viascope.topic_id) - Subscribed to topic
archive: receive all 3 events (viascope.topic_id2)
Cross-channel rejection example:
$ agentlip msg retopic msg_3 --to-topic-id other_channel_topic --mode one
Error: cross-channel move forbidden
Exit code: 1Retopic edge cases:
-
Retopic to same topic (no-op):
- Message already in target topic
- Decision: idempotent success (no state change, no events, return 200)
- Rationale: client intent achieved (message is in target topic)
-
Retopic of tombstoned message:
- Message has
deleted_at != NULL - Decision: allow retopic of deleted messages (tombstone is content state, not routing state)
- Retopic updates
topic_id, incrementsversion, emits event - Deleted message is now in new topic (still deleted)
- UI should still render as deleted in new location
- Message has
-
Retopic with expected_version on already-moved message:
- Message was retopiced (v1→v2), now in topic B
- Client retries retopic with
expected_version=1(stale) - Result: conflict (current version is 2)
- Client must re-read current state, decide if retopic still needed
-
Source topic deleted during retopic "all":
- Retopic transaction starts, selects all messages in topic A
- Topic A deleted (CASCADE deletes all messages) before retopic commits
- Foreign key constraint: messages referencing topic A are deleted
- Retopic update finds zero rows (messages gone)
- Decision: return 200 with
affected_count: 0(no error; topic was deleted) - Alternative: topic deletion blocks until retopic completes (lock contention)
- v1: allow concurrent topic deletion; retopic may affect 0 messages if topic deleted
-
Target topic deleted during retopic:
- Retopic transaction starts, validates target topic exists
- Target topic deleted before retopic update commits
- Retopic update sets
topic_idto deleted topic - Foreign key constraint: fails (target topic_id does not exist)
- SQLite returns constraint violation; transaction rolls back
- Return 400 "target topic not found"
-
Retopic "all" mode selects 10k messages:
- Single transaction updates 10k rows + inserts 10k event rows
- Risk: long transaction, lock contention, WAL growth
- Mitigation: enforce
max_retopic_batch(e.g., 1000 messages) - If selection exceeds limit: return 400 "too many messages; use mode=later with smaller anchor, or delete old messages first"
- v1: no batch limit; document that "all" mode on large topics may be slow
- Future: chunked retopic (internal pagination, multiple txns)
-
Retopic "later" mode anchor message already at end:
- Anchor message is last (or only) message in topic
- Selection: only anchor message (nothing "later")
- Outcome: move only anchor message (correct; mode=later includes anchor)
-
Retopic "later" mode with gaps in message IDs:
- Topic has messages: msg_1, msg_5, msg_10 (IDs are sparse)
- Retopic anchor: msg_5, mode=later
- Selection:
WHERE topic_id=X AND id >= 'msg_5'→ msg_5, msg_10 - Outcome: msg_1 stays, msg_5 and msg_10 move (correct)
-
Concurrent retopics on same topic:
- Two retopic "all" operations on topic A, different targets (B and C)
- Both start, both select all messages in topic A
- First commits: all messages now in topic B, event_id 100-110
- Second commits: updates
topic_idfrom B to C (since messages are now in B, not A; selection was snapshot) - Outcome: all messages end up in topic C (last writer wins)
- Problem: first retopic's events show A→B, but final state is C; confusing
- Mitigation: retopic selection should re-check topic_id inside transaction before update:
UPDATE messages SET topic_id = :new_topic, version = version + 1 WHERE id IN (:selected_ids) AND topic_id = :expected_old_topic
- If topic_id changed, update affects 0 rows; return 409 "messages moved by concurrent retopic"
-
Retopic + edit race on version:
- Already covered in concurrent mutations; version mismatch causes conflict
- Retopic increments version; concurrent edit with expected_version will fail
Decision: Bun Worker isolation by default; --unsafe-inproc-plugins for dev; subprocess reserved for future.
Tests: hang timeout; crash containment.
Decision: separate schema_v1_fts.sql applied opportunistically; failure is non-fatal.
Tests: suite runs with FTS on/off.
Decision
- Add required
dedupe_keytotopic_attachments. - Enforce uniqueness with:
UNIQUE(topic_id, kind, COALESCE(key,''), dedupe_key)
- Hub computes a
dedupe_keyif caller doesn't provide one. - Emit
topic.attachment_addedonly if a new row was created.
DDL delta (shape)
dedupe_key TEXT NOT NULL,
CHECK (length(dedupe_key) > 0);
CREATE UNIQUE INDEX IF NOT EXISTS idx_topic_attachments_dedupe
ON topic_attachments(topic_id, kind, COALESCE(key, ''), dedupe_key);Insert semantics
- Attempt insert
- On unique conflict: fetch existing row and return it
- No event on deduped insert
Tests: retry insert does not duplicate; no phantom events.
Decision
- Edits are explicit events with optimistic concurrency.
- Deletes are tombstones; hard deletes are forbidden.
Consequences
- Stable message identity forever
- Attachments referencing
source_message_idremain valid - "Delete" is not secure erasure; old content may persist in historical events
Tests
- Edit success increments version + emits event
- Edit conflict ⇒ no state/events
- Delete tombstones row + emits event
- Derived staleness guard prevents stale enrichment/extraction commits
The canonical execution checklist is Part X: Master TODO Inventory. Treat it as the execution board.
This is a workspace-scoped state machine:
- Canonical state: channels/topics/messages/attachments/(derived enrichments)
- Canonical change log: events (monotonic)
- Derived projections: enrichment + extraction (recomputable)
Key insight:
- Agents need shared local truth with stable addresses + deterministic replay + minimal coordination overhead.
Each mutation endpoint is a transition S → S' with corresponding event E.
Invariant: mutation commit implies event commit.
If a message edit commits, a message.edited event exists in the same transaction with event_id reflecting the total order.
Concurrent mutations: SQLite serializes transactions; event_id (autoincrement) defines total order. If two mutations target the same message concurrently:
- optimistic concurrency (
expected_version) may cause one to fail withconflict - both cannot succeed with same
version; one will see incremented version and fail or retry - event stream reflects whichever transaction committed first
Rapid successive edits: if the same message is edited multiple times in quick succession:
- each edit increments
versionand emits a separatemessage.editedevent - all edits are recorded in event log (preserving edit history)
- clients see all edit events in order; UI may choose to coalesce or show history
- Server emits a total order by
event_id. - Clients store
last_processed_event_iddurably and dedupe.
I1: Single-writer serialization
- Only one hub process writes to DB at a time (enforced by writer.lock)
- All transactions are serialized by SQLite (SERIALIZABLE isolation + WAL)
- Consequence: no lost updates, no write-write conflicts at DB layer
I2: Event ID monotonicity
event_idis INTEGER PRIMARY KEY AUTOINCREMENT- SQLite guarantees monotonic increase within single connection
- Consequence: total order over all events; no gaps (except wraparound at 2^63, impractical)
I3: Message version monotonicity
- Each mutation (edit/delete/retopic) that commits increments
messages.versionby exactly 1 - Version starts at 1 (on creation)
- Consequence: version reflects mutation count; version N means N-1 mutations since creation
I4: Atomic mutation + event
- State change and event insertion occur in same SQLite transaction
- If crash occurs: both commit or both rollback (atomicity)
- Consequence: event log is complete (no state change without event, no event without state change)
I5: At-least-once WS delivery
- Server may send same event multiple times (e.g., reconnect during replay)
- Server never skips an event matching subscription
- Consequence: clients must dedupe by event_id; guaranteed to see all matching events
I6: Optimistic concurrency correctness
- If
expected_versionprovided: txn verifiesmessages.version == expected_versionbefore mutation - If mismatch: txn rolls back, no state change, no event emitted
- Consequence: lost update prevention; client can detect concurrent modifications
I7: Replay boundary consistency
- Replay sends events
(after_event_id, replay_until] - Live stream sends events
(replay_until, ∞) - No gaps: events committed during replay are > replay_until, sent by live stream
- Possible duplicates: event at boundary (replay_until or replay_until+1) may appear in both replay and live
- Consequence: client dedupes by event_id; sees all events exactly once (after deduplication)
I8: Scope-based routing correctness
- Every event has
scope_channel_idand/orscope_topic_idand/orscope_topic_id2 - Replay query matches subscription by scope columns (index-backed)
- Fanout matches subscription by scope columns
- Consequence: clients receive exactly events matching their subscriptions (no false positives/negatives after deduplication)
I9: Foreign key consistency
messages.topic_idreferencestopics.id(ON DELETE CASCADE)messages.channel_idmatchestopics.channel_idfor referenced topic (app-enforced invariant)topic_attachments.topic_idreferencestopics.id(ON DELETE CASCADE)- Consequence: referential integrity; orphaned messages/attachments prevented by cascade or null
I10: Tombstone immutability
messagesrows never deleted (DELETE trigger prevents)- Tombstone delete sets
deleted_at, tombstonescontent_raw, incrementsversion - Consequence: message identity stable forever; historical references valid; "deleted" is a state, not an operation
I11: Derived data staleness protection
- Plugin reads message at version V, content C
- Before committing derived outputs: re-read message
- If
content_raw != CORversion != VORdeleted_at IS NOT NULL: discard outputs - Consequence: derived data never references stale/deleted content; correctness over availability
I12: Lock-free reads (WAL mode)
- SQLite WAL allows concurrent readers with writer
- CLI queries use
PRAGMA query_only = ON(read-only snapshot) - Consequence: CLI can query DB without blocking hub writes; snapshot consistency
matches(event, subs) is OR across:
scope_channel_id == sub.channel_idscope_topic_id == sub.topic_idscope_topic_id2 == sub.topic_id
Handshake defines replay_until; replay is (after, replay_until]; live starts > replay_until.
For any message:
message.createdprecedes any enrichment/attachment event sourced from its content at that time.- If content changes (edit/delete), derived jobs must not commit outputs computed from older content after the edit/delete commits (staleness guard).
meta table:
CREATE TABLE IF NOT EXISTS meta (
key TEXT PRIMARY KEY NOT NULL,
value TEXT NOT NULL
) STRICT;
-- Required keys:
-- 'db_id': UUIDv4 generated at init, never changes
-- 'schema_version': integer, current version
-- 'created_at': ISO8601 timestampchannels table:
CREATE TABLE IF NOT EXISTS channels (
id TEXT PRIMARY KEY NOT NULL, -- UUIDv4 or ULID
name TEXT NOT NULL UNIQUE,
description TEXT,
created_at TEXT NOT NULL, -- ISO8601
CHECK (length(name) > 0 AND length(name) <= 100)
) STRICT;topics table:
CREATE TABLE IF NOT EXISTS topics (
id TEXT PRIMARY KEY NOT NULL,
channel_id TEXT NOT NULL,
title TEXT NOT NULL,
created_at TEXT NOT NULL,
updated_at TEXT NOT NULL,
FOREIGN KEY (channel_id) REFERENCES channels(id) ON DELETE CASCADE,
UNIQUE(channel_id, title),
CHECK (length(title) > 0 AND length(title) <= 200)
) STRICT;
CREATE INDEX IF NOT EXISTS idx_topics_channel ON topics(channel_id, updated_at DESC);messages table:
CREATE TABLE IF NOT EXISTS messages (
id TEXT PRIMARY KEY NOT NULL,
topic_id TEXT NOT NULL,
channel_id TEXT NOT NULL, -- denormalized for fast filtering
sender TEXT NOT NULL,
content_raw TEXT NOT NULL,
version INTEGER NOT NULL DEFAULT 1,
created_at TEXT NOT NULL,
edited_at TEXT,
deleted_at TEXT,
deleted_by TEXT,
FOREIGN KEY (topic_id) REFERENCES topics(id) ON DELETE CASCADE,
CHECK (length(sender) > 0),
CHECK (length(content_raw) <= 65536), -- 64KB limit
CHECK (version >= 1)
) STRICT;
CREATE INDEX IF NOT EXISTS idx_messages_topic ON messages(topic_id, id DESC);
CREATE INDEX IF NOT EXISTS idx_messages_channel ON messages(channel_id, id DESC);
CREATE INDEX IF NOT EXISTS idx_messages_created ON messages(created_at DESC);
-- Trigger: prevent hard deletes
CREATE TRIGGER IF NOT EXISTS prevent_message_delete
BEFORE DELETE ON messages
FOR EACH ROW
BEGIN
SELECT RAISE(ABORT, 'Hard deletes forbidden on messages; use tombstone');
END;events table:
CREATE TABLE IF NOT EXISTS events (
event_id INTEGER PRIMARY KEY AUTOINCREMENT,
ts TEXT NOT NULL, -- ISO8601
name TEXT NOT NULL, -- event type (e.g., 'message.created')
scope_channel_id TEXT, -- for channel-level routing
scope_topic_id TEXT, -- primary topic
scope_topic_id2 TEXT, -- secondary topic (for retopic)
entity_type TEXT NOT NULL, -- 'channel', 'topic', 'message', etc.
entity_id TEXT NOT NULL,
data_json TEXT NOT NULL, -- JSON payload
CHECK (length(name) > 0)
) STRICT;
CREATE INDEX IF NOT EXISTS idx_events_replay ON events(event_id);
CREATE INDEX IF NOT EXISTS idx_events_scope_channel ON events(scope_channel_id, event_id);
CREATE INDEX IF NOT EXISTS idx_events_scope_topic ON events(scope_topic_id, event_id);
CREATE INDEX IF NOT EXISTS idx_events_scope_topic2 ON events(scope_topic_id2, event_id);
-- Trigger: prevent updates/deletes
CREATE TRIGGER IF NOT EXISTS prevent_event_mutation
BEFORE UPDATE ON events
FOR EACH ROW
BEGIN
SELECT RAISE(ABORT, 'Events are immutable');
END;
CREATE TRIGGER IF NOT EXISTS prevent_event_delete
BEFORE DELETE ON events
FOR EACH ROW
BEGIN
SELECT RAISE(ABORT, 'Events are append-only');
END;topic_attachments table:
CREATE TABLE IF NOT EXISTS topic_attachments (
id TEXT PRIMARY KEY NOT NULL,
topic_id TEXT NOT NULL,
kind TEXT NOT NULL,
key TEXT, -- optional namespace (e.g., 'url', 'file')
value_json TEXT NOT NULL, -- JSON object
dedupe_key TEXT NOT NULL, -- idempotency key
source_message_id TEXT,
created_at TEXT NOT NULL,
FOREIGN KEY (topic_id) REFERENCES topics(id) ON DELETE CASCADE,
FOREIGN KEY (source_message_id) REFERENCES messages(id) ON DELETE SET NULL,
CHECK (length(kind) > 0),
CHECK (length(dedupe_key) > 0),
CHECK (length(value_json) <= 16384) -- 16KB limit
) STRICT;
CREATE INDEX IF NOT EXISTS idx_attachments_topic ON topic_attachments(topic_id, created_at DESC);
CREATE UNIQUE INDEX IF NOT EXISTS idx_topic_attachments_dedupe
ON topic_attachments(topic_id, kind, COALESCE(key, ''), dedupe_key);enrichments table (derived data, recomputable):
CREATE TABLE IF NOT EXISTS enrichments (
id TEXT PRIMARY KEY NOT NULL,
message_id TEXT NOT NULL,
kind TEXT NOT NULL,
span_start INTEGER NOT NULL,
span_end INTEGER NOT NULL,
data_json TEXT NOT NULL,
created_at TEXT NOT NULL,
FOREIGN KEY (message_id) REFERENCES messages(id) ON DELETE CASCADE,
CHECK (span_start >= 0),
CHECK (span_end > span_start),
CHECK (length(kind) > 0)
) STRICT;
CREATE INDEX IF NOT EXISTS idx_enrichments_message ON enrichments(message_id, created_at DESC);Optional FTS5 schema (schema_v1_fts.sql):
CREATE VIRTUAL TABLE IF NOT EXISTS messages_fts USING fts5(
content_raw,
content=messages,
content_rowid=rowid
);
-- Triggers to keep FTS in sync
CREATE TRIGGER IF NOT EXISTS messages_fts_insert AFTER INSERT ON messages
BEGIN
INSERT INTO messages_fts(rowid, content_raw) VALUES (new.rowid, new.content_raw);
END;
CREATE TRIGGER IF NOT EXISTS messages_fts_update AFTER UPDATE ON messages
BEGIN
UPDATE messages_fts SET content_raw = new.content_raw WHERE rowid = old.rowid;
END;
CREATE TRIGGER IF NOT EXISTS messages_fts_delete AFTER DELETE ON messages
BEGIN
DELETE FROM messages_fts WHERE rowid = old.rowid;
END;- Denormalizes for fast filtering without joins.
- Enforces same-channel retopic rule cheaply.
- Invariant:
messages.channel_idmatchestopics.channel_idfor itstopic_id(validated by hub on insert/retopic).
The scope_* pattern avoids joins during replay and keeps replay index-backed.
Example replay query:
SELECT event_id, ts, name, data_json
FROM events
WHERE event_id > :after_event_id
AND event_id <= :replay_until
AND (
scope_channel_id IN (/* subscribed channels */)
OR scope_topic_id IN (/* subscribed topics */)
OR scope_topic_id2 IN (/* subscribed topics */)
)
ORDER BY event_id ASC
LIMIT 1000;WAL allows CLI reads while hub writes; small txns reduce lock time and failure blast radius.
PRAGMAs (set on connection):
PRAGMA journal_mode = WAL;
PRAGMA foreign_keys = ON;
PRAGMA busy_timeout = 5000;
PRAGMA synchronous = NORMAL; -- balance safety/performanceLock contention handling:
- Hub sets
busy_timeout(e.g., 5000ms) to retry on lock contention - If transaction fails after retries: return 503 Service Unavailable to client
- CLI reads use
PRAGMA query_only = ONto avoid write lock conflicts
Timestamps:
- All timestamps stored as TEXT in ISO8601 format with UTC timezone
- Format:
YYYY-MM-DDTHH:MM:SS.sssZ(e.g.,2026-02-04T23:30:45.123Z) - Millisecond precision required
- Always UTC (Z suffix required)
- Generated via
new Date().toISOString()or equivalent
IDs:
- Entity IDs (channels, topics, messages, attachments, enrichments): TEXT
- Recommended: UUIDv4, UUIDv7, or ULID (sortable)
- Format validation: non-empty, max 64 chars, alphanumeric + hyphen/underscore
- Event IDs: INTEGER AUTOINCREMENT (guarantees monotonicity)
Strings:
- All text fields UTF-8
- JSON payloads: UTF-8 encoded
- Max lengths enforced at application layer and DB constraints (CHECK)
JSON payloads:
data_json,value_json: stored as TEXT (serialized JSON)- Must be valid JSON object (not array or primitive)
- Parsing: strict mode (reject invalid JSON)
- Size limits enforced before insertion
Boolean semantics:
- SQLite STRICT mode: use INTEGER (0/1) for booleans
- Protocol/API: use JSON true/false
- NULL vs. false: explicit NULL for optional fields, never implicit false
Version numbers:
messages.version: INTEGER starting at 1, increments on mutationschema_version: INTEGER starting at 1protocol_version: STRING ("v1", "v2", etc.)
Null handling:
- Optional fields: NULL allowed in DB, null in JSON
- Required fields: NOT NULL constraint in DB, field required in JSON
- Empty string vs. NULL: prefer NULL for "absent" (empty string = present but empty)
Schema version tracking:
meta.schema_version(integer) tracks current schema version- Hub checks on startup; refuses to run if version mismatch
- Migrations are forward-only (no downgrades)
Migration process:
- Hub checks
meta.schema_versionagainst expected version - If lower: run migrations sequentially (e.g.,
0001_schema_v1.sql→0002_add_feature.sql) - Before migration: create backup (
db.sqlite3.backup-v1-TIMESTAMP) - Apply migration SQL in transaction
- Update
meta.schema_version - Log migration event to
eventstable (for audit)
Migration naming convention:
migrations/NNNN_description.sql- e.g.,
0001_schema_v1.sql,0002_add_enrichments_index.sql
Migration file structure:
-- Migration: 0002_add_enrichments_index.sql
-- From schema version: 1
-- To schema version: 2
BEGIN TRANSACTION;
-- Create new index
CREATE INDEX IF NOT EXISTS idx_enrichments_kind ON enrichments(kind, message_id);
-- Update schema version
UPDATE meta SET value = '2' WHERE key = 'schema_version';
COMMIT;Rollback strategy:
- Restore from timestamped backup
- Recompute derived tables (enrichments, attachments can be regenerated from messages)
- Events table is immutable; never modified by migrations (additive only)
Breaking schema changes (requiring v2):
- Removing columns
- Renaming columns
- Changing column types incompatibly
- Changing event payload structure in breaking ways
Additive schema changes (v1.x):
- Adding nullable columns
- Adding indexes
- Adding new tables (opt-in features)
- Adding optional fields to event payloads (clients ignore unknown fields)
- Start at
cwd(or provided path) - Walk upward until
.agentlip/db.sqlite3exists - That directory is workspace root
- Security: stop traversal at filesystem boundary or user home directory; never load
agentlip.config.tsfrom untrusted parent directories server.jsonis advisory; validate via/health
GET /health endpoint:
{
"status": "ok",
"instance_id": "abc123-def456",
"db_id": "workspace-db-uuid",
"schema_version": 1,
"protocol_version": "v1",
"uptime_seconds": 3600,
"pid": 12345
}- No authentication required (public endpoint)
- Always returns 200 if hub is running and responsive
instance_id: unique per hub process (regenerated on restart)db_id: stable workspace identifier (frommetatable)- Used for staleness detection and validation
Hub startup sequence:
- Validate workspace:
.agentlip/db.sqlite3exists and readable - Open DB, set PRAGMAs (WAL, foreign_keys, busy_timeout)
- Check
meta.schema_version; run migrations if needed - Acquire writer lock (
.agentlip/locks/writer.lock)- If lock exists: validate via
/healthon port from existingserver.json - If stale (no response or PID dead): remove lock
- If live: fail with error "hub already running"
- If lock exists: validate via
- Generate
instance_id(UUID) - Load or generate
auth_token(crypto random ≥128-bit) - Bind HTTP server to localhost:port
- Write
server.json(chmod 0600) - Load
agentlip.config.ts(if exists) - Initialize plugin workers
- Log startup event to
eventstable - Begin serving requests
Hub shutdown sequence (graceful):
- Stop accepting new connections (close listener)
- Finish in-flight requests (with timeout, e.g., 10s)
- Close all WebSocket connections (send close frame)
- Flush WAL checkpoint
- Close DB connection
- Remove writer lock
- Remove
server.json - Exit process
Startup failure modes:
- Schema version too new: refuse to start, instruct user to upgrade hub
- Schema version too old: auto-migrate (with backup) or refuse if migration disabled
- DB corrupted: exit with error, recommend
agentlip doctor - Lock acquisition failed (live hub): exit with error showing running hub details
- Port bind failed: exit with error (port already in use)
agentlipd status command:
- Read
server.json(if absent: "no hub running") - Call
GET /healthon port from server.json - Validate:
db_idmatches on-disk DB- Response within timeout (5s)
- Print status:
Status: running
Instance ID: abc123-def456
Port: 8080
PID: 12345
Uptime: 1h 23m
Schema version: 1
Protocol version: v1
agentlipd down command:
- Read
server.jsonto find running hub - Send SIGTERM to PID (if available)
- Wait for graceful shutdown (timeout 10s)
- If timeout: send SIGKILL
- Verify shutdown via
/health(expect connection refused) - Clean up stale files if needed
- Validate auth token (constant-time comparison)
- Validate input:
- size limits (message content ≤64KB, attachment metadata ≤16KB, etc.)
- schema/type correctness
- sanitize/escape as needed
- Begin txn (using prepared statements/parameterized queries only)
- Apply state change
- Insert event row(s) with scopes + payload
- Commit
- Trigger async derived pipelines
- Respond
{ok:true}(on error: generic message, log details server-side without leaking paths/tokens)
- Validate same-channel constraint
- Optional
expected_versionvalidation - Select affected messages by mode
- Update
topic_id, bumpversion - Emit
message.moved_topicevents:scope_channel_id = channel_idscope_topic_id = old_topic_idscope_topic_id2 = new_topic_id
Edit
- Validate
expected_version(if provided) - Update:
content_rawedited_at = nowversion = version + 1
- Emit
message.edited
Delete (tombstone)
- Validate
expected_version(if provided) - Update:
deleted_at = now,deleted_by = actorcontent_raw = "[deleted]"edited_at = now(recommended)version = version + 1
- Emit
message.deleted
When a derived job starts, it reads {message_id, content_raw, deleted_at, version}.
Before committing derived outputs:
- re-read
messages.content_rawandmessages.deleted_at - if
content_rawchanged ORdeleted_at IS NOT NULL: discard outputs (do not commit derived rows or events) - if message was deleted (tombstoned) after job started: discard
Security note: do not extract or enrich tombstoned content; check deleted_at before processing.
SQL shape (staleness verification):
SELECT content_raw, deleted_at, version, topic_id, channel_id
FROM messages
WHERE id = :message_id;Derived pipeline edge cases:
-
ABA problem (edit back to original content):
- Job starts with content "Hello"
- Message edited to "Goodbye" (v1→v2)
- Message edited back to "Hello" (v2→v3)
- Job finishes, compares content: "Hello" == "Hello" ✓
- Problem: content matches but version changed; derived output may be stale
- Solution: compare both content AND version; if version changed, discard even if content matches
- Updated guard:
if content_raw != original_content OR version != original_version OR deleted_at IS NOT NULL: discard
-
TOC/TOU race (content changes during verification):
- Job finishes, reads message for verification
- Edit commits after read but before derived insert
- Mitigation: perform verification query and derived insert in same transaction
- Transaction ensures atomic "check-then-insert"; if message changes mid-transaction, next read will see new version
- Verification must use same transaction as derived row insert
-
Multiple plugins processing same message concurrently:
- Two plugins (enricher, extractor) both triggered by
message.created - Both read same initial state, both pass staleness guard (if content unchanged)
- Both insert derived rows and emit events
- Outcome: both succeed (correct); enrichments and attachments are independent
- Edge case: if both try to insert same
dedupe_keyattachment: unique constraint; second fails or returns existing; no duplicate events
- Two plugins (enricher, extractor) both triggered by
-
Plugin output depends on external state (e.g., URL resolves to title):
- Message contains URL; extractor fetches URL, gets title "Old Title"
- URL content changes externally (server updates page title)
- Re-enrichment fetches URL, gets title "New Title"
- Outcome: attachment updated? Or duplicate?
- v1 decision: attachments are immutable once inserted; dedupe_key prevents duplicates; external changes not tracked
- If URL content changes, manual re-extraction required (
agentlip re-extract --message-id <id>future command)
-
Message deleted (tombstoned) while plugin running:
- Plugin reads content, starts processing
- Message deleted:
deleted_atset,content_rawchanged to "[deleted]" - Staleness guard checks:
deleted_at IS NOT NULL→ discard - No derived rows or events emitted for tombstoned content
- Existing enrichments/attachments remain (not deleted); tied to message via foreign key with
ON DELETE CASCADE(if message row deleted) orON DELETE SET NULL(for source_message_id in attachments) - v1: existing enrichments persist when message tombstoned (enrichments not auto-deleted)
- Clients should hide enrichments when rendering tombstoned messages
-
Plugin timeout vs. staleness:
- Plugin times out (e.g., 5s limit)
- Hub kills plugin, logs error
- No derived rows inserted; no events emitted
- Message remains un-enriched
- Should we retry? v1 decision: no automatic retry; log timeout; emit
plugin.timeoutinternal event (optional); admin can manually re-enrich
-
Plugin emits outputs, then message is edited before commit:
- Plugin runs on content "Hello", produces enrichments for "Hello"
- Message edited to "Goodbye" (v1→v2) before plugin commits
- Staleness guard sees version changed: discard enrichments
- New
message.editedevent triggers new plugin job for "Goodbye" - Outcome: only "Goodbye" enrichments persist (correct)
-
Retopic during plugin execution:
- Plugin starts on message in topic A
- Message retopiced to topic B (version increments)
- Staleness guard sees version changed OR topic_id changed (should we check topic_id?)
- Decision: version change is sufficient; retopic bumps version, so guard will discard
- Derived rows would be inserted into wrong topic if guard didn't catch this
- For attachments:
topic_idis denormalized on attachment row; if message moves, attachmenttopic_idshould NOT auto-update - v1: attachments stay with topic they were inserted into; do not auto-migrate on retopic
-
Hub restart during plugin execution:
- Plugins are in-flight (Worker processes)
- Hub crashes or restarts
- Workers detect disconnect or timeout, exit
- On restart: no in-flight plugin state recovered
- Messages remain un-enriched; no automatic retry
- v1: no crash recovery for plugins; require manual re-enrichment if needed
-
Concurrent edits triggering multiple plugin jobs:
- Message edited rapidly: v1→v2→v3→v4
- Each edit triggers plugin job
- Multiple plugin jobs running concurrently on different versions
- Each job will check against current version at commit time
- Only the job matching the current version will commit (if content unchanged since job started)
- Older jobs will see version mismatch, discard
- Outcome: at most one set of enrichments persists (for latest version)
- Problem: rapid edits may cause "thundering herd" of plugin jobs
- Mitigation: debounce plugin triggers (e.g., wait 1s after edit before triggering; if another edit arrives, reset timer)
- v1: no debouncing; document that rapid edits may waste plugin cycles
- Schema init + optional FTS
- Event insertion helper scope correctness
- Retopic selection correctness
- Patch conflict logic (
expected_version) - Tombstone constraints and triggers
- Start hub in temp workspace
- WS connect with
after_event_id=0 - Send message; verify
message.created - Edit; verify
message.editedand conflict behavior - Delete; verify tombstone state +
message.deleted - Retopic; verify fanout to old/new/channel and cross-channel rejection
- Disconnect/reconnect with last id; verify no gaps
- crash during mutation (between state write/event write) cannot produce partial state
- slow WS client triggers backpressure disconnect
- plugin hang timeout doesn't block ingestion
- derived job staleness guard blocks stale commits
Approach 1: Fault injection at SQLite layer
- Mock or wrap SQLite driver to inject failures:
SQLITE_FULLduring transaction commitSQLITE_BUSYafter N retriesSQLITE_CORRUPTon integrity check
- Verify hub handles gracefully (503, log error, no crash)
Approach 2: Time manipulation
- Mock
Date.now()or system clock:- Jump backward 1 hour (test clock skew)
- Jump forward 1 year (test far-future timestamps)
- Freeze time (test timeout enforcement)
- Verify event_id monotonicity preserved, ts may be out of order
Approach 3: Concurrency stress testing
- Spawn N clients (e.g., 50) concurrently:
- All edit same message (rapid fire, no expected_version)
- All retopic same message to different topics
- All insert same attachment (dedupe_key)
- Verify eventual consistency: version correct, no lost events, dedupe works
Approach 4: Network simulation (WS edge cases)
- Drop WS connection mid-replay (client or server side)
- Simulate slow client (don't read from socket; trigger backpressure)
- Simulate rapid reconnects (connect, disconnect, repeat 100x)
- Verify replay correctness, backpressure disconnect, no hub crash
Approach 5: Filesystem simulation
- Fill disk (create large file to consume space)
- Make
.agentlip/read-only (chmod 555) - Delete
server.jsonwhile hub running - Create lock file with wrong PID
- Verify hub detects conditions, logs errors, fails gracefully
Approach 6: Plugin simulation
- Plugin that sleeps 10s (test timeout)
- Plugin that throws error
- Plugin that returns huge output (100MB enrichment)
- Plugin that accesses network (fetch https://example.com; test timeout)
- Verify timeout enforced, errors contained, huge outputs rejected
Approach 7: Race condition testing (deterministic)
- Use SQLite hooks (e.g.,
update_hook,commit_hook) to inject delays:- Pause between state write and event write (should be impossible; same txn)
- Pause between read and write (staleness guard)
- Verify transactions are atomic (no pause observable)
Approach 8: Chaos testing (randomized)
- Randomly:
- Kill hub mid-request (SIGKILL)
- Disconnect random WS client
- Inject random SQLite error
- Change system clock randomly
- Fill disk to random percentage
- Run for N iterations (e.g., 1000)
- Verify system recovers, no data loss, no corruption
Example test case (disk full during mutation):
test('disk full during message insert', async () => {
// Setup: create workspace, start hub
const hub = await startTestHub();
// Fill disk (mock or real filesystem limit)
await fillDisk(1024); // leave 1KB free
// Attempt mutation
const res = await fetch('http://localhost:8080/api/v1/messages', {
method: 'POST',
headers: { 'Authorization': `Bearer ${token}` },
body: JSON.stringify({
topic_id: 'topic_abc',
sender: 'agent-1',
content_raw: 'Hello world'
})
});
// Verify: 503 response
expect(res.status).toBe(503);
// Verify: no partial state (no message row, no event row)
const messages = await db.all('SELECT * FROM messages');
const events = await db.all('SELECT * FROM events');
expect(messages).toHaveLength(0);
expect(events).toHaveLength(0);
// Verify: hub still running (health check)
const health = await fetch('http://localhost:8080/health');
expect(health.status).toBe(200);
});Example test case (concurrent edits with expected_version):
test('concurrent edits with expected_version', async () => {
// Create message (version 1)
const { message_id } = await createMessage();
// Two clients edit concurrently (both expect version 1)
const [res1, res2] = await Promise.all([
editMessage(message_id, 'Edit A', 1),
editMessage(message_id, 'Edit B', 1)
]);
// One succeeds (200, version 2), one conflicts (409, current_version 2)
const success = [res1, res2].find(r => r.status === 200);
const conflict = [res1, res2].find(r => r.status === 409);
expect(success).toBeDefined();
expect(conflict).toBeDefined();
expect(conflict.body.code).toBe('VERSION_CONFLICT');
expect(conflict.body.details.current_version).toBe(2);
// Verify: only one edit persisted
const msg = await getMessage(message_id);
expect(msg.version).toBe(2);
expect([msg.content_raw]).toContain(success.body.message.content_raw);
// Verify: only one message.edited event
const editEvents = await getEvents('message.edited');
expect(editEvents).toHaveLength(1);
});- Linux/macOS matrix
- FTS enabled/disabled where possible
- protocol compatibility lint (additive changes only in v1)
- acquire writer lock
- open DB; set PRAGMAs (WAL, foreign_keys, busy_timeout, etc.)
- apply migrations (backup first)
- generate auth token if missing (cryptographically random ≥128-bit, e.g.,
crypto.randomBytes(32).toString('hex')) - write
server.jsonwith token + instance_id (chmod 0600; verify perms) - validate localhost bind (reject
0.0.0.0unless--unsafe-networkflag) - serve HTTP+WS with rate limiting and input validation
- never log auth token or full message content
- restart; writer lock reacquired after staleness check
- event log continues monotonic (DB-managed ids)
agentlip doctor:
- SQLite integrity check (
PRAGMA integrity_check) - WAL checkpoint status (
PRAGMA wal_checkpoint(PASSIVE)) - WAL file size (warn if >100MB; suggest checkpoint or investigate lock holders)
- Disk space check (warn if <1GB free)
- Schema version validation (compare
meta.schema_versionto expected) - Foreign key constraint check (
PRAGMA foreign_key_check) - Event log gaps (verify
event_idis contiguous; warn on gaps) - Last event ID and timestamp
- server.json validation:
- File exists and mode is 0600
- PID is alive (if available)
db_idmatches databasemeta.db_id/healthreachable and returns matchinginstance_id
- Orphaned lock files (writer.lock exists but no live hub)
- Plugin configuration validation (agentlip.config.ts syntax, plugin modules exist)
- Rate limit configuration sanity (not zero, not too high)
Doctor repair mode:
agentlip doctor --repair:
- Fix file permissions (chmod 0600 on server.json)
- Remove stale lock files (after confirming PID dead or /health unreachable)
- Checkpoint WAL
- Vacuum database (reclaim space)
- Reindex (rebuild indexes for performance)
- Warning: repair mode should not modify data; only fix metadata/locks/perms
Doctor output format:
Agentlip Doctor v1.0
Workspace: /Users/cole/project/.agentlip
Database: db_id abc-123-def-456
[✓] Database integrity: OK
[✓] Schema version: 1 (current)
[✓] Foreign keys: OK (0 violations)
[⚠] WAL size: 120 MB (recommend checkpoint)
[✓] Disk space: 45 GB free
[✓] Event log: 15234 events, no gaps
[✓] Server status: running (instance xyz-789, PID 12345)
[✓] server.json: valid, mode 0600
Warnings:
- WAL file is large; run `agentlip doctor --checkpoint` to reclaim space
Summary: 1 warning, 0 errors
- before migrations: timestamped copy of
db.sqlite3(and WAL if present) - derived tables recomputable (enrichments/extracted links)
Key metrics to track:
- Event emission rate (events/sec)
- WS connection count (current, peak)
- API request rate (per endpoint)
- Database size (main + WAL)
- Disk space (free GB, % used)
- Plugin execution time (p50, p95, p99)
- Plugin timeout count
- Lock contention (503 error count)
- Auth failures (401 error count)
- Rate limit hits (429 error count)
- Hub uptime
- Last checkpoint timestamp
Alert thresholds (suggested):
- WAL file >100MB (warn), >500MB (critical)
- Disk space <10% or <1GB (warn), <5% or <500MB (critical)
- 503 error rate >10/min (warn), >50/min (critical; lock contention)
- 429 error rate >100/min (warn; possible DoS)
- Plugin timeout rate >10% (warn; plugin bug or slow external service)
- Event backlog >10k (warn; slow WS clients)
- Hub not responding to /health for 30s (critical)
Monitoring implementation (v1):
- Hub emits structured JSON logs with metrics
- External log aggregator (e.g., Loki, CloudWatch) parses and alerts
agentlip doctor --monitor(future): CLI command to dump current metrics
Example log entry (metrics event):
{
"level": "info",
"ts": "2026-02-04T23:45:00.000Z",
"msg": "metrics",
"metrics": {
"event_rate_1m": 45.2,
"ws_connections": 12,
"db_size_mb": 234,
"wal_size_mb": 15,
"disk_free_gb": 50,
"plugin_timeout_count_1h": 3,
"api_rate_1m": 120,
"lock_contention_count_1h": 0
}
}Disk space exhaustion:
- Symptom:
SQLITE_FULLerrors, writes fail - Detection: monitor disk usage; alert if <10% free or <1GB
- Immediate mitigation:
- Stop accepting new messages (return 503)
- Checkpoint WAL to flush committed data to main DB
- Vacuum database (reclaim deleted space)
- Rotate/compress logs
- Prevention:
- WAL auto-checkpoint (default 1000 pages, ~4MB)
- Log rotation policy (e.g., keep 7 days, compress older)
- Message retention policy (future: auto-delete old messages in archived topics)
WAL file growth unbounded:
- Symptom: .wal file grows to hundreds of MB or GB
- Causes:
- Long-running read transaction (CLI holding open read snapshot)
- Checkpoint disabled or failing
- High write rate with no reader commit points
- Detection: monitor WAL size; alert if >100MB
- Mitigation:
- Identify long-running readers (
PRAGMA wal_checkpoint(TRUNCATE)shows busy status) - Force checkpoint:
agentlip doctor --checkpoint - If CLI is culprit: close stale connections/queries
- If hub is culprit: restart hub (flush WAL on shutdown)
- Identify long-running readers (
- Prevention:
- CLI queries use
PRAGMA query_only = ONand close connections promptly - Hub periodically checkpoints (e.g., every 10k events or 10 minutes)
- CLI queries use
Clock skew / time travel:
- Symptom:
tstimestamps out of order, future timestamps, or past timestamps - Impact:
event_idremains authoritative (monotonic);tsis advisory - Clients should sort by
event_id, displaytsfor human reference only - NTP sync recommended but not required
- If clock jumps backward: new events have earlier
tsthan old events (cosmetic issue only) - If clock jumps forward: new events have far-future
ts(cosmetic issue only) - No correctness impact (event ordering unaffected)
Permission errors:
.agentlip/directory not writable: hub cannot create lock, write server.json → exit with errordb.sqlite3read-only: hub cannot acquire write lock → exit with errorserver.jsonwrong permissions (not 0600): security risk; hub should warn or refuse to start- Plugin module files not readable: plugin load fails; log error; skip plugin (non-fatal)
File descriptor exhaustion:
- Symptom: "too many open files" error
- Causes: many WS connections, many plugin Workers, leaked file handles
- Mitigation:
- Enforce
maxWsConnections(default 100) - Close plugin Workers promptly after job completes
- Monitor open FDs:
lsof -p <hub_pid> | wc -l - Increase ulimit if needed (OS-level config)
- Enforce
SQLite busy timeout edge cases:
- Transaction retries exhaust
busy_timeout(5s default) - Returns
SQLITE_BUSY→ hub returns 503 - Client should retry with exponential backoff
- If persistent: indicates lock contention (long-running txn, or concurrent writer)
- Debug:
PRAGMA wal_autocheckpointstatus, identify slow transactions
Hub port already in use:
- Scenario: previous hub crashed, OS hasn't released port yet
- Hub startup tries to bind port, fails
- Mitigation:
- Try binding with
SO_REUSEADDR(allow quick rebind) - If still fails: try next available port (ephemeral), update server.json
- Or: wait 5s, retry bind (TCP TIME_WAIT delay)
- Try binding with
- CLI: if server.json has stale port, validate via
/health(connection refused → stale)
Multiple hub instances (lock failure):
- Scenario: two users/processes try to start hub in same workspace
- First acquires lock, writes server.json
- Second sees lock exists, validates via
/health - If first hub healthy: second exits with error "hub already running at port X"
- If first hub stale (crashed): second removes lock, starts fresh
- Race: both check simultaneously, both think stale, both remove lock, both start
- Mitigation: atomic lock file creation (open with O_CREAT | O_EXCL)
- If create fails: lock exists; validate staleness
- Prevents race condition
Auth token rotation:
- Scenario: admin wants to rotate token (security best practice)
- Challenge: active clients have old token
- Procedure:
- Generate new token
- Write new token to server.json
- Hub serves both old and new tokens for grace period (e.g., 5 min)
- After grace period: reject old token
- Clients detect 401, re-read server.json, reconnect with new token
- v1: no token rotation support; require hub restart for new token
- Future:
/admin/rotate-tokenendpoint (requires existing valid token)
Schema migration failure:
- Migration SQL has syntax error or constraint violation
- Transaction rolls back automatically
- Hub exits with error "migration failed"
- Admin must fix migration SQL or restore from backup
- Backup taken before migration ensures safe rollback
Database corruption:
- Symptom:
SQLITE_CORRUPTor integrity check fails - Causes: disk failure, OS crash during write, bug in SQLite (rare)
- Detection:
PRAGMA integrity_checkin doctor command - Mitigation:
- Restore from timestamped backup (before last migration)
- Replay event log (events table is append-only; may survive corruption)
- Use
.recovercommand (SQLite 3.40+) to extract data from corrupt DB
- Prevention:
PRAGMA synchronous = NORMAL(balance safety/performance)- Avoid forceful shutdowns (SIGKILL); use graceful shutdown (SIGTERM)
- Use journaling filesystem (ext4, APFS) with barriers enabled
Plugin module not found:
agentlip.config.tsreferences./custom-plugins/foo.ts, file doesn't exist- Hub startup: log error, skip plugin, continue (non-fatal)
- Or: fail fast (exit with error) if plugin loading is critical
- v1 decision: warn and skip missing plugins; hub starts without them
Plugin infinite loop / CPU spike:
- Plugin has bug, uses 100% CPU, doesn't timeout (e.g., busy loop)
- Worker CPU limit: not enforceable in Bun Worker (JS has no preemption)
- Mitigation: timeout is wall-clock time (5s default); Worker killed after timeout regardless of CPU usage
- Monitor: hub tracks plugin execution time, logs slow plugins (>1s)
Plugin memory leak:
- Plugin allocates large objects, doesn't release
- Worker memory limit:
--max-old-space-sizeflag (if Worker supports) - v1: no memory limit enforcement; rely on timeout to kill runaway plugins
- Future: track Worker RSS, kill if exceeds threshold (requires OS-level monitoring)
Network partition (localhost unreachable):
- Scenario: firewall blocks 127.0.0.1 (misconfiguration)
- Hub binds successfully but clients cannot connect
- Detection:
curl http://127.0.0.1:<port>/healthfrom client machine - If fails: check firewall, loopback interface status
- v1: assume localhost always reachable (no special handling)
Build
- workspace discovery + init
- schema apply (core + optional FTS)
- hub
/health, lock,server.json
Exit
- Gate A passes
agentlipd statusworks
Build
- channel/topic CRUD
- send message
- edit message + tombstone delete + conflict semantics
- events table + WS replay/stream
- CLI: list/tail/page/listen (+ edit/delete)
Exit
- Gates B, C, G, H pass for message mutations
- CLI JSONL listen works with reconnect
Build
- retopic modes + fanout correctness (same-channel only)
- attachments API + CLI
- built-in URL extractor to attachments with dedupe_key
Exit
- Gate D passes
- attachment idempotency tests pass
Build
agentlip.config.tsloading- Worker isolation + timeouts + circuit breaker
- linkifier →
message.enriched - extractor →
topic.attachment_added
Exit
- Gate E passes
- Gate I passes (staleness tests)
Build
/uibrowsing and live updates@agentlip/client+ served bundle if needed- docs + examples
Exit
- Gate F passes
- end-to-end demo script works
- ADR-0003: Replay boundary codified in docs + tests
- ADR-0005: Plugin isolation finalized (Worker defaults)
- ADR-0007: Attachment idempotency implemented (
dedupe_key+ unique index) - ADR-0008: Edit + tombstone delete implemented (no hard deletes)
-
schema_v1.sqlwithmetainit (db_id,schema_version,created_at) - Optional
schema_v1_fts.sqlwith graceful fallback - Migration scaffolding using
meta.schema_version - DB open helper sets PRAGMAs (
WAL,foreign_keys,busy_timeout) - Canonical read queries (channels, topics, tail/page, attachments, replay)
- Add columns:
edited_at,deleted_at,deleted_by,version - Triggers: forbid hard deletes on
messages; forbid update/delete onevents - Implement PATCH operations: edit, delete (tombstone), retopic
- Conflict responses include
current_version - Version increments on edit/delete/retopic
- Writer lock acquisition with staleness handling
- Auth token generation (≥128-bit cryptographically random)
-
server.jsonwriting (chmod 0600 verification; never log token) - Localhost-only bind validation (reject
0.0.0.0by default) -
/healthendpoint (instance_id,db_id,schema_version,protocol_version) - Auth middleware for mutations + WS (constant-time token comparison)
- Rate limiting middleware (per-connection and global)
- Input validation and size limits (message ≤64KB, attachment ≤16KB, WS ≤256KB)
- Prepared statements for all SQL queries
- HTTP API endpoints v1
- WS endpoint: hello handshake, replay boundary, live fanout, backpressure
- Structured JSON logging (
request_id,event_id; never tokens or full content) - Graceful shutdown
- Central helper:
insertEvent(name, scopes, entity, data) - Scope correctness for all event types
- Dev-mode invariant assertions for scope population
- Selection queries: one/later/all
- CLI guardrails (
--mode allrequires--force) - Emit per-message
message.moved_topicevents - Enforce same-channel constraint with negative tests
- Implement
dedupe_keywith unique index - Insert semantics: dedupe returns existing row without new event
- Validate attachment metadata (URL format, size limits, sanitize XSS payloads)
- URL extraction built-in plugin (with configurable allowlist/blocklist)
-
agentlip.config.tsloader with config schema (workspace root only; path traversal protection) - Worker runtime harness (RPC, timeouts, circuit breaker)
- Plugin isolation (no write access to
.agentlip/directory) - Linkifiers: write derived rows, emit
message.enriched - Extractors: insert attachments, emit
topic.attachment_added - Staleness guard for derived jobs (verify content +
deleted_at; discard if tombstoned)
- Workspace discovery + DB read-only open
- Read-only commands (channel/topic/msg/attachments/search)
- Mutations via HTTP (send/edit/delete/retopic/attach)
-
listenvia WS outputting JSONL - Stable machine-readable error codes and schemas
- Workspace discovery helper
- Read
server.json, validate via/health - WS connect with replay and reconnect loop
- Async iterator yielding typed event envelopes
- Convenience mutation methods (send/edit/delete/retopic/attach)
Connect and stream events:
import { AgentlipClient } from '@agentlip/client';
const client = new AgentlipClient({
workspacePath: process.cwd(), // auto-discover from here
afterEventId: 0, // or load from persistent storage
subscriptions: {
channels: ['general'],
topics: ['topic_xyz']
}
});
await client.connect();
// Stream events as async iterator
for await (const envelope of client.events()) {
console.log(envelope.event_id, envelope.name, envelope.data);
// Persist last processed event_id for reconnection
await saveCheckpoint(envelope.event_id);
// Handle specific event types
if (envelope.name === 'message.created') {
const msg = envelope.data.message;
console.log(`New message from ${msg.sender}: ${msg.content_raw}`);
}
}Send message:
const result = await client.sendMessage({
topicId: 'topic_xyz',
sender: 'agent-1',
contentRaw: 'Hello from SDK'
});
console.log(`Sent message ${result.message.id} (event ${result.event_id})`);Edit message with optimistic locking:
try {
const result = await client.editMessage({
messageId: 'msg_456',
contentRaw: 'Updated content',
expectedVersion: 2
});
console.log(`Edited to version ${result.message.version}`);
} catch (err) {
if (err.code === 'VERSION_CONFLICT') {
console.error(`Conflict: current version is ${err.details.current}`);
// Retry with current version
}
}Retopic messages:
const result = await client.retopicMessage({
messageId: 'msg_100',
toTopicId: 'topic_archive',
mode: 'later' // or 'one', 'all'
});
console.log(`Moved ${result.affected_count} messages`);Graceful reconnection:
client.on('disconnect', () => {
console.log('Disconnected, will reconnect...');
});
client.on('reconnect', (afterEventId) => {
console.log(`Reconnected, replaying from ${afterEventId}`);
});
// Client automatically reconnects and resumes from last processed event_idSDK interface:
interface AgentlipClient {
// Connection lifecycle
connect(): Promise<void>;
disconnect(): Promise<void>;
// Event stream
events(): AsyncIterableIterator<EventEnvelope>;
// Mutations
sendMessage(params: SendMessageParams): Promise<SendMessageResult>;
editMessage(params: EditMessageParams): Promise<EditMessageResult>;
deleteMessage(params: DeleteMessageParams): Promise<DeleteMessageResult>;
retopicMessage(params: RetopicMessageParams): Promise<RetopicResult>;
addAttachment(params: AddAttachmentParams): Promise<AddAttachmentResult>;
renameTopic(params: RenameTopicParams): Promise<RenameTopicResult>;
// Queries (direct DB read)
listChannels(): Promise<Channel[]>;
listTopics(channelId: string): Promise<Topic[]>;
tailMessages(params: TailMessagesParams): Promise<Message[]>;
pageMessages(params: PageMessagesParams): Promise<Message[]>;
listAttachments(topicId: string): Promise<Attachment[]>;
search(query: string, filters?: SearchFilters): Promise<Message[]>;
// Events
on(event: 'disconnect', handler: () => void): void;
on(event: 'reconnect', handler: (afterEventId: number) => void): void;
on(event: 'error', handler: (err: Error) => void): void;
}
interface EventEnvelope {
event_id: number;
ts: string;
name: string;
scope: {
channel_id?: string;
topic_id?: string;
topic_id2?: string;
};
data: Record<string, unknown>;
}- Channels/topics/messages view
- Tombstone + edit indicators
- Attachments pane (sanitize URLs; validate before rendering)
- Live updates via WS
- Security headers (CSP to prevent XSS; X-Frame-Options; X-Content-Type-Options)
- Escape all user content (message text, attachment metadata) before rendering
- Unit tests for schema + query contracts
- Integration harness (temp workspace + hub + ws client)
- Failure injection tests (plugin hang, WS slow consumer, conflict)
- Security tests:
- Rate limiting (verify 429 responses)
- Input size limits (reject oversized payloads)
- SQL injection attempts (verify prepared statements)
- Auth token leakage (verify not in logs or error responses)
- File permissions (verify server.json is 0600)
- Localhost bind (verify rejects
0.0.0.0by default) - Plugin isolation (verify no write access to
.agentlip/) - Workspace discovery (verify stops at boundary; no untrusted config loading)
- CI matrix with FTS on/off
Transaction and crash safety:
- Disk full during message insert: verify 503 returned, no partial state, transaction rolled back
- Lock contention timeout: verify 503 with Retry-After header
- WAL checkpoint failure (simulate I/O error): verify hub continues serving, WAL grows, doctor reports issue
- Power loss simulation (kill -9 during transaction): verify DB recovers cleanly, WAL replays, no corruption
- Corruption detection: inject corruption (SQLite debug mode), verify hub refuses to start, doctor detects issue
WebSocket delivery guarantees:
- Client disconnect mid-replay: reconnect with same after_event_id, verify replay restarts, no gaps
- Events committed during replay: verify boundary semantics (replay sends ≤ replay_until, live sends >replay_until), client dedupes
- Send failure mid-batch: close connection, client reconnects, verify no lost events
- Stale client (after=0 with 100k events): verify paginated replay, backpressure enforced if needed
- Clock skew: set system clock backward, emit events, verify event_id monotonic (ts may be out of order)
- Hub restart during active connections: verify graceful close (1001), clients reconnect with last processed event_id
Concurrent mutations:
- Two edits racing (no expected_version): both succeed, version increments twice, both events emitted
- Two edits racing (both expected_version=1): first succeeds, second conflicts (409 with current_version)
- Edit vs. delete race: delete succeeds, subsequent edit rejected (400 "cannot edit deleted message")
- Edit vs. retopic race: retopic increments version, concurrent edit conflicts
- Delete vs. delete race: second delete is idempotent (200, no new event)
- Rapid successive edits (10 edits in 1s): all succeed, version increments to 11, all events emitted
- Retopic "all" concurrent with new message insert: verify serialization (message either included or not, no partial state)
- Version overflow (simulate 2^63 edits): verify overflow handling or rejection
Plugin and derived data:
- ABA problem (edit back to original): verify version-based staleness guard discards outputs
- TOC/TOU race (content changes during verification): verify transactional verification prevents stale commits
- Multiple plugins concurrently: verify both succeed, dedupe_key prevents duplicate attachments
- External state change (URL title changes): verify no automatic update, dedupe prevents duplicate
- Message deleted while plugin running: verify staleness guard checks deleted_at, discards outputs
- Plugin timeout: verify hub continues serving, no enrichments committed, timeout logged
- Plugin emits outputs, message edited before commit: verify version guard discards
- Retopic during plugin execution: verify version guard discards (version changed)
- Hub restart during plugin execution: verify plugins exit, no auto-retry, messages remain un-enriched
- Concurrent edits triggering multiple plugins: verify only latest version's enrichments persist
Retopic edge cases:
- Retopic to same topic: verify idempotent success (200, no events)
- Retopic of tombstoned message: verify allowed, message moves (still deleted)
- Retopic with stale expected_version: verify conflict (409)
- Source topic deleted during retopic: verify 0 affected (200) or constraint error
- Target topic deleted during retopic: verify foreign key constraint error (400)
- Retopic "all" with 10k messages: verify succeeds (or rejected if batch limit enforced)
- Retopic "later" anchor at end: verify only anchor moves
- Retopic with sparse IDs: verify selection uses >= correctly
- Concurrent retopics on same topic: verify topic_id re-check prevents double-move
- Cross-channel retopic attempt: verify 400 error, no state change, no events
Operational edge cases:
- Disk space exhaustion: verify writes fail gracefully (503), checkpoint releases space
- WAL growth to 500MB: verify doctor reports warning, checkpoint truncates
- Clock skew (set clock +1 hour): verify event_id order preserved, ts jumps forward
- Permission error (server.json not writable): verify hub exits with clear error
- File descriptor exhaustion: verify connection limit enforced, new connections rejected (503)
- SQLite busy timeout: simulate long txn, concurrent write, verify 503 after timeout
- Hub port already in use: verify SO_REUSEADDR or port increment, server.json updated
- Multiple hub instances: verify lock file prevents second start (or removes stale lock)
- Schema migration failure: verify rollback, backup preserved, hub exits with error
- Database corruption: verify integrity check fails, doctor detects, hub refuses to start
- Plugin module not found: verify warning logged, hub starts without plugin
- Plugin infinite loop: verify timeout kills Worker (wall-clock, not CPU-based)
- Plugin memory leak: verify timeout eventually kills (no memory limit in v1)
Attachment idempotency:
- Insert same attachment twice: verify dedupe (no new event, existing ID returned)
- Concurrent attachment inserts with same dedupe_key: verify unique constraint, one succeeds
- Dedupe_key computed by hub (not provided): verify deterministic computation, idempotent
Event log integrity:
- Event IDs strictly increasing: insert 1000 messages concurrently, verify event_id sequence has no gaps
- Event immutability: attempt UPDATE/DELETE on events table, verify trigger prevents
- Message hard delete prevention: attempt DELETE on messages table, verify trigger prevents
- Scope correctness: verify every event has correct scope_channel_id, scope_topic_id, scope_topic_id2 (audit all event types)
Rate limiting:
- Per-connection limit (100 req/s): send 200 requests in 1s, verify 429 after 100
- Global limit (1000 req/s): 20 clients send 60 req/s each, verify 429 after 1000 total
- Rate limit reset: wait for window to expire, verify limit resets
Security boundary tests:
- SQL injection in message content: insert
'; DROP TABLE messages; --, verify no SQL execution - SQL injection in channel name: create channel with
'; DROP TABLE channels; --, verify no SQL execution - Oversized message (100KB): verify 400 rejection
- Oversized attachment (100KB): verify 400 rejection
- Oversized WS message (1MB): verify connection closed
- Auth token in logs: send request with token, verify token not in log output (search for token string)
- Auth token in error response: send invalid request, verify token not echoed in response
- server.json permissions: create server.json with mode 0644, verify hub fixes or refuses to start
- Localhost bind check: configure hub with 0.0.0.0, verify rejection (unless --unsafe-network)
- Plugin write attempt: plugin tries to write to
.agentlip/db.sqlite3, verify permission denied or isolation prevents - Workspace discovery upward traversal: create
.agentlip/in parent dir, run CLI in child, verify stops at workspace root
Migration edge cases:
- Upgrade 1→2 with data: apply migration, verify schema_version updated, data intact
- Downgrade attempt (schema_version=2, hub expects 1): verify hub refuses to start
- Migration with constraint violation: simulate migration that fails, verify rollback, backup preserved
- Concurrent hub start during migration: verify second hub sees lock, waits or exits
- Protocol doc (handshake, replay, event types, conflicts)
- Ops doc (startup, recovery, migrations, doctor)
- Security doc:
- Threat model and trust boundaries
- Auth token handling and rotation
- Plugin security model and risks (v1: network/filesystem access)
- Privacy implications (immutable event log; no secure erasure)
- Safe defaults and configuration
- Rate limits and resource constraints
- Examples: multi-agent + human demo script
- Workspace: Repository directory containing
.agentlip/state - Channel: Long-lived bucket for project/team scope
- Topic: Thread entity with stable ID; belongs to a channel
- Message: Stable identity; mutable via explicit edit; deletable via tombstone
- Event: Durable append-only log entry ordered by
event_id; the integration surface - Enrichment: Derived structured expansions for tokens in message text
- Attachment: Topic-scoped structured grounding metadata
- Single writer: Only the hub process writes to SQLite
-
Duplicate attachments due to retries
- Mitigation:
dedupe_key+ unique index + no-event on dedupe - Residual risk: client-computed dedupe_key may have collisions (hash-based); use full URL as dedupe_key for v1
- Mitigation:
-
WS clients miss events due to replay/live boundary bug
- Mitigation: explicit
replay_untilcontract + integration tests - Residual risk: events committed exactly at replay_until boundary may cause edge cases; client deduplication handles
- Mitigation: explicit
-
Two hub instances (lock file race)
- Mitigation: atomic lock file creation (O_CREAT|O_EXCL) + /health validation + fail fast
- Residual risk: NFS or network filesystem may not guarantee atomicity; detect via instance_id mismatch
-
Plugin hangs (infinite loop, network timeout)
- Mitigation: Worker isolation, wall-clock timeouts (not CPU-based), circuit breaker after N failures
- Residual risk: Worker CPU spike may degrade hub performance (JS single-threaded); monitor hub CPU
-
Schema drift breaks stateless CLI
- Mitigation: additive evolution + migrations + query contract tests
- Residual risk: schema_version mismatch between CLI and DB; CLI should check and warn
-
Edits cause stale derived outputs
- Mitigation: version-match + content-match +
deleted_atstaleness guard in same transaction as insert; re-enqueue on edit; Gate I - Residual risk: ABA problem if only content compared; version comparison required
- Mitigation: version-match + content-match +
-
WAL file growth unbounded (reader holds snapshot)
- Mitigation: monitor WAL size, periodic checkpoint, CLI closes queries promptly
- Residual risk: long-running CLI query (e.g., FTS search) may prevent checkpoint; timeout CLI queries
-
Disk space exhaustion (WAL + logs)
- Mitigation: monitor disk usage, checkpoint on low space, log rotation, reject writes if <1GB free
- Residual risk: rapid growth may fill disk before monitoring detects; preemptive limits
-
Lock contention timeout (busy database)
- Mitigation:
busy_timeout5s, return 503 with Retry-After, client exponential backoff - Residual risk: pathological write pattern (e.g., retopic 100k messages) may block all writes; enforce batch limits
- Mitigation:
-
Clock skew (NTP failure, manual time change)
- Mitigation: event_id is authoritative order, not
ts; document client sorting behavior - Residual risk:
tsmay be confusing in UI (out of order); display warning iftsjumps >1 hour
- Mitigation: event_id is authoritative order, not
-
Migration failure mid-apply (constraint violation)
- Mitigation: migrations in transaction, backup before apply, rollback on error, admin manual intervention
- Residual risk: backup may be stale if writes occurred during migration prep; stop hub before migration
-
Database corruption (disk failure, OS crash)
- Mitigation:
PRAGMA synchronous=NORMAL, avoid SIGKILL, journaling filesystem, integrity checks in doctor - Residual risk: unrecoverable corruption; restore from backup, replay event log (events table append-only)
- Mitigation:
-
Plugin module not found or syntax error
- Mitigation: warn and skip plugin, hub starts anyway (graceful degradation)
- Residual risk: missing plugin may be critical; option to fail-fast if
plugin.required = true
-
Hub port already in use (previous crash)
- Mitigation: SO_REUSEADDR, retry bind, fallback to ephemeral port
- Residual risk: clients may have stale server.json; validate via /health
-
File descriptor exhaustion (many WS connections, leaked handles)
- Mitigation: enforce maxWsConnections (100), close Workers promptly, monitor open FDs
- Residual risk: OS-level ulimit may be low; document requirement (e.g., ulimit -n 1024)
-
Auth token leakage (logs, error messages, file perms)
- Mitigation: chmod 0600 on server.json; never log token; constant-time comparison; no token in error responses
- Residual risk: token may leak via process args if passed as flag; use file-based token only
-
SQL injection via user inputs
- Mitigation: prepared statements only; no string concatenation in queries; input validation
- Residual risk: none if policy enforced; audit all queries
-
DoS via API abuse (large payloads, rapid requests)
- Mitigation: rate limits (per-connection + global); size limits on all inputs; backpressure on WS
- Residual risk: distributed attack (many clients); add IP-based limit (future, requires reverse proxy)
-
Malicious plugin (filesystem access, network abuse, resource exhaustion)
- Mitigation: Worker isolation; timeouts; no write access to
.agentlip/; future: explicit capability grants - Residual risk: v1 plugins CAN access network and filesystem (Worker limitations); document risk
- Mitigation: Worker isolation; timeouts; no write access to
-
Path traversal during workspace discovery
- Mitigation: stop at filesystem boundary; never load
agentlip.config.tsfrom untrusted parent dirs - Residual risk: symlink attack (
.agentlipsymlinked to attacker-controlled dir); resolve symlinks, validate ownership
- Mitigation: stop at filesystem boundary; never load
-
Sensitive data in event log (user thinks "deleted" = erased)
- Mitigation: document clearly that tombstones do not erase; events are immutable; old content may persist in historical events
- Residual risk: users expect secure deletion; add "archive-and-purge" workflow (future, requires v2 with event log truncation)
-
Untrusted workspace config (code execution via
agentlip.config.ts)- Mitigation: only load from discovered workspace root; document that workspace is trusted; consider signature verification (future)
- Residual risk: developer clones malicious repo, runs CLI; code executes; warn on untrusted workspace
-
XSS or injection via attachment URLs in UI
- Mitigation: UI must sanitize/escape attachment metadata; CSP headers; URL validation
- Residual risk: complex URL schemes (javascript:, data:) may bypass filters; whitelist schemes (http, https, file)
-
Replay timing attack (infer message content from event timing)
- Mitigation: v1 none; localhost-only reduces risk
- Residual risk: malicious local process could observe timing; future: add jitter to event timestamps
-
Auth token brute force (if short token)
- Mitigation: token is ≥128-bit (32 hex chars = 128 bits entropy); constant-time comparison prevents timing attacks
- Residual risk: none if token generation secure (crypto.randomBytes)
-
TOCTOU in staleness guard (content changes between read and insert)
- Mitigation: perform verification read and derived insert in same transaction
- Residual risk: none if transaction isolation correct
-
Retopic fanout missing subscriber (topic_id2 not indexed)
- Mitigation: index on scope_topic_id2; verify fanout logic includes topic_id2 matches
- Residual risk: missing index would cause slow fanout, not incorrect fanout
-
Event log gaps (event_id skip due to rollback)
- Mitigation: SQLite autoincrement reuses rolled-back IDs in same session, but not across restarts; gaps possible after crash
- Residual risk: clients assume contiguous event_id; doctor should detect gaps and warn
-
Hub crashes during graceful shutdown (partial cleanup)
- Mitigation: critical cleanup (lock removal, server.json deletion) should be idempotent; next start cleans up stale files
- Residual risk: stale server.json may confuse clients; validate via /health
-
Client storage corruption (loses last processed event_id, replays millions)
- Mitigation: client decides replay policy (full replay or skip history); hub enforces maxEventReplayBatch to paginate
- Residual risk: full replay of large event log (1M+ events) may take minutes; consider replay TTL (e.g., only replay last 7 days)
- Mutation path uses one transaction for state+event
- Event scopes populated correctly
- Replay query is index-backed (EXPLAIN QUERY PLAN in dev)
- WS replay/live boundary tests pass
- Conflict semantics tests pass (expected_version)
- Tombstone delete leaves row intact + emits event
- No hard deletes possible (trigger enforced)
- Plugin timeout tests pass
- Derived staleness guard tests pass (including tombstone check)
- Disk full during mutation: verify 503 returned, no partial state
- Lock timeout during mutation: verify 503 with Retry-After
- WAL checkpoint failure: verify hub continues serving (degraded mode)
- Crash during transaction: verify WAL recovery, atomicity preserved
- Concurrent edits (no expected_version): both succeed, correct version sequence
- Concurrent edits (with expected_version): second conflicts with current_version
- Edit of tombstoned message: verify rejection (400)
- Delete of already-deleted message: verify idempotent success (200)
- Retopic to same topic: verify idempotent success (200, no events)
- Retopic with concurrent topic deletion: verify handles gracefully (0 affected or constraint error)
- Retopic with concurrent retopic: verify topic_id re-check prevents anomalies
- Plugin staleness (ABA problem): verify version-based guard discards
- Plugin staleness (TOC/TOU): verify transactional check-then-insert
- Plugin timeout: verify hub continues, no stale commits
- Message deleted during plugin run: verify deleted_at guard discards
- Attachment dedupe: verify unique constraint, no duplicate events
- WS events during replay: verify boundary semantics, client dedupes
- WS disconnect mid-replay: verify reconnect resumes correctly
- Clock skew: verify event_id monotonicity preserved (ts may be out of order)
- Rapid successive edits: verify all succeed, no lost events, version correct
- Retopic "all" with 10k messages: verify succeeds or batch limit enforced
- Multiple hub instances: verify lock prevents concurrent start
- Schema migration failure: verify rollback, backup preserved
- Database corruption: verify doctor detects, hub refuses to start
- All SQL uses prepared statements (audit for string concatenation)
- Auth token never appears in logs or error responses
-
server.jsonhas mode 0600 (verify programmatically) - Hub rejects
0.0.0.0bind by default - Rate limits enforced (test with burst requests)
- Input size limits enforced (test with oversized payloads)
- Plugin isolation verified (cannot write to
.agentlip/) - Workspace discovery stops at boundary (test with untrusted parent)
- Error responses are generic (no path/token leakage)
- SQL injection in all text fields: verify prepared statements prevent
- Auth token in logs (search for token literal): verify not present
- Auth token in error response (test invalid request): verify not echoed
- server.json wrong permissions: verify hub fixes or refuses to start
- Localhost bind with 0.0.0.0: verify rejection (unless --unsafe-network flag)
- Plugin filesystem write: verify isolation prevents or permission denied
- Plugin network abuse: verify timeout limits duration (v1: no network blocking)
- Rate limit bypass (multiple connections): verify global limit enforced
- Oversized payload (message, attachment, WS): verify size limits enforced at all layers
- XSS in attachment URL (UI): verify sanitization before rendering
- Disk space monitoring: verify doctor reports low disk space
- WAL size monitoring: verify doctor reports large WAL (>100MB)
- WAL checkpoint: verify
agentlip doctor --checkpointsucceeds - File descriptor limit: verify connection limit prevents exhaustion
- Hub graceful shutdown: verify closes WS (1001), flushes WAL, removes lock
- Hub crash cleanup: verify stale lock removed on next start
- Hub port conflict: verify SO_REUSEADDR or port increment
- Plugin module missing: verify warning logged, hub starts
- Plugin infinite loop: verify timeout enforced (wall-clock)
- Long-running CLI query: verify doesn't block hub writes (WAL)
- Multiple simultaneous CLI queries: verify all succeed (read concurrency)