|
1 | 1 | # MosaicDB |
2 | 2 |
|
3 | | -Distributed semantic search built on SQLite shards ( a sketch ) |
| 3 | +### A Distributed, Federated Semantic Search Engine Built on SQLite Shards |
4 | 4 |
|
5 | | -## Setup |
| 5 | +MosaicDB is an experimental distributed query engine performing **hybrid vector + metadata search** across many **immutable SQLite shard files**. Each shard contains: |
6 | 6 |
|
7 | | -```bash |
8 | | -make build |
9 | | -make up |
10 | | -``` |
| 7 | +* Document text or metadata |
| 8 | +* Vector embeddings (`sqlite-vss`) |
| 9 | +* PageRank or other ranking signals |
| 10 | + |
| 11 | +Elixir acts as the **coordinator and control plane**, orchestrating fan-out queries, retries, merges, caching, and ranking. |
| 12 | + |
| 13 | +--- |
| 14 | + |
| 15 | +# Features |
| 16 | + |
| 17 | +* Federated search across multiple SQLite shards |
| 18 | +* Vector similarity search using `sqlite-vss` |
| 19 | +* Metadata-aware filtering |
| 20 | +* PageRank-based reranking |
| 21 | +* LRU embedding cache |
| 22 | +* Distributed coordinator architecture |
| 23 | +* HTTP API for search |
| 24 | +* Metrics via Prometheus/Grafana |
| 25 | + |
| 26 | +MosaicDB combines **SQLite simplicity with Erlang/Elixir scale**. Each node is a lightweight SQLite database capable of storing both vector embeddings and structured metadata. Distributed across multiple nodes, MosaicDB provides fault-tolerant, scalable storage **without the overhead of managed clusters**. |
| 27 | + |
| 28 | +--- |
| 29 | + |
| 30 | +# Feature Comparison |
| 31 | + |
| 32 | +| Feature | PostgreSQL | Pinecone | Weaviate | MosaicDB (SQLite nodes) | |
| 33 | +| --------------- | ----------------------------- | ------------- | ------------- | --------------------------------- | |
| 34 | +| SQL support | Yes | No | No | Yes, native SQLite queries | |
| 35 | +| Vector search | Extensions needed (pgvector) | Yes | Yes | Yes, exact or approximate | |
| 36 | +| Distribution | Manual (sharding/replication) | Managed | Managed | Built-in via Elixir/Erlang | |
| 37 | +| Fault tolerance | Manual / HA setups | Cloud-managed | Cloud-managed | Erlang/Elixir supervision trees | |
| 38 | +| Lightweight | Moderate | No | No | Each node is a single SQLite file | |
| 39 | +| Edge-ready | No | No | No | Yes, nodes are self-contained | |
| 40 | + |
| 41 | +**Developer Pitch:** |
| 42 | +MosaicDB gives developers a **lightweight, distributed vector + relational database** where each node is just a SQLite file. Fully SQL-capable, fault-tolerant via Erlang/Elixir, and easy to deploy at the edge — you get vector search + relational queries in one place, without complex cluster management or cloud lock-in. It’s **SQLite simplicity with Erlang reliability**. |
| 43 | + |
| 44 | +--- |
| 45 | + |
| 46 | +# Why Elixir? |
11 | 47 |
|
12 | | -Done. Go to http://localhost/health |
| 48 | +MosaicDB uses Elixir for its coordination layer because it naturally fits **federated query execution**: |
13 | 49 |
|
14 | | -## What It Does |
| 50 | +### Concurrency for fan-out search |
15 | 51 |
|
16 | | -Takes queries, searches across SQLite database shards, returns ranked results. |
| 52 | +Each shard query runs as an isolated BEAM process—no thread pools, no shared state, no locks. |
17 | 53 |
|
18 | | -Each shard = immutable SQLite file with documents, vectors, and PageRank scores. |
| 54 | +### Supervisor-based fault tolerance |
19 | 55 |
|
20 | | -This is a system for performing semantic search across many distributed chunks of data (shards) using embedding-based vector search (via sqlite-vss) and then ranking the results. |
| 56 | +Shard errors, timeouts, or node failures are isolated and automatically recovered. |
21 | 57 |
|
22 | | -All in a federated/distributed way, inspired by scalable distributed systems like Riak. |
| 58 | +### Predictable under load |
23 | 59 |
|
24 | | -## Services |
| 60 | +The BEAM scheduler ensures slow shards do not block others. |
25 | 61 |
|
26 | | -- **Coordinator** (4040): Routes queries |
27 | | -- **Nginx** (80): Load balancer |
28 | | -- **Redis** (6379): Cache |
29 | | -- **Prometheus** (9090): Metrics |
30 | | -- **Grafana** (3000): Dashboards |
| 62 | +### Built-in distribution |
| 63 | + |
| 64 | +Elixir nodes auto-discover and form a cluster, enabling multi-node coordination without external registries. |
| 65 | + |
| 66 | +### Clean pipeline composition |
| 67 | + |
| 68 | +Query planning, merging, and reranking are expressed using functional pipelines and pattern matching. |
| 69 | + |
| 70 | +### Observability |
| 71 | + |
| 72 | +LiveDashboard, telemetry, and introspection tools simplify distributed debugging. |
| 73 | + |
| 74 | +**In short:** Elixir is the resilient, concurrent **control plane** around fast SQLite shards. |
| 75 | + |
| 76 | +--- |
| 77 | + |
| 78 | +# Quick Start |
| 79 | + |
| 80 | +## Build and run |
| 81 | + |
| 82 | +```bash |
| 83 | +make build |
| 84 | +make up |
| 85 | +``` |
31 | 86 |
|
32 | | -## API |
| 87 | +Check health: |
33 | 88 |
|
34 | 89 | ```bash |
35 | | -# Health |
36 | 90 | curl http://localhost/health |
| 91 | +``` |
37 | 92 |
|
38 | | -# Search (placeholder) |
39 | | -curl -X POST http://localhost/api/search \ |
40 | | - -d '{"query":"test"}' \ |
41 | | - -H "Content-Type: application/json" |
| 93 | +--- |
| 94 | + |
| 95 | +# API |
| 96 | + |
| 97 | +### Health |
| 98 | + |
| 99 | +```bash |
| 100 | +curl http://localhost/health |
42 | 101 | ``` |
43 | 102 |
|
44 | | -## Commands |
| 103 | +### Search (placeholder API) |
45 | 104 |
|
46 | 105 | ```bash |
47 | | -make up # Start |
48 | | -make down # Stop |
49 | | -make logs # Logs |
50 | | -make restart # Restart |
| 106 | +curl -X POST http://localhost/api/search \ |
| 107 | + -H "Content-Type: application/json" \ |
| 108 | + -d '{"query": "test"}' |
51 | 109 | ``` |
52 | 110 |
|
53 | | -## Local Dev |
| 111 | +--- |
| 112 | + |
| 113 | +# Components |
| 114 | + |
| 115 | +| Service | Port | Description | |
| 116 | +| ----------- | ---- | -------------------------- | |
| 117 | +| Coordinator | 4040 | Elixir-based query router | |
| 118 | +| Nginx | 80 | Load balancer / entrypoint | |
| 119 | +| Redis | 6379 | Metadata + embedding cache | |
| 120 | +| Prometheus | 9090 | Metrics | |
| 121 | +| Grafana | 3000 | Dashboards | |
| 122 | + |
| 123 | +--- |
| 124 | + |
| 125 | +# Development |
| 126 | + |
| 127 | +Install dependencies: |
54 | 128 |
|
55 | 129 | ```bash |
56 | 130 | mix deps.get |
| 131 | +``` |
| 132 | + |
| 133 | +Run the system: |
| 134 | + |
| 135 | +```bash |
57 | 136 | mix run --no-halt |
58 | 137 | ``` |
59 | 138 |
|
60 | | -## Architecture |
| 139 | +--- |
| 140 | + |
| 141 | +# Basic Architecture |
| 142 | + |
| 143 | +``` |
| 144 | +Client Query |
| 145 | + │ |
| 146 | + Nginx |
| 147 | + │ |
| 148 | + Coordinator (Elixir) |
| 149 | + ┌────┴─────────────┐ |
| 150 | + │ fan-out async RPC│ |
| 151 | + └────┬─────────────┘ |
| 152 | + Many SQLite Shards |
| 153 | + │ |
| 154 | + Vector + metadata search |
| 155 | + │ |
| 156 | + Coordinator merges + ranks |
| 157 | + │ |
| 158 | + Response |
| 159 | +``` |
| 160 | + |
| 161 | +--- |
| 162 | + |
| 163 | +# Scaling |
61 | 164 |
|
| 165 | +To scale horizontally, edit `docker-compose.yml` and increase coordinator workers: |
| 166 | + |
| 167 | +```yaml |
| 168 | +scale: 4 |
62 | 169 | ``` |
63 | | -Query → Nginx → Coordinator → Shards (SQLite files) |
64 | | - ↓ |
65 | | - Cache (Redis) |
| 170 | +
|
| 171 | +Then: |
| 172 | +
|
| 173 | +```bash |
| 174 | +make restart |
66 | 175 | ``` |
67 | 176 |
|
68 | | -## Scaling |
| 177 | +Elixir nodes will auto-discover each other (via libcluster) and share load. |
| 178 | + |
| 179 | +--- |
69 | 180 |
|
70 | | -Edit `docker-compose.yml`, add more workers, restart. |
| 181 | +# Documentation |
71 | 182 |
|
72 | | -## Docs |
| 183 | +* `docs/ARCHITECTURE.md` — data flow, shard layout, search pipeline |
| 184 | +* `docs/DEPLOYMENT_GUIDE.md` — running MosaicDB in production |
| 185 | +* `docs/SHARD_FORMAT.md` — SQLite schema, embeddings, PageRank structure |
73 | 186 |
|
74 | | -- `docs/ARCHITECTURE.md` - How it works |
75 | | -- `docs/DEPLOYMENT_GUIDE.md` - Production setup |
| 187 | +--- |
76 | 188 |
|
77 | | -## License |
| 189 | +# License |
78 | 190 |
|
79 | 191 | MIT |
0 commit comments