Proposed
2026-03-23
Samyama currently handles ~7.7M nodes on a single Mac Mini M4 (16GB) and demonstrated a 3-node cluster with 426K nodes across AWS spot instances (2026-03-22). For production healthcare, financial, and social graph workloads, we need to scale to 1 billion+ nodes on a small cluster (3-5 commodity nodes) without vertical scaling.
| System | Architecture | Scale Demonstrated | Notes |
|---|---|---|---|
| TigerGraph | Native MPP, shared-nothing | 73B vertices, 534B edges (36TB raw) | Hybrid memory-disk, distributed GSQL |
| Neo4j | Leader-follower replication | ~100M per node, no distributed writes | Cannot maintain ACID on 2 nodes (Dan McCreary) |
| NebulaGraph | Shared-nothing, Raft per partition | 100B+ edges claimed | Storage-compute separation |
| Kuzu | Embedded, columnar | 280M nodes, 1.7B edges (LDBC SF100) | Single-node only, fastest embedded |
| Memgraph | In-memory, ACID | Single-node focus, replication only | No distributed writes |
Key insight: No open-source graph database does distributed ACID writes at billion scale. TigerGraph (proprietary) is the only system that has demonstrated this.
| Instance | vCPU | RAM | Network | Spot $/hr | Monthly |
|---|---|---|---|---|---|
| r8g.2xlarge (Graviton4) | 8 | 64 GB | 15 Gbps | ~$0.15 | ~$108 |
| r8g.4xlarge | 16 | 128 GB | 15 Gbps | ~$0.30 | ~$216 |
| r7i.4xlarge (Intel) | 16 | 128 GB | 12.5 Gbps | ~$0.35 | ~$252 |
| r8g.8xlarge | 32 | 256 GB | 15 Gbps | ~$0.60 | ~$432 |
| i4i.2xlarge (NVMe) | 8 | 64 GB | 12.5 Gbps | ~$0.20 | ~$144 |
Target config: 5× r8g.4xlarge = 80 vCPU, 640 GB RAM, 75 Gbps aggregate bandwidth. Cost: ~$1,080/month on spot ($36/day).
1B node math: 1B nodes × ~320 bytes/node (avg with properties) = ~300 GB. Fits in 640 GB with room for edges, indexes, and working memory.
We will implement a partition-based distributed architecture inspired by VoltDB's deterministic execution model, adapted for graph workloads with modern hardware capabilities.
- Partition to eliminate coordination — Single-partition queries (90%+ of workload) execute locally with zero network hops
- Community-aware partitioning — Use graph structure (WCC/METIS) to co-locate densely connected nodes, minimizing edge cuts
- Replicate hot boundaries — Edges crossing partition boundaries are replicated to both sides (read-local, write-coordinate)
- Async persistence, sync replication — Writes replicate to K replicas synchronously (VoltDB K-safety), persist to disk asynchronously
- io_uring for disk I/O — Eliminate syscall overhead for persistence (60% improvement over epoll, proven by Neon/Tonbo in production)
┌─────────────────────────┐
│ Query Router │
│ (partition-aware plan) │
└────────┬────────────────┘
│
┌──────────────┼──────────────┐
▼ ▼ ▼
┌────────────┐ ┌────────────┐ ┌────────────┐
│ Node 1 │ │ Node 2 │ │ Node 3 │
│ │ │ │ │ │
│ Partition │ │ Partition │ │ Partition │
│ 0, 3 │ │ 1, 4 │ │ 2, 5 │
│ │ │ │ │ │
│ Boundary │ │ Boundary │ │ Boundary │
│ Edge Cache │ │ Edge Cache │ │ Edge Cache │
│ │ │ │ │ │
│ RocksDB │ │ RocksDB │ │ RocksDB │
│ (NVMe) │ │ (NVMe) │ │ (NVMe) │
└────────────┘ └────────────┘ └────────────┘
│ │ │
└───────── Raft Consensus ──────┘
(metadata + partition map)
Phase 1: Hash partitioning (simple, even distribution)
partition(node_id) = hash(node_id) % num_partitions
- Even distribution, simple routing
- Works well for label scans and property lookups
- Edge cuts are random — ~70% of edges cross partitions
Phase 2: Community-aware partitioning (CUTTANA/METIS)
- Use WCC to identify connected components
- Assign entire communities to the same partition
- Reduces edge cuts to ~10-20% (vs 70% with hash)
- Rebalance when communities grow unevenly
- Based on CUTTANA (VLDB 2024) — streaming partitioner that prevents premature vertex assignment
Phase 3: Hybrid with boundary replication
- Partition nodes by community
- Replicate boundary edges to both partitions (1-hop cache)
- 1-hop traversals from any node are always local
- 2+ hop traversals may cross partitions (scatter-gather)
Single-partition query (most common):
MATCH (p:Person {id: 123})-[:KNOWS]->(friend)
→ Router determines: Person 123 is on Partition 2 (Node 2)
→ Execute entirely on Node 2, return result
→ Zero coordination, zero network hops
Multi-partition query (cross-community):
MATCH (p:Person)-[:WORKS_AT]->(c:Company {name: 'Acme'})
→ Router determines: Company 'Acme' is on Partition 0 (Node 1)
→ Phase 1: Node 1 scans Company 'Acme', expands WORKS_AT (boundary cache)
→ Phase 2: If employees span partitions, scatter to relevant nodes
→ Phase 3: Gather results, merge, return
Write coordination (K-safety model):
CREATE (p:Person {name: 'Alice'})
→ Router determines: hash('Alice') → Partition 3 (Node 1)
→ Node 1 executes locally
→ Synchronous replicate to K=1 replica (Node 2)
→ Ack to client after K+1 confirmations
→ Async persist to RocksDB via io_uring
What changed since VoltDB (2012):
| 2012 (VoltDB era) | 2026 (Now) |
|---|---|
| 10 Gbps Ethernet | 100-200 Gbps (ENA, EFA) |
| Spinning disks / early SSDs | NVMe (i4i: 3.75 TB, 400K IOPS) |
| epoll / select | io_uring (60% less overhead) |
| TCP/IP networking | RDMA / kernel bypass available |
| 4-8 cores typical | 32-96 cores typical |
| 16-64 GB RAM typical | 128-768 GB RAM commodity |
| No cloud spot instances | 70-90% spot discount |
io_uring for Samyama:
- Replace RocksDB's default epoll-based I/O with io_uring
tokio-uringormonoioas async runtime for disk operations- Neon (Postgres storage) proved this works: Rust + tokio + io_uring + O_DIRECT
- 44x improvement on 4KB random reads vs standard tokio (Tonbo benchmark)
Network: AWS Elastic Fabric Adapter (EFA) provides kernel-bypass networking with ~2μs latency between instances in the same placement group. Not full RDMA but close enough for graph traversal scatter-gather.
CUTTANA (VLDB 2024):
- Streaming graph partitioner — processes vertices in one pass
- Buffering technique prevents premature partition assignment
- Coarsening + refinement for complete graph view
- Scales to billion-node web graphs
- Open source
METIS (classic, proven):
- Offline multilevel k-way partitioning
- Near-optimal edge cut minimization
- Minutes for billion-node graphs
- Well-understood, production-proven
Samyama's WCC as pre-partitioner:
- Already implemented, O(V+E)
- Identifies natural community boundaries
- WCC components become partition candidates
- Assign small components to same partition; split large components with METIS
- Partition metadata service — Partition map (node → partition), stored in Raft consensus
- Partition-aware query router — Extend query planner to annotate operators with partition info
- Hash partitioning — Simple hash(node_id) % N for even distribution
- Inter-node RPC — Arrow Flight (already in ADR-006 plan) or gRPC for scatter-gather
- Benchmark: 100M nodes across 3 nodes, LDBC queries
- WCC-based community detection as pre-partitioner
- CUTTANA/METIS integration for optimal partition assignment
- Boundary edge replication — 1-hop cache at partition boundaries
- Partition rebalancing — Move partitions when nodes join/leave
- Benchmark: 500M nodes, measure edge-cut reduction vs hash
- K-safety replication (VoltDB model) — sync to K replicas, async persist
- io_uring persistence — Replace epoll I/O in RocksDB path
- Snapshot import persistence (HA-08) — Background flush after bulk load
- Jepsen testing — Distributed consistency validation
- Benchmark: 1B nodes, 5-node cluster, LDBC + failure scenarios
- 1B+ nodes on 5 commodity nodes ($36/day on spot)
- Single-partition queries at local speed (sub-millisecond)
- Community-aware partitioning minimizes network hops (10-20% edge cuts vs 70%)
- K-safety without Raft overhead per write (VoltDB proven model)
- io_uring eliminates 60% of I/O syscall overhead
- Only open-source graph DB with distributed ACID at scale
- Multi-partition queries add latency (~1-5ms per partition hop)
- Partition rebalancing is complex (online migration without downtime)
- io_uring requires Linux 5.1+ (not available on macOS — dev/test divergence)
- Boundary edge replication increases storage (~10-20% overhead)
| Risk | Impact | Mitigation |
|---|---|---|
| Hot partition (skewed data) | High | Adaptive split + merge |
| Network partition (split brain) | Critical | Raft consensus for metadata, K-safety for data |
| io_uring compatibility | Medium | Fallback to epoll on older kernels / macOS |
| Partition rebalancing complexity | High | Start with static partitioning, add online rebalancing in Phase 3 |
- Compute-storage separation, shared NVMe pool
- Pros: Simpler consistency, elastic compute
- Cons: Storage becomes bottleneck for graph traversals (random access pattern)
- Verdict: Bad fit for graph — adjacency list lookups need local memory
- Every write goes through Raft consensus
- Pros: Strong consistency, simple model
- Cons: Raft latency on every write (~2-5ms), consensus bottleneck
- Verdict: Fine for metadata, too slow for data writes at scale
- Full copy on every node, leader handles writes
- Pros: Simple, all reads local
- Cons: Doesn't scale beyond single-node RAM
- Verdict: Can't reach 1B nodes without massive vertical scaling
- CUTTANA: Scalable Graph Partitioning (VLDB 2024)
- TigerGraph: Native MPP Graph Database (arXiv)
- io_uring for High-Performance DBMSs (VLDB)
- Neon: Rust + tokio + io_uring + O_DIRECT
- METIS Graph Partitioning
- VoltDB Architecture
- How to Partition a Billion-Node Graph (Microsoft Research)
- Jepsen Distributed Systems Testing
- A Wake-Up Call for Kernel-Bypass on Modern Hardware (DaMoN 2025)
- Distributed Shared-Memory Databases (VLDB)
- Tonbo: Exploring Better Async Rust Disk I/O
- Sandeep Kunkunuru (Architecture)
- Madhulatha Mandarapu (Engineering)
Proposed: 2026-03-23
Last Updated: 2026-03-23 Status: Proposed