Purpose: Document the physical, logical, and operational constraints that shape LineraDB's design.
Last Updated: December 2025
- Why Document Constraints?
- Physical Constraints
- CAP Theorem Trade-offs
- Consensus Constraints
- Storage Constraints
- Network Constraints
- Operational Constraints
- Security Constraints
Distributed systems are constrained by physics and mathematics.
Many design decisions in LineraDB stem from unavoidable limits:
- Speed of light limits latency
- CAP theorem limits consistency guarantees
- Quorum math limits fault tolerance
By making these constraints explicit, we can:
- Justify design decisions - "We chose X because constraint Y"
- Set realistic expectations - "We can't do Z because of constraint W"
- Avoid impossible goals - "Strong consistency + 100% availability = impossible"
Constraint: Information cannot travel faster than ~300,000 km/s in vacuum (~200,000 km/s in fiber).
Impact on LineraDB:
| Distance | One-Way Latency | Round-Trip (RTT) | Example |
|---|---|---|---|
| Same rack | 0.1 ms | 0.2 ms | Node-to-node in same data center |
| Same DC | 0.5-2 ms | 1-4 ms | Different racks, same building |
| Cross-AZ | 2-5 ms | 4-10 ms | AWS us-west-2a → us-west-2b |
| Cross-region (US) | 30-50 ms | 60-100 ms | us-west-2 → us-east-1 (4,000 km) |
| Transatlantic | 40-80 ms | 80-160 ms | US East → EU West (6,000 km) |
| Transpacific | 80-120 ms | 160-240 ms | US West → Tokyo (8,000 km) |
Design Implications:
- ✅ Single-region linearizable writes: <10ms (1 RTT for Raft quorum)
⚠️ Cross-region linearizable writes: >100ms (unavoidable)- ❌ Global linearizable writes <50ms: Physically impossible
LineraDB's Approach:
- Phase 1-4: Single-region deployment (minimize latency)
- Phase 5: Multi-region with follower reads (eventual consistency, <10ms)
- Phase 5: Multi-region linearizable reads (>100ms, opt-in only)
Quote from Google Spanner Paper:
"Spanner's use of TrueTime allows linearizable reads at a timestamp without blocking writes, but cross-region commits still require WAN round-trips (~100ms)."
Constraint: WAN bandwidth is limited and expensive compared to LAN.
| Link Type | Typical Bandwidth | Cost |
|---|---|---|
| Same rack | 10-100 Gbps | Free (backplane) |
| Same DC | 10-40 Gbps | Free (intra-DC) |
| Cross-region | 1-10 Gbps | ~$0.01-0.05/GB (AWS/GCP) |
Impact:
- Large transactions spanning regions incur latency + cost
- Bulk replication (initial sync, backups) must be throttled
LineraDB's Approach:
- Compress Raft log entries before cross-region replication
- Batch small writes to reduce overhead
- Snapshot transfers for new replicas (not full log replay)
Constraint: Disk writes are slow compared to memory.
| Medium | Random IOPS | Sequential Throughput |
|---|---|---|
| DRAM | 10M+ | 50+ GB/s |
| NVMe SSD | 100K-1M | 3-7 GB/s |
| SATA SSD | 10K-100K | 500 MB/s |
| HDD | 100-200 | 100-200 MB/s |
Impact:
- Every
fsync()call adds 1-10ms latency (SSD) or 10-20ms (HDD) - Durability vs. throughput trade-off
LineraDB's Approach:
- Write-Ahead Log (WAL): Sequential writes (fast)
- Group Commit: Batch multiple transactions into one
fsync() - Async Replication: Acknowledge to client before followers persist (configurable)
CAP Theorem: In the presence of network Partitions (P), you must choose between Consistency (C) and Availability (A).
What this means:
- ✅ Strong consistency: Linearizable reads/writes (Raft quorum)
- ✅ Partition tolerance: Survives network splits
- ❌ 100% availability: If majority of nodes unreachable, system rejects writes
Why not AP (eventual consistency)?
- LineraDB is transactional - banking use cases require strong consistency
- Eventual consistency leads to anomalies (lost updates, dirty reads)
Alternative (Phase 6+): Allow tunable consistency:
-- Strong consistency (default)
BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
-- Weak consistency (faster, may read stale data)
BEGIN TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;3 nodes: N1 (Leader), N2, N3
Result: ✅ Consistent ✅ Available
3 nodes: N1 | N2, N3 (partition)
Result:
- N2 or N3 becomes new leader (majority)
- ✅ Consistent (N2/N3 partition)
- ❌ Not Available (N1 partition - rejects writes)
5 nodes: N1, N2 | N3, N4, N5 (partition)
Result:
- N3/N4/N5 elect new leader (majority)
- ✅ Consistent (N3/N4/N5 partition)
- ❌ Not Available (N1/N2 partition - rejects writes)
Key Insight: LineraDB sacrifices availability (of the minority partition) to maintain consistency.
Constraint: Raft requires strict majority (⌈N/2⌉ + 1) for safety.
| Cluster Size | Quorum Size | Tolerated Failures | Availability Risk |
|---|---|---|---|
| 1 | 1 | 0 | 🔴 Single point of failure |
| 3 | 2 | 1 | 🟢 Good (can lose 1 node) |
| 5 | 3 | 2 | 🟢 Better (can lose 2 nodes) |
| 7 | 4 | 3 | 🟡 Diminishing returns (latency) |
Why Odd Numbers?
- 3-node cluster: Tolerates 1 failure
- 4-node cluster: Still only tolerates 1 failure (quorum = 3)
- No benefit to even numbers (same fault tolerance, higher cost)
LineraDB's Recommendation:
- Dev/Test: 1 node (fast, but no fault tolerance)
- Production (single-region): 3 nodes
- Production (multi-region): 5 nodes (2-2-1 split across regions)
Problem: Linearizable reads require contacting the leader, but how does leader know it's still leader?
Constraint: Leader must prove it's still leader before serving reads.
Options:
| Approach | Latency | Consistency | LineraDB Uses |
|---|---|---|---|
| Read Index | 1 RTT to quorum | Linearizable | ✅ Phase 2 |
| Lease | 0 RTT (if lease valid) | Linearizable (with caveats) | ✅ Phase 5 |
| Follower Reads | 0 RTT | Eventual consistency | ✅ Phase 5 (opt-in) |
Leader Lease Algorithm:
- Leader acquires lease from quorum (duration:
election_timeout) - Leader serves reads locally (no quorum check)
- Lease expires → leader must renew before serving reads
Caveats:
- Requires bounded clock skew (use NTP + HLC)
- If clocks drift too much, lease safety breaks
LineraDB's Approach:
- Use Read Index (Phase 2-4) - always safe
- Add Leases (Phase 5) - faster, requires clock sync
Problem: LSM trees write data multiple times (WAL, memtable, SSTables, compaction).
Constraint:
Write Amplification = Bytes Written to Disk / Bytes Written by User
Typical Values:
- RocksDB: 10-30x
- LevelDB: 10-20x
Impact:
- SSD wear (limited write cycles)
- I/O bandwidth consumption
LineraDB's Approach:
- Use Leveled Compaction (RocksDB default) - lower write amplification
- Tune compaction aggressiveness vs. read performance
Problem: LSM trees must check multiple SSTables for a key.
Constraint:
Read Amplification = Number of SSTables Checked
Mitigation:
- Bloom Filters: Skip SSTables that don't contain key (99% accurate)
- Compaction: Merge SSTables to reduce levels
LineraDB's Approach:
- Bloom filters in every SSTable
- Monitor read latency, trigger compaction if too many levels
Problem: How to distinguish slow node from crashed node?
Constraint: Impossible to distinguish perfectly (see FLP impossibility).
Timeouts:
- Too short → False positives (unnecessary leader elections)
- Too long → Slow recovery (user-visible downtime)
LineraDB's Approach:
| Phase | Heartbeat Interval | Election Timeout | Rationale |
|---|---|---|---|
| Single-region | 50ms | 150-300ms | Low latency, LAN is reliable |
| Multi-region | 200ms | 600-1200ms | High latency, WAN less reliable |
Adaptive Timeouts (Future):
- Measure actual RTT, adjust timeouts dynamically
- Inspiration: CockroachDB's "liveness" subsystem
Types:
-
Complete Partition:
[N1, N2] | [N3, N4, N5]- Majority partition continues (N3/N4/N5)
- Minority partition rejects writes (N1/N2)
-
Asymmetric Partition:
N1 → N2 ✅ N1 → N3 ❌ N2 → N1 ✅ N2 → N3 ✅- Raft handles this (majority-based decisions)
-
Flapping Network:
Partition forms → heals → forms → heals- Causes repeated leader elections (thundering herd)
- Mitigation: Exponential backoff on election timeout
LineraDB's Approach:
- Assume crash-recovery model (nodes crash and restart)
- Do not handle Byzantine faults (malicious nodes) - trust infrastructure
- Test partitions extensively with Jepsen (Phase 6)
Constraint: LineraDB requires at least 3 nodes for fault tolerance.
Minimum Hardware (Production):
- 3 nodes (different availability zones)
- 4 vCPU + 16 GB RAM per node (for Raft + storage)
- 100 GB SSD per node (for LSM tree)
Cloud Costs (AWS example):
- 3x
m5.xlargeinstances: ~$500/month - 3x 100 GB gp3 SSDs: ~$30/month
- Cross-region data transfer: Variable (~$0.01/GB)
Total: ~$530+/month (single-region)
Constraint: Distributed systems fail in complex ways - observability is non-negotiable.
Metrics LineraDB Must Track:
- Raft: Leader election rate, log replication lag, commit latency
- Storage: Compaction duration, SSTable count, disk usage
- SQL: Query latency (p50, p99), slow queries
- System: CPU, memory, disk I/O, network I/O
LineraDB's Approach:
- Prometheus metrics (Phase 6)
- Grafana dashboards (Phase 6)
- OpenTelemetry tracing (Phase 6)
Constraint: LineraDB must verify client identity before accepting connections.
Options:
- Client Certificates (mTLS): Strong, but requires PKI
- Username + Password: Weak, but easy
- JWT Tokens: Scalable, but requires key management
LineraDB's Approach:
- Start with client certificates (inspired by FoundationDB)
- Add JWT later for easier integration
Constraint: SSTable files on disk must be encrypted to prevent offline attacks.
Options:
- Application-level encryption: Encrypt before writing to storage engine
- OS-level encryption: Use LUKS / dm-crypt (Linux)
- Cloud-level encryption: Use AWS EBS encryption
LineraDB's Approach:
- Phase 7: Application-level encryption (AES-256-GCM)
- Key management via AWS KMS or HashiCorp Vault
| Constraint | LineraDB's Approach | Trade-off |
|---|---|---|
| Speed of light | Single-region by default | Latency vs. disaster recovery |
| CAP theorem | CP (consistency + partition tolerance) | Consistency vs. availability |
| Quorum math | 3-node clusters (production) | Fault tolerance vs. cost |
| Disk I/O | WAL + group commit | Durability vs. throughput |
| Write amplification | LSM with leveled compaction | Write throughput vs. SSD wear |
| Network partitions | Raft's majority quorum | Availability in minority partition |
| Clock skew | Hybrid Logical Clocks | Leader leases safety |
When proposing features, please check if they violate any constraints:
- "Can we achieve <10ms cross-continent latency?" → ❌ Physics
- "Can we tolerate 3 failures with 5 nodes?" → ❌ Quorum math
- "Can we have strong consistency + 100% availability?" → ❌ CAP theorem
See TRADEOFFS.md for how LineraDB navigates these constraints.
Constraints shape design. Embrace them.