Skip to content

Commit 2a0bccd

Browse files
committed
docs: improvements
1 parent 64702d8 commit 2a0bccd

File tree

1 file changed

+78
-18
lines changed

1 file changed

+78
-18
lines changed

DESIGN.md

Lines changed: 78 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -394,6 +394,25 @@ Default batch configuration:
394394

395395
Divergence indicates a bug or corruption, not consensus failure. The committing quorum is authoritative.
396396

397+
### Degraded Operation Modes
398+
399+
When failures occur, Ledger degrades gracefully rather than failing completely:
400+
401+
| Failure Scenario | Write Availability | Read Availability | Recovery Action |
402+
| ----------------------------- | ------------------ | ----------------- | ----------------------------------- |
403+
| Single node down (3-node) | ✓ Available | ✓ Available | Automatic failover |
404+
| Minority nodes down | ✓ Available | ✓ Available | Reduced redundancy, monitor closely |
405+
| Majority nodes down | ✗ Unavailable | ✓ Stale reads | Manual intervention required |
406+
| Leader network isolated | ✓ After election | ✓ Available | New leader elected (~10s) |
407+
| State root divergence (vault) | ✗ Vault halted | ✓ Available | Rebuild vault from snapshot |
408+
| Disk full | ✗ Unavailable | ✓ Available | Expand storage, compact logs |
409+
410+
**Partial availability**: When writes are unavailable, Ledger continues serving eventually consistent reads from any healthy replica. Applications can implement read-only degraded modes.
411+
412+
**Vault-level isolation**: A diverged vault does not affect other vaults in the same namespace. Only the affected vault halts writes pending recovery.
413+
414+
**Automatic recovery scope**: Ledger automatically recovers from transient failures (network blips, brief partitions). Persistent failures (disk corruption, state divergence) require operator intervention to prevent data loss.
415+
397416
---
398417

399418
## Scaling Architecture: Shard Groups
@@ -654,9 +673,7 @@ This section consolidates all design trade-offs for decision archaeology.
654673
| State verification | O(log n) proof | Replay from snapshot |
655674
| Query latency | O(log n) | O(1) |
656675

657-
**Trade-off accepted**: We sacrifice instant per-key proofs for 10x lower write amplification and O(1) query latency. Verification via replay is acceptable for audit scenarios (not real-time).
658-
659-
**When to reconsider**: If instant per-key proofs become a hard requirement.
676+
We sacrifice instant per-key proofs for 10x lower write amplification and O(1) query latency. Verification via replay is acceptable for audit scenarios (not real-time).
660677

661678
### gRPC/HTTP/2 vs. Custom Protocol
662679

@@ -668,9 +685,7 @@ This section consolidates all design trade-offs for decision archaeology.
668685
| Operational complexity | Low (standard load balancers) | High (custom tooling) |
669686
| Per-request overhead | ~0.5ms | ~0.1-0.2ms |
670687

671-
**Trade-off accepted**: ~0.3ms overhead per request is negligible compared to consensus latency (~2-5ms). Development velocity wins.
672-
673-
**When to reconsider**: If throughput exceeds 100K ops/sec per shard or serialization exceeds 5% of request latency.
688+
~0.3ms overhead per request is negligible compared to consensus latency (~2-5ms). Development velocity wins.
674689

675690
### Raft vs. Byzantine Consensus
676691

@@ -683,9 +698,7 @@ This section consolidates all design trade-offs for decision archaeology.
683698
| Fault model | Crash faults only | Malicious nodes |
684699
| Complexity | Moderate | High |
685700

686-
**Trade-off accepted**: We assume trusted operators. Byzantine tolerance would 3x latency for a threat model that doesn't apply.
687-
688-
**When to reconsider**: Multi-party deployments where no single operator is trusted.
701+
We assume trusted operators. Byzantine tolerance would 3x latency for a threat model that doesn't apply.
689702

690703
### Per-Vault Chains vs. Single Chain
691704

@@ -702,7 +715,7 @@ This section consolidates all design trade-offs for decision archaeology.
702715
- No cross-vault transactions
703716
- More complex routing
704717

705-
**Trade-off accepted**: Isolation is more important than cross-vault atomicity for authorization workloads.
718+
Isolation is more important than cross-vault atomicity for authorization workloads.
706719

707720
### Bucket-Based State Roots vs. Traditional Merkle Trees
708721

@@ -714,15 +727,64 @@ This section consolidates all design trade-offs for decision archaeology.
714727
| Proof size | O(log n) | O(256) = O(1) fixed |
715728
| Range proofs | Efficient | Not supported |
716729

717-
**Trade-off accepted**: O(k) updates independent of database size. Larger proof size acceptable for audit scenarios.
718-
719-
**When to reconsider**: If range proofs become necessary.
730+
O(k) updates independent of database size. Larger proof size acceptable for audit scenarios.
720731

721732
### Single Leader vs. Multi-Leader
722733

723734
**Decision**: Single Raft leader per shard handles all writes.
724735

725-
**Trade-off accepted**: Single leader is a bottleneck but simplifies consistency. Horizontal scaling via sharding.
736+
Single leader is a bottleneck but simplifies consistency. Horizontal scaling via sharding.
737+
738+
### SHA-256 vs. Alternative Hash Functions
739+
740+
**Decision**: SHA-256 for all cryptographic commitments (state roots, block hashes, merkle proofs).
741+
742+
| Factor | SHA-256 | BLAKE3 | SHA-3 |
743+
| -------------------- | ---------------------- | ----------- | --------- |
744+
| Performance | ~500 MB/s | ~6 GB/s | ~300 MB/s |
745+
| Hardware accel | Widespread (SHA-NI) | Limited | Growing |
746+
| Standardization | FIPS 180-4, ubiquitous | Not FIPS | FIPS 202 |
747+
| Tooling/verification | Excellent | Growing | Good |
748+
| Audit familiarity | Universal | Less common | Growing |
749+
750+
SHA-256's universal recognition and hardware acceleration trump BLAKE3's raw speed. For authorization audits, auditors must be able to verify hashes with standard tools—SHA-256 is understood everywhere. Cryptographic operations are not the bottleneck (<5% of request latency).
751+
752+
### seahash vs. Alternative Non-Cryptographic Hashes
753+
754+
**Decision**: seahash for bucket assignment and internal indexing (non-security-critical paths).
755+
756+
| Factor | seahash | xxhash | FNV-1a | SipHash |
757+
| ------------- | -------- | -------------- | -------------- | ----------------- |
758+
| Speed | ~15 GB/s | ~30 GB/s | ~5 GB/s | ~2 GB/s |
759+
| Pure Rust | Yes | Requires C FFI | Yes | Yes (std default) |
760+
| Distribution | Good | Excellent | Poor for short | Good |
761+
| DoS resistant | No | No | No | Yes |
762+
763+
seahash provides excellent speed in pure Rust without FFI complexity. For bucket assignment from already-authenticated data, DoS resistance is unnecessary—we're hashing internal keys, not untrusted input.
764+
765+
---
766+
767+
## Threat Model
768+
769+
### Trusted Operator Assumption
770+
771+
Ledger assumes a **trusted operator model**: the organization running the cluster controls all nodes and does not act maliciously. This is distinct from permissionless blockchains where nodes may be adversarial.
772+
773+
**What Ledger protects against**:
774+
775+
- **Crash failures**: Nodes may crash, lose power, or experience hardware failures. Raft tolerates (n-1)/2 simultaneous failures.
776+
- **Network partitions**: Nodes may become temporarily unreachable. Raft maintains safety (no conflicting commits) and makes progress when majority is reachable.
777+
- **Disk corruption**: State root verification detects corruption. Recovery via snapshot + log replay.
778+
- **Accidental misconfiguration**: Sequence numbers and idempotency prevent duplicate operations.
779+
- **Post-hoc tampering**: Cryptographic chain linking makes undetected modification computationally infeasible.
780+
781+
**What Ledger does NOT protect against**:
782+
783+
- **Malicious operator**: A compromised operator with access to majority of nodes can forge state. Raft is not Byzantine fault tolerant.
784+
- **Compromised leader**: A Byzantine leader can propose invalid blocks. Followers verify state roots but cannot prevent a malicious majority from accepting invalid state.
785+
- **Side-channel attacks**: Memory inspection, timing attacks on cryptographic operations are out of scope.
786+
787+
**Mitigation for untrusted environments**: Organizations requiring Byzantine fault tolerance should evaluate Tendermint-based systems or PBFT variants, accepting 3x latency overhead.
726788

727789
---
728790

@@ -763,11 +825,9 @@ This section consolidates all design trade-offs for decision archaeology.
763825

764826
### Future Considerations
765827

766-
1. **Hardware acceleration**: Can cryptographic operations benefit from GPU/FPGA offload?
767-
768-
2. **Zero-knowledge proofs**: Could ZK-SNARKs enable private verification without revealing data?
828+
1. **Zero-knowledge proofs**: Could ZK-SNARKs enable private verification without revealing data?
769829

770-
3. **Tiered storage**: Hot data in memory, warm on SSD, cold in object storage?
830+
2. **Tiered storage**: Hot data in memory, warm on SSD, cold in object storage?
771831

772832
---
773833

0 commit comments

Comments
 (0)