A high-performance, fault-tolerant distributed key-value store built from scratch in C++ using the Raft Consensus Algorithm.
This project implements a distributed banking system powered by the Raft Consensus Protocol. It ensures strict data consistency and availability even in the presence of node failures. The system is designed to handle network partitions, leader crashes, and log replication lag using industry-standard techniques like Log Compaction and Snapshotting.
Key engineering challenges solved in this implementation include preventing "Thread Explosion" during high-load replication and implementing robust Snapshot Recovery for lagging nodes.
This project leverages high-performance C++ libraries to ensure low latency and scalability:
| Category | Library / Tool | Usage |
|---|---|---|
| RPC Framework | gRPC | Handles synchronous and asynchronous communication between nodes. |
| Serialization | Protobuf | Efficient binary serialization for log entries and snapshots. |
| Storage Engine | RocksDB | High-performance embedded key-value store for log persistence and state machine data. |
| Networking | Boost.Asio | Asynchronous TCP handling for the Client-Gateway interaction. |
| Logging | spdlog | Fast, asynchronous structured logging for debugging distributed states. |
| JSON Parsing | nlohmann/json | Parsing cluster configuration and node discovery files. |
| Build System | CMake & vcpkg | Cross-platform build automation and dependency management. |
The system was stress-tested with 10,000 requests (80% writes / 20% reads) on a 5-node cluster running in a WSL2 environment.
| Metric | Result |
|---|---|
| Throughput | 875.76 ops/sec |
| Requests | 10,000 |
| Failed Requests | 0 (100% Consistency) |
| Latency (P50) | 51.22 ms |
| Latency (P99) | 124.60 ms |
- Leader Election: Randomized election timeouts to prevent split votes.
- Log Replication: Asynchronous replication with batched AppendEntries for network efficiency.
- Commit Safety: Quorum-based commits ensure data is never lost once acknowledged.
- Snapshotting: Implemented Log Compaction to truncate logs and reduce disk usage.
- Disaster Recovery: Nodes that fall too far behind (e.g., after a crash) automatically fetch a full snapshot from the Leader to catch up, bypassing thousands of log entries.
- Persistence: All state (Logs, Term, Vote) is persisted to RocksDB, allowing nodes to survive restarts.
- Thread Management: Custom "Busy Flag" logic prevents thread explosions during high-load replication.
- Non-Blocking I/O: Decoupled Disk I/O from the main event loop to prevent Leader stalling.
- Smart Backtracking: Optimized log inconsistency checks to quickly resolve conflicts.
The codebase is modularized into distinct layers handling consensus, networking, and storage:
src/
├── benchmark/ # Performance testing suite and metrics collection
├── client/ # Asynchronous client implementation for load generation
├── gateway/ # Entry point for client requests (TCP/HTTP)
├── network/ # gRPC service definitions and transport layer
├── raft/ # Core Consensus Logic (Election, Replication, State Machine)
├── rpc/ # Protobuf generated files and RPC handlers
├── storage/ # RocksDB wrapper, Log management, and Snapshotting
└── utils/ # Logging (spdlog), configuration parsers, and globals
- vcpkg (C++ Package Manager)
- CMake 3.10+
- C++17 Compiler
Ensure vcpkg is installed and install the required packages (gRPC, Protobuf, RocksDB, spdlog, etc.):
~/vcpkg/vcpkg installConfigure the project using the vcpkg toolchain and compile:
# Configure
cmake -B build -S . -DCMAKE_TOOLCHAIN_FILE=~/vcpkg/scripts/buildsystems/vcpkg.cmake -DCMAKE_BUILD_TYPE=Release
# Build
cmake --build buildStart the Raft cluster:
./build/raft_kvOnce the cluster starts, you will see an interactive menu to verify system behavior:
- Run CSV Workload: Verifies correctness with a deterministic set of transactions.
- Run Benchmark: Stress tests the system to measure throughput and latency.
- Run Fault Tolerance Test: Automatically stops a node, generates load, restarts the node, and verifies Snapshot Recovery.
The project includes a robust testing suite used to verify correctness:
- Leader Crash Test: Kills the Leader process mid-operation to verify a new Leader is elected and no data is lost.
- Partition Test: Isolates the Leader from the network to ensure split-brain scenarios are handled correctly.
- Snapshot Recovery Test: Stops a follower for 500+ logs, restarts it, and verifies it catches up via Snapshot (not log replay).
MIT License.
