A production-grade, high-frequency trading (HFT) Limit Order Book (LOB) and Matching Engine written in C++20. Designed for extreme throughput, deterministic latency, and cache efficiency using modern lock-free techniques.
Performance Benchmark: ~160,000,000 orders/second (Average) | ~171,000,000 (Peak) on Apple M1 Pro.
- Architecture: Sharded "Share-by-Communicating" design avoids global locks.
- Memory:
std::pmrmonotonic buffers on the stack = 0 heap allocations on hot path. - Optimization:
alignas(128)(vs 64) reduced M1/M2 false sharing by ~5%. - Latency: Sub-microsecond matching latency within the engine core.
- Ultra-High Throughput: Capable of processing >160 million distinct order operations per second on a single machine.
- Zero-Allocation Hot Path:
- PMR (Polymorphic Memory Resources): Uses
std::pmr::monotonic_buffer_resourcewith a pre-allocated 512MB stack buffer for nanosecond-level allocations. - Soft Limits: Gracefully falls back to heap allocation if the static buffer is exhausted (no crashes).
- PMR (Polymorphic Memory Resources): Uses
- Lock-Free Architecture:
- SPSC Ring Buffer: Custom cache-line aligned (
alignas(128)) ring buffer for thread-safe, lock-free communication between producer and consumer. - Shard-per-Core: "Share by Communicating" design. Each CPU core owns a dedicated shard, eliminating mutex contention entirely.
- Thread Pinning: Experimentally verified
pthread_setaffinity_np/thread_policy_setguarantees exclusive core usage for worker threads.
- SPSC Ring Buffer: Custom cache-line aligned (
- Cache Optimizations:
- Flat OrderBook: Replaces node-based maps with
std::vectorfor linear memory access. - Bitset Scanning: Uses CPU intrinsics (
__builtin_ctzll) to "teleport" to the next active price level, skipping empty levels instantly. Optimized to avoid "ghost level" checks. - Compact Storage: Order objects are PODs (Plain Old Data) optimized for
memcpy.
- Flat OrderBook: Replaces node-based maps with
- Smart Batching:
- Producer-Side: Thread-local accumulation of orders reduces atomic contention on the RingBuffer tail.
- Consumer-Side: Workers pop commands in batches (up to 256) to amortize cache line invalidations.
- Verification & Safety:
- Deterministic:
--verifymode runs a mathematically verifiable sequence (Matches == Min(Buys, Sells)). - Instrumentation:
--latencymode enables wall-clock end-to-end latency tracking.
- Deterministic:
The system moves away from the traditional "Central Limit Order Book with Global Lock" to a Partitioned/Sharded Model.
graph LR
%% Styles for high visibility and contrast
classDef client fill:#e3f2fd,stroke:#1565c0,stroke-width:2px,color:#0d47a1
classDef infra fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#4a148c
classDef core fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#1b5e20
classDef memory fill:#fff3e0,stroke:#e65100,stroke-width:2px,color:#e65100
User([User / Network]) -->|TCP Input| Gateway{Symbol Mapping}
class User client
class Gateway infra
%% Shard 0 Pipeline
subgraph Core0 [Core 0: Pinned]
direction TB
RB0[Ring Buffer<br/>SPSC Lock-Free]:::infra
Matcher0[Matching Engine<br/>Shard 0]:::core
Mem0[PMR Arena<br/>Stack Buffer]:::memory
RB0 --> Matcher0
Matcher0 --> Mem0
end
%% Shard 1 Pipeline
subgraph Core1 [Core 1: Pinned]
direction TB
RB1[Ring Buffer<br/>SPSC Lock-Free]:::infra
Matcher1[Matching Engine<br/>Shard 1]:::core
Mem1[PMR Arena<br/>Stack Buffer]:::memory
RB1 --> Matcher1
Matcher1 --> Mem1
end
%% Routing (No crossing)
Gateway -->|Shard 0: GOOG, MSFT| RB0
Gateway -->|Shard 1: AAPL, TSLA| RB1
- Ingestion (Exchange):
- Orders are received and hashed by
SymbolID. - "Smart Gateway" logic routes the order to the specific Shard owning that symbol.
- Orders are received and hashed by
- Transport (Ring Buffer):
- Orders are pushed into a lock-free Single-Producer Single-Consumer (SPSC) ring buffer.
- Union-Based Commands: Uses a
unionstructure to overlayAddandCancelcommands, saving memory and fitting more commands per cache line.
- Matching (Core):
- Flat OrderBook: Bids and Asks are simple
std::pmr::vectors indexed directly by price (O(1) lookup). - Matcher: Iterates linearly over the vector for maximum hardware prefetching efficiency. Active orders are tracked via a
Bitset.
- Flat OrderBook: Bids and Asks are simple
- Memory Management:
- A Monotonic Buffer (Arena) provides memory for new orders. It resets instantly (
release()) between benchmark runs, preventing fragmentation.
- A Monotonic Buffer (Arena) provides memory for new orders. It resets instantly (
- C++20 Compiler (GCC 10+ / Clang 12+)
- CMake 3.14+
mkdir -p build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS="-march=native"
make -j$(nproc)To replicate the >160M ops/s performance:
./build/src/benchmarkTo measure End-to-End Latency (approximate P50/P99):
./build/src/benchmark --latencyNote: Latency mode adds instrumentation overhead and runs usually at ~15-20M ops/sec on M1 Pro due to timestamp calls.
To ensure the engine is actually matching correctly and not just dropping frames:
./build/src/benchmark --verifyTest the engine against live Binance L3 data (Trade + Depth updates) to verify handling of realistic price clustering and bursty order flow.
-
Record Data (Requires Python 3 +
websocket-client):# Record 60 seconds of live BTCUSDT data python3 scripts/record_l3_data.py 60 -
Run Replay:
./build/src/benchmark --replay data/market_data.csv
Result: ~132,000,000 orders/sec (M1 Pro) on real-world data.
Start the engine networking layer (listens on port 8080):
./build/src/OrderMatchingEnginePopulate the Book (Optional): To quickly seed the book with Bids, Asks, and Trades for testing/visualization:
python3 scripts/seed_orders.pyConnect using netcat or any TCP client.
Submit Order:
BUY AAPL 100 15000
> ORDER_ACCEPTED_ASYNC 1
(Format: SIDE SYMBOL QTY PRICE_INT)
Subscribe to Market Data:
SUBSCRIBE AAPL
> SUBSCRIBED AAPL
> TRADE AAPL 15000 50
The project includes a comprehensive GoogleTest suite covering matching logic, cancellation scenarios, and partial fills.
./build/tests/unit_tests- Kernel Bypass: Integration with DPDK/Solarflare for sub-microsecond wire latency.
- Market Data Distribution: Multicast UDP feed for quote dissemination.