Skip to content

PIYUSH-KUMAR1809/order-matching-engine

Repository files navigation

High-Performance Order Matching Engine

A production-grade, high-frequency trading (HFT) Limit Order Book (LOB) and Matching Engine written in C++20. Designed for extreme throughput, deterministic latency, and cache efficiency using modern lock-free techniques.

C++ License CMake Build Awesome Quant

Performance Benchmark: ~160,000,000 orders/second (Average) | ~171,000,000 (Peak) on Apple M1 Pro.

⚡️ Key Takeaways

  • Architecture: Sharded "Share-by-Communicating" design avoids global locks.
  • Memory: std::pmr monotonic buffers on the stack = 0 heap allocations on hot path.
  • Optimization: alignas(128) (vs 64) reduced M1/M2 false sharing by ~5%.
  • Latency: Sub-microsecond matching latency within the engine core.

🚀 Key Features

  • Ultra-High Throughput: Capable of processing >160 million distinct order operations per second on a single machine.
  • Zero-Allocation Hot Path:
    • PMR (Polymorphic Memory Resources): Uses std::pmr::monotonic_buffer_resource with a pre-allocated 512MB stack buffer for nanosecond-level allocations.
    • Soft Limits: Gracefully falls back to heap allocation if the static buffer is exhausted (no crashes).
  • Lock-Free Architecture:
    • SPSC Ring Buffer: Custom cache-line aligned (alignas(128)) ring buffer for thread-safe, lock-free communication between producer and consumer.
    • Shard-per-Core: "Share by Communicating" design. Each CPU core owns a dedicated shard, eliminating mutex contention entirely.
    • Thread Pinning: Experimentally verified pthread_setaffinity_np / thread_policy_set guarantees exclusive core usage for worker threads.
  • Cache Optimizations:
    • Flat OrderBook: Replaces node-based maps with std::vector for linear memory access.
    • Bitset Scanning: Uses CPU intrinsics (__builtin_ctzll) to "teleport" to the next active price level, skipping empty levels instantly. Optimized to avoid "ghost level" checks.
    • Compact Storage: Order objects are PODs (Plain Old Data) optimized for memcpy.
  • Smart Batching:
    • Producer-Side: Thread-local accumulation of orders reduces atomic contention on the RingBuffer tail.
    • Consumer-Side: Workers pop commands in batches (up to 256) to amortize cache line invalidations.
  • Verification & Safety:
    • Deterministic: --verify mode runs a mathematically verifiable sequence (Matches == Min(Buys, Sells)).
    • Instrumentation: --latency mode enables wall-clock end-to-end latency tracking.

🏗 Architecture

The system moves away from the traditional "Central Limit Order Book with Global Lock" to a Partitioned/Sharded Model.

graph LR
    %% Styles for high visibility and contrast
    classDef client fill:#e3f2fd,stroke:#1565c0,stroke-width:2px,color:#0d47a1
    classDef infra fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#4a148c
    classDef core fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:#1b5e20
    classDef memory fill:#fff3e0,stroke:#e65100,stroke-width:2px,color:#e65100

    User([User / Network]) -->|TCP Input| Gateway{Symbol Mapping}
    class User client
    class Gateway infra

    %% Shard 0 Pipeline
    subgraph Core0 [Core 0: Pinned]
        direction TB
        RB0[Ring Buffer<br/>SPSC Lock-Free]:::infra
        Matcher0[Matching Engine<br/>Shard 0]:::core
        Mem0[PMR Arena<br/>Stack Buffer]:::memory
        
        RB0 --> Matcher0
        Matcher0 --> Mem0
    end

    %% Shard 1 Pipeline
    subgraph Core1 [Core 1: Pinned]
        direction TB
        RB1[Ring Buffer<br/>SPSC Lock-Free]:::infra
        Matcher1[Matching Engine<br/>Shard 1]:::core
        Mem1[PMR Arena<br/>Stack Buffer]:::memory
        
        RB1 --> Matcher1
        Matcher1 --> Mem1
    end

    %% Routing (No crossing)
    Gateway -->|Shard 0: GOOG, MSFT| RB0
    Gateway -->|Shard 1: AAPL, TSLA| RB1
Loading
  1. Ingestion (Exchange):
    • Orders are received and hashed by SymbolID.
    • "Smart Gateway" logic routes the order to the specific Shard owning that symbol.
  2. Transport (Ring Buffer):
    • Orders are pushed into a lock-free Single-Producer Single-Consumer (SPSC) ring buffer.
    • Union-Based Commands: Uses a union structure to overlay Add and Cancel commands, saving memory and fitting more commands per cache line.
  3. Matching (Core):
    • Flat OrderBook: Bids and Asks are simple std::pmr::vectors indexed directly by price (O(1) lookup).
    • Matcher: Iterates linearly over the vector for maximum hardware prefetching efficiency. Active orders are tracked via a Bitset.
  4. Memory Management:
    • A Monotonic Buffer (Arena) provides memory for new orders. It resets instantly (release()) between benchmark runs, preventing fragmentation.

🛠 Build & Run

Prerequisites

  • C++20 Compiler (GCC 10+ / Clang 12+)
  • CMake 3.14+

Compiling

mkdir -p build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS="-march=native"
make -j$(nproc)

Running Benchmarks

To replicate the >160M ops/s performance:

./build/src/benchmark

To measure End-to-End Latency (approximate P50/P99):

./build/src/benchmark --latency

Note: Latency mode adds instrumentation overhead and runs usually at ~15-20M ops/sec on M1 Pro due to timestamp calls.

Running Verification

To ensure the engine is actually matching correctly and not just dropping frames:

./build/src/benchmark --verify

Real-World Market Replay

Test the engine against live Binance L3 data (Trade + Depth updates) to verify handling of realistic price clustering and bursty order flow.

  1. Record Data (Requires Python 3 + websocket-client):

    # Record 60 seconds of live BTCUSDT data
    python3 scripts/record_l3_data.py 60
  2. Run Replay:

    ./build/src/benchmark --replay data/market_data.csv

    Result: ~132,000,000 orders/sec (M1 Pro) on real-world data.

Running the Server

Start the engine networking layer (listens on port 8080):

./build/src/OrderMatchingEngine

Populate the Book (Optional): To quickly seed the book with Bids, Asks, and Trades for testing/visualization:

python3 scripts/seed_orders.py

📊 Client Usage (TCP)

Connect using netcat or any TCP client.

Submit Order:

BUY AAPL 100 15000
> ORDER_ACCEPTED_ASYNC 1

(Format: SIDE SYMBOL QTY PRICE_INT)

Subscribe to Market Data:

SUBSCRIBE AAPL
> SUBSCRIBED AAPL
> TRADE AAPL 15000 50

🧪 Testing

The project includes a comprehensive GoogleTest suite covering matching logic, cancellation scenarios, and partial fills.

./build/tests/unit_tests

📈 Roadmap (Future)

  • Kernel Bypass: Integration with DPDK/Solarflare for sub-microsecond wire latency.
  • Market Data Distribution: Multicast UDP feed for quote dissemination.