My attempt at a complete high-frequency trading (HFT) pipeline, from synthetic tick generation to order execution and trade publishing. It’s designed to demonstrate how networking, clock synchronization, and hardware limits affect end-to-end latency in distributed systems.
Built using C++, Go, and Python, all services communicate via ZeroMQ using PUB/SUB and PUSH/PULL patterns. The stack is fully containerized with Docker Compose and can scale under K8s. No specialized hardware was used in this demo (e.g., FPGAs, RDMA NICs, etc.), the idea was to explore what I could achieve with commodity hardware and software optimizations.
| Step | Service (From) | Service (To) | Protocol | Purpose |
|---|---|---|---|---|
| 1️⃣ | Synthetic Feeder | Market Data Ingestor | WebSocket (port 8765) | Stream synthetic market data |
| 2️⃣ | Market Data Ingestor | Trading Algorithm | ZeroMQ PUB/SUB (port 5555) | Broadcast tick data |
| 3️⃣ | Trading Algorithm | Matching Engine | ZeroMQ PUSH/PULL (port 5557) | Send generated orders based on algorithm |
| 4️⃣ | Matching Engine | Order Submission / Latency Harness | ZeroMQ PUB/SUB (port 5558) | Publish executed trades |
| 5️⃣ | Order Submission | Mock Exchange | HTTP POST (port 5678) | Send trade confirmations |
| 6️⃣ | Latency Harness | (Self-contained) | Local timing | Measure E2E pipeline latency |
- Feeder generates synthetic tick data (e.g., price = 100.4202132) every 1ms.
- Ingestor reads the WebSocket and publishes the tick to all subscribers.
- Algorithm consumes ticks, generates BUY/SELL orders, and sends them to the matching engine.
- Matching Engine pairs bids/asks, timestamps the match, and publishes trade results.
- Order Submission forwards matched trades to a mock exchange via HTTP, and the exchange responds with a confirmation (its just a
200 OKin this case, for simplicity). - Latency Harness subscribes to the port opened by the Order Submission service and measures latency across the pipeline.
From this work, I was able to achieve an end-to-end latency of ~10 ms for ~100 trades (ran on my machine, an Asus VivoBook with a Ryzen 7 4700U and 16GB of RAM); from synthetic tick generation to order execution and trade publishing. This was measured using a simple latency harness Python script that utilized the timestamps associated with the orders, and calculated the total time taken for a message to traverse the system.
The main bottlenecks in the system from my perspective/research were:
-
Network latency: Even on a local network, the time taken for messages to travel between services contributed significantly. Using ZeroMQ helped reduce some of this overhead due to its efficient messaging patterns.
-
Containerization overhead: In real HFT environments, the network interface card (NIC) bypasses the kernel entirely using technologies like DPDK, RDMA, or Onload, allowing network packets to be DMA’d directly into user space. The NIC can provide hardware time-stamping with nanosecond accuracy, and the trading engine typically runs pinned to a CPU core, reading packets directly—enabling sub-microsecond tick-to-trade latency. Docker, or any virtualization layer, breaks this model because it adds network abstraction layers (e.g., bridges, veth pairs, etc...), preventing direct kernel bypass, and cannot expose NIC clock synchronization into containers. This explains why my containerized setup achieved latencies on the order of milliseconds rather than the sub-microsecond latencies seen in production HFT systems.
-
Processing overhead: Serialization, deserialization, and order book management all contributed minor additional delays, though these seemed to generally be dwarfed by network and containerization effects.
The following improvements that I'm considering to explore in future iterations:
-
Investigate the use of more efficient serialization formats (e.g., Protobufs, FlatBuffers) to reduce message size and parsing overhead.
-
This implementation assumes a single stock symbol, as it's generally common to maintain one order book per symbol. Expanding to multiple symbols would require more complex order book management and could introduce additional latency, but would make the system more realistic, so it's worth exploring.
-
I also want to explore more advanced trading algorithms, such as statistical arbitrage or machine learning-based strategies, to see how they perform in this low-latency environment, right now the algorithm is as trivial as it gets (BUY if current price < last price, SELL if current price > last price).