Skip to content

alex-lt-kong/order-book-programming-problem

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Order Book Programming Problem

Table of contents

1. The Solution

1.1 Design

  • Three implementations of the OrderBook:
    • OrderBookArray: bid_prices[p] stores the bid orders at price p in cents, i.e., bid_prices[4412] stores the bid orders at price 44.12. The same applies to ask_prices.
      • Fast (O(1)) in order addition/amendment/removal;
      • Slow (O(N) where N is the depth of the order book) in order book traversal (as there are many holes);
      • Performs better if the price levels are close to each other (e.g., the product is liquid).
    • OrderBookBst: Each binary search tree node stores one bid/ask price level.
      • Relatively slow (O(logN)) in order addition/amendment/removal
      • Relatively fase (O(logN)) in order book traversal
      • Perform better if the pricer levels are sparse (e.g., the product is illiquid).
    • OrderBookStdMap: Uses std::map, which is STL's implementation of a binary search tree, mostly the same as OrderBookBst just slower lol
  • Three implementations of price level (i.e., a collection of orders with the same price): std::list (doubly-linked list), std::unordered_map(hash table) and std::vector

1.2 Build and run

  • Build

    mkdir build && cd build
    cmake ../  -DCMAKE_BUILD_TYPE=Release
    cmake --build . --config Release -j 10
  • Unzip pricer.in from ./assets/test-data/pricer.in.gz

  • Run (results are written to stdout and debug info is written to stderr)

    ./pricer-array 1 < ./pricer.in 1>./stdout1.log 2>./stderr1.log
    ./pricer-bst 200 < ./pricer.in 1>./stdout200.log 2>./stderr200.log
    ./pricer-std-map 10000 < ./pricer.in 1>./stdout10000.log 2>./stderr10000.log
  • Check outputs against test cases:

    • stdout1.log vs pricer.out.1 (perfectly matching)
    • stdout200.log vs pricer.out.200 (perfectly matching)
    • stdout10000.log vs pricer.out.10000 (there seems to be a precision issue that causes output to be off by ~0.001% for some lines)
  • The string representation of the order book (showing ten levels only)

Ask:
....
Level: 10, Price: 44.37, Size:   100, AccuSize:  1510, AccuVolume:  6680730, Orders: { { id:  pghe, size:  100 } }
Level:  9, Price: 44.31, Size:   100, AccuSize:  1410, AccuVolume:  6237030, Orders: { { id:  hwte, size:  100 } }
Level:  8, Price: 44.29, Size:   100, AccuSize:  1310, AccuVolume:  5793930, Orders: { { id:  qbgg, size:  100 } }
Level:  7, Price: 44.28, Size:   100, AccuSize:  1210, AccuVolume:  5351030, Orders: { { id:  plfg, size:  100 } }
Level:  6, Price: 44.25, Size:   400, AccuSize:  1110, AccuVolume:  4908230, Orders: { { id:  lfse, size:  100 },  { id:  ozaf, size:  100 },  { id:  mplg, size:  200 } }
Level:  5, Price: 44.23, Size:    10, AccuSize:   710, AccuVolume:  3138230, Orders: { { id:  ymye, size:   10 } }
Level:  4, Price: 44.22, Size:    50, AccuSize:   700, AccuVolume:  3094000, Orders: { { id:  bckg, size:   50 } }
Level:  3, Price: 44.21, Size:   100, AccuSize:   650, AccuVolume:  2872900, Orders: { { id:  vhcg, size:  100 } }
Level:  2, Price: 44.20, Size:   350, AccuSize:   550, AccuVolume:  2430800, Orders: { { id:  lbbf, size:  100 },  { id:  hwlg, size:  250 } }
Level:  1, Price: 44.19, Size:   200, AccuSize:   200, AccuVolume:   883800, Orders: { { id:  izlg, size:  100 },  { id:  xzlg, size:  100 } }
Bid:
Level:  1, Price: 44.17, Size:   156, AccuSize:   156, AccuVolume:   689052, Orders: { { id:  qzlg, size:   14 },  { id:  rzlg, size:   14 },  { id:  szlg, size:    7 },  { id:  tzlg, size:   21 },  { id:  wzlg, size:  100 } }
Level:  2, Price: 44.16, Size:   400, AccuSize:   556, AccuVolume:  2455452, Orders: { { id:  mxlg, size:  100 },  { id:  ezlg, size:  100 },  { id:  qwlg, size:  200 } }
Level:  3, Price: 44.15, Size:   550, AccuSize:  1106, AccuVolume:  4883702, Orders: { { id:  zrlg, size:  300 },  { id:  pzlg, size:  250 } }
Level:  4, Price: 44.14, Size:  1000, AccuSize:  2106, AccuVolume:  9297702, Orders: { { id:  urlg, size: 1000 } }
Level:  5, Price: 44.12, Size:    48, AccuSize:  2154, AccuVolume:  9509478, Orders: { { id:  ellg, size:   48 } }
Level:  6, Price: 44.05, Size:   100, AccuSize:  2254, AccuVolume:  9949978, Orders: { { id:  rtkg, size:  100 } }
Level:  7, Price: 44.04, Size:    10, AccuSize:  2264, AccuVolume:  9994018, Orders: { { id:  xrlg, size:   10 } }
Level:  8, Price: 44.03, Size:   200, AccuSize:  2464, AccuVolume: 10874618, Orders: { { id:  wvlg, size:  200 } }
Level:  9, Price: 43.99, Size:    29, AccuSize:  2493, AccuVolume: 11002189, Orders: { { id:  phcg, size:   29 } }
Level: 10, Price: 43.92, Size:   960, AccuSize:  3453, AccuVolume: 15218509, Orders: { { id:  sczf, size:  100 },  { id:  odzf, size:   10 },  { id:  yyyf, size:  100 },  { id:  yxzf, size:  150 },  { id:  umag, size:  500 },  { id:  pbgg, size:  100 } }
...

1.3 Performance benchmark

  • Prepare ramdisk and input data

    sudo mkdir /tmp/tmpfs
    sudo mount -o size=128M -t tmpfs none /tmp/tmpfs
    cp ./assets/test-data/pricer.in.gz /tmp/tmpfs/
    gzip -d /tmp/tmpfs/pricer.in.gz
  • Build with stdout/stderr turned off

    mkdir build && cd build
    cmake ../ -DBENCHMARK_PERFORMANCE=1 -DCMAKE_BUILD_TYPE=Release
    cmake --build . --config Release -j 10
  • Time the execution with minimal output

    > time ./pricer-bst 1 < /tmp/tmpfs/pricer.in
    ./pricer-bst exited gracefully, price changed 75335 time(s)
    
    real    0m0.666s
    user    0m0.630s
    sys     0m0.036s
    # ~1.8M orders/sec
    
    >  time ./pricer-array 200 < /tmp/tmpfs/pricer.in
    ./pricer-array exited gracefully, price changed 211539 time(s)
    
    real    0m0.551s
    user    0m0.526s
    sys     0m0.024s
    # ~2.2M orders/sec 
    
    > time ./pricer-bst 10000 < /tmp/tmpfs/pricer.in
    ./pricer-bst exited gracefully, price changed 777063 time(s)
    
    real    0m1.680s
    user    0m1.647s
    sys     0m0.032s
    # ~711K orders/sec

2. The Problem

  • The problem, originally called "Order Book Programming Problem", was organized by RGM Advisors (now part of DRW).

  • The problem's official website has been taken down since a while ago, the below duplicate is sourced from web.archive.org/

Background

Suppose your great aunt Gertrude dies and leaves you $3000 which you decide you want to invest in the Acme Internet Widget Company (stock symbol:AIWC). You are willing to pay up to $30 per share of AIWC. So you log in to your online trading account and enter a limit order: "BUY 100 AIWC @ $30". It's a limit order because that's most you're willing to pay. You'd be willing to pay less than $30, but not more than $30. Your order will sit around in the market until you get your 100 shares. A limit order to buy is called a "bid".

But you're not the only prospective buyer for AIWC stock. Others want to buy it too. One is bidding $31/share for 200 shares, while another is bidding $29/share for 300 shares. When Warren Buffett wants to sell 225 shares, he's obviously going to take the highest price he can get for each share. So he hits the $31 bid first, selling 200 shares. Then he sells his remaining 25 shares to you at $30/share. Your bid size reduced by 25, leaving 75 shares still to be bought.

Suppose you eventually get the full 100 shares at some price. Next year, you decide to buy a new computer and you need $4500 for it, and luckily the value of AIWC has appreciated by 50%. So you want to sell your 100 shares of AIWC stock for at least $45/share. So you enter this limit order: "SELL 100 AIWC @ $45". A limit order to sell is called an "ask".

But you're not the only prospective seller of AIWC stock. There's also an ask for $44/share and an ask for $46/share. If Alan Greenspan wants to buy AIWC, he's obviously going to pay as little as possible. So he'll take the $44 offer first, and only buy from you at $45 if he can't buy as much as he wants at $44.

The set of all standing bids and asks is called a "limit order book", or just a "book". You can buy a data feed from the stock market, and they will send you messages in real time telling you about changes to the book. Each message either adds an order to the book, or reduces the size of an order in the book (possibly removing the order entirely). You can record these messages in a log file, and later you can go back and analyze your log file.

Problem

Your task is to write a program, Pricer, that analyzes such a log file. Pricer takes one command-line argument: target-size. Pricer then reads a market data log on standard input. As the book is modified, Pricer prints (on standard output) the total expense you would incur if you bought target-size shares (by taking as many asks as necessary, lowest first), and the total income you would receive if you sold target-size shares (by hitting as many bids as necessary, highest first). Each time the income or expense changes, it prints the changed value.

Editor's Note: It is slightly confusing whether my Pricer will cause a change to the order book being maintained (i.e., if Pricer sells 200 shares, does it reduce the best bid's size by 200?). If you read through the Example Input and Output section below you can draw the conclusion that the answer is negative, Priceronly calculates such hypothetical price, but it does not impact the order book in any manner.

Input Format

The market data log contains one message per line (terminated by a single linefeed character, '\n'), and each message is a series of fields separated by spaces.

An "Add Order to Book" message looks like this:

timestamp A order-id side price size

Field Meaning
timestamp The time when this message was generated by the market, as milliseconds since midnight.
A A literal string identifying this as an "Add Order to Book" message.
order-id A unique string that subsequent "Reduce Order" messages will use to modify this order.
side A 'B' if this is a buy order (a bid), and a 'S' if this is a sell order (an ask).
price The limit price of this order.
size The size in shares of this order, when it was initially sent to the market by some stock trader.

A "Reduce Order" message looks like this:

timestamp R order-id size

Field Meaning
timestamp The time when this message was generated by the market, as milliseconds since midnight.
R A literal string identifying this as an "Reduce Order" message.
order-id The unique string that identifies the order to be reduced.
size The amount by which to reduce the size of the order. This is not the new size of the order. If size is equal to or greater than the existing size of the order, the order is removed from the book.

The log file messages are sorted by timestamp by the time Pricer receives them.

If Pricer encounters an error in an input message, it must print a warning to standard error and proceed to the next message.

Output Format

Pricer's output consists of one message per line in this format:

timestamp action total

Field Meaning
timestamp The timestamp from the input message that caused this output message to be generated.
action A string: 'B' if this message contains the new expense to buy target-size shares, and 'S' if this message contains the new income for selling target-size shares.
total

The total expense (if action is 'B') to buy target-size shares, or the total income (if action is 'S') for selling target-size shares.

If the book does not contain target-size shares in the appropriate type of order (asks for expense; bids for income), the total field contains the string 'NA'.

Example Input and Output

Here is an example run of Pricer with a target-size of 200. Input messages are in the left column. Notes and output messages are in the right column.

           Standard Input            Standard Output/Notes
28800538 A b S 44.26 100 No output yet because neither the bids nor the asks in the book have a total of 200 shares yet.
28800562 A c B 44.10 100 Still not enough shares on either side of the book.
28800744 R b 100 This reduces order 'b' to zero shares, which removes it from the book, so now the book contains no asks. But there's still no change to the total income or expense on 200 shares.
28800758 A d B 44.18 157 The bid sizes now total 257, which is more than the target size of 200. To sell 200 shares, you would first hit the bid at 44.18 for 157 shares, spending $6936.26. Then you would hit the bid at 44.10 for the remaining 43 shares, spending another $1896.30. Your total income would be $8832.56, so Pricer emits this message:
28800758 S 8832.56
28800773 A e S 44.38 100 The book now contains a single ask of size 100, which is still not enough to change the target size expense from 'NA'.
28800796 R d 157 This removes bid 'd' from the book, leaving just one bid with a size of 100 on the book, so the income from selling changes to 'NA':
28800796 S NA
28800812 A f B 44.18 157 This new bid brings the total bid size back over 200, so the selling income is no longer 'NA':
28800812 S 8832.56
28800974 A g S 44.27 100 This ask brings the total ask size up to 200, exactly the target size. The total expense for buying 200 shares would be 100 * $44.27 + 100 * $44.38:
28800974 B 8865.00
28800975 R e 100 Removing ask 'e' from the book leaves less than 200 shares on the ask side, so the buying expense changes back to 'NA':
28800975 B NA
28812071 R f 100 Reducing bid 'f' by 100 shares leaves only 157 shares on the bid side, so the selling income changes to 'NA':
28812071 S NA
28813129 A h B 43.68 50 This new bid makes it possible to sell 200 shares: 57 at $44.18, 100 at $44.10, and the last 43 at $43.68.
28813129 S 8806.50
28813300 R f 57 This removes bid 'f' from the book, so it is no longer possible to sell 200 shares:
28813300 S NA
28813830 A i S 44.18 100 This ask makes it possible to buy 200 shares again: 100 at $44.18 and 100 at $44.27.
28813830 B 8845.00
28814087 A j S 44.18 1000 This ask has the same price as an existing ask, and these two asks are tied for the best asking price. This means you could now buy all 200 shares at $44.18 instead of buying half of them at $44.27, so the buying expense decreases:
28814087 B 8836.00
28814834 R c 100 This leaves only 50 shares on the bid side (all in order 'h'), so it is still not possible to sell 200 shares. The selling income is therefore unchanged from 'NA' and Pricer prints no output message.
28814864 A k B 44.09 100 Only 150 shares on the bid side, so no output needed.
28815774 R k 100 Back to 50 shares on the bid side; still no output needed.
28815804 A l B 44.07 175 There are now more than 200 shares on the bid side. You could sell 175 shares at $44.07 each, and the remaining 25 shares at $43.68 each:
28815804 S 8804.25
28815937 R j 1000 After ask 'j' is removed from the book, you can still buy 200 shares: 100 at $44.18 each, and 100 at the worse price of $44.27.
28815937 B 8845.00
28816245 A m S 44.22 100 Since $44.22 is a better price than $44.27, the buying expense decreases:
28816245 B 8840.00

Note that the book initially contains no orders, and that the buying expense and selling income are both considered to start at 'NA'. Since Pricer only produces output when the income or expense changes, it does not print anything until the total size of all bids or the total size of all asks meets or exceeds target-size.

What We're Looking For

Please write the Pricer program in one of C, C++, Java, or Scala, and send us the source code. You do not need to send us any compiler output. We encourage you to take advantage of the language's standard libraries and latest versions. Please do not use other code that you would have to download separately from your chosen language's normal distribution package.

We're looking for evidence that you can produce code that others would be able to understand, fix, and extend. At the same time, we do have real-time requirements for production code, so we frown on gratuitous inefficiency. Here are some qualities we look for:

  • Correctness. Obviously, the fewer bugs you write, the fewer you'll have to fix.
  • Clarity. If only you can understand your code, then only you can maintain it. Good code speaks for itself - a good programmer should be able to understand your implementation details without extensive comments.
  • Conciseness. The less code you write, the less code your coworkers need to puzzle out.
  • Coefficiency. OK, just efficiency, but wouldn't it be cool if all of these qualities started with a 'C'? Anyway, using less time and space is generally better than otherwise.

Of course, there are often trade-offs between each of these properties (except correctness, which is pretty much an absolute).

Also, we are a Linux shop and we like Unix-style tools: programs that process the output of other programs. So make sure your implementation of Pricer is suitable for use in a shell pipeline. Follow the I/O specifications: don't mix prompts in with your output or demand that the input come from a disk file.

In addition to supplying us with your source code, please answer these questions:

  • How did you choose your implementation language?
  • What is the time complexity for processing an Add Order message?
  • What is the time complexity for processing a Reduce Order message?
  • If your implementation were put into production and found to be too slow, what ideas would you try out to improve its performance? (Other than reimplementing it in a different language such as C or C++.)

Test Data

You can download a large amount of test input data here: pricer.in.gz. This file is compressed using gzip. You should uncompress it before feeding it to your program. This is real market data, collected from a live system, so you might want to analyze it to decide what algorithms are most appropriate to use in Pricer.

You can also download the corresponding output of our reference implementation of Pricer with various target sizes: target size 1, target size 200, target size 10000. These files are also compressed with gzip.

In case you want to test your implementation on a smaller sample of data, here is a snippet in easy-to-cut-and-paste format:

28800538 A b S 44.26 100
28800562 A c B 44.10 100
28800744 R b 100
28800758 A d B 44.18 157
28800773 A e S 44.38 100
28800796 R d 157
28800812 A f B 44.18 157
28800974 A g S 44.27 100
28800975 R e 100
28812071 R f 100
28813129 A h B 43.68 50
28813300 R f 57
28813830 A i S 44.18 100
28814087 A j S 44.18 1000
28814834 R c 100
28814864 A k B 44.09 100
28815774 R k 100
28815804 A l B 44.07 175
28815937 R j 1000
28816245 A m S 44.22 100

And here is the corresponding output:

28800758 S 8832.56
28800796 S NA
28800812 S 8832.56
28800974 B 8865.00
28800975 B NA
28812071 S NA
28813129 S 8806.50
28813300 S NA
28813830 B 8845.00
28814087 B 8836.00
28815804 S 8804.25
28815937 B 8845.00
28816245 B 8840.00

3. Limited order book basics

3.1 What is a limited order book (LOB)

  • A limit order book is a record of outstanding limit orders. A limit order is a type of order to buy or sell a security at a specific price or better.[What Is a Limit Order Book?]

2.2 L1/L2/L3 market data

  • L1 market data: Provides more comprehensive by displaying the full depth of the order book. This includes multiple bid and ask prices at different levels, not just the best available prices [Market Data]

  • L2 market data: Provides more comprehensive by displaying the full depth of the order book. This includes multiple bid and ask prices at different levels, not just the best available prices. [Market Data] L2 data is also sometimes referred to as market-by-price (MBP), since the updates to book depth are usually keyed by price or price level. It may also be called market depth or depth of market ( DoM). [L2]

  • L3 market data: Level 3 (L3) refers to market data that provides every individual buy and sell order at every price level. This is often also the highest granularity of data available. L3 data is also called market by order or full order book data. [L3]

4. Factors to consider before implementing an LOB

  • To give some idea of the data volumes, the Nasdaq TotalView ITCH feed, which is every event in every instrument traded on the Nasdaq, can have data rates of 20+ gigabytes/day with spikes of 3 megabytes/second or more. The individual messages average about 20 bytes each so this means handling 100,000-200,000 messages per second during high volume periods. [How to Build a Fast Limit Order Book]

4.1 The business use case

  • Consider the below factors:

  • Markets/products to trade/liquidity impacts engineering decisions:

    • In illiquid options, there may be very few resting orders on the book, so it may be cheaper to just store everything in arrays and linearly walk through them.
    • In liquid futures, most events only affect a few hundred price levels, and price bands might give you a bound on levels that you actually care about, so it is possible to preallocate the levels in an array and represent index prices as an offset from some initial state in number of ticks. [Red Black Trees for Limit Order Book]
  • LOB's behavior can differ depending on whether you are implementing an LOB for equities (order-based, a.k.a. Market by Order, MBO or L3) [Market by Order (MBO)], [L3] or futures (level-based, a.k.a., Market by Price or MBP) [What is an efficient data structure to model order book?]

4.2 Engineering considerations

  • There are three main operations that an LOB has to implement: add, cancel, and execute. The goal is to implement these operations in O(1) time while making it possible for the trading model to efficiently ask questions like:

  • If you can, you should also optimize for the particular exchange. For instance, it turns out that, last I checked, Nasdaq generates order IDs incrementally starting from a small number, so you can store all the orders in a giant array instead of a hashtable. This is really cache- and TLB-friendly compared to a hashtable because most updates tend to happen to recently-dereferenced orders. [What is an efficient data structure to model order book?]

  • https://github.com/da-bao-jian/fast_limit_orderbook

5. Candidate data structures for LOB implementation

5.1 Binary search tree

  • A Binary Search Tree (BST) is a type of Binary Tree data structure, where the following properties must be true for any node "X" in the tree [1]:

    • The X node's left child and all of its descendants (children, children's children, and so on) have lower values than X's value.

    • The right child, and all its descendants have higher values than X's value.

    • Left and right subtrees must also be Binary Search Trees.

Common operations

  • Insert

  • Search (omitted for being too simple)

  • Delete

    • If the node to be deleted is a leaf (i.e., it has no children): omitted

    • If the node to be deleted has one child: omitted

    • If the node to be deleted has two children:

5.2 Heap

  • A Min-Heap is a Data Structure with the following properties.[2]
    • It is a complete Complete Binary Tree.
    • The value of the root node must be the smallest among all its descendant nodes and the same thing must be done for its left and right sub-tree also.

Common operations

  • Insert

  • Delete

About

My solution to RGM Advisors's Order Book Programming Problem

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published