This repository focuses on transforming OKX BTC-USDT Level 2 book dumps and trade archives into aligned, fixed-interval feature datasets ready for research or downstream model training. The heavy lifting is handled by the Rust binary okx-features, while the accompanying notebook (feature_lab.ipynb) is used for inspection and lightweight analysis.
- Rust toolchain: Install via rustup.rs. Once installed, run
rustup updatesocargo runis available in your shell. - OKX historical data: Download daily L2 archives (
BTC-USDT-L2orderbook-400lv-YYYY-MM-DD.tar.gz) and trade archives (BTC-USDT-trades-YYYY-MM-DD.zip) from the OKX Historical Data portal. Place them underbtcusdt_l2/andbtcusdt_trade/respectively, one file per day (already the default layout in this repo). - Python environment (optional): Only required if you plan to open
feature_lab.ipynb. Install the packages listed inrequirements.txt.
btcusdt_l2/
BTC-USDT-L2orderbook-400lv-2025-10-01.tar.gz
...
btcusdt_trade/
BTC-USDT-trades-2025-10-01.zip
...
rust/
okx-features/ # Rust crate with the CLI described below
features/ # Suggested output directory (created automatically)
The order book archives contain NDJSON snapshots/updates, while the trade ZIP files contain CSV rows where each file spans 16:00 UTC (prev day) to 15:59:59 UTC (current day). The okx-features CLI understands this layout and stitches the correct windows automatically.
Compile and run directly with Cargo:
cargo run -p okx-features -- \
--start 2025-10-01 --end 2025-10-03 `
--freq 1s --depth 20 --include-price `
--l2-dir btcusdt_l2 --trade-dir btcusdt_trade `
--format parquet --days-per-file 1 `
--output-dir featuresKey options:
--start/--end(UTC days) define the inclusive date range.--freqcontrols the sampling interval (e.g.,1s,500ms).--depthsets how many book levels per side are exported.--include-pricealso writesask_price_n/bid_price_ncolumns for each level (sizes are always exported).--formatchooses betweencsvandparquet. Both formats are written into the--output-dirdirectory.--days-per-filegroups multiple days into a single chunk.1means one file per day;3would produce rolling 3-day files.- The tool automatically handles trade warmups/carry-over to ensure VWAP/buy/sell volumes are continuous at day boundaries and emits the final snapshot at exactly
t+1 00:00:00.
Every output row contains:
timestamp(ms) and ISO8601 string- Rolling VWAP, buy volume, sell volume over the last sampling interval (VWAP is
-1if no trades occurred in the window) - Bid/ask sizes for the requested depth (
ask_size_1 ... ask_size_n,bid_size_1 ... bid_size_n) - When
--include-priceis set, bid/ask prices for each level (ask_price_1 ... ask_price_n,bid_price_1 ... bid_price_n)
CSV outputs create features-YYYY-MM-DD-YYYY-MM-DD.csv files inside the chosen directory. Parquet outputs follow the same naming convention but are Arrow-native .parquet files, so the directory can be treated as a partitioned dataset.
Use the new Rust binary to stream raw L2 + trade archives into an .npz that matches hftbacktest's event dtype (data[ev, exch_ts, local_ts, px, qty, order_id, ival, fval], ns timestamps):
cargo run -p okx-backtest -- \
--start 2025-10-01 --end 2025-10-02 `
--depth 50 --l2-dir btcusdt_l2 --trade-dir btcusdt_trade `
--output-dir backtests --instrument BTC-USDT `
--local-latency-ns 0ns --days-per-file 1Notes:
- Trades are stitched the same way as
okx-features: theend+1trade archive is read so the 16:00–23:59 slice of the last day is included; data beforestart 00:00is dropped. --depthtruncates each message to the first N ask/bid levels (as published by OKX) to keep event volume manageable.--local-latency-nsshiftslocal_tsrelative toexch_tsfor latency/queue modelling;--skip-tradesemits depth-only feeds if you only want book events.- Output files are compressed NPZ by default; add
--uncompressedif you prefer plain ZIP storage.
- feature_lab.ipynb – Explore, visualize, or post-process the generated feature files. Update the filepaths in the first cell if you saved the features somewhere other than
features/.
- Running with
cargo run --release(or building once withcargo build --release) significantly speeds up multi-day processing. - If you only need CSV output, use
--format csvand point--output-dirto a clean directory; the CLI will create it on demand. - Keep the raw archives compressed after extraction—the
okx-featuresbinary streams from the compressed.tar.gz/.zipfiles directly, so no intermediate unpacking is required.