This project was conducted as part of our participiation in the EPFL course Financial Big Data (FIN-252), under the supervision of Prof. Damien Challet. It aims to demonstrate the application of big data techniques to high-frequency financial markets by implementing a complete pipeline from raw limit order book data to an executable pairs trading strategy. We process roughly 40 GB of nanosecond-level S&P100 data and transform it into minute-frequency panels (addressing computational efficiency, missing values, and estimation noise). We then perform and compare various time-series correlation clustering techniques (Leiden, Louvain...). We finally use our findings to build a functionnal pairs trading strategy, carefully avoiding any look-ahead bias or data leakage.
We have meticulously documented our entire analysis and results, along with interpretations and improvement suggestions, in the report FBD2025_report.pdf. Guidelines to replicate our results are stated at the end of this ReadMe file.
Our work flow comprises four phases:
Purpose: Transform raw high-frequency data into clean, analysis-ready format.
Key Steps:
- Load raw parquet files (~40 GB of tick data)
- Filter by date range (default: Sep-Dec 2008)
- Remove duplicates via VWAP aggregation
- Identify and exclude assets with incomplete coverage
- Resample to 1-minute intervals using VWAP
- Create unified panel data (timestamp × ticker format)
Main Files:
main_preprocessing.py- Preprocessing execution scriptutils/preprocessing_utils/preprocessing_utils.py- Core preprocessing functions
Output: Clean panel data with 79 assets at 1-minute frequency
Purpose: Handle missing values and prepare data structures for graph-based clustering algorithms.
Key Steps:
- Format stock prices
- Missing value treatment and NaN proportion visualizations
- Compute stock returns as percentage changes
Main Files:
main_formatting.py- Formatting execution scriptutils/formatting_utils/formatting_utils.py- Data transformation utilities
Purpose: Identify pairs of co-moving stocks using community detection on correlation graphs.
Methods:
- Leiden Clustering - High-quality community detection
- Louvain Clustering - Fast modularity optimization
- Marsili-Giada Clustering - Correlation-based filtering method
Main Files:
main_clustering.py- Clustering execution scriptutils/clustering_utils/Leiden_clustering.py- Leiden algorithm implementationutils/clustering_utils/Louvain_clustering.py- Louvain algorithm implementationutils/clustering_utils/Marsili_Giada_clustering.py- Marsili-Giada implementationutils/clustering_utils/Utils.py- Helper functionsutils/clustering_utils/plots.py- Visualization utilitiesutils/clustering_utils/clustering_analysis_report.ipynb- Clustering analysis notebook
Output: Asset pairs with high statistical co-movement
Purpose: Implement statistical arbitrage strategies on periodically identified pairs and evaluate financial performance.
Main Files:
main_trading.py- Trading execution scriptutils/trading_utils/trading_utils.py- Trading strategy logicutils/trading_utils/trading_visuals.py- Performance visualization and analysis
To reproduce our entire analysis and results, please closely follow the below guidelines.
- Download our code folder
FBD2025. Place it in some high root (possibly the Desktop). - Download the required intial data from Google Drive: https://drive.google.com/drive/folders/1xProHPN1YtKKkLh8917-R50KtgXmy_rO. This should download a folder named
Data_parquet, containing 85 parquet files (1 per asset). - Create a folder named
FBD_local_dataand place it in the same root as theFBD2025(again, possibly the Desktop).FBD2025andFBD_local_datamust be at the same hierarchy level. - Place the downloaded
Data_parquetfolder insideFBD_local_data. - You are ready to go! Additional outputs will be generated in
FBD_local_dataduring execution.
To summarize, here is what your set-up should look like:
Desktop/ # (or any high root)
├── FBD2025/ # Folder with code files
│ ├── main.py
│ ├── utils.py
│ └── ...
│
└── FBD_local_data/ # Folder with initial parquet data
└── Data_parquet/
├── AA.N.parquet
├── ...
└── XRX.N.parquet
conda create -n finbigdata python=3.14
conda activate finbigdata
pip install polars numpy pandas matplotlib seaborn scipy networkx jupyter scikit-network communityjupyter notebook master_notebook.ipynbFBD2025/
├── README.md
├── master_notebook.ipynb # Main pipeline execution
├── main_preprocessing.py # Preprocessing entry point
├── main_formatting.py # Formatting entry point
├── main_clustering.py # Clustering entry point
├── main_trading.py # Trading strategy entry point
└── utils/
├── preprocessing_utils/
│ ├── preprocessing_utils.py
│ └── datapreprocessing_pandas.py
├── formatting_utils/
│ └── formatting_utils.py
├── clustering_utils/
│ ├── Leiden_clustering.py
│ ├── Louvain_clustering.py
│ ├── Marsili_Giada_clustering.py
│ ├── Utils.py
│ ├── plots.py
| └── clustering_analysis_report.ipynb
└── trading_utils/
├── trading_utils.py
└── trading_visuals.py
- Assets: S&P100 constituents
- Period: January 2004 - December 2008 (full coverage for 79 assets)
- Frequency: Original tick data (~microsecond) → Resampled to 1-minute
- Size: ~41.48 GB raw data
- Format: Parquet files (one per asset)
- Timestamp coverage validation (2004-01-02 to 2008-12-31)
- Missing data analysis per asset per year
- Automatic exclusion of 7 assets with incomplete coverage (AA, MA, MS, NOV, PM, V, DVN)
- Bid/ask prices and volumes
- Spread (ask - bid)
- Mid-price
- Volume imbalance (bid-ask pressure)
- High-frequency returns
- Data Processing: Polars (high-performance DataFrames)
- Clustering: NetworkX, python-igraph
- Visualization: Matplotlib, Seaborn
- Environment: Python 3.14, Jupyter notebooks
Dard Timothé, Pécaut Marius, Tordo Cyprien.
EPFL - Financial Big Data Course, 2025