Skip to content

Commit f6a96b1

Browse files
committed
Add Hunting Anomalies in the Stock Market scripts
1 parent 53808f8 commit f6a96b1

File tree

5 files changed

+478
-0
lines changed

5 files changed

+478
-0
lines changed
Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
# Hunting Anomalies in the Stock Market
2+
3+
This repository contains all the necessary scripts and data directories used in the [Hunting Anomalies in the Stock Market](https://polygon.io/blog/hunting-anomalies-in-stock-market/) tutorial, hosted on Polygon.io's blog. The tutorial demonstrates how to detect statistical anomalies in historical US stock market data through a comprehensive workflow that involves downloading data, building a lookup table, querying for anomalies, and visualizing them through a web interface.
4+
5+
### Prerequisites
6+
7+
- Python 3.8+
8+
- Access to Polygon.io's historical data via Flat Files
9+
- An active Polygon.io API key, obtainable by signing up for a Stocks paid plan
10+
11+
### Repository Contents
12+
13+
- `README.md`: This file, outlining setup and execution instructions.
14+
- `aggregates_day`: Directory where downloaded CSV data files are stored.
15+
- `build-lookup-table.py`: Python script to build a lookup table from the historical data.
16+
- `query-lookup-table.py`: Python script to query the lookup table for anomalies.
17+
- `gui-lookup-table.py`: Python script for a browser-based interface to explore anomalies visually.
18+
19+
### Running the Tutorial
20+
21+
1. **Ensure Python 3.8+ is installed:** Check your Python version and ensure all required libraries (polygon-api-client, pandas, pickle, and argparse) are installed.
22+
23+
2. **Set up your API key:** Make sure you have an active paid Polygon.io Stock subscription for accessing Flat Files. Set up your API key in your environment or directly in the scripts where required.
24+
25+
3. **Download Historical Data:** Use the MinIO client to download historical stock market data:
26+
```bash
27+
mc alias set s3polygon https://files.polygon.io YOUR_ACCESS_KEY YOUR_SECRET_KEY
28+
mc cp --recursive s3polygon/flatfiles/us_stocks_sip/day_aggs_v1/2024/08/ ./aggregates_day/
29+
mc cp --recursive s3polygon/flatfiles/us_stocks_sip/day_aggs_v1/2024/09/ ./aggregates_day/
30+
mc cp --recursive s3polygon/flatfiles/us_stocks_sip/day_aggs_v1/2024/10/ ./aggregates_day/
31+
gunzip ./aggregates_day/*.gz
32+
```
33+
Adjust the commands and paths based on the data you're interested in.
34+
35+
4. **Build the Lookup Table:** This script processes the downloaded data and builds a lookup table, saving it as `lookup_table.pkl`.
36+
```bash
37+
python build-lookup-table.py
38+
```
39+
40+
5. **Query Anomalies:** Replace `2024-10-18` with the date you want to analyze for anomalies.
41+
```bash
42+
python query-lookup-table.py 2024-10-18
43+
```
44+
45+
6. **Run the GUI:** Access the web interface at `http://localhost:8888` to explore the anomalies visually.
46+
```bash
47+
python gui-lookup-table.py
48+
```
49+
50+
For a complete step-by-step guide on each phase of the anomaly detection process, including additional configurations and troubleshooting, refer to the detailed [tutorial on our blog](https://polygon.io/blog/hunting-anomalies-in-stock-market).
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Download flat files into here.
Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
import os
2+
import pandas as pd
3+
from collections import defaultdict
4+
import pickle
5+
import json
6+
7+
# Directory containing the daily CSV files
8+
data_dir = './aggregates_day/'
9+
10+
# Initialize a dictionary to hold trades data
11+
trades_data = defaultdict(list)
12+
13+
# List all CSV files in the directory
14+
files = sorted([f for f in os.listdir(data_dir) if f.endswith('.csv')])
15+
16+
print("Starting to process files...")
17+
18+
# Process each file (assuming files are named in order)
19+
for file in files:
20+
print(f"Processing {file}")
21+
file_path = os.path.join(data_dir, file)
22+
df = pd.read_csv(file_path)
23+
# For each stock, store the date and relevant data
24+
for _, row in df.iterrows():
25+
ticker = row['ticker']
26+
date = pd.to_datetime(row['window_start'], unit='ns').date()
27+
trades = row['transactions']
28+
close_price = row['close'] # Ensure 'close' column exists in your CSV
29+
trades_data[ticker].append({
30+
'date': date,
31+
'trades': trades,
32+
'close_price': close_price
33+
})
34+
35+
print("Finished processing files.")
36+
print("Building lookup table...")
37+
38+
# Now, build the lookup table with rolling averages and percentage price change
39+
lookup_table = defaultdict(dict) # Nested dict: ticker -> date -> stats
40+
41+
for ticker, records in trades_data.items():
42+
# Convert records to DataFrame
43+
df_ticker = pd.DataFrame(records)
44+
# Sort records by date
45+
df_ticker.sort_values('date', inplace=True)
46+
df_ticker.set_index('date', inplace=True)
47+
48+
# Calculate the percentage change in close_price
49+
df_ticker['price_diff'] = df_ticker['close_price'].pct_change() * 100 # Multiply by 100 for percentage
50+
51+
# Shift trades to exclude the current day from rolling calculations
52+
df_ticker['trades_shifted'] = df_ticker['trades'].shift(1)
53+
# Calculate rolling average and standard deviation over the previous 5 days
54+
df_ticker['avg_trades'] = df_ticker['trades_shifted'].rolling(window=5).mean()
55+
df_ticker['std_trades'] = df_ticker['trades_shifted'].rolling(window=5).std()
56+
# Store the data in the lookup table
57+
for date, row in df_ticker.iterrows():
58+
# Convert date to string for JSON serialization
59+
date_str = date.strftime('%Y-%m-%d')
60+
# Ensure rolling stats are available
61+
if pd.notnull(row['avg_trades']) and pd.notnull(row['std_trades']):
62+
lookup_table[ticker][date_str] = {
63+
'trades': row['trades'],
64+
'close_price': row['close_price'],
65+
'price_diff': row['price_diff'],
66+
'avg_trades': row['avg_trades'],
67+
'std_trades': row['std_trades']
68+
}
69+
else:
70+
# Store data without rolling stats if not enough data points
71+
lookup_table[ticker][date_str] = {
72+
'trades': row['trades'],
73+
'close_price': row['close_price'],
74+
'price_diff': row['price_diff'],
75+
'avg_trades': None,
76+
'std_trades': None
77+
}
78+
79+
print("Lookup table built successfully.")
80+
81+
# Convert defaultdict to regular dict for JSON serialization
82+
lookup_table = {k: v for k, v in lookup_table.items()}
83+
84+
# Save the lookup table to a JSON file
85+
with open('lookup_table.json', 'w') as f:
86+
json.dump(lookup_table, f, indent=4)
87+
88+
print("Lookup table saved to 'lookup_table.json'.")
89+
90+
# Save the lookup table to a file for later use
91+
with open('lookup_table.pkl', 'wb') as f:
92+
pickle.dump(lookup_table, f)
93+
94+
print("Lookup table saved to 'lookup_table.pkl'.")

0 commit comments

Comments
 (0)