Given a dataset with:
- Multiple symbols (stocks/assets)
- Features
f0tof25(most constant intraday, varying by date) - Target variable
y(intraday-varying, representing returns) - Time dimensions:
date_idandtime_id
Objective: Predict future values of y with maximum accuracy and directional correctness.
- Target Variable Nature:
yexhibits near-random-walk characteristics with values concentrated in [-0.5, 0.5] - Price Reconstruction: Synthetic price can be constructed as
P_t = P_{t-1}(1 + y_t)with base price P_0 = 100 - Feature Correlation: Feature
f25shows strong correlation (ρ ≈ 0.8) with constructed price series - Direct Regression Limitations: Simple regression models overfit severely (train RMSE: 0.0022, test RMSE: 0.00917)
File: lstm_train_test.py
A deep learning approach using Long Short-Term Memory networks to capture temporal dependencies.
- Multi-layer LSTM (4 layers, 64 hidden units)
- Dropout: 0.2
- Sequence length: 45
- Output: Price predictions (returns derived post-hoc)
- Incremental Learning: Models are trained sequentially across symbols
- Parallel Data Loading: Utilizes multiprocessing for efficient data processing
- Lagged Features: 5 time-lagged versions of time-varying features
- Validation RMSE (Price): ~1.41
- Validation RMSE (Returns): ~0.00192
- Directional Accuracy: ~57%
python lstm_train_test.pyRequirements:
- PyTorch
- NumPy, Pandas
- scikit-learn
- tqdm, joblib
File: kalman_train_test.py
A two-stage probabilistic approach combining state-space modeling with residual learning.
Models the latent trend as a random walk with Gaussian noise:
z_t = A·z_{t-1} + w_t, w_t ~ N(0, Q)
y_t = C·z_t + v_t, v_t ~ N(0, R)
- State dimension: 3
- Transition matrix: Identity (A = I₃)
- Observation matrix: C = [1/3, 1/3, 1/3]
- Process noise: Q = 0.01·I₃
- Observation noise: R = 0.05
LightGBM gradient boosting model predicts residuals:
ε_t = y_t - y_kalman_t
ŷ_t = y_kalman_t + ε̂_t
- Parallel Processing: Multi-core processing for symbol-wise training
- State Persistence: Kalman states maintained across dates within symbols
- No Lookahead Bias: Predictions use only past information
- Validation RMSE: ~0.00085
- Directional Accuracy: ~63%
- Sign Accuracy: Better market-neutral predictions
python kalman_train_test.pyRequirements:
- LightGBM
- NumPy, Pandas
- statsmodels
- scikit-learn
- matplotlib, tqdm
Both approaches utilize extensive technical indicators derived from price series:
- KAMA (Kaufman Adaptive Moving Average): Efficiency Ratio for trend detection
- CHOP (Choppiness Index): Identifies sideways markets (threshold: 43.2)
- Johnny Ribbon: Multi-timeframe moving average regimes (5, 10, 15, 25, 40, 65, 105, 180 periods)
- RSI (Relative Strength Index): Overbought/oversold conditions (period: 10)
- CCI (Commodity Channel Index): Cyclical turning points (period: 20)
- CMO (Chande Momentum Oscillator): Centered momentum measure (period: 14)
- ATR (Average True Range): Volatility normalization (period: 14)
- Aroon Up/Down: Trend strength and direction (period: 20)
- EMA (Exponential Moving Average): Multiple timeframes for trend analysis
Note: All technical features are calculated from price assuming f0 as returns: P_t = P_{t-1}(1 + f0)
Both scripts split train.csv and test.csv into individual symbol files:
train_csvs/symbol_0.csv
train_csvs/symbol_1.csv
...
- Within time-slice ranking: Features ranked within each (date_id, time_id) group
- Standard scaling: Applied after ranking transformation
- Global scaler fitting: Trained on sample from multiple symbols
- Temporal split: 80% earliest dates for training, 20% latest for validation
- No shuffling: Maintains temporal ordering to prevent lookahead bias
- submission.csv: Predicted returns for test set
- Model artifacts: Saved scalers and trained network
- kalman_model.pkl: Contains:
- Trained LightGBM model
- Feature scaler
- Symbol-wise Kalman states
- Feature column definitions
- Kalman parameters
- submission.csv: Final predictions
- prediction_distribution.png: Visualization of prediction distribution
.
├── train.csv # Original training data
├── test.csv # Original test data
├── train_csvs/ # Split training symbols
│ ├── symbol_0.csv
│ ├── symbol_1.csv
│ └── ...
├── test_csvs/ # Split test symbols
│ ├── symbol_0.csv
│ └── ...
├── lstm_train_test.py # LSTM approach
├── kalman_train_test.py # Kalman + GBM approach
├── submission.csv # Final predictions
├── kalman_model.pkl # Saved Kalman-GBM model
├── prediction_distribution.png # Prediction visualization
└── README.md # This file
| Parameter | Value |
|---|---|
| Sequence Length | 45 |
| Hidden Size | 64 |
| Number of Layers | 4 |
| Dropout | 0.2 |
| Learning Rate | 0.001 |
| Epochs | 40 |
| Batch Size | 512 |
| Lag Features | 5 |
| Parameter | Value |
|---|---|
| Latent Dimension | 3 |
| Process Noise (Q) | 0.01 |
| Observation Noise (R) | 0.05 |
| GBM Learning Rate | 0.01 |
| GBM Max Depth | 6 |
| GBM Num Leaves | 31 |
| Early Stopping Rounds | 300 |
| Metric | LSTM | Kalman-GBM |
|---|---|---|
| Validation RMSE (Returns) | 0.00192 | 0.00085 |
| Directional Accuracy | 57% | 63% |
| Approach | End-to-end deep learning | Probabilistic + residual learning |
| Training Time | Longer (GPU recommended) | Moderate (CPU parallel) |
- Strengths:
- Captures complex temporal patterns
- End-to-end learning
- Good for long sequences
- Weaknesses:
- Requires more data
- Longer training time
- Less interpretable
- Strengths:
- Probabilistically grounded
- Better directional accuracy
- Interpretable components
- Handles random-walk nature explicitly
- Weaknesses:
- Assumes linear latent dynamics
- Two-stage training complexity
pip install numpy pandas scikit-learn matplotlib tqdm
pip install lightgbm statsmodels # For Kalman approach
pip install torch joblib # For LSTM approachLSTM:
python lstm_train_test.pyKalman-GBM:
python kalman_train_test.pyBoth scripts will:
- Split raw data into symbol-specific files
- Engineer features
- Train models
- Generate
submission.csv
The target variable y exhibits near-random-walk behavior:
y_t = y_{t-1} + η_t, η_t ~ N(0, σ²)
The Kalman approach models the latent state evolution:
p(y_t | y_{1:t-1}) = ∫ p(y_t | z_t) p(z_t | y_{1:t-1}) dz_t
This provides a smoothed estimate of returns while filtering high-frequency noise.
The gradient boosting stage models systematic deviations:
ε_t = y_t - ŷ_kalman_t
E[ε_t | F] = G(F)
where F represents engineered features capturing regime shifts and market conditions.