Skip to content

This repository contains two distinct approaches for forecasting market returns. The challenge involves predicting an intraday-varying target variable `y` (hypothesized to represent log returns) across multiple symbols using time-series data with 26 features.

Notifications You must be signed in to change notification settings

who-else-but-arjun/market-forecasting

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

Problem Statement

Given a dataset with:

  • Multiple symbols (stocks/assets)
  • Features f0 to f25 (most constant intraday, varying by date)
  • Target variable y (intraday-varying, representing returns)
  • Time dimensions: date_id and time_id

Objective: Predict future values of y with maximum accuracy and directional correctness.

Key Insights

  1. Target Variable Nature: y exhibits near-random-walk characteristics with values concentrated in [-0.5, 0.5]
  2. Price Reconstruction: Synthetic price can be constructed as P_t = P_{t-1}(1 + y_t) with base price P_0 = 100
  3. Feature Correlation: Feature f25 shows strong correlation (ρ ≈ 0.8) with constructed price series
  4. Direct Regression Limitations: Simple regression models overfit severely (train RMSE: 0.0022, test RMSE: 0.00917)

Approaches

Approach 1: LSTM-Based Time Series Modeling

File: lstm_train_test.py

A deep learning approach using Long Short-Term Memory networks to capture temporal dependencies.

Architecture

  • Multi-layer LSTM (4 layers, 64 hidden units)
  • Dropout: 0.2
  • Sequence length: 45
  • Output: Price predictions (returns derived post-hoc)

Key Features

  • Incremental Learning: Models are trained sequentially across symbols
  • Parallel Data Loading: Utilizes multiprocessing for efficient data processing
  • Lagged Features: 5 time-lagged versions of time-varying features

Performance

  • Validation RMSE (Price): ~1.41
  • Validation RMSE (Returns): ~0.00192
  • Directional Accuracy: ~57%

Usage

python lstm_train_test.py

Requirements:

  • PyTorch
  • NumPy, Pandas
  • scikit-learn
  • tqdm, joblib

Approach 2: Kalman Filter + Gradient Boosting (Hybrid Model)

File: kalman_train_test.py

A two-stage probabilistic approach combining state-space modeling with residual learning.

Stage 1: Kalman Filter

Models the latent trend as a random walk with Gaussian noise:

z_t = A·z_{t-1} + w_t,  w_t ~ N(0, Q)
y_t = C·z_t + v_t,      v_t ~ N(0, R)
  • State dimension: 3
  • Transition matrix: Identity (A = I₃)
  • Observation matrix: C = [1/3, 1/3, 1/3]
  • Process noise: Q = 0.01·I₃
  • Observation noise: R = 0.05

Stage 2: Residual Learning

LightGBM gradient boosting model predicts residuals:

ε_t = y_t - y_kalman_t
ŷ_t = y_kalman_t + ε̂_t

Key Features

  • Parallel Processing: Multi-core processing for symbol-wise training
  • State Persistence: Kalman states maintained across dates within symbols
  • No Lookahead Bias: Predictions use only past information

Performance

  • Validation RMSE: ~0.00085
  • Directional Accuracy: ~63%
  • Sign Accuracy: Better market-neutral predictions

Usage

python kalman_train_test.py

Requirements:

  • LightGBM
  • NumPy, Pandas
  • statsmodels
  • scikit-learn
  • matplotlib, tqdm

Feature Engineering

Both approaches utilize extensive technical indicators derived from price series:

Regime Identification

  • KAMA (Kaufman Adaptive Moving Average): Efficiency Ratio for trend detection
  • CHOP (Choppiness Index): Identifies sideways markets (threshold: 43.2)
  • Johnny Ribbon: Multi-timeframe moving average regimes (5, 10, 15, 25, 40, 65, 105, 180 periods)

Momentum & Oscillators

  • RSI (Relative Strength Index): Overbought/oversold conditions (period: 10)
  • CCI (Commodity Channel Index): Cyclical turning points (period: 20)
  • CMO (Chande Momentum Oscillator): Centered momentum measure (period: 14)

Volatility

  • ATR (Average True Range): Volatility normalization (period: 14)

Trend

  • Aroon Up/Down: Trend strength and direction (period: 20)
  • EMA (Exponential Moving Average): Multiple timeframes for trend analysis

Note: All technical features are calculated from price assuming f0 as returns: P_t = P_{t-1}(1 + f0)


Data Preprocessing

Symbol Splitting

Both scripts split train.csv and test.csv into individual symbol files:

train_csvs/symbol_0.csv
train_csvs/symbol_1.csv
...

Feature Normalization

  1. Within time-slice ranking: Features ranked within each (date_id, time_id) group
  2. Standard scaling: Applied after ranking transformation
  3. Global scaler fitting: Trained on sample from multiple symbols

Train-Test Split

  • Temporal split: 80% earliest dates for training, 20% latest for validation
  • No shuffling: Maintains temporal ordering to prevent lookahead bias

Model Outputs

LSTM Model

  • submission.csv: Predicted returns for test set
  • Model artifacts: Saved scalers and trained network

Kalman-GBM Model

  • kalman_model.pkl: Contains:
    • Trained LightGBM model
    • Feature scaler
    • Symbol-wise Kalman states
    • Feature column definitions
    • Kalman parameters
  • submission.csv: Final predictions
  • prediction_distribution.png: Visualization of prediction distribution

Directory Structure

.
├── train.csv                          # Original training data
├── test.csv                           # Original test data
├── train_csvs/                        # Split training symbols
│   ├── symbol_0.csv
│   ├── symbol_1.csv
│   └── ...
├── test_csvs/                         # Split test symbols
│   ├── symbol_0.csv
│   └── ...
├── lstm_train_test.py                 # LSTM approach
├── kalman_train_test.py               # Kalman + GBM approach
├── submission.csv                     # Final predictions
├── kalman_model.pkl                   # Saved Kalman-GBM model
├── prediction_distribution.png        # Prediction visualization
└── README.md                          # This file

Hyperparameters

LSTM Model

Parameter Value
Sequence Length 45
Hidden Size 64
Number of Layers 4
Dropout 0.2
Learning Rate 0.001
Epochs 40
Batch Size 512
Lag Features 5

Kalman-GBM Model

Parameter Value
Latent Dimension 3
Process Noise (Q) 0.01
Observation Noise (R) 0.05
GBM Learning Rate 0.01
GBM Max Depth 6
GBM Num Leaves 31
Early Stopping Rounds 300

Performance Comparison

Metric LSTM Kalman-GBM
Validation RMSE (Returns) 0.00192 0.00085
Directional Accuracy 57% 63%
Approach End-to-end deep learning Probabilistic + residual learning
Training Time Longer (GPU recommended) Moderate (CPU parallel)

Key Differences Between Approaches

LSTM Approach

  • Strengths:
    • Captures complex temporal patterns
    • End-to-end learning
    • Good for long sequences
  • Weaknesses:
    • Requires more data
    • Longer training time
    • Less interpretable

Kalman-GBM Approach

  • Strengths:
    • Probabilistically grounded
    • Better directional accuracy
    • Interpretable components
    • Handles random-walk nature explicitly
  • Weaknesses:
    • Assumes linear latent dynamics
    • Two-stage training complexity

Running the Code

Prerequisites

pip install numpy pandas scikit-learn matplotlib tqdm
pip install lightgbm statsmodels  # For Kalman approach
pip install torch joblib           # For LSTM approach

Training and Prediction

LSTM:

python lstm_train_test.py

Kalman-GBM:

python kalman_train_test.py

Both scripts will:

  1. Split raw data into symbol-specific files
  2. Engineer features
  3. Train models
  4. Generate submission.csv

Theoretical Foundation

Random Walk Hypothesis

The target variable y exhibits near-random-walk behavior:

y_t = y_{t-1} + η_t,  η_t ~ N(0, σ²)

Kalman Filter Formulation

The Kalman approach models the latent state evolution:

p(y_t | y_{1:t-1}) = ∫ p(y_t | z_t) p(z_t | y_{1:t-1}) dz_t

This provides a smoothed estimate of returns while filtering high-frequency noise.

Residual Learning

The gradient boosting stage models systematic deviations:

ε_t = y_t - ŷ_kalman_t
E[ε_t | F] = G(F)

where F represents engineered features capturing regime shifts and market conditions.


About

This repository contains two distinct approaches for forecasting market returns. The challenge involves predicting an intraday-varying target variable `y` (hypothesized to represent log returns) across multiple symbols using time-series data with 26 features.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages