GitHub - who-else-but-arjun/market-forecasting: This repository contains two distinct approaches for forecasting market returns. The challenge involves predicting an intraday-varying target variable `y` (hypothesized to represent log returns) across multiple symbols using time-series data with 26 features.

Problem Statement

Given a dataset with:

Multiple symbols (stocks/assets)
Features f0 to f25 (most constant intraday, varying by date)
Target variable y (intraday-varying, representing returns)
Time dimensions: date_id and time_id

Objective: Predict future values of y with maximum accuracy and directional correctness.

Key Insights

Target Variable Nature: y exhibits near-random-walk characteristics with values concentrated in [-0.5, 0.5]
Price Reconstruction: Synthetic price can be constructed as P_t = P_{t-1}(1 + y_t) with base price P_0 = 100
Feature Correlation: Feature f25 shows strong correlation (ρ ≈ 0.8) with constructed price series
Direct Regression Limitations: Simple regression models overfit severely (train RMSE: 0.0022, test RMSE: 0.00917)

Approaches

Approach 1: LSTM-Based Time Series Modeling

File: lstm_train_test.py

A deep learning approach using Long Short-Term Memory networks to capture temporal dependencies.

Architecture

Multi-layer LSTM (4 layers, 64 hidden units)
Dropout: 0.2
Sequence length: 45
Output: Price predictions (returns derived post-hoc)

Key Features

Incremental Learning: Models are trained sequentially across symbols
Parallel Data Loading: Utilizes multiprocessing for efficient data processing
Lagged Features: 5 time-lagged versions of time-varying features

Performance

Validation RMSE (Price): ~1.41
Validation RMSE (Returns): ~0.00192
Directional Accuracy: ~57%

Usage

python lstm_train_test.py

Requirements:

PyTorch
NumPy, Pandas
scikit-learn
tqdm, joblib

Approach 2: Kalman Filter + Gradient Boosting (Hybrid Model)

File: kalman_train_test.py

A two-stage probabilistic approach combining state-space modeling with residual learning.

Stage 1: Kalman Filter

Models the latent trend as a random walk with Gaussian noise:

z_t = A·z_{t-1} + w_t,  w_t ~ N(0, Q)
y_t = C·z_t + v_t,      v_t ~ N(0, R)

State dimension: 3
Transition matrix: Identity (A = I₃)
Observation matrix: C = [1/3, 1/3, 1/3]
Process noise: Q = 0.01·I₃
Observation noise: R = 0.05

Stage 2: Residual Learning

LightGBM gradient boosting model predicts residuals:

ε_t = y_t - y_kalman_t
ŷ_t = y_kalman_t + ε̂_t

Key Features

Parallel Processing: Multi-core processing for symbol-wise training
State Persistence: Kalman states maintained across dates within symbols
No Lookahead Bias: Predictions use only past information

Performance

Validation RMSE: ~0.00085
Directional Accuracy: ~63%
Sign Accuracy: Better market-neutral predictions

Usage

python kalman_train_test.py

Requirements:

LightGBM
NumPy, Pandas
statsmodels
scikit-learn
matplotlib, tqdm

Feature Engineering

Both approaches utilize extensive technical indicators derived from price series:

Regime Identification

KAMA (Kaufman Adaptive Moving Average): Efficiency Ratio for trend detection
CHOP (Choppiness Index): Identifies sideways markets (threshold: 43.2)
Johnny Ribbon: Multi-timeframe moving average regimes (5, 10, 15, 25, 40, 65, 105, 180 periods)

Momentum & Oscillators

RSI (Relative Strength Index): Overbought/oversold conditions (period: 10)
CCI (Commodity Channel Index): Cyclical turning points (period: 20)
CMO (Chande Momentum Oscillator): Centered momentum measure (period: 14)

Volatility

ATR (Average True Range): Volatility normalization (period: 14)

Trend

Aroon Up/Down: Trend strength and direction (period: 20)
EMA (Exponential Moving Average): Multiple timeframes for trend analysis

Note: All technical features are calculated from price assuming f0 as returns: P_t = P_{t-1}(1 + f0)

Data Preprocessing

Symbol Splitting

Both scripts split train.csv and test.csv into individual symbol files:

train_csvs/symbol_0.csv
train_csvs/symbol_1.csv
...

Feature Normalization

Within time-slice ranking: Features ranked within each (date_id, time_id) group
Standard scaling: Applied after ranking transformation
Global scaler fitting: Trained on sample from multiple symbols

Train-Test Split

Temporal split: 80% earliest dates for training, 20% latest for validation
No shuffling: Maintains temporal ordering to prevent lookahead bias

Model Outputs

LSTM Model

submission.csv: Predicted returns for test set
Model artifacts: Saved scalers and trained network

Kalman-GBM Model

kalman_model.pkl: Contains:
- Trained LightGBM model
- Feature scaler
- Symbol-wise Kalman states
- Feature column definitions
- Kalman parameters
submission.csv: Final predictions
prediction_distribution.png: Visualization of prediction distribution

Directory Structure

.
├── train.csv                          # Original training data
├── test.csv                           # Original test data
├── train_csvs/                        # Split training symbols
│   ├── symbol_0.csv
│   ├── symbol_1.csv
│   └── ...
├── test_csvs/                         # Split test symbols
│   ├── symbol_0.csv
│   └── ...
├── lstm_train_test.py                 # LSTM approach
├── kalman_train_test.py               # Kalman + GBM approach
├── submission.csv                     # Final predictions
├── kalman_model.pkl                   # Saved Kalman-GBM model
├── prediction_distribution.png        # Prediction visualization
└── README.md                          # This file

Hyperparameters

LSTM Model

Parameter	Value
Sequence Length	45
Hidden Size	64
Number of Layers	4
Dropout	0.2
Learning Rate	0.001
Epochs	40
Batch Size	512
Lag Features	5

Kalman-GBM Model

Parameter	Value
Latent Dimension	3
Process Noise (Q)	0.01
Observation Noise (R)	0.05
GBM Learning Rate	0.01
GBM Max Depth	6
GBM Num Leaves	31
Early Stopping Rounds	300

Performance Comparison

Metric	LSTM	Kalman-GBM
Validation RMSE (Returns)	0.00192	0.00085
Directional Accuracy	57%	63%
Approach	End-to-end deep learning	Probabilistic + residual learning
Training Time	Longer (GPU recommended)	Moderate (CPU parallel)

Key Differences Between Approaches

LSTM Approach

Strengths:
- Captures complex temporal patterns
- End-to-end learning
- Good for long sequences
Weaknesses:
- Requires more data
- Longer training time
- Less interpretable

Kalman-GBM Approach

Strengths:
- Probabilistically grounded
- Better directional accuracy
- Interpretable components
- Handles random-walk nature explicitly
Weaknesses:
- Assumes linear latent dynamics
- Two-stage training complexity

Running the Code

Prerequisites

pip install numpy pandas scikit-learn matplotlib tqdm
pip install lightgbm statsmodels  # For Kalman approach
pip install torch joblib           # For LSTM approach

Training and Prediction

LSTM:

python lstm_train_test.py

Kalman-GBM:

python kalman_train_test.py

Both scripts will:

Split raw data into symbol-specific files
Engineer features
Train models
Generate submission.csv

Theoretical Foundation

Random Walk Hypothesis

The target variable y exhibits near-random-walk behavior:

y_t = y_{t-1} + η_t,  η_t ~ N(0, σ²)

Kalman Filter Formulation

The Kalman approach models the latent state evolution:

p(y_t | y_{1:t-1}) = ∫ p(y_t | z_t) p(z_t | y_{1:t-1}) dz_t

This provides a smoothed estimate of returns while filtering high-frequency noise.

Residual Learning

The gradient boosting stage models systematic deviations:

ε_t = y_t - ŷ_kalman_t
E[ε_t | F] = G(F)

where F represents engineered features capturing regime shifts and market conditions.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
kalman_train_test.py		kalman_train_test.py
lstm_train_test.py		lstm_train_test.py

who-else-but-arjun/market-forecasting

Folders and files

Latest commit

History

Repository files navigation

Problem Statement

Key Insights

Approaches

Approach 1: LSTM-Based Time Series Modeling

Architecture

Key Features

Performance

Usage

Approach 2: Kalman Filter + Gradient Boosting (Hybrid Model)

Stage 1: Kalman Filter

Stage 2: Residual Learning

Key Features

Performance

Usage

Feature Engineering

Regime Identification

Momentum & Oscillators

Volatility

Trend

Data Preprocessing

Symbol Splitting

Feature Normalization

Train-Test Split

Model Outputs

LSTM Model

Kalman-GBM Model

Directory Structure

Hyperparameters

LSTM Model

Kalman-GBM Model

Performance Comparison

Key Differences Between Approaches

LSTM Approach

Kalman-GBM Approach

Running the Code

Prerequisites

Training and Prediction

Theoretical Foundation

Random Walk Hypothesis

Kalman Filter Formulation

Residual Learning

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages