This repository presents an end‑to‑end research pipeline for stock trend prediction that fuses price/technical indicators with NLP signals derived from financial tweets. It supports data preprocessing, feature engineering (including sector and meta‑features), model benchmarking (binary, regression, and multi‑class/ordinal), and investment simulation with an optional early‑exit filter under realistic constraints.
Goal: Predict short-/medium-horizon price direction/returns and evaluate decision‑making performance using a constrained trading simulation.
Core idea: Combine market micro‑signals (technical indicators) with textual sentiment/emotion/stance information to improve predictive signal quality.
- Reliability‑aware meta‑features improved trading outcomes, not just predictive accuracy.
- Longer‑horizon classification models produced the most useful trading signals in simulation.
- Best reported investment result: a 12.47% return over 100 trading days using the binary meta‑feature system with MPT allocation.
- The work argues that economic performance and calibration are essential evaluation criteria, not accuracy alone.
- A structured multi‑stage pipeline from raw data to portfolio simulation.
- Multiple NLP feature streams (sentiment, emotion, stance, FinBERT) aligned to daily ticker data.
- Comparative benchmarking across feature sets and model families.
- Meta‑feature reliability modeling and an early‑exit mechanism to improve decision stability.
- Simulation framework enforcing allocation limits, diversification, and probabilistic sizing.
data-pre-processing/data ingestion and cleaning notebooks.feat-engineering/NLP feature generation, technical indicators, sector/meta features.meta-features/auxiliary meta‑features and early‑exit signals notebooks.benchmarking/model training and evaluation notebooks (binary, regression, multiclass).simulation/investment simulation notebooks.data/raw datasets and intermediate parquet outputs.results/saved benchmarking outputs.documentation/dissertation and supporting material.
Stages
- Preprocessing: parse StockNet price/tweet data and output cleaned parquet files.
- Feature Engineering: NLP features, technical indicators, sector features, and meta‑features.
- Benchmarking: train/evaluate sequence models on fixed time splits.
- Meta‑Features & Early‑Exit: generate model‑reliability signals and optional early‑exit predictions for simulation.
- Simulation: portfolio construction using calibrated predictions and constrained risk rules.
- How well do deep sequence models predict short‑horizon returns using price + text features?
- Do reliability‑aware meta‑features improve predictive stability and downstream profitability?
- Which allocation strategy (Kelly, MPT, or hybrid) yields better risk‑adjusted performance?
- How do horizon length and model class (binary, regression, ordinal) interact with economic outcomes?
Start here: raw StockNet data in data/stocknet-dataset/
End here: benchmark results in results/benchmarking/ and simulation outputs/logs in simulation/ and trained_models*
Recommended execution path
- Preprocess
Run:data-pre-processing/Data_PreProcessing_1.ipynb→Data_PreProcessing_2.ipynb→Data_PreProcessing_3.ipynb
Output: cleaned parquet files indata/dataset/ - NLP Features
Run:NLP_1_Sentiment_Scoring.ipynb→NLP_2_0_Emotion_Scoring.ipynb→NLP_2_1_Emotion_Engineering.ipynb→
NLP_3_Stance_Scoring.ipynb→NLP_4_FinBert_Sentiment.ipynb - Technical + Sector Features
Run:Technical_Indicators.ipynb→Sector_Features.ipynb - Meta‑Features & Early‑Exit
Run:feat-engineering/Meta_Features.ipynb→meta-features/Early_Exit.ipynb - Benchmarking
Run:benchmarking/Benchmarking.ipynband/orbenchmarking/MultiClass_Benchmarking.ipynb, can be done on parquet with or without meta-features. - Simulation (End of Pipeline)
Run:simulation/Invesment_Simulation_System.ipynband/or
simulation/MultiClass_Invesment_Simulation_System.ipynb
Preprocessing
data-pre-processing/Data_PreProcessing_1.ipynbtweet parsing/cleaning and parquet output.data-pre-processing/Data_PreProcessing_2.ipynbstock table cleaning (tab‑separated) to parquet.data-pre-processing/Data_PreProcessing_3.ipynbtweet normalization/cleaning (non‑merged).
NLP Features
feat-engineering/NLP_1_Sentiment_Scoring.ipynbsentiment scoring (1–5).feat-engineering/NLP_2_0_Emotion_Scoring.ipynbemotion scores + percentiles.feat-engineering/NLP_2_1_Emotion_Engineering.ipynbunified emotion features.feat-engineering/NLP_3_Stance_Scoring.ipynbstance label/score.feat-engineering/NLP_4_FinBert_Sentiment.ipynbFinBERT sentiment features.
Technical + Sector Features
feat-engineering/Technical_Indicators.ipynbTA indicators + NLP merge.feat-engineering/Sector_Features.ipynbsector‑level aggregates and indicators.
Meta‑Features + Early Exit
feat-engineering/Meta_Features.ipynbmeta‑model signals and reliability features.meta-features/Early_Exit.ipynbearly‑exit signals for filtering simulation trades.
Benchmarking
benchmarking/Benchmarking.ipynbbinary classification & regression pipeline.benchmarking/MultiClass_Benchmarking.ipynbmulti‑class/ordinal pipeline.benchmarking/*_GPU.ipynbGPU‑optimized variants for cluster runs.
Simulation
simulation/Invesment_Simulation_System.ipynbbinary simulation engine.simulation/MultiClass_Invesment_Simulation_System.ipynbordinal simulation engine.meta-features/Early_Exit.ipynbearly‑exit features for simulation filtering.
- StockNet dataset under
data/stocknet-dataset/. - Intermediate parquet outputs written to
data/dataset/.
Used across benchmarking notebooks:
- Train: 2014‑01‑01 to 2015‑08‑01
- Validation: 2015‑08‑01 to 2015‑10‑01
- Test: 2015‑10‑01 to 2016‑01‑01
The simulation uses walk‑forward training, per‑ticker models, calibrated probabilities, and constrained allocation. It enforces:
- total capital utilization caps,
- per‑ticker caps,
- optional sector diversification,
- drawdown‑based throttling,
- fractional Kelly sizing.
- Place raw dataset in
data/stocknet-dataset/. - Run preprocessing notebooks in
data-pre-processing/. - Run NLP + TA notebooks in
feat-engineering/in the order listed above. - Run meta‑features and early‑exit notebooks (
feat-engineering/Meta_Features.ipynb,meta-features/Early_Exit.ipynb). - Run benchmarking notebooks in
benchmarking/. - Run simulation notebooks in
simulation/.
Python: 3.10+
Common dependencies: pandas, numpy, scikit-learn, torch, transformers, optuna, pandas_ta, pyarrow, tqdm, matplotlib, seaborn
GPU: supported via CUDA or Apple MPS in the benchmarking/simulation notebooks.
- Intermediate parquet files:
data/dataset/ - Benchmark results:
results/benchmarking/ - Simulation logs and artifacts:
simulation/andtrained_models*
Most notebooks set fixed random seeds and use deterministic options where possible. Exact reproducibility may still vary across GPU devices and driver versions.
- StockNet coverage and tweet noise may introduce data sparsity or bias.
- Results are sensitive to feature selection and horizon choice.
- Simulated trading does not include all real‑world frictions unless explicitly modeled in notebooks.
For academic use only.
