This project leverages Natural Language Processing (NLP) and Machine Learning to analyze the language of S&P 500 earnings call transcripts and predict future stock volatility. It features an end-to-end pipeline that automates data collection, feature engineering, model training, and performance evaluation.
The key finding of this project is the successful creation of a Volatility Prediction Model that demonstrated a strong, verifiable ability to predict future risk based on the linguistic patterns in corporate earnings calls.
- Automated Data Pipeline: Downloads and processes years of earnings call transcripts from the
kurry/sp500_earnings_transcriptsdataset on Hugging Face. - Advanced Feature Engineering: Calculates dozens of linguistic features, including sentiment, complexity, and risk keyword density, and normalizes them using Z-scores to measure deviation from a company's historical average.
- Dual Predictive Models: Trains two separate ensemble machine learning models:
- Volatility Prediction Model (Successful): Predicts whether the next quarter will be a high or low volatility period with high accuracy.
- Return Direction Model (Experimental): Predicts whether the next quarter's stock return will be positive or negative.
- Rigorous Backtesting: Utilizes a strict temporal hold-out set (2024 data) to evaluate model performance on completely unseen data.
- Interactive Dashboard: A multi-page Streamlit application to visualize the linguistic features and the final model performance metrics.
linguistic-alpha/
├── analysis/
│ ├── data_loader.py # Downloads and processes transcript data
│ ├── transcript_feature_engineering.py # Calculates linguistic features
│ ├── model_training.py # Trains and saves the ML models
│ └── backtest.py # Evaluates models on hold-out data
├── dashboard/
│ ├── app.py # Main Streamlit app
│ └── pages/ # Dashboard pages for analysis and model performance
├── output/ # Stores generated data and trained models
├── run_pipeline.py # Main script to run the entire pipeline
├── requirements.txt
└── README.md
-
Clone the Repository:
git clone <your-repo-url> cd linguistic-alpha
-
Create and Activate a Virtual Environment:
python -m venv venv source venv/bin/activate -
Install Dependencies:
pip install -r requirements.txt
-
Run the Full Pipeline: This command will download the data, perform feature engineering, train the models, and run the backtest. This will take some time as it processes over a decade of data.
python run_pipeline.py
-
Launch the Dashboard: After the pipeline completes, you can view the results.
streamlit run dashboard/app.py
The primary success of this project is the Volatility Prediction Model, which achieved an Accuracy of 81.25% and an AUC Score of 0.85 on the 2024 hold-out set. This demonstrates a strong predictive signal in linguistic data for forecasting future market risk. The Return Prediction Model did not show significant predictive power, confirming the difficulty of predicting stock direction.