This analysis investigates how tweet sentiment affects stock prices using a dataset containing stock return data and tweet sentiment scores. The dataset was processed, cleaned, and visualized to understand relationships between sentiment polarity, stock returns, trading volume, and tweet counts. New features were also added through engineering before applying Machine Learning techniques like Linear Regression and Random Forest Regression.
- Dropped unnecessary columns.
- Converted date columns to appropriate datetime formats.
- Handled missing values.
- Ensured numeric columns were correctly formatted for analysis.
The lstm_polarity column represents the sentiment polarity of tweets as determined by a Long Short-Term Memory (LSTM) neural network model. The LSTM model reads the text of each tweet and derive a sentiment score, typically denoting positive, negative, or neutral sentiment.
Textblob_polarity column represents sentiment polarity of tweets as analyzed using TextBlob library. TextBlob is a Python library for processing textual data with APIs for language processing like sentiment analysis. Polarity score returned by TextBlob ranges from -1 to 1, with:
- LSTM and TextBlob sentiment polarities show different distributions.
- The polarity scores generally center around neutral values.
- 7-day returns are more spread out than 1-day returns.
- The data suggests that short-term fluctuations do not always translate into long-term movement.
- Weak correlation observed between sentiment polarity and 1-day stock returns.
- Some extreme sentiment scores correspond to higher volatility in returns.
- Similar weak correlation as seen in 1-day returns.
- Longer-term returns may be influenced by additional market factors beyond sentiment.
As a baseline model to measure how well sentiment and stock-related attributes can predict stock returns, Linear Regression was chosen because it is simple to interpret, and can determine whether these attributes are linearly related to stock price movement. Since financial data often follow complex patterns, this model is a good initial choice before trying more complex techniques like Random Forest.
- Predictions are spread wide apart, indicating low predictive power.
- The R² value means that our selected features (sentiment, volume, volatility) don't explain stock returns well using a linear model.
- Stock returns can't always have a strictly linear relationship with sentiment data.
The poor performance of Linear Regression led to trying Random Forest Regression, a more general model that can detect non-linear patterns of stock price movement. Random Forest employs an ensemble of decision trees and is therefore more capable of handling datasets that consist of many interacting variables. When used, this model found much improved accuracy, with an R² of 0.8670 compared to Linear Regression's 0.0063. This means that sentiment, tweet activity, and volatility are related in ways that cannot be represented by a linear model. The fact that the model does better means that non-linear relationships and other context variables, such as market conditions and overall economic indicators, must be considered when looking at the impact of sentiment on stocks.
- Predictions are much closer to actual returns compared to Linear Regression.
- Random Forest captures some of the nonlinear relationships and thus has greater accuracy.
- The model's high R² value shows that adding more financial or fundamental indicators can further boost performance.
| Model | R² Score | Mean Squared Error (MSE) |
|---|---|---|
| Linear Regression | 0.0063 | 0.000511 |
| Random Forest Regression | 0.8670 | 0.000068 |
- Future studies should include macroeconomic variables. Combining sentiment with other factors such as interest rates, earnings announcements, and global economic conditions may enhance predictability and provide a more comprehensive overview of the market.
- Sentiment may affect volatility more so than direction. While the models were unable to predict exact price movements, they suggest that surges in sentiment do correlate with higher trading volumes and market action.
- Short-Term Trading Strategies: The findings from Random Forest show that non-obvious patterns in the direction of stock movement can be detected by machine learning algorithms. Short-term traders can potentially use sentiment information with technical indicators like volatility and momentum.
- Risk Management: Volatility analysis shows that highly volatile stocks remain so in the longer term. Sentiment trends can be utilized by investors to anticipate changes and hedge against them in advance.
- Optimize the Random Forest model by adjusting tree depth, number of trees, and features.
- Try other Machine Learning models like Gradient Boosting to enhance predictive power.
- Employ LSTM models to identify sequential patterns between sentiment shifts and stock price movements.
- Investigate whether deep learning techniques are superior to tree-based models for this application.
- Include wider financial indicators (S&P 500 price movement, economic news, interest rate changes).
- Test if adding sentiment together with macroeconomic features improves predictions.
- Understand sentiment "momentum" features by seeing how fast the sentiment is changing and not just its raw value.
- Include tweet engagement features (likes, retweets) to give weight to how impactful high-visibility tweets are.
- CU_Boulder_Module7_SupervisedLearning.ipynb: Jupyter Notebook with full EDA analysis and ML.
- Images/: Folder containing images of generated plots.
- Data/: Folder containing the dataset.
- README.md: Full EDA