I analyze whether online sentiment (e.g. WallStreetBets) has predictive power over short-term price movements.
Goal for semester 2 project: To test whether WallStreetBets sentiment and activity can predict short-term stock price movement
What I am predicting
A stock’s next-day direction:
Up (Green Day) or Down (Red Day)
Target Variable: Next-day return direction
𝑦=1 if next-day return > 0, else 0
This project aims to develop a machine learning model that leverages social media sentiment from WallStreetBets (WSB) to predict short-term stock price movements. The core question is: Can online retail investor sentiment serve as a predictive signal for next-day stock returns?
The model will provide:
- Binary Classification: Predict whether a stock will close up (green day) or down (red day) the following trading session
- Sentiment Signal Strength: Quantify the relationship between WSB discussion volume, sentiment scores, and actual price movements
- Trading Insights: Identify which sentiment metrics (upvotes, comment volume, positive/negative language) have the strongest predictive power
- Risk Assessment: Understand the reliability and limitations of social media sentiment as a trading indicator
- Classification accuracy > 55% (baseline: 50% random)
- Precision and recall for both up/down predictions
- Feature importance analysis showing significant sentiment variables
- Backtested strategy performance vs. buy-and-hold benchmark
Goals:
- Set up Python environment with required libraries (PRAW, pandas, scikit-learn, yfinance)
- Create Reddit API credentials and test WSB data access
- Define initial stock universe (e.g., top 20 most-mentioned WSB stocks)
- Research and document relevant sentiment analysis approaches
- Set up GitHub repository structure with proper documentation
Deliverables:
- Working Reddit API connection
- List of target stocks to analyze
- Initial project structure and documentation
Goals:
- Scrape historical WSB posts and comments for selected stocks (ideally 6-12 months of data)
- Collect corresponding historical stock price data using yfinance
- Build data pipeline for ongoing data collection
- Store data in structured format (CSV/SQLite/Parquet)
- Perform initial data quality checks
Deliverables:
- Historical WSB dataset (posts, comments, timestamps, scores)
- Historical price dataset aligned with sentiment data
- Data collection scripts for automated updates
- Initial data quality report
Goals:
- Implement sentiment analysis (VADER, TextBlob, or FinBERT)
- Engineer features from WSB data:
- Daily post/comment volume per stock
- Aggregate sentiment scores
- Upvote ratios and engagement metrics
- Trending indicators (velocity of mentions)
- Create target variable (next-day return direction)
- Handle data alignment and time-zone issues
- Address missing data and outliers
Deliverables:
- Feature engineering pipeline
- Sentiment-labeled WSB dataset
- Combined dataset with features and target variable
- Exploratory data analysis (EDA) notebook
Goals:
- Conduct comprehensive EDA:
- Correlation between sentiment and returns
- Distribution of features and target variable
- Temporal patterns and trends
- Visualize key relationships
- Split data into train/validation/test sets (temporal split)
- Build baseline logistic regression model
- Establish performance benchmarks
Deliverables:
- EDA report with visualizations
- Train/validation/test datasets
- Baseline model with performance metrics
- Initial insights on feature-target relationships
Goals:
- Experiment with multiple algorithms:
- Random Forest
- Gradient Boosting (XGBoost/LightGBM)
- Neural Networks (if time permits)
- Perform hyperparameter tuning
- Implement cross-validation strategies
- Address class imbalance if present
- Feature selection and importance analysis
Deliverables:
- Multiple trained models with comparison metrics
- Hyperparameter tuning results
- Feature importance rankings
- Model selection justification
Goals:
- Evaluate final model on test set
- Conduct detailed error analysis
- Test model robustness across different market conditions
- Implement backtesting framework for trading strategy
- Calculate risk-adjusted returns
- Identify model limitations and failure cases
Deliverables:
- Final model performance report
- Backtesting results with strategy metrics (Sharpe ratio, max drawdown)
- Error analysis and case studies
- Documented limitations and assumptions
Goals:
- Create comprehensive project documentation
- Develop final presentation/report with:
- Problem statement and motivation
- Methodology and approach
- Results and key findings
- Limitations and future work
- Clean and organize code repository
- Create visualizations for presentation
- Prepare demonstration/demo (if applicable)
Deliverables:
- Final project report (written)
- Presentation slides
- Clean, documented GitHub repository
- README with project overview and instructions
- Reflections on learnings and challenges
Completed Machine Learning System that:
- Successfully predicts next-day stock price direction with statistically significant accuracy above baseline
- Identifies which WallStreetBets sentiment metrics are most predictive of price movements
- Demonstrates understanding of the relationship between social media sentiment and financial markets
- Provides actionable insights on the viability of sentiment-based trading strategies
- Includes comprehensive documentation of methodology, results, and limitations
Key Deliverable: A reproducible, well-documented project showcasing the full data science workflow from data collection through model deployment, with clear evidence of predictive power (or lack thereof) from WSB sentiment analysis.
- Data Quality: Reddit data can be noisy; plan for extensive cleaning
- Market Events: Consider excluding major market events (earnings, acquisitions) that could skew results
- Survivorship Bias: Be aware of stocks that were popular on WSB but no longer traded
- Ethical Considerations: This is for academic purposes; be cautious about real-world trading applications
- Adaptability: Timeline may need adjustments based on data availability and computational constraints