A comprehensive analysis of Wall Street Journal articles to investigate relationships between article sentiment, reader engagement, and financial market movements.
- Project Overview
- Research Questions
- Data Collection
- Natural Language Processing
- Research Findings
- Project Structure
- Getting Started
- Docker Setup
- Results & Applications
This project investigates the relationship between Wall Street Journal article sentiment and two key metrics:
- User engagement (measured by comment count)
- S&P 500 market returns
The analysis leverages web scraping and natural language processing to extract insights from 22,772 WSJ articles published between January 2019 and July 2020.
-
Comment Engagement Analysis: Can a statistically significant relationship be demonstrated between a WSJ article's degree of subjectivity/objectivity and positivity/negativity in its writing and the number of online comments posted by readers?
-
Market Prediction Analysis: Can a statistically significant relationship be demonstrated between WSJ articles' sentiment polarity on day t and S&P 500 Index movements on day t + n (where 0 ≤ n ≤ 3)?
- Source: Wall Street Journal news archives
- Method: Python Selenium web scraping
- Dataset: 22,772 full-text articles (Jan 2019 - July 2020)
- Data Points per Article:
- Article text and headline
- Sub-headline and publication date
- Author name and rubric category
- Number of comments
- VADER (Valence Aware Dictionary and sEntiment Reasoner)
- TextBlob
- Purpose: Polarity and emotion intensity scoring
- Output Variables:
negative,neutral,positive,compound - Documentation: VADER Sentiment
- Purpose: Sentiment analysis and subjectivity scoring
- Output Variables:
polarity,subjectivity - Documentation: TextBlob Documentation
- Model Performance: Simple linear regression shows poor predictive power (Adj R² = 0.014)
- Statistical Significance: Cannot reject null hypothesis (p-value = 0.2045)
- Key Finding: VADER negativity scores are statistically significant at 1% level
- Interpretation: Higher negativity may correlate with events generating public response (e.g., public figure deaths)
- Model Performance: Low predictive power across all models (Adj R² ≈ 0.01)
- Analysis Scope: Four regression models testing same-day and next-day S&P 500 movements
- Key Finding: TextBlob polarity shows significance at 10% level
- Conclusion: WSJ sentiment has limited predictive power for market movements
WSJ_WebScraping_NLP/
├── app/ # R Shiny application
│ ├── global.R
│ ├── server.R
│ └── ui.R
├── data/ # Processed datasets
│ └── wsjsections.csv
├── notebooks/ # Jupyter analysis notebooks
│ └── WSJ_Scraping NLP_Analysis.ipynb
├── scraping/ # Web scraping scripts
│ └── scrape.py
├── Dockerfile # Container configuration
├── README.md # Project documentation
└── objectives.md # Detailed project objectives
- Python 3.9+
- R (for Shiny app)
- Docker (optional)
- Clone the repository
- Install Python dependencies:
pip install -r requirements.txt - Run the Jupyter notebook for analysis
- Launch R Shiny app for interactive visualization
docker build -t wsj-nlp-analysis .docker run -p 8888:8888 wsj-nlp-analysis- Open your browser and navigate to
http://localhost:8888 - The Jupyter notebook interface will be available
- Use the provided token for authentication
docker stop <container_id>- R Shiny App: Live Application
- Features: Interactive sentiment analysis visualization and data exploration
- Blog Post: Detailed Analysis
- Objectives: See
objectives.mdfor detailed project goals and methodology