This assignment focuses on conversations in Reddit's r/ChangeMyView community. You will apply your NLP skills to analyze discourse patterns across linked posts and comments from this community.
- Link and analyze multiple related text documents
- Compare language patterns between posts (arguments) and comments (responses)
- Apply word embeddings to analyze social discourse dynamics
- Investigate conversation patterns and discussion quality
- Connect NLP findings to theories of online deliberation and persuasion
This analysis uses two linked datasets from Reddit's r/ChangeMyView community:
Posts Dataset (data/changemyview_posts.csv
): 5,000 top-ranked CMV submissions
title
: The CMV post title (opinion to be changed)selftext
: The post body/content with argumentscore
: Reddit upvotes (engagement measure)num_comments
: Number of comments (discussion level)id
: Unique post identifier for linking to comments
Comments Dataset (data/cmv_comments.csv
): 12,106 top-rated comments
body
: Comment text contentscore
: Comment upvoteslink_id
: Links comment to its parent post- Additional metadata for conversation analysis
r/ChangeMyView is a subreddit where people post views they're willing to have challenged, creating an ideal environment for studying reasoned discourse and opinion change.
nlp-analysis-social-text/
├── README.md # This file
├── requirements.txt # Python dependencies
├── data/
│ ├── changemyview_posts.csv # CMV posts dataset
│ └── cmv_comments.csv # CMV comments dataset
├── notebooks/
│ └── nlp_analysis.ipynb
└── output/
└── (your output files will go here)
-
Clone this repository from GitHub Classroom
-
Create a conda environment (recommended):
conda create -n nlp-hw5 python=3.9 conda activate nlp-hw5
Note: If you don't have conda installed, you can get it from Anaconda or Miniconda (lighter weight).
-
Install required packages:
pip install -r requirements.txt
-
Open the notebook:
- Jupyter:
jupyter notebook notebooks/nlp_analysis.ipynb
- VSCode: Open the notebook file
- Jupyter:
This assignment provides a structured approach to analyzing social discourse patterns in Reddit's r/ChangeMyView community using fundamental NLP techniques.
-
Data Loading & Basic Exploration
- Successfully load both datasets (posts and comments)
- Perform basic data exploration (shape, columns, missing values)
- Create simple statistics (average post length, comment counts)
- Visualize basic distributions (post scores, comment lengths)
-
Text Preprocessing
- Clean text data (remove special characters, lowercase)
- Tokenize posts and comments
- Remove stopwords using NLTK
- Create and compare word frequency distributions
-
Comparative Analysis
- Find the top 20 most common words in posts vs comments
- Create word clouds for posts and comments separately
- Calculate basic text statistics (average word length, vocabulary size)
- Identify unique words that appear only in posts or only in comments
-
Sentiment Analysis
- Apply a pre-built sentiment analyzer (VADER or TextBlob)
- Compare sentiment distributions between posts and comments
- Find the most positive and negative posts/comments
- Create visualizations of sentiment patterns
-
Interpretation
- Write a 2-paragraph summary of your findings
- Discuss one interesting pattern you discovered
- Suggest one way these findings could be useful in a social science setting
- Completed notebook with all code cells executed
- At least 4 visualizations
- Written interpretation of findings
- Documentation of any challenges faced
For students who complete the basic analysis and want additional challenges:
-
Advanced Text Analysis
- Link posts to their comments using ID columns
- Implement TF-IDF to find distinctive vocabulary between posts and comments
- Apply named entity recognition to identify key topics
-
Conversation Dynamics
- Calculate semantic similarity between posts and their comments
- Analyze response patterns (agreement vs disagreement language)
- Identify high-engagement conversation characteristics
-
Word Embeddings
- Load and apply pre-trained word embeddings (GloVe or Word2Vec)
- Calculate semantic distances between key concepts
- Visualize word relationships in semantic space
-
Machine Learning Applications
- Build a classifier to predict comment engagement levels
- Implement topic modeling (LDA) to discover conversation themes
- Explore what linguistic features correlate with successful persuasion
In the example_approach/
folder, you'll find an example of an approach to analyzing CMV conversations using Retrieval-Augmented Generation (RAG) and Large Language Models (LLMs). This example is entirely optional and provided for inspiration only.
What the example demonstrates:
- Data Collection via API: Using PRAW (Python Reddit API Wrapper) to collect fresh CMV data with delta tracking
- Delta Analysis: Identifying which comments successfully changed views (marked by delta awards)
- RAG Implementation: Using TF-IDF retrieval to find relevant conversation snippets
- LLM Analysis: Employing Qwen 2.5 to analyze persuasion patterns and rhetorical strategies
- Structured Output: Generating JSON-formatted insights about what makes arguments persuasive
Key Concepts You Could Adapt (without LLMs):
- Success Metrics: Analyzing differences between comments that changed views vs those that didn't
- Conversation Threading: Following argument chains from initial post to resolution
- Persuasion Patterns: Identifying linguistic features of successful arguments
- Rhetorical Analysis: Examining ethos, pathos, logos in argumentation
Important Notes:
- This approach requires additional dependencies (PRAW, transformers, torch)
- LLM inference benefits greatly from GPU access (Google Colab)
- The focus includes traditional NLP and prompt engineering
- You can extract ideas without implementing the full stack
If You're Interested:
- Advanced pathway students might incorporate LLM-based analysis
- You could use simpler methods to explore similar research questions
- Consider the delta concept in r/changemyview: What language patterns correlate with view changes?
- Think about how traditional NLP methods could answer similar questions
Remember: This is one possible approach among many. Your creativity in applying NLP techniques to understand online discourse is what matters most.
- Start with exploration: Understand your data before diving into analysis
- Document as you go: Explain your thinking and choices
- Visualize findings: Good visualizations tell the story better than numbers
- Think critically: What do these patterns really mean for online discourse?
- Word embeddings loading: The first time loading GloVe embeddings may take a few minutes and ~200MB download
- Dataset linking errors: Ensure you're using the correct ID columns (
posts['id']
andcomments['link_id']
) - Memory issues with large datasets: Work with samples if needed, or use the provided subsets
- Missing NLTK data: The notebooks will download required NLTK resources automatically
- Import errors: Make sure you've installed all packages from requirements.txt, including
gensim
- Empty conversations: Some posts may have no comments in the dataset - handle these cases gracefully
- You may use AI tools (ChatGPT, Copilot, etc.) to help with coding, but you must:
- Document any AI assistance in code comments
- Understand and be able to explain all code you submit
- Write your own interpretation and analysis
- Collaboration is encouraged for understanding concepts, but each student must submit their own work
- Clearly indicate which pathway you chose at the top of your notebook
- Complete all required tasks for your chosen pathway
- Push to GitHub Classroom with:
- Completed notebook
- Any additional output files in the
output/
folder - Clear indication of which pathway you followed