A Python scraper using PRAW to extract Reddit comments for NLP and GPT model analysis. This project aims to collect comprehensive comment data from Reddit threads.
A command-line tool to scrape submissions from a specified subreddit, process the data, and save the output into batched JSON files.
- Conda: This project is configured to use
condafor environment management. You can install it via Anaconda or Miniconda. - Reddit & Google API Credentials: You will need API keys for both Reddit and any Google services you intend to use.
Follow these steps to set up the project environment using conda.
1. Clone the Repository
git clone <your-repository-url>
cd ask_reddit2. Create and Activate the Conda Environment
The repository includes an environment.yml file that contains all the necessary dependencies. Run the following command from your terminal to create the environment:
conda env create -f environment.ymlOnce the process is complete, activate the new environment:
conda activate ask_reddit3. Configure Environment Variables
Create a file named .env in the project's root directory. This file securely stores your API keys and configuration. Copy the following, paste it into the .env file, and add your credentials.
# .env file
# --- Reddit API Credentials ---
REDDIT_CLIENT_ID="YOUR_CLIENT_ID_HERE"
REDDIT_CLIENT_SECRET="YOUR_CLIENT_SECRET_HERE"
REDDIT_USER_AGENT="A_DESCRIPTIVE_USER_AGENT_STRING"
REDDIT_PASSWORD="YOUR_REDDIT_PASSWORD"
# --- Google Generative AI Configuration ---
GOOGLE_API_KEY="YOUR_GOOGLE_API_KEY_HERE"
GENAI_MODEL="gemini-2.5-flash"
# --- File & Data Configuration ---
FILE_LOCATION="data/"
SOURCE="reddit"Run the module from your terminal with the required arguments.
python -m ask_reddit --subreddit <name> --days <number> --batch <M|D>--subreddit: (Required) The name of the subreddit to scrape (e.g.,python).--days: (Required) The number of days back from today to collect submissions.--batch: (Required) The batching mode for output files (Dfor daily,Mfor monthly).
python -m ask_reddit --subreddit dataisbeautiful --days 30 --batch MThe script will generate JSON files inside the data/ directory, which is created automatically if it does not exist.
- Daily Batching (
D):r_subreddit_YYYY-MM-DD.json - Monthly Batching (
M):r_subreddit_YYYY-MM.json
This project is licensed under the terms of the MIT license. See LICENSE for more details.
@misc{ask-reddit,
author = {john-james-ai},
title = {A Python scraper using PRAW to extract Reddit comments for NLP and GPT model analysis. This project aims to collect comprehensive comment data from Reddit threads.},
year = {2025},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/ask-reddit/ask-reddit}}
}