Skip to content

Commit 7b238ed

Browse files
committed
initial
0 parents  commit 7b238ed

File tree

19 files changed

+921
-0
lines changed

19 files changed

+921
-0
lines changed

.github/workflows/ci.yml

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
name: CI
2+
3+
on:
4+
push:
5+
branches: [ main ]
6+
pull_request:
7+
branches: [ main ]
8+
9+
jobs:
10+
test:
11+
runs-on: ubuntu-latest
12+
strategy:
13+
matrix:
14+
python-version: [3.9, 3.10, 3.11]
15+
16+
steps:
17+
- uses: actions/checkout@v3
18+
19+
- name: Set up Python ${{ matrix.python-version }}
20+
uses: actions/setup-python@v4
21+
with:
22+
python-version: ${{ matrix.python-version }}
23+
24+
- name: Install dependencies
25+
run: |
26+
python -m pip install --upgrade pip
27+
pip install -r requirements.txt
28+
29+
- name: Run pre-commit hooks
30+
run: |
31+
pre-commit install
32+
pre-commit run --all-files
33+
34+
- name: Run tests
35+
run: |
36+
pytest --cov=. --cov-report=xml
37+
38+
- name: Upload coverage to Codecov
39+
uses: codecov/codecov-action@v3
40+
with:
41+
file: ./coverage.xml
42+
fail_ci_if_error: true

.pre-commit-config.yaml

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
repos:
2+
- repo: https://github.com/pre-commit/pre-commit-hooks
3+
rev: v4.4.0
4+
hooks:
5+
- id: trailing-whitespace
6+
- id: end-of-file-fixer
7+
- id: check-yaml
8+
- id: check-added-large-files
9+
- id: check-ast
10+
- id: check-json
11+
- id: check-merge-conflict
12+
- id: detect-private-key
13+
14+
- repo: https://github.com/psf/black
15+
rev: 23.3.0
16+
hooks:
17+
- id: black
18+
language_version: python3
19+
20+
- repo: https://github.com/pycqa/isort
21+
rev: 5.12.0
22+
hooks:
23+
- id: isort
24+
args: ["--profile", "black"]
25+
26+
- repo: https://github.com/pycqa/flake8
27+
rev: 6.0.0
28+
hooks:
29+
- id: flake8
30+
additional_dependencies: [flake8-docstrings]

README.md

Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
# Reddit Scraper
2+
3+
A simple tool to scrape posts and comments from Reddit subreddits.
4+
5+
## What it does
6+
7+
- Scrapes top posts and their comments from specified subreddits
8+
- Supports monthly or yearly time periods
9+
- Can limit the number of posts scraped per subreddit
10+
- Saves data in JSON format for easy analysis
11+
12+
## Installation
13+
14+
### From source
15+
16+
```bash
17+
git clone https://github.com/yourusername/reddit-scraper.git
18+
cd reddit-scraper
19+
pip install -e .
20+
```
21+
22+
### Using pip
23+
24+
```bash
25+
pip install reddit-scraper
26+
```
27+
28+
## Usage
29+
30+
1. Create a `subreddits.json` file with your target subreddits:
31+
32+
```json
33+
[
34+
"programming",
35+
"python",
36+
"physics",
37+
"biology"
38+
]
39+
```
40+
41+
2. Run the scraper:
42+
43+
```bash
44+
reddit-scraper -d month -s subreddits.json
45+
```
46+
47+
3. To limit posts per subreddit:
48+
49+
```bash
50+
reddit-scraper -d month -s subreddits.json -l 50
51+
```
52+
53+
## Options
54+
55+
- `-d, --duration`: Time period (month/year)
56+
- `-l, --post-limit`: Max posts per subreddit
57+
- `-s, --subreddits-file`: Path to subreddits config file
58+
59+
## Output
60+
61+
Data is saved in JSON files under the `data/` directory, one file per subreddit.
62+
63+
## Development
64+
65+
### Project Structure
66+
67+
```
68+
reddit-scraper/
69+
├── reddit_scraper/ # Main package
70+
│ ├── core/ # Core functionality
71+
│ │ ├── models.py # Data models
72+
│ │ ├── scraper.py # Scraping logic
73+
│ │ └── data_processor.py # Data processing
74+
│ ├── utils/ # Utilities
75+
│ │ └── config.py # Configuration
76+
│ ├── __init__.py # Package initialization
77+
│ └── __main__.py # Entry point
78+
├── tests/ # Test suite
79+
├── setup.py # Package setup
80+
├── requirements.txt # Dependencies
81+
└── README.md # Documentation
82+
```
83+
84+
### Pre-commit Hooks
85+
86+
This project uses pre-commit hooks to ensure code quality. To set them up:
87+
88+
```bash
89+
pre-commit install
90+
```
91+
92+
The hooks will run automatically on commit, or you can run them manually:
93+
94+
```bash
95+
pre-commit run --all-files
96+
```
97+
98+
### Testing
99+
100+
Run the tests with:
101+
102+
```bash
103+
pytest
104+
```
105+
106+
For coverage information:
107+
108+
```bash
109+
pytest --cov=. --cov-report=term-missing
110+
```

pytest.ini

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
[pytest]
2+
testpaths = tests
3+
python_files = test_*.py
4+
python_classes = Test*
5+
python_functions = test_*
6+
addopts = --cov=. --cov-report=term-missing

reddit_scraper/__init__.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
"""Reddit Scraper - A simple tool to scrape posts and comments from Reddit subreddits."""
2+
3+
__version__ = "0.1.0"

reddit_scraper/__main__.py

Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
"""Reddit Scraper - A simple tool to scrape posts and comments from Reddit subreddits."""
2+
3+
import click
4+
from dotenv import load_dotenv
5+
6+
from reddit_scraper.core.data_processor import DataProcessor
7+
from reddit_scraper.core.models import SubredditConfig
8+
from reddit_scraper.core.scraper import RedditScraper
9+
10+
load_dotenv()
11+
12+
13+
def process_subreddit(
14+
subreddit_config: SubredditConfig, post_limit: int = None
15+
) -> None:
16+
"""Process a single subreddit."""
17+
print(f"Scraping subreddit: {subreddit_config.name}")
18+
scraper = RedditScraper(subreddit=subreddit_config.url, post_limit=post_limit)
19+
processor = DataProcessor()
20+
21+
try:
22+
scraper.get_posts()
23+
while True:
24+
checkpointed = scraper.get_post_details()
25+
if not checkpointed:
26+
break
27+
28+
processor.save_to_json(scraper.posts, subreddit_config.name)
29+
except Exception as e:
30+
print(f"Error processing subreddit {subreddit_config.name}: {e}")
31+
finally:
32+
scraper.destroy()
33+
34+
35+
@click.command()
36+
@click.option(
37+
"-d",
38+
"--duration",
39+
prompt="Scrape Duration",
40+
help="Duration to scrape for (month/year)",
41+
)
42+
@click.option(
43+
"-s",
44+
"--subreddits-file",
45+
default="subreddits.json",
46+
help="Path to the subreddits JSON file",
47+
)
48+
@click.option(
49+
"-l",
50+
"--post-limit",
51+
type=int,
52+
help="Maximum number of posts to scrape per subreddit",
53+
)
54+
def main(duration: str, subreddits_file: str, post_limit: int = None) -> None:
55+
"""Main entry point for the Reddit scraper."""
56+
if duration not in ["month", "year"]:
57+
raise ValueError("Duration must be either month or year")
58+
59+
processor = DataProcessor()
60+
subreddits = processor.read_subreddits_from_json(subreddits_file, duration)
61+
62+
for subreddit_config in subreddits:
63+
process_subreddit(subreddit_config, post_limit)
64+
65+
66+
if __name__ == "__main__":
67+
main()

reddit_scraper/core/__init__.py

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
"""Core functionality for the Reddit Scraper."""
2+
3+
from reddit_scraper.core.data_processor import DataProcessor, DateTimeEncoder
4+
from reddit_scraper.core.models import Comment, Post, ScraperState, SubredditConfig
5+
from reddit_scraper.core.scraper import RedditScraper
6+
7+
__all__ = [
8+
"Comment",
9+
"Post",
10+
"SubredditConfig",
11+
"ScraperState",
12+
"RedditScraper",
13+
"DataProcessor",
14+
"DateTimeEncoder",
15+
]
Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
import json
2+
import os
3+
from datetime import datetime
4+
from typing import Any, Dict, List
5+
6+
from reddit_scraper.core.models import Comment, Post, SubredditConfig
7+
from reddit_scraper.utils.config import get_scraper_config
8+
9+
10+
class DateTimeEncoder(json.JSONEncoder):
11+
"""Custom JSON encoder for datetime objects."""
12+
13+
def default(self, obj):
14+
if isinstance(obj, datetime):
15+
return obj.isoformat()
16+
return super().default(obj)
17+
18+
19+
class DataProcessor:
20+
"""Handles data processing and storage operations."""
21+
22+
def __init__(self):
23+
"""Initialize the data processor."""
24+
self.config = get_scraper_config()
25+
26+
def parse_post_data(self, json_data: Dict[str, Any]) -> Post:
27+
"""Parse raw JSON data into a Post model."""
28+
post = json_data[0]["data"]["children"][0]["data"]
29+
comments_data = json_data[1]["data"]["children"]
30+
31+
return Post(
32+
post_body=post["title"],
33+
post_user=post["author"],
34+
post_time=datetime.fromtimestamp(post["created_utc"]),
35+
comments=self._parse_comments(comments_data),
36+
)
37+
38+
def _parse_comments(self, comment_data: List[Dict[str, Any]]) -> List[Comment]:
39+
"""Parse comment data into Comment models."""
40+
comments = []
41+
for comment in comment_data:
42+
if comment["kind"] != "t1":
43+
continue
44+
45+
comment_dict = comment["data"]
46+
comments.append(
47+
Comment(
48+
body=comment_dict["body"],
49+
user=comment_dict["author"],
50+
time=datetime.fromtimestamp(comment_dict["created_utc"]),
51+
replies=self._parse_comments(
52+
comment_dict["replies"]["data"]["children"]
53+
)
54+
if comment_dict.get("replies")
55+
else [],
56+
)
57+
)
58+
return comments
59+
60+
def save_to_json(self, data: List[Post], subreddit: str) -> str:
61+
"""Save processed data to a JSON file."""
62+
directory = self.config.data_dir
63+
os.makedirs(directory, exist_ok=True)
64+
filename = f"{directory}/{subreddit}.json"
65+
66+
with open(filename, "w") as f:
67+
json.dump([post.dict() for post in data], f, cls=DateTimeEncoder)
68+
return filename
69+
70+
def read_subreddits_from_json(
71+
self, filename: str, duration: str
72+
) -> List[SubredditConfig]:
73+
"""Read subreddit configurations from a JSON file."""
74+
with open(filename) as f:
75+
subreddits_list = json.load(f)
76+
77+
return [
78+
SubredditConfig(
79+
name=subreddit,
80+
url=f"https://www.reddit.com/r/{subreddit}/top/?t={duration}",
81+
)
82+
for subreddit in subreddits_list
83+
]

0 commit comments

Comments
 (0)