Skip to content

Commit e45a6e2

Browse files
committed
initial
1 parent 7b238ed commit e45a6e2

File tree

2 files changed

+117
-0
lines changed

2 files changed

+117
-0
lines changed

.gitignore

Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
# Python
2+
__pycache__/
3+
*.py[cod]
4+
*$py.class
5+
*.so
6+
.Python
7+
build/
8+
develop-eggs/
9+
dist/
10+
downloads/
11+
eggs/
12+
.eggs/
13+
lib/
14+
lib64/
15+
parts/
16+
sdist/
17+
var/
18+
wheels/
19+
*.egg-info/
20+
.installed.cfg
21+
*.egg
22+
23+
# Virtual Environment
24+
venv/
25+
env/
26+
ENV/
27+
.env
28+
.venv
29+
env.bak/
30+
venv.bak/
31+
32+
# IDE - VSCode
33+
.vscode/
34+
*.code-workspace
35+
.history
36+
37+
# IDE - PyCharm
38+
.idea/
39+
*.iml
40+
*.iws
41+
.idea_modules/
42+
43+
# IDE - Jupyter Notebook
44+
.ipynb_checkpoints
45+
*.ipynb
46+
47+
# Coverage and Testing
48+
htmlcov/
49+
.tox/
50+
.nox/
51+
.coverage
52+
.coverage.*
53+
.cache
54+
nosetests.xml
55+
coverage.xml
56+
*.cover
57+
*.py,cover
58+
.hypothesis/
59+
.pytest_cache/
60+
cover/
61+
62+
# Project specific
63+
data/
64+
*.json
65+
!reddit_scraper/data/subreddits.json
66+
logs/
67+
*.log
68+
69+
# OS specific
70+
# macOS
71+
.DS_Store
72+
.AppleDouble
73+
.LSOverride
74+
._*
75+
76+
# Windows
77+
Thumbs.db
78+
Thumbs.db:encryptable
79+
ehthumbs.db
80+
ehthumbs_vista.db
81+
*.stackdump
82+
[Dd]esktop.ini
83+
84+
# Linux
85+
*~
86+
.fuse_hidden*
87+
.directory
88+
.Trash-*
89+
.nfs*
90+
91+
data/

README.md

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,32 @@ A simple tool to scrape posts and comments from Reddit subreddits.
99
- Can limit the number of posts scraped per subreddit
1010
- Saves data in JSON format for easy analysis
1111

12+
## How it works
13+
14+
### The Smart Way to Scrape Reddit
15+
16+
Instead of brute-forcing our way through Reddit's pages, we use a clever approach that's both efficient and respectful of Reddit's servers:
17+
18+
First, we visit the subreddit's top posts page and scroll through it just like a human would. As we scroll, we collect the IDs of all the posts we want to save. This is quick and lightweight - we're just gathering a list of what we want to look at later.
19+
20+
Then, for each post we're interested in, we use Reddit's own API to get all the data in a clean, structured format. We do this by adding `.json` to the end of any Reddit URL, which gives us everything we need in one go - the post itself, all its comments, and all the metadata.
21+
22+
This approach has several advantages:
23+
- It's much faster than downloading and parsing entire HTML pages
24+
- We get clean, structured data instead of messy HTML
25+
- We can easily resume if something goes wrong
26+
- We're less likely to trigger Reddit's rate limits
27+
28+
### Handling Reddit's Limits
29+
30+
Reddit doesn't like it when people make too many requests too quickly. Our scraper is smart about this:
31+
32+
- If Reddit tells us we're asking for too much too quickly, we save our progress and wait
33+
- We can pick up right where we left off when we run the scraper again
34+
- We add small delays between requests to be nice to Reddit's servers
35+
36+
This makes our scraper more reliable and less likely to get blocked.
37+
1238
## Installation
1339

1440
### From source

0 commit comments

Comments
 (0)