Skip to content

Commit 937f2f8

Browse files
Add comprehensive scraper documentation
Co-authored-by: willtheorangeguy <[email protected]>
1 parent d9658b2 commit 937f2f8

File tree

1 file changed

+238
-0
lines changed

1 file changed

+238
-0
lines changed

SCRAPER_README.md

Lines changed: 238 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,238 @@
1+
# Price Scraper Documentation
2+
3+
## Overview
4+
5+
This repository includes an automated price scraper that updates PC component prices daily. The scraper runs via GitHub Actions and updates all markdown files with current pricing from major retailers.
6+
7+
## How It Works
8+
9+
### Architecture
10+
11+
1. **scraper.py**: Python script that performs the actual scraping
12+
2. **.github/workflows/update-prices.yml**: GitHub Actions workflow that runs the scraper daily
13+
3. **requirements.txt**: Python dependencies
14+
15+
### Process Flow
16+
17+
```
18+
┌─────────────────────────────────────────────────────────────┐
19+
│ 1. GitHub Actions triggers daily at 2 AM UTC │
20+
└────────────────────┬────────────────────────────────────────┘
21+
22+
23+
┌─────────────────────────────────────────────────────────────┐
24+
│ 2. Install Python dependencies (beautifulsoup4, requests) │
25+
└────────────────────┬────────────────────────────────────────┘
26+
27+
28+
┌─────────────────────────────────────────────────────────────┐
29+
│ 3. Run scraper.py │
30+
│ • Find all 75 markdown files │
31+
│ • Extract PCPartPicker list URLs │
32+
│ • Scrape current prices from PCPartPicker │
33+
│ • Update markdown tables with new prices │
34+
└────────────────────┬────────────────────────────────────────┘
35+
36+
37+
┌─────────────────────────────────────────────────────────────┐
38+
│ 4. Commit and push changes (if any prices updated) │
39+
└─────────────────────────────────────────────────────────────┘
40+
```
41+
42+
### Data Source
43+
44+
The scraper uses **PCPartPicker** as the data source because:
45+
- PCPartPicker already aggregates prices from multiple retailers (Newegg, Amazon, Best Buy, etc.)
46+
- Each build in the repository already has a PCPartPicker list URL
47+
- PCPartPicker handles the complexity of tracking product availability across retailers
48+
- More reliable than scraping individual retailer sites directly
49+
50+
### Security Features
51+
52+
- **URL Validation**: Uses proper URL parsing with a whitelist of trusted retailer domains
53+
- **Error Handling**: Comprehensive try-catch blocks prevent crashes
54+
- **Rate Limiting**: 2-second delay between requests to be respectful to servers
55+
- **No Secrets Required**: No API keys or credentials needed
56+
- **CodeQL Verified**: Passed security scanning with no vulnerabilities
57+
58+
## Manual Usage
59+
60+
### Prerequisites
61+
62+
```bash
63+
# Python 3.12+ recommended
64+
python --version
65+
66+
# Install dependencies
67+
pip install -r requirements.txt
68+
```
69+
70+
### Running the Scraper
71+
72+
```bash
73+
# Run from repository root
74+
python scraper.py
75+
```
76+
77+
The script will:
78+
1. Find all markdown files in year directories (2018/, 2020/, 2021/, etc.)
79+
2. Extract PCPartPicker URLs from each file
80+
3. Scrape current prices
81+
4. Update markdown files with new prices
82+
5. Log progress and any errors
83+
84+
### Output
85+
86+
```
87+
2025-11-18 14:00:00 - INFO - Starting PC Parts Price Scraper
88+
2025-11-18 14:00:00 - INFO - Found 75 markdown files to process
89+
2025-11-18 14:00:00 - INFO - Processing: 2018/January/Budget.md
90+
2025-11-18 14:00:02 - INFO - Scraping prices from: https://ca.pcpartpicker.com/list/8gGn9r
91+
2025-11-18 14:00:04 - INFO - Scraped 8 prices from URL
92+
2025-11-18 14:00:04 - INFO - Updated 2018/January/Budget.md
93+
...
94+
2025-11-18 14:15:00 - INFO - Price scraping complete!
95+
2025-11-18 14:15:00 - INFO - Files updated: 42
96+
2025-11-18 14:15:00 - INFO - Files failed: 0
97+
2025-11-18 14:15:00 - INFO - Total files processed: 75
98+
```
99+
100+
## GitHub Actions Workflow
101+
102+
### Automatic Execution
103+
104+
The workflow runs automatically:
105+
- **Schedule**: Daily at 2:00 AM UTC
106+
- **Trigger**: Can also be manually triggered via GitHub Actions UI
107+
108+
### Manual Triggering
109+
110+
1. Go to the repository on GitHub
111+
2. Click "Actions" tab
112+
3. Select "Update PC Part Prices" workflow
113+
4. Click "Run workflow"
114+
5. Select branch and click "Run workflow" button
115+
116+
### Workflow Configuration
117+
118+
```yaml
119+
# .github/workflows/update-prices.yml
120+
schedule:
121+
- cron: '0 2 * * *' # Daily at 2 AM UTC
122+
```
123+
124+
To change the schedule, modify the cron expression:
125+
- `'0 */6 * * *'` - Every 6 hours
126+
- `'0 0 * * 1'` - Every Monday at midnight
127+
- `'0 12 * * *'` - Daily at noon
128+
129+
## Markdown Format
130+
131+
The scraper expects markdown files with this table format:
132+
133+
```markdown
134+
# January 2018 - Budget
135+
136+
[PCPartPicker Part List](https://ca.pcpartpicker.com/list/8gGn9r)
137+
138+
| Type | Item | Price | Print Price |
139+
| :--- | :--- | :--- | :--- |
140+
| **CPU** | [AMD Ryzen 3 1200...](https://ca.pcpartpicker.com/product/...) | $276.90 @ Amazon Canada | $110.00 |
141+
| **Memory** | [Patriot Viper Elite...](https://ca.pcpartpicker.com/product/...) | - | $77.00 |
142+
```
143+
144+
The scraper:
145+
- Extracts the PCPartPicker list URL from the header
146+
- Parses the table to find product names
147+
- Updates the "Price" column with current prices
148+
- Preserves the "Print Price" column (historical data)
149+
150+
## Supported Retailers
151+
152+
The scraper recognizes these retailers:
153+
- Amazon Canada (amazon.ca, amazon.com)
154+
- Newegg Canada (newegg.ca, newegg.com)
155+
- Best Buy Canada (bestbuy.ca, bestbuy.com)
156+
- Vuugo (vuugo.com)
157+
- Canada Computers (canadacomputers.com)
158+
159+
## Troubleshooting
160+
161+
### No Prices Found
162+
163+
If the scraper reports "No prices scraped":
164+
1. Check that the PCPartPicker URL is valid
165+
2. Verify the PCPartPicker page loads in a browser
166+
3. Check GitHub Actions logs for detailed error messages
167+
168+
### Prices Not Updating
169+
170+
Common causes:
171+
1. Products are out of stock (shows as "-")
172+
2. PCPartPicker page structure changed (may need scraper update)
173+
3. Network issues during GitHub Actions run
174+
175+
### GitHub Actions Failed
176+
177+
1. Check the Actions tab for error logs
178+
2. Verify requirements.txt dependencies are compatible
179+
3. Check if PCPartPicker website is accessible
180+
181+
## Maintenance
182+
183+
### Updating Dependencies
184+
185+
```bash
186+
# Check for outdated packages
187+
pip list --outdated
188+
189+
# Update requirements.txt
190+
pip install --upgrade beautifulsoup4 requests lxml
191+
pip freeze > requirements.txt
192+
```
193+
194+
### Adding New Retailers
195+
196+
To add support for a new retailer:
197+
198+
1. Edit `scraper.py`
199+
2. Add the domain to the `trusted_retailers` dictionary
200+
3. Test with a sample build
201+
4. Commit and push
202+
203+
Example:
204+
```python
205+
trusted_retailers = {
206+
# ... existing retailers ...
207+
'www.memoryexpress.com': 'Memory Express',
208+
'memoryexpress.com': 'Memory Express',
209+
}
210+
```
211+
212+
## Limitations
213+
214+
- **Internet Required**: Scraper needs internet access to reach PCPartPicker
215+
- **Rate Limiting**: 2-second delay between requests (takes ~3-5 minutes for all 75 files)
216+
- **PCPartPicker Dependency**: If PCPartPicker changes their HTML structure, scraper needs updates
217+
- **Canadian Prices**: Currently configured for Canadian pricing (ca.pcpartpicker.com)
218+
219+
## Future Improvements
220+
221+
Potential enhancements:
222+
- [ ] Support for US pricing (pcpartpicker.com)
223+
- [ ] Price history tracking
224+
- [ ] Email notifications when prices drop significantly
225+
- [ ] Support for more retailers
226+
- [ ] Parallel processing for faster execution
227+
- [ ] Website (gh-pages) automatic updates
228+
229+
## Support
230+
231+
For issues or questions:
232+
1. Check existing Issues on GitHub
233+
2. Review GitHub Actions logs for errors
234+
3. Open a new Issue with detailed information
235+
236+
## License
237+
238+
Same as repository license (see LICENSE.md)

0 commit comments

Comments
 (0)