Skip to content

Commit a826fd7

Browse files
Merge pull request #48 from willtheorangeguy/copilot/add-web-scraper-for-prices-again
Add automated daily price scraper for PC component pricing
2 parents ceafde9 + 0035b97 commit a826fd7

File tree

6 files changed

+728
-0
lines changed

6 files changed

+728
-0
lines changed
Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
name: Update PC Part Prices
2+
3+
on:
4+
schedule:
5+
# Run daily at 2 AM UTC (adjust as needed)
6+
- cron: '0 2 * * *'
7+
workflow_dispatch: # Allow manual triggering
8+
inputs:
9+
debug:
10+
description: 'Enable debug logging'
11+
required: false
12+
default: false
13+
type: boolean
14+
15+
jobs:
16+
update-prices:
17+
runs-on: ubuntu-latest
18+
19+
permissions:
20+
contents: write
21+
22+
steps:
23+
- name: Checkout repository
24+
uses: actions/checkout@v5
25+
with:
26+
fetch-depth: 0
27+
28+
- name: Set up Python
29+
uses: actions/setup-python@v5
30+
with:
31+
python-version: '3.12'
32+
cache: 'pip'
33+
34+
- name: Install dependencies
35+
run: |
36+
python -m pip install --upgrade pip
37+
pip install -r requirements.txt
38+
39+
- name: Run price scraper
40+
run: |
41+
python scraper.py
42+
env:
43+
DEBUG: ${{ inputs.debug && '1' || '' }}
44+
45+
- name: Check for changes
46+
id: git-check
47+
run: |
48+
git diff --exit-code || echo "changes=true" >> $GITHUB_OUTPUT
49+
50+
- name: Commit and push changes
51+
if: steps.git-check.outputs.changes == 'true'
52+
run: |
53+
git config --local user.email "github-actions[bot]@users.noreply.github.com"
54+
git config --local user.name "github-actions[bot]"
55+
git add .
56+
git commit -m "chore: update PC part prices [automated]"
57+
git push
58+
env:
59+
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

.gitignore

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# Python
2+
__pycache__/
3+
*.py[cod]
4+
*$py.class
5+
*.so
6+
.Python
7+
env/
8+
venv/
9+
ENV/
10+
build/
11+
develop-eggs/
12+
dist/
13+
downloads/
14+
eggs/
15+
.eggs/
16+
lib/
17+
lib64/
18+
parts/
19+
sdist/
20+
var/
21+
wheels/
22+
*.egg-info/
23+
.installed.cfg
24+
*.egg
25+
26+
# IDE
27+
.vscode/
28+
.idea/
29+
*.swp
30+
*.swo
31+
*~
32+
33+
# OS
34+
.DS_Store
35+
Thumbs.db
36+
37+
# Testing
38+
.pytest_cache/
39+
.coverage
40+
htmlcov/
41+
42+
# Logs
43+
*.log

README.md

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,7 @@
5050
- Available as PCPartPicker lists, Markdown files, or on a website.
5151
- Markdown files and website show the original printed price.
5252
- Current prices are available in United States Dollar or Canadian Dollar.
53+
- **Automated daily price updates** - Prices are scraped from retailers (Newegg, Amazon, Best Buy) and updated automatically via GitHub Actions.
5354
- Cross platform.
5455

5556
## Download
@@ -157,6 +158,25 @@ Each of the issues has its builds listed in three different places, with either
157158
| October 2021 | AMD Turbo | [PCPartPicker](https://pcpartpicker.com/user/willtheornageguy/saved/VBjMFT) | [Markdown](/2021/October/AMD%20Turbo.md) | [Web](https://willtheorangeguy.github.io/Maximum-PC-Builds-Archive/2021/october/) |
158159
| October 2021 | Intel Turbo | [PCPartPicker](https://pcpartpicker.com/user/willtheornageguy/saved/F4s7wP) | [Markdown](/2021/October/Intel%20Turbo.md) | [Web](https://willtheorangeguy.github.io/Maximum-PC-Builds-Archive/2021/october/) |
159160

161+
## Automated Price Updates
162+
163+
This repository includes an automated price scraper that runs daily to keep component prices up-to-date. The scraper:
164+
165+
- Runs automatically every day at 2 AM UTC via GitHub Actions
166+
- Scrapes current prices from PCPartPicker (which aggregates prices from retailers like Newegg, Amazon, and Best Buy)
167+
- Updates the markdown files with the latest pricing information
168+
- Can be manually triggered using the "Update PC Part Prices" workflow in the Actions tab
169+
170+
The price scraper is implemented in Python and uses BeautifulSoup to parse PCPartPicker's build lists. If you want to run it manually:
171+
172+
```bash
173+
# Install dependencies
174+
pip install -r requirements.txt
175+
176+
# Run the scraper
177+
python scraper.py
178+
```
179+
160180
## Contributing
161181

162182
Please contribute using [GitHub Flow](https://guides.github.com/introduction/flow). Create a branch, add commits, and [open a pull request](https://github.com/willtheorangeguy/PyWorkout/compare).

SCRAPER_README.md

Lines changed: 249 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,249 @@
1+
# Price Scraper Documentation
2+
3+
## Overview
4+
5+
This repository includes an automated price scraper that updates PC component prices daily. The scraper runs via GitHub Actions and updates all markdown files with current pricing from major retailers.
6+
7+
## How It Works
8+
9+
### Architecture
10+
11+
1. **scraper.py**: Python script that performs the actual scraping
12+
2. **.github/workflows/update-prices.yml**: GitHub Actions workflow that runs the scraper daily
13+
3. **requirements.txt**: Python dependencies
14+
15+
### Process Flow
16+
17+
```
18+
┌─────────────────────────────────────────────────────────────┐
19+
│ 1. GitHub Actions triggers daily at 2 AM UTC │
20+
└────────────────────┬────────────────────────────────────────┘
21+
22+
23+
┌─────────────────────────────────────────────────────────────┐
24+
│ 2. Install Python dependencies (beautifulsoup4, requests) │
25+
└────────────────────┬────────────────────────────────────────┘
26+
27+
28+
┌─────────────────────────────────────────────────────────────┐
29+
│ 3. Run scraper.py │
30+
│ • Find all 75 markdown files │
31+
│ • Extract PCPartPicker list URLs │
32+
│ • Scrape current prices from PCPartPicker │
33+
│ • Update markdown tables with new prices │
34+
└────────────────────┬────────────────────────────────────────┘
35+
36+
37+
┌─────────────────────────────────────────────────────────────┐
38+
│ 4. Commit and push changes (if any prices updated) │
39+
└─────────────────────────────────────────────────────────────┘
40+
```
41+
42+
### Data Source
43+
44+
The scraper uses **PCPartPicker** as the data source because:
45+
46+
- PCPartPicker already aggregates prices from multiple retailers (Newegg, Amazon, Best Buy, etc.)
47+
- Each build in the repository already has a PCPartPicker list URL
48+
- PCPartPicker handles the complexity of tracking product availability across retailers
49+
- More reliable than scraping individual retailer sites directly
50+
51+
### Security Features
52+
53+
- **URL Validation**: Uses proper URL parsing with a whitelist of trusted retailer domains
54+
- **Error Handling**: Comprehensive try-catch blocks prevent crashes
55+
- **Rate Limiting**: 2-second delay between requests to be respectful to servers
56+
- **No Secrets Required**: No API keys or credentials needed
57+
- **CodeQL Verified**: Passed security scanning with no vulnerabilities
58+
59+
## Manual Usage
60+
61+
### Prerequisites
62+
63+
```bash
64+
# Python 3.12+ recommended
65+
python --version
66+
67+
# Install dependencies
68+
pip install -r requirements.txt
69+
```
70+
71+
### Running the Scraper
72+
73+
```bash
74+
# Run from repository root
75+
python scraper.py
76+
```
77+
78+
The script will:
79+
80+
1. Find all markdown files in year directories (2018/, 2020/, 2021/, etc.)
81+
2. Extract PCPartPicker URLs from each file
82+
3. Scrape current prices
83+
4. Update markdown files with new prices
84+
5. Log progress and any errors
85+
86+
### Output
87+
88+
```
89+
2025-11-18 14:00:00 - INFO - Starting PC Parts Price Scraper
90+
2025-11-18 14:00:00 - INFO - Found 75 markdown files to process
91+
2025-11-18 14:00:00 - INFO - Processing: 2018/January/Budget.md
92+
2025-11-18 14:00:02 - INFO - Scraping prices from: https://ca.pcpartpicker.com/list/8gGn9r
93+
2025-11-18 14:00:04 - INFO - Scraped 8 prices from URL
94+
2025-11-18 14:00:04 - INFO - Updated 2018/January/Budget.md
95+
...
96+
2025-11-18 14:15:00 - INFO - Price scraping complete!
97+
2025-11-18 14:15:00 - INFO - Files updated: 42
98+
2025-11-18 14:15:00 - INFO - Files failed: 0
99+
2025-11-18 14:15:00 - INFO - Total files processed: 75
100+
```
101+
102+
## GitHub Actions Workflow
103+
104+
### Automatic Execution
105+
106+
The workflow runs automatically:
107+
108+
- **Schedule**: Daily at 2:00 AM UTC
109+
- **Trigger**: Can also be manually triggered via GitHub Actions UI
110+
111+
### Manual Triggering
112+
113+
1. Go to the repository on GitHub
114+
2. Click "Actions" tab
115+
3. Select "Update PC Part Prices" workflow
116+
4. Click "Run workflow"
117+
5. Select branch and click "Run workflow" button
118+
119+
### Workflow Configuration
120+
121+
```yaml
122+
# .github/workflows/update-prices.yml
123+
schedule:
124+
- cron: "0 2 * * *" # Daily at 2 AM UTC
125+
```
126+
127+
To change the schedule, modify the cron expression:
128+
129+
- `'0 */6 * * *'` - Every 6 hours
130+
- `'0 0 * * 1'` - Every Monday at midnight
131+
- `'0 12 * * *'` - Daily at noon
132+
133+
## Markdown Format
134+
135+
The scraper expects markdown files with this table format:
136+
137+
```markdown
138+
# January 2018 - Budget
139+
140+
[PCPartPicker Part List](https://ca.pcpartpicker.com/list/8gGn9r)
141+
142+
| Type | Item | Price | Print Price |
143+
| :--------- | :---------------------------------------------------------------- | :---------------------- | :---------- |
144+
| **CPU** | [AMD Ryzen 3 1200...](https://ca.pcpartpicker.com/product/...) | $276.90 @ Amazon Canada | $110.00 |
145+
| **Memory** | [Patriot Viper Elite...](https://ca.pcpartpicker.com/product/...) | - | $77.00 |
146+
```
147+
148+
The scraper:
149+
150+
- Extracts the PCPartPicker list URL from the header
151+
- Parses the table to find product names
152+
- Updates the "Price" column with current prices
153+
- Preserves the "Print Price" column (historical data)
154+
155+
## Supported Retailers
156+
157+
The scraper recognizes these retailers:
158+
159+
- Amazon Canada (amazon.ca, amazon.com)
160+
- Newegg Canada (newegg.ca, newegg.com)
161+
- Best Buy Canada (bestbuy.ca, bestbuy.com)
162+
- Vuugo (vuugo.com)
163+
- Canada Computers (canadacomputers.com)
164+
165+
## Troubleshooting
166+
167+
### No Prices Found
168+
169+
If the scraper reports "No prices scraped":
170+
171+
1. Check that the PCPartPicker URL is valid
172+
2. Verify the PCPartPicker page loads in a browser
173+
3. Check GitHub Actions logs for detailed error messages
174+
175+
### Prices Not Updating
176+
177+
Common causes:
178+
179+
1. Products are out of stock (shows as "-")
180+
2. PCPartPicker page structure changed (may need scraper update)
181+
3. Network issues during GitHub Actions run
182+
183+
### GitHub Actions Failed
184+
185+
1. Check the Actions tab for error logs
186+
2. Verify requirements.txt dependencies are compatible
187+
3. Check if PCPartPicker website is accessible
188+
189+
## Maintenance
190+
191+
### Updating Dependencies
192+
193+
```bash
194+
# Check for outdated packages
195+
pip list --outdated
196+
197+
# Update requirements.txt
198+
pip install --upgrade beautifulsoup4 requests lxml
199+
pip freeze > requirements.txt
200+
```
201+
202+
### Adding New Retailers
203+
204+
To add support for a new retailer:
205+
206+
1. Edit `scraper.py`
207+
2. Add the domain to the `trusted_retailers` dictionary
208+
3. Test with a sample build
209+
4. Commit and push
210+
211+
Example:
212+
213+
```python
214+
trusted_retailers = {
215+
# ... existing retailers ...
216+
'www.memoryexpress.com': 'Memory Express',
217+
'memoryexpress.com': 'Memory Express',
218+
}
219+
```
220+
221+
## Limitations
222+
223+
- **Internet Required**: Scraper needs internet access to reach PCPartPicker
224+
- **Rate Limiting**: 2-second delay between requests (takes ~3-5 minutes for all 75 files)
225+
- **PCPartPicker Dependency**: If PCPartPicker changes their HTML structure, scraper needs updates
226+
- **Canadian Prices**: Currently configured for Canadian pricing (ca.pcpartpicker.com)
227+
228+
## Future Improvements
229+
230+
Potential enhancements:
231+
232+
- [ ] Support for US pricing (pcpartpicker.com)
233+
- [ ] Price history tracking
234+
- [ ] Email notifications when prices drop significantly
235+
- [ ] Support for more retailers
236+
- [ ] Parallel processing for faster execution
237+
- [ ] Website (gh-pages) automatic updates
238+
239+
## Support
240+
241+
For issues or questions:
242+
243+
1. Check existing Issues on GitHub
244+
2. Review GitHub Actions logs for errors
245+
3. Open a new Issue with detailed information
246+
247+
## License
248+
249+
Same as repository license (see LICENSE.md)

requirements.txt

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
beautifulsoup4==4.12.3
2+
requests==2.32.3
3+
lxml==5.3.0

0 commit comments

Comments
 (0)