|
| 1 | +# Price Scraper Documentation |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +This repository includes an automated price scraper that updates PC component prices daily. The scraper runs via GitHub Actions and updates all markdown files with current pricing from major retailers. |
| 6 | + |
| 7 | +## How It Works |
| 8 | + |
| 9 | +### Architecture |
| 10 | + |
| 11 | +1. **scraper.py**: Python script that performs the actual scraping |
| 12 | +2. **.github/workflows/update-prices.yml**: GitHub Actions workflow that runs the scraper daily |
| 13 | +3. **requirements.txt**: Python dependencies |
| 14 | + |
| 15 | +### Process Flow |
| 16 | + |
| 17 | +``` |
| 18 | +┌─────────────────────────────────────────────────────────────┐ |
| 19 | +│ 1. GitHub Actions triggers daily at 2 AM UTC │ |
| 20 | +└────────────────────┬────────────────────────────────────────┘ |
| 21 | + │ |
| 22 | + ▼ |
| 23 | +┌─────────────────────────────────────────────────────────────┐ |
| 24 | +│ 2. Install Python dependencies (beautifulsoup4, requests) │ |
| 25 | +└────────────────────┬────────────────────────────────────────┘ |
| 26 | + │ |
| 27 | + ▼ |
| 28 | +┌─────────────────────────────────────────────────────────────┐ |
| 29 | +│ 3. Run scraper.py │ |
| 30 | +│ • Find all 75 markdown files │ |
| 31 | +│ • Extract PCPartPicker list URLs │ |
| 32 | +│ • Scrape current prices from PCPartPicker │ |
| 33 | +│ • Update markdown tables with new prices │ |
| 34 | +└────────────────────┬────────────────────────────────────────┘ |
| 35 | + │ |
| 36 | + ▼ |
| 37 | +┌─────────────────────────────────────────────────────────────┐ |
| 38 | +│ 4. Commit and push changes (if any prices updated) │ |
| 39 | +└─────────────────────────────────────────────────────────────┘ |
| 40 | +``` |
| 41 | + |
| 42 | +### Data Source |
| 43 | + |
| 44 | +The scraper uses **PCPartPicker** as the data source because: |
| 45 | +- PCPartPicker already aggregates prices from multiple retailers (Newegg, Amazon, Best Buy, etc.) |
| 46 | +- Each build in the repository already has a PCPartPicker list URL |
| 47 | +- PCPartPicker handles the complexity of tracking product availability across retailers |
| 48 | +- More reliable than scraping individual retailer sites directly |
| 49 | + |
| 50 | +### Security Features |
| 51 | + |
| 52 | +- **URL Validation**: Uses proper URL parsing with a whitelist of trusted retailer domains |
| 53 | +- **Error Handling**: Comprehensive try-catch blocks prevent crashes |
| 54 | +- **Rate Limiting**: 2-second delay between requests to be respectful to servers |
| 55 | +- **No Secrets Required**: No API keys or credentials needed |
| 56 | +- **CodeQL Verified**: Passed security scanning with no vulnerabilities |
| 57 | + |
| 58 | +## Manual Usage |
| 59 | + |
| 60 | +### Prerequisites |
| 61 | + |
| 62 | +```bash |
| 63 | +# Python 3.12+ recommended |
| 64 | +python --version |
| 65 | + |
| 66 | +# Install dependencies |
| 67 | +pip install -r requirements.txt |
| 68 | +``` |
| 69 | + |
| 70 | +### Running the Scraper |
| 71 | + |
| 72 | +```bash |
| 73 | +# Run from repository root |
| 74 | +python scraper.py |
| 75 | +``` |
| 76 | + |
| 77 | +The script will: |
| 78 | +1. Find all markdown files in year directories (2018/, 2020/, 2021/, etc.) |
| 79 | +2. Extract PCPartPicker URLs from each file |
| 80 | +3. Scrape current prices |
| 81 | +4. Update markdown files with new prices |
| 82 | +5. Log progress and any errors |
| 83 | + |
| 84 | +### Output |
| 85 | + |
| 86 | +``` |
| 87 | +2025-11-18 14:00:00 - INFO - Starting PC Parts Price Scraper |
| 88 | +2025-11-18 14:00:00 - INFO - Found 75 markdown files to process |
| 89 | +2025-11-18 14:00:00 - INFO - Processing: 2018/January/Budget.md |
| 90 | +2025-11-18 14:00:02 - INFO - Scraping prices from: https://ca.pcpartpicker.com/list/8gGn9r |
| 91 | +2025-11-18 14:00:04 - INFO - Scraped 8 prices from URL |
| 92 | +2025-11-18 14:00:04 - INFO - Updated 2018/January/Budget.md |
| 93 | +... |
| 94 | +2025-11-18 14:15:00 - INFO - Price scraping complete! |
| 95 | +2025-11-18 14:15:00 - INFO - Files updated: 42 |
| 96 | +2025-11-18 14:15:00 - INFO - Files failed: 0 |
| 97 | +2025-11-18 14:15:00 - INFO - Total files processed: 75 |
| 98 | +``` |
| 99 | + |
| 100 | +## GitHub Actions Workflow |
| 101 | + |
| 102 | +### Automatic Execution |
| 103 | + |
| 104 | +The workflow runs automatically: |
| 105 | +- **Schedule**: Daily at 2:00 AM UTC |
| 106 | +- **Trigger**: Can also be manually triggered via GitHub Actions UI |
| 107 | + |
| 108 | +### Manual Triggering |
| 109 | + |
| 110 | +1. Go to the repository on GitHub |
| 111 | +2. Click "Actions" tab |
| 112 | +3. Select "Update PC Part Prices" workflow |
| 113 | +4. Click "Run workflow" |
| 114 | +5. Select branch and click "Run workflow" button |
| 115 | + |
| 116 | +### Workflow Configuration |
| 117 | + |
| 118 | +```yaml |
| 119 | +# .github/workflows/update-prices.yml |
| 120 | +schedule: |
| 121 | + - cron: '0 2 * * *' # Daily at 2 AM UTC |
| 122 | +``` |
| 123 | +
|
| 124 | +To change the schedule, modify the cron expression: |
| 125 | +- `'0 */6 * * *'` - Every 6 hours |
| 126 | +- `'0 0 * * 1'` - Every Monday at midnight |
| 127 | +- `'0 12 * * *'` - Daily at noon |
| 128 | +
|
| 129 | +## Markdown Format |
| 130 | +
|
| 131 | +The scraper expects markdown files with this table format: |
| 132 | +
|
| 133 | +```markdown |
| 134 | +# January 2018 - Budget |
| 135 | + |
| 136 | +[PCPartPicker Part List](https://ca.pcpartpicker.com/list/8gGn9r) |
| 137 | + |
| 138 | +| Type | Item | Price | Print Price | |
| 139 | +| :--- | :--- | :--- | :--- | |
| 140 | +| **CPU** | [AMD Ryzen 3 1200...](https://ca.pcpartpicker.com/product/...) | $276.90 @ Amazon Canada | $110.00 | |
| 141 | +| **Memory** | [Patriot Viper Elite...](https://ca.pcpartpicker.com/product/...) | - | $77.00 | |
| 142 | +``` |
| 143 | + |
| 144 | +The scraper: |
| 145 | +- Extracts the PCPartPicker list URL from the header |
| 146 | +- Parses the table to find product names |
| 147 | +- Updates the "Price" column with current prices |
| 148 | +- Preserves the "Print Price" column (historical data) |
| 149 | + |
| 150 | +## Supported Retailers |
| 151 | + |
| 152 | +The scraper recognizes these retailers: |
| 153 | +- Amazon Canada (amazon.ca, amazon.com) |
| 154 | +- Newegg Canada (newegg.ca, newegg.com) |
| 155 | +- Best Buy Canada (bestbuy.ca, bestbuy.com) |
| 156 | +- Vuugo (vuugo.com) |
| 157 | +- Canada Computers (canadacomputers.com) |
| 158 | + |
| 159 | +## Troubleshooting |
| 160 | + |
| 161 | +### No Prices Found |
| 162 | + |
| 163 | +If the scraper reports "No prices scraped": |
| 164 | +1. Check that the PCPartPicker URL is valid |
| 165 | +2. Verify the PCPartPicker page loads in a browser |
| 166 | +3. Check GitHub Actions logs for detailed error messages |
| 167 | + |
| 168 | +### Prices Not Updating |
| 169 | + |
| 170 | +Common causes: |
| 171 | +1. Products are out of stock (shows as "-") |
| 172 | +2. PCPartPicker page structure changed (may need scraper update) |
| 173 | +3. Network issues during GitHub Actions run |
| 174 | + |
| 175 | +### GitHub Actions Failed |
| 176 | + |
| 177 | +1. Check the Actions tab for error logs |
| 178 | +2. Verify requirements.txt dependencies are compatible |
| 179 | +3. Check if PCPartPicker website is accessible |
| 180 | + |
| 181 | +## Maintenance |
| 182 | + |
| 183 | +### Updating Dependencies |
| 184 | + |
| 185 | +```bash |
| 186 | +# Check for outdated packages |
| 187 | +pip list --outdated |
| 188 | + |
| 189 | +# Update requirements.txt |
| 190 | +pip install --upgrade beautifulsoup4 requests lxml |
| 191 | +pip freeze > requirements.txt |
| 192 | +``` |
| 193 | + |
| 194 | +### Adding New Retailers |
| 195 | + |
| 196 | +To add support for a new retailer: |
| 197 | + |
| 198 | +1. Edit `scraper.py` |
| 199 | +2. Add the domain to the `trusted_retailers` dictionary |
| 200 | +3. Test with a sample build |
| 201 | +4. Commit and push |
| 202 | + |
| 203 | +Example: |
| 204 | +```python |
| 205 | +trusted_retailers = { |
| 206 | + # ... existing retailers ... |
| 207 | + 'www.memoryexpress.com': 'Memory Express', |
| 208 | + 'memoryexpress.com': 'Memory Express', |
| 209 | +} |
| 210 | +``` |
| 211 | + |
| 212 | +## Limitations |
| 213 | + |
| 214 | +- **Internet Required**: Scraper needs internet access to reach PCPartPicker |
| 215 | +- **Rate Limiting**: 2-second delay between requests (takes ~3-5 minutes for all 75 files) |
| 216 | +- **PCPartPicker Dependency**: If PCPartPicker changes their HTML structure, scraper needs updates |
| 217 | +- **Canadian Prices**: Currently configured for Canadian pricing (ca.pcpartpicker.com) |
| 218 | + |
| 219 | +## Future Improvements |
| 220 | + |
| 221 | +Potential enhancements: |
| 222 | +- [ ] Support for US pricing (pcpartpicker.com) |
| 223 | +- [ ] Price history tracking |
| 224 | +- [ ] Email notifications when prices drop significantly |
| 225 | +- [ ] Support for more retailers |
| 226 | +- [ ] Parallel processing for faster execution |
| 227 | +- [ ] Website (gh-pages) automatic updates |
| 228 | + |
| 229 | +## Support |
| 230 | + |
| 231 | +For issues or questions: |
| 232 | +1. Check existing Issues on GitHub |
| 233 | +2. Review GitHub Actions logs for errors |
| 234 | +3. Open a new Issue with detailed information |
| 235 | + |
| 236 | +## License |
| 237 | + |
| 238 | +Same as repository license (see LICENSE.md) |
0 commit comments