Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
59 changes: 59 additions & 0 deletions .github/workflows/update-prices.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
name: Update PC Part Prices

on:
schedule:
# Run daily at 2 AM UTC (adjust as needed)
- cron: '0 2 * * *'
workflow_dispatch: # Allow manual triggering
inputs:
debug:
description: 'Enable debug logging'
required: false
default: false
type: boolean

jobs:
update-prices:
runs-on: ubuntu-latest

permissions:
contents: write

steps:
- name: Checkout repository
uses: actions/checkout@v5
with:
fetch-depth: 0

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.12'
cache: 'pip'

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt

- name: Run price scraper
run: |
python scraper.py
env:
DEBUG: ${{ inputs.debug && '1' || '' }}

- name: Check for changes
id: git-check
run: |
git diff --exit-code || echo "changes=true" >> $GITHUB_OUTPUT

- name: Commit and push changes
if: steps.git-check.outputs.changes == 'true'
run: |
git config --local user.email "github-actions[bot]@users.noreply.github.com"
git config --local user.name "github-actions[bot]"
git add .
git commit -m "chore: update PC part prices [automated]"
git push
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
43 changes: 43 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# Python
__pycache__/
*.py[cod]
*$py.class
*.so
.Python
env/
venv/
ENV/
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg

# IDE
.vscode/
.idea/
*.swp
*.swo
*~

# OS
.DS_Store
Thumbs.db

# Testing
.pytest_cache/
.coverage
htmlcov/

# Logs
*.log
20 changes: 20 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@
- Available as PCPartPicker lists, Markdown files, or on a website.
- Markdown files and website show the original printed price.
- Current prices are available in United States Dollar or Canadian Dollar.
- **Automated daily price updates** - Prices are scraped from retailers (Newegg, Amazon, Best Buy) and updated automatically via GitHub Actions.
- Cross platform.

## Download
Expand Down Expand Up @@ -157,6 +158,25 @@ Each of the issues has its builds listed in three different places, with either
| October 2021 | AMD Turbo | [PCPartPicker](https://pcpartpicker.com/user/willtheornageguy/saved/VBjMFT) | [Markdown](/2021/October/AMD%20Turbo.md) | [Web](https://willtheorangeguy.github.io/Maximum-PC-Builds-Archive/2021/october/) |
| October 2021 | Intel Turbo | [PCPartPicker](https://pcpartpicker.com/user/willtheornageguy/saved/F4s7wP) | [Markdown](/2021/October/Intel%20Turbo.md) | [Web](https://willtheorangeguy.github.io/Maximum-PC-Builds-Archive/2021/october/) |

## Automated Price Updates

This repository includes an automated price scraper that runs daily to keep component prices up-to-date. The scraper:

- Runs automatically every day at 2 AM UTC via GitHub Actions
- Scrapes current prices from PCPartPicker (which aggregates prices from retailers like Newegg, Amazon, and Best Buy)
- Updates the markdown files with the latest pricing information
- Can be manually triggered using the "Update PC Part Prices" workflow in the Actions tab

The price scraper is implemented in Python and uses BeautifulSoup to parse PCPartPicker's build lists. If you want to run it manually:

```bash
# Install dependencies
pip install -r requirements.txt

# Run the scraper
python scraper.py
```

## Contributing

Please contribute using [GitHub Flow](https://guides.github.com/introduction/flow). Create a branch, add commits, and [open a pull request](https://github.com/willtheorangeguy/PyWorkout/compare).
Expand Down
249 changes: 249 additions & 0 deletions SCRAPER_README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,249 @@
# Price Scraper Documentation

## Overview

This repository includes an automated price scraper that updates PC component prices daily. The scraper runs via GitHub Actions and updates all markdown files with current pricing from major retailers.

## How It Works

### Architecture

1. **scraper.py**: Python script that performs the actual scraping
2. **.github/workflows/update-prices.yml**: GitHub Actions workflow that runs the scraper daily
3. **requirements.txt**: Python dependencies

### Process Flow

```
┌─────────────────────────────────────────────────────────────┐
│ 1. GitHub Actions triggers daily at 2 AM UTC │
└────────────────────┬────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 2. Install Python dependencies (beautifulsoup4, requests) │
└────────────────────┬────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 3. Run scraper.py │
│ • Find all 75 markdown files │
│ • Extract PCPartPicker list URLs │
│ • Scrape current prices from PCPartPicker │
│ • Update markdown tables with new prices │
└────────────────────┬────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 4. Commit and push changes (if any prices updated) │
└─────────────────────────────────────────────────────────────┘
```

### Data Source

The scraper uses **PCPartPicker** as the data source because:

- PCPartPicker already aggregates prices from multiple retailers (Newegg, Amazon, Best Buy, etc.)
- Each build in the repository already has a PCPartPicker list URL
- PCPartPicker handles the complexity of tracking product availability across retailers
- More reliable than scraping individual retailer sites directly

### Security Features

- **URL Validation**: Uses proper URL parsing with a whitelist of trusted retailer domains
- **Error Handling**: Comprehensive try-catch blocks prevent crashes
- **Rate Limiting**: 2-second delay between requests to be respectful to servers
- **No Secrets Required**: No API keys or credentials needed
- **CodeQL Verified**: Passed security scanning with no vulnerabilities

## Manual Usage

### Prerequisites

```bash
# Python 3.12+ recommended
python --version

# Install dependencies
pip install -r requirements.txt
```

### Running the Scraper

```bash
# Run from repository root
python scraper.py
```

The script will:

1. Find all markdown files in year directories (2018/, 2020/, 2021/, etc.)
2. Extract PCPartPicker URLs from each file
3. Scrape current prices
4. Update markdown files with new prices
5. Log progress and any errors

### Output

```
2025-11-18 14:00:00 - INFO - Starting PC Parts Price Scraper
2025-11-18 14:00:00 - INFO - Found 75 markdown files to process
2025-11-18 14:00:00 - INFO - Processing: 2018/January/Budget.md
2025-11-18 14:00:02 - INFO - Scraping prices from: https://ca.pcpartpicker.com/list/8gGn9r
2025-11-18 14:00:04 - INFO - Scraped 8 prices from URL
2025-11-18 14:00:04 - INFO - Updated 2018/January/Budget.md
...
2025-11-18 14:15:00 - INFO - Price scraping complete!
2025-11-18 14:15:00 - INFO - Files updated: 42
2025-11-18 14:15:00 - INFO - Files failed: 0
2025-11-18 14:15:00 - INFO - Total files processed: 75
```

## GitHub Actions Workflow

### Automatic Execution

The workflow runs automatically:

- **Schedule**: Daily at 2:00 AM UTC
- **Trigger**: Can also be manually triggered via GitHub Actions UI

### Manual Triggering

1. Go to the repository on GitHub
2. Click "Actions" tab
3. Select "Update PC Part Prices" workflow
4. Click "Run workflow"
5. Select branch and click "Run workflow" button

### Workflow Configuration

```yaml
# .github/workflows/update-prices.yml
schedule:
- cron: "0 2 * * *" # Daily at 2 AM UTC
```

To change the schedule, modify the cron expression:

- `'0 */6 * * *'` - Every 6 hours
- `'0 0 * * 1'` - Every Monday at midnight
- `'0 12 * * *'` - Daily at noon

## Markdown Format

The scraper expects markdown files with this table format:

```markdown
# January 2018 - Budget

[PCPartPicker Part List](https://ca.pcpartpicker.com/list/8gGn9r)

| Type | Item | Price | Print Price |
| :--------- | :---------------------------------------------------------------- | :---------------------- | :---------- |
| **CPU** | [AMD Ryzen 3 1200...](https://ca.pcpartpicker.com/product/...) | $276.90 @ Amazon Canada | $110.00 |
| **Memory** | [Patriot Viper Elite...](https://ca.pcpartpicker.com/product/...) | - | $77.00 |
```

The scraper:

- Extracts the PCPartPicker list URL from the header
- Parses the table to find product names
- Updates the "Price" column with current prices
- Preserves the "Print Price" column (historical data)

## Supported Retailers

The scraper recognizes these retailers:

- Amazon Canada (amazon.ca, amazon.com)
- Newegg Canada (newegg.ca, newegg.com)
- Best Buy Canada (bestbuy.ca, bestbuy.com)
- Vuugo (vuugo.com)
- Canada Computers (canadacomputers.com)

## Troubleshooting

### No Prices Found

If the scraper reports "No prices scraped":

1. Check that the PCPartPicker URL is valid
2. Verify the PCPartPicker page loads in a browser
3. Check GitHub Actions logs for detailed error messages

### Prices Not Updating

Common causes:

1. Products are out of stock (shows as "-")
2. PCPartPicker page structure changed (may need scraper update)
3. Network issues during GitHub Actions run

### GitHub Actions Failed

1. Check the Actions tab for error logs
2. Verify requirements.txt dependencies are compatible
3. Check if PCPartPicker website is accessible

## Maintenance

### Updating Dependencies

```bash
# Check for outdated packages
pip list --outdated

# Update requirements.txt
pip install --upgrade beautifulsoup4 requests lxml
pip freeze > requirements.txt
```

### Adding New Retailers

To add support for a new retailer:

1. Edit `scraper.py`
2. Add the domain to the `trusted_retailers` dictionary
3. Test with a sample build
4. Commit and push

Example:

```python
trusted_retailers = {
# ... existing retailers ...
'www.memoryexpress.com': 'Memory Express',
'memoryexpress.com': 'Memory Express',
}
```

## Limitations

- **Internet Required**: Scraper needs internet access to reach PCPartPicker
- **Rate Limiting**: 2-second delay between requests (takes ~3-5 minutes for all 75 files)
- **PCPartPicker Dependency**: If PCPartPicker changes their HTML structure, scraper needs updates
- **Canadian Prices**: Currently configured for Canadian pricing (ca.pcpartpicker.com)

## Future Improvements

Potential enhancements:

- [ ] Support for US pricing (pcpartpicker.com)
- [ ] Price history tracking
- [ ] Email notifications when prices drop significantly
- [ ] Support for more retailers
- [ ] Parallel processing for faster execution
- [ ] Website (gh-pages) automatic updates

## Support

For issues or questions:

1. Check existing Issues on GitHub
2. Review GitHub Actions logs for errors
3. Open a new Issue with detailed information

## License

Same as repository license (see LICENSE.md)
3 changes: 3 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
beautifulsoup4==4.12.3
requests==2.32.3
lxml==5.3.0
Loading