-
-
Notifications
You must be signed in to change notification settings - Fork 0
Add automated daily price scraper for PC component pricing #48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add automated daily price scraper for PC component pricing #48
Conversation
Co-authored-by: willtheorangeguy <[email protected]>
Co-authored-by: willtheorangeguy <[email protected]>
Co-authored-by: willtheorangeguy <[email protected]>
|
@copilot I get the following log output, which makes it look like the scraper is not working correctly: 2025-11-18 17:50:04,771 - INFO - Scraping prices from: https://ca.pcpartpicker.com/list/4n3mNc |
…s suffixes Co-authored-by: willtheorangeguy <[email protected]>
Fixed in commit 0035b97. PCPartPicker updated their HTML structure to use versioned class names (e.g., The scraper now:
Tested with the actual PCPartPicker HTML structure you provided - now successfully parses products, prices, and retailers. Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
If you need me to access, download, or install something from one of these locations, you can either:
|
Describe the Pull Request
Implements automated price scraping from PCPartPicker (aggregates Newegg, Amazon, Best Buy) to keep all 75 PC build markdown files current. Runs daily via GitHub Actions at 2 AM UTC.
Implementation
scraper.py (370 lines): Parses markdown files, extracts PCPartPicker URLs, scrapes prices, updates tables with current pricing and retailer info. Includes rate limiting (2s), error handling, whitelisted domain validation, and flexible HTML parsing to handle PCPartPicker's versioned class names (e.g.,
td__component-2025). Supports debug mode viaDEBUGenvironment variable..github/workflows/update-prices.yml: Scheduled workflow with manual trigger option and debug mode checkbox. Auto-commits price changes using GitHub Actions bot.
requirements.txt: beautifulsoup4, requests, lxml (all verified secure, zero vulnerabilities)
SCRAPER_README.md: Architecture, usage, troubleshooting, maintenance guide
.gitignore: Python artifacts excluded
README.md: Feature documentation added
Technical Approach
Uses PCPartPicker as data source rather than scraping individual retailers directly - more reliable and maintainable since PCPartPicker already aggregates multi-retailer pricing. Updates prices in-place preserving historical "Print Price" columns.
The scraper uses flexible class name matching to handle PCPartPicker's evolving HTML structure, supporting both legacy class names and versioned suffixes (e.g.,
-2025). Implements 4 fallback strategies to locate the parts table, ensuring robustness against HTML changes.Expected Behavior
PC component prices across all markdown files update automatically daily. Manual workflow trigger available from Actions tab with optional debug mode. Changes auto-committed when prices update. The scraper correctly handles PCPartPicker's current HTML structure with versioned class names and extracts prices from the new link format.
Fixes the Following Issues
Addresses requirement for automated daily price updates from major retailers (Newegg, BestBuy, Amazon) with GitHub Actions integration. Resolves HTML parsing issues caused by PCPartPicker's updated structure with versioned class suffixes.
Screenshots
N/A - Backend automation only
Desktop Developed On (please complete the following information):
Additional Context
Security: CodeQL verified, proper URL parsing with domain whitelist. All dependencies scanned clean. Rate-limited requests prevent server abuse.
The scraper has been updated to handle PCPartPicker's new HTML structure (2025 format) which uses versioned class names. It now:
<a class="pp_async_mr">linkstd__wherecolumn with image alt textTesting confirms successful parsing of the new HTML format while maintaining compatibility with legacy structures.
Original prompt
💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.