-
Notifications
You must be signed in to change notification settings - Fork 3
fix: being blocked by Cloudflare Turnstile #150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThe pull request introduces significant updates to the web scraping mechanism in the project. The primary change is a transition from using the standard Changes
Sequence DiagramsequenceDiagram
participant Scraper
participant Puppeteer
participant Browser
participant WebPage
Scraper->>Puppeteer: Launch browser
Puppeteer->>Browser: Create headless instance
Scraper->>Browser: Navigate to URL
Browser->>WebPage: Load page
WebPage-->>Browser: Page loaded
Browser-->>Scraper: Return HTML content
Scraper->>Puppeteer: Close browser
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (2)
data/patchlogs.json (2)
6-6: Empty "additions" fields.All entries have empty "additions" fields. Consider removing this field if it's not being used, or document why it's being kept empty.
{ "name": "Warframe: 1999: Hotfix 38.0.7", "url": "https://forums.warframe.com/topic/1434747-warframe-1999-hotfix-3807/", "date": "2025-01-14T20:03:08Z", - "additions": "", "changes": "...", "fixes": "...", "type": "Hotfix" }Also applies to: 15-15, 24-24, 33-33
7-8: Consider splitting long text fields.The "changes" and "fixes" fields contain very long text blocks with multiple items. Consider structuring these as arrays for better readability and easier parsing.
Example restructuring:
{ "name": "Warframe: 1999: Hotfix 38.0.7", "changes": [ "The Overpower Hex Override is now capped at +150% Ability Strength and -75% Efficiency.", "Reduced the distance at which Legacytes can spawn in Legacyte Harvest missions." ], "fixes": [ "Fixed Caliban's Conculysts ending their whirlwind attack early for Clients.", "Fixed the bonus damage duration from Hydroid's Plunder not being reset when recast." ] }Also applies to: 16-17, 25-26, 34-35
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
package-lock.jsonis excluded by!**/package-lock.json
📒 Files selected for processing (3)
build/scraper.js(2 hunks)data/patchlogs.json(1 hunks)package.json(2 hunks)
🧰 Additional context used
🪛 GitHub Check: Lint
build/scraper.js
[failure] 7-7:
There should be at least one empty line between import groups
[failure] 8-8:
puppeteer-extra-plugin-stealth import should occur before import of ../data/patchlogs.json
[failure] 9-9:
puppeteer-extra import should occur before import of ../data/patchlogs.json
🪛 eslint
build/scraper.js
[error] 7-7: There should be at least one empty line between import groups
(import/order)
[error] 8-8: puppeteer-extra-plugin-stealth import should occur before import of ../data/patchlogs.json
(import/order)
[error] 9-9: puppeteer-extra import should occur before import of ../data/patchlogs.json
(import/order)
🪛 GitHub Actions: Pull Request Checks
build/scraper.js
[error] 7-7: There should be at least one empty line between import groups (import/order)
🔇 Additional comments (4)
data/patchlogs.json (3)
5-5: Verify dates in patch logs.The patch log entries contain future dates (December 2024 and January 2025). This might indicate:
- Incorrect date entries
- System clock issues during data collection
- Test data that wasn't meant for production
Run this script to check the dates of recent entries and their chronological order:
Also applies to: 14-14, 23-23, 32-32
4-4: Verify forum URLs are accessible.Let's ensure the forum URLs are valid and accessible.
Also applies to: 13-13, 22-22, 31-31
2-37: Verify JSON schema consistency.Let's ensure all entries follow a consistent schema and contain required fields.
✅ Verification successful
JSON schema is consistent and valid
The schema validation confirms:
- All required fields (name, url, date, type) are present
- All dates follow the correct ISO format
- Optional fields (additions, changes, fixes) are consistent
Note: While older entries contain a legacy "description" field, this doesn't impact the current schema or functionality.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify JSON schema consistency across all entries # Check for required fields and their types jq ' def check_entry: . as $entry | ["name", "url", "date", "type"] as $required | $required - ($entry | keys) | if length > 0 then "Missing required fields: \($entry.name): \(.)" else empty end; .[] | check_entry ' data/patchlogs.json # Verify all dates are in ISO format jq -r '.[] | select(.date | test("^\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}Z$") | not) | .name + ": " + .date' data/patchlogs.jsonLength of output: 419
Script:
#!/bin/bash # First check if file exists and show content echo "=== File content ===" if [ -f "data/patchlogs.json" ]; then cat data/patchlogs.json else echo "File does not exist" fi echo -e "\n=== JSON validation ===" if [ -f "data/patchlogs.json" ]; then if jq empty data/patchlogs.json 2>/dev/null; then echo "Valid JSON syntax" else echo "Invalid JSON syntax" fi else echo "File does not exist" fiLength of output: 129903
Script:
#!/bin/bash echo "=== Schema Validation ===" # Check for required fields and unexpected fields jq -r ' def check_entry: . as $entry | ["name", "url", "date", "type"] as $required | ($entry | keys - ["name", "url", "date", "type", "additions", "changes", "fixes"]) as $extra | if $extra | length > 0 then "Entry \($entry.name): Unexpected fields: \($extra)" elif ($required - ($entry | keys)) | length > 0 then "Entry \($entry.name): Missing required fields: \($required - ($entry | keys))" else empty end; .[] | check_entry ' data/patchlogs.json echo -e "\n=== Date Format Validation ===" # Verify ISO date format jq -r '.[] | select(.date | test("^\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}Z$") | not) | "\(.name): Invalid date format: \(.date)"' data/patchlogs.jsonLength of output: 7468
package.json (1)
90-93: Consider security implications and resource requirements.The addition of Puppeteer and its stealth plugins introduces several considerations:
- Security: Browser automation can be flagged by security tools. Ensure your deployment environment allows this.
- Resources: Puppeteer bundles Chromium (~300MB) and requires significant memory for browser instances.
- Version Stability: Consider pinning exact versions to prevent breaking changes.
Run this script to check the impact on package size and verify version compatibility:
Consider pinning exact versions:
- "puppeteer": "^24.1.0", - "puppeteer-extra": "^3.3.6", - "puppeteer-extra-plugin-stealth": "^2.11.2" + "puppeteer": "24.1.0", + "puppeteer-extra": "3.3.6", + "puppeteer-extra-plugin-stealth": "2.11.2"✅ Verification successful
Implementation review shows responsible usage of Puppeteer
The codebase demonstrates a well-structured implementation with:
- Proper browser instance management and cleanup
- Resource-efficient headless mode usage
- Built-in caching to minimize requests
- Controlled URL access and clean browser sessions
The current version constraints (^) are acceptable given the stable implementation, though you may still consider pinning versions if deployment consistency is critical.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Analyze package sizes and check for breaking changes # Check package sizes npm show puppeteer dist-size npm show puppeteer-extra dist-size npm show puppeteer-extra-plugin-stealth dist-size # Check for breaking changes in recent versions npm view puppeteer versions --json | jq -r '.[] | select(startswith("24."))' npm view puppeteer-extra versions --json | jq -r '.[] | select(startswith("3."))' npm view puppeteer-extra-plugin-stealth versions --json | jq -r '.[] | select(startswith("2."))'Length of output: 1098
Script:
#!/bin/bash # Create a temporary directory to check actual disk space impact TEMP_DIR=$(mktemp -d) cd "$TEMP_DIR" # Create a minimal package.json echo '{ "dependencies": { "puppeteer": "^24.1.0", "puppeteer-extra": "^3.3.6", "puppeteer-extra-plugin-stealth": "^2.11.2" } }' > package.json # Install packages and check size npm install --quiet du -sh node_modules/ # Search for Puppeteer usage patterns in the codebase cd - rg -l "puppeteer" --type js --type ts rg "new puppeteer\." -A 2 --type js --type ts rg "browser\.(new)?page\(\)" -A 2 --type js --type tsLength of output: 1017
Script:
#!/bin/bash # Examine the scraper implementation cat build/scraper.js # Look for browser launch configurations rg "launch\(" -A 3 build/scraper.js # Look for browser cleanup patterns rg "browser\.close" build/scraper.js rg "page\.close" build/scraper.jsLength of output: 6290
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (1)
build/scraper.js (1)
Line range hint
74-76: Improve error handling message for Cloudflare blockingThe current error message doesn't provide enough context or resolution steps.
Apply this diff to improve the error message:
if (text.length < 2) { - throw new Error('Connection blocked by Cloudflare.'); + throw new Error( + 'Connection blocked by Cloudflare Turnstile. ' + + 'This could happen if the request was flagged as automated. ' + + 'Please try again later or contact support if the issue persists.' + ); }🧰 Tools
🪛 GitHub Actions: Pull Request Checks
[error] 44: Chrome browser not found (ver. 132.0.6834.83). Puppeteer installation is required before running the script. Run 'npx puppeteer browsers install chrome' to resolve.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
build/scraper.js(2 hunks)
🧰 Additional context used
🪛 GitHub Actions: Pull Request Checks
build/scraper.js
[error] 44: Chrome browser not found (ver. 132.0.6834.83). Puppeteer installation is required before running the script. Run 'npx puppeteer browsers install chrome' to resolve.
🔇 Additional comments (2)
build/scraper.js (2)
Line range hint
1-9: Fix import order to comply with coding standardsThe import statements need to be reordered to follow the coding standards: external modules first, followed by internal modules, with empty lines between groups.
🧰 Tools
🪛 GitHub Actions: Pull Request Checks
[error] 44: Chrome browser not found (ver. 132.0.6834.83). Puppeteer installation is required before running the script. Run 'npx puppeteer browsers install chrome' to resolve.
54-54: Await browser.close() to properly close the browser instanceThe
browser.close()method returns a Promise and should be awaited.🧰 Tools
🪛 GitHub Actions: Pull Request Checks
[error] 44: Chrome browser not found (ver. 132.0.6834.83). Puppeteer installation is required before running the script. Run 'npx puppeteer browsers install chrome' to resolve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
.github/workflows/build.yaml(2 hunks).github/workflows/static.yaml(1 hunks).nvmrc(1 hunks)build/scraper.js(2 hunks)
✅ Files skipped from review due to trivial changes (1)
- .nvmrc
🔇 Additional comments (2)
build/scraper.js (2)
2-5: Fix import order to comply with coding standardsThe import statements need to be reordered to comply with coding standards. External modules should be imported before internal modules.
import { load } from 'cheerio'; +import puppeteer from 'puppeteer-extra'; +import StealthPlugin from 'puppeteer-extra-plugin-stealth'; + +import cache from '../data/patchlogs.json' with { type: 'json' }; -import puppeteer from 'puppeteer-extra'; -import StealthPlugin from 'puppeteer-extra-plugin-stealth'; - -import cache from '../data/patchlogs.json' with { type: 'json' }; - import ProgressBar from './progress.js'; import sleep from './sleep.js'; import title from './title.js';
44-61: Add Chrome browser installation requirementThe pipeline failure indicates that Chrome browser is required but not found.
Add this step to your CI pipeline:
- name: Install Chrome browser run: npx puppeteer browsers install chrome
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
build/scraper.js (1)
44-64: 🛠️ Refactor suggestionEnhance browser launch configuration and error handling.
The browser launch configuration could be improved for better reliability:
- Add specific error types for different failure scenarios
- Configure longer timeout for slow connections
- Add proxy support for better reliability
Apply this diff to enhance the implementation:
let browser; try { browser = await puppeteer .use(StealthPlugin()) - .launch({ headless: true, args: ['--no-sandbox', '--disable-setuid-sandbox'] }); + .launch({ + headless: true, + args: [ + '--no-sandbox', + '--disable-setuid-sandbox', + '--disable-dev-shm-usage', + ], + timeout: 60000, // Increase timeout to 60 seconds + }); const page = await browser.newPage(); + + // Configure longer timeouts for navigation + await page.setDefaultNavigationTimeout(60000); + await page.setDefaultTimeout(60000); await page.goto(url, { waitUntil: ['networkidle0', 'domcontentloaded'], - timeout: 30000, // 30 second timeout + timeout: 60000, // 60 second timeout }); return await page.content(); } catch (err) { - console.error('Failed to fetch page:', err); + if (err.name === 'TimeoutError') { + console.error('Page load timed out:', url); + } else if (err.name === 'ProtocolError') { + console.error('Browser disconnected while loading:', url); + } else { + console.error('Failed to fetch page:', err); + } throw err; } finally { - await browser.close(); + if (browser) { + await browser?.close().catch(console.error); + } }
🧹 Nitpick comments (1)
build/scraper.js (1)
Line range hint
82-85: Improve error handling for Cloudflare blocking.The error message could be more descriptive and include troubleshooting steps.
if (text.length < 2) { - throw new Error('Connection blocked by Cloudflare.'); + throw new Error( + 'Connection blocked by Cloudflare Turnstile. ' + + 'This could happen if the request was flagged as suspicious. ' + + 'Try using a different IP address or waiting before retrying.' + ); }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
build/scraper.js(2 hunks)index.js(1 hunks)
🔇 Additional comments (1)
build/scraper.js (1)
1-5: Fix import order to comply with coding standards.External modules should be imported before internal modules, with empty lines between different import groups.
import { load } from 'cheerio'; import puppeteer from 'puppeteer-extra'; import StealthPlugin from 'puppeteer-extra-plugin-stealth'; - -import cache from '../data/patchlogs.json' with { type: 'json' }; import ProgressBar from './progress.js'; import sleep from './sleep.js'; import title from './title.js'; + +import cache from '../data/patchlogs.json' with { type: 'json' };
TobiTenno
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just something small w/ dependencies
corrected |
|
🎉 This PR is included in version 2.55.1 🎉 The release is available on: Your semantic-release bot 📦🚀 |
What did you fix?
Reproduction steps
Evidence/screenshot/link to line
Considerations
Summary by CodeRabbit
New Features
Dependency Updates
Data Updates
Workflow Improvements
Node.js Version Update
.nvmrcto the latest LTS version.Import Syntax Update