Skip to content

Conversation

@SlayerOrnstein
Copy link
Member

@SlayerOrnstein SlayerOrnstein commented Jan 22, 2025

What did you fix?

  • Use puppeteer to get pass Cloudflare Turnstile before passing the rendered html to cheerio
  • Update the syntax to work with new nodejs versions
  • Update cheerio
  • Add new changelog

Reproduction steps


Evidence/screenshot/link to line

Considerations

  • Does this contain a new dependency? Yes
  • Does this introduce opinionated data formatting or manual data entry? Yes
  • Does this pr include updated data files in a separate commit that can be reverted for a clean code-only pr? No
  • Have I run the linter? Yes
  • Is is a bug fix, feature request, or enhancement? Bug Fix

Summary by CodeRabbit

  • New Features

    • Enhanced web scraping functionality using Puppeteer for more robust content retrieval.
    • Added stealth capabilities to the web scraping process.
  • Dependency Updates

    • Updated Cheerio to stable version.
    • Added Puppeteer and related plugins for improved web scraping.
  • Data Updates

    • Added multiple Warframe hotfix patch notes (38.0.4 - 38.0.7).
    • Included detailed gameplay adjustments and bug fixes for Warframe: 1999.
  • Workflow Improvements

    • Updated GitHub Actions workflows to streamline dependency installation and restructured job steps.
  • Node.js Version Update

    • Updated Node.js version in .nvmrc to the latest LTS version.
  • Import Syntax Update

    • Modified import syntax for patchlogs to improve module interpretation.

@SlayerOrnstein SlayerOrnstein requested a review from a team as a code owner January 22, 2025 22:35
@SlayerOrnstein SlayerOrnstein requested a review from AyAyEm January 22, 2025 22:35
@coderabbitai
Copy link

coderabbitai bot commented Jan 22, 2025

Walkthrough

The pull request introduces significant updates to the web scraping mechanism in the project. The primary change is a transition from using the standard fetch API to employing Puppeteer for web scraping. This new approach allows for more robust content retrieval, particularly for dynamic web pages. The changes span across multiple files, including build/scraper.js, data/patchlogs.json, and package.json. The modifications include adding new dependencies, updating the scraping method, and incorporating additional Warframe hotfix information.

Changes

File Change Summary
build/scraper.js - Replaced fetch with Puppeteer for web scraping
- Added imports for puppeteer-extra and puppeteer-extra-plugin-stealth
- Enhanced error handling for connection issues
data/patchlogs.json - Added four new Warframe hotfix entries (38.0.4 to 38.0.7)
- Included detailed information about game updates and fixes
package.json - Updated cheerio to stable version
- Added Puppeteer-related dependencies:
- puppeteer
- puppeteer-extra
- puppeteer-extra-plugin-stealth
.github/workflows/build.yaml - Added npm i step after npm ci in workflow
- Updated cron schedule format from double quotes to single quotes
.github/workflows/static.yaml - Reordered steps in lint, build, and test jobs
- Replaced npm ci with npm i in the build job
.nvmrc - Updated Node.js version from lts/iron to lts/jod
index.js - Changed import syntax for patchlogs.json from assert to with

Sequence Diagram

sequenceDiagram
    participant Scraper
    participant Puppeteer
    participant Browser
    participant WebPage

    Scraper->>Puppeteer: Launch browser
    Puppeteer->>Browser: Create headless instance
    Scraper->>Browser: Navigate to URL
    Browser->>WebPage: Load page
    WebPage-->>Browser: Page loaded
    Browser-->>Scraper: Return HTML content
    Scraper->>Puppeteer: Close browser
Loading

Poem

🐰 A Scraper's Tale of Web Delight 🌐

With Puppeteer, our code takes flight,
Browsing pages with stealth so bright,
No more simple fetch, we're going deep,
Scraping secrets while browsers sleep!

Hop, hop, hurray for tech so neat! 🚀


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c4bb663 and ed9018c.

📒 Files selected for processing (1)
  • package.json (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • package.json

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
data/patchlogs.json (2)

6-6: Empty "additions" fields.

All entries have empty "additions" fields. Consider removing this field if it's not being used, or document why it's being kept empty.

 {
   "name": "Warframe: 1999: Hotfix 38.0.7",
   "url": "https://forums.warframe.com/topic/1434747-warframe-1999-hotfix-3807/",
   "date": "2025-01-14T20:03:08Z",
-  "additions": "",
   "changes": "...",
   "fixes": "...",
   "type": "Hotfix"
 }

Also applies to: 15-15, 24-24, 33-33


7-8: Consider splitting long text fields.

The "changes" and "fixes" fields contain very long text blocks with multiple items. Consider structuring these as arrays for better readability and easier parsing.

Example restructuring:

 {
   "name": "Warframe: 1999: Hotfix 38.0.7",
   "changes": [
     "The Overpower Hex Override is now capped at +150% Ability Strength and -75% Efficiency.",
     "Reduced the distance at which Legacytes can spawn in Legacyte Harvest missions."
   ],
   "fixes": [
     "Fixed Caliban's Conculysts ending their whirlwind attack early for Clients.",
     "Fixed the bonus damage duration from Hydroid's Plunder not being reset when recast."
   ]
 }

Also applies to: 16-17, 25-26, 34-35

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 424db4c and e10ec15.

⛔ Files ignored due to path filters (1)
  • package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (3)
  • build/scraper.js (2 hunks)
  • data/patchlogs.json (1 hunks)
  • package.json (2 hunks)
🧰 Additional context used
🪛 GitHub Check: Lint
build/scraper.js

[failure] 7-7:
There should be at least one empty line between import groups


[failure] 8-8:
puppeteer-extra-plugin-stealth import should occur before import of ../data/patchlogs.json


[failure] 9-9:
puppeteer-extra import should occur before import of ../data/patchlogs.json

🪛 eslint
build/scraper.js

[error] 7-7: There should be at least one empty line between import groups

(import/order)


[error] 8-8: puppeteer-extra-plugin-stealth import should occur before import of ../data/patchlogs.json

(import/order)


[error] 9-9: puppeteer-extra import should occur before import of ../data/patchlogs.json

(import/order)

🪛 GitHub Actions: Pull Request Checks
build/scraper.js

[error] 7-7: There should be at least one empty line between import groups (import/order)

🔇 Additional comments (4)
data/patchlogs.json (3)

5-5: Verify dates in patch logs.

The patch log entries contain future dates (December 2024 and January 2025). This might indicate:

  1. Incorrect date entries
  2. System clock issues during data collection
  3. Test data that wasn't meant for production

Run this script to check the dates of recent entries and their chronological order:

Also applies to: 14-14, 23-23, 32-32


4-4: Verify forum URLs are accessible.

Let's ensure the forum URLs are valid and accessible.

Also applies to: 13-13, 22-22, 31-31


2-37: Verify JSON schema consistency.

Let's ensure all entries follow a consistent schema and contain required fields.

✅ Verification successful

JSON schema is consistent and valid

The schema validation confirms:

  • All required fields (name, url, date, type) are present
  • All dates follow the correct ISO format
  • Optional fields (additions, changes, fixes) are consistent

Note: While older entries contain a legacy "description" field, this doesn't impact the current schema or functionality.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify JSON schema consistency across all entries

# Check for required fields and their types
jq '
  def check_entry:
    . as $entry
    | ["name", "url", "date", "type"] as $required
    | $required - ($entry | keys)
    | if length > 0 then
        "Missing required fields: \($entry.name): \(.)"
      else
        empty
      end;
  .[] | check_entry
' data/patchlogs.json

# Verify all dates are in ISO format
jq -r '.[] | select(.date | test("^\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}Z$") | not) | .name + ": " + .date' data/patchlogs.json

Length of output: 419


Script:

#!/bin/bash
# First check if file exists and show content
echo "=== File content ==="
if [ -f "data/patchlogs.json" ]; then
    cat data/patchlogs.json
else
    echo "File does not exist"
fi

echo -e "\n=== JSON validation ==="
if [ -f "data/patchlogs.json" ]; then
    if jq empty data/patchlogs.json 2>/dev/null; then
        echo "Valid JSON syntax"
    else
        echo "Invalid JSON syntax"
    fi
else
    echo "File does not exist"
fi

Length of output: 129903


Script:

#!/bin/bash
echo "=== Schema Validation ==="
# Check for required fields and unexpected fields
jq -r '
  def check_entry:
    . as $entry
    | ["name", "url", "date", "type"] as $required
    | ($entry | keys - ["name", "url", "date", "type", "additions", "changes", "fixes"]) as $extra
    | if $extra | length > 0 then
        "Entry \($entry.name): Unexpected fields: \($extra)"
      elif ($required - ($entry | keys)) | length > 0 then
        "Entry \($entry.name): Missing required fields: \($required - ($entry | keys))"
      else
        empty
      end;
  .[] | check_entry
' data/patchlogs.json

echo -e "\n=== Date Format Validation ==="
# Verify ISO date format
jq -r '.[] | select(.date | test("^\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}Z$") | not) | "\(.name): Invalid date format: \(.date)"' data/patchlogs.json

Length of output: 7468

package.json (1)

90-93: Consider security implications and resource requirements.

The addition of Puppeteer and its stealth plugins introduces several considerations:

  1. Security: Browser automation can be flagged by security tools. Ensure your deployment environment allows this.
  2. Resources: Puppeteer bundles Chromium (~300MB) and requires significant memory for browser instances.
  3. Version Stability: Consider pinning exact versions to prevent breaking changes.

Run this script to check the impact on package size and verify version compatibility:

Consider pinning exact versions:

-    "puppeteer": "^24.1.0",
-    "puppeteer-extra": "^3.3.6",
-    "puppeteer-extra-plugin-stealth": "^2.11.2"
+    "puppeteer": "24.1.0",
+    "puppeteer-extra": "3.3.6",
+    "puppeteer-extra-plugin-stealth": "2.11.2"
✅ Verification successful

Implementation review shows responsible usage of Puppeteer

The codebase demonstrates a well-structured implementation with:

  • Proper browser instance management and cleanup
  • Resource-efficient headless mode usage
  • Built-in caching to minimize requests
  • Controlled URL access and clean browser sessions

The current version constraints (^) are acceptable given the stable implementation, though you may still consider pinning versions if deployment consistency is critical.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Analyze package sizes and check for breaking changes

# Check package sizes
npm show puppeteer dist-size
npm show puppeteer-extra dist-size
npm show puppeteer-extra-plugin-stealth dist-size

# Check for breaking changes in recent versions
npm view puppeteer versions --json | jq -r '.[] | select(startswith("24."))'
npm view puppeteer-extra versions --json | jq -r '.[] | select(startswith("3."))'
npm view puppeteer-extra-plugin-stealth versions --json | jq -r '.[] | select(startswith("2."))'

Length of output: 1098


Script:

#!/bin/bash
# Create a temporary directory to check actual disk space impact
TEMP_DIR=$(mktemp -d)
cd "$TEMP_DIR"

# Create a minimal package.json
echo '{
  "dependencies": {
    "puppeteer": "^24.1.0",
    "puppeteer-extra": "^3.3.6",
    "puppeteer-extra-plugin-stealth": "^2.11.2"
  }
}' > package.json

# Install packages and check size
npm install --quiet
du -sh node_modules/

# Search for Puppeteer usage patterns in the codebase
cd -
rg -l "puppeteer" --type js --type ts
rg "new puppeteer\." -A 2 --type js --type ts
rg "browser\.(new)?page\(\)" -A 2 --type js --type ts

Length of output: 1017


Script:

#!/bin/bash
# Examine the scraper implementation
cat build/scraper.js

# Look for browser launch configurations
rg "launch\(" -A 3 build/scraper.js

# Look for browser cleanup patterns
rg "browser\.close" build/scraper.js
rg "page\.close" build/scraper.js

Length of output: 6290

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
build/scraper.js (1)

Line range hint 74-76: Improve error handling message for Cloudflare blocking

The current error message doesn't provide enough context or resolution steps.

Apply this diff to improve the error message:

     if (text.length < 2) {
-      throw new Error('Connection blocked by Cloudflare.');
+      throw new Error(
+        'Connection blocked by Cloudflare Turnstile. ' +
+        'This could happen if the request was flagged as automated. ' +
+        'Please try again later or contact support if the issue persists.'
+      );
     }
🧰 Tools
🪛 GitHub Actions: Pull Request Checks

[error] 44: Chrome browser not found (ver. 132.0.6834.83). Puppeteer installation is required before running the script. Run 'npx puppeteer browsers install chrome' to resolve.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e10ec15 and 92792f4.

📒 Files selected for processing (1)
  • build/scraper.js (2 hunks)
🧰 Additional context used
🪛 GitHub Actions: Pull Request Checks
build/scraper.js

[error] 44: Chrome browser not found (ver. 132.0.6834.83). Puppeteer installation is required before running the script. Run 'npx puppeteer browsers install chrome' to resolve.

🔇 Additional comments (2)
build/scraper.js (2)

Line range hint 1-9: Fix import order to comply with coding standards

The import statements need to be reordered to follow the coding standards: external modules first, followed by internal modules, with empty lines between groups.

🧰 Tools
🪛 GitHub Actions: Pull Request Checks

[error] 44: Chrome browser not found (ver. 132.0.6834.83). Puppeteer installation is required before running the script. Run 'npx puppeteer browsers install chrome' to resolve.


54-54: Await browser.close() to properly close the browser instance

The browser.close() method returns a Promise and should be awaited.

🧰 Tools
🪛 GitHub Actions: Pull Request Checks

[error] 44: Chrome browser not found (ver. 132.0.6834.83). Puppeteer installation is required before running the script. Run 'npx puppeteer browsers install chrome' to resolve.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 92792f4 and c868f20.

📒 Files selected for processing (4)
  • .github/workflows/build.yaml (2 hunks)
  • .github/workflows/static.yaml (1 hunks)
  • .nvmrc (1 hunks)
  • build/scraper.js (2 hunks)
✅ Files skipped from review due to trivial changes (1)
  • .nvmrc
🔇 Additional comments (2)
build/scraper.js (2)

2-5: Fix import order to comply with coding standards

The import statements need to be reordered to comply with coding standards. External modules should be imported before internal modules.

 import { load } from 'cheerio';
+import puppeteer from 'puppeteer-extra';
+import StealthPlugin from 'puppeteer-extra-plugin-stealth';
+
+import cache from '../data/patchlogs.json' with { type: 'json' };
 
-import puppeteer from 'puppeteer-extra';
-import StealthPlugin from 'puppeteer-extra-plugin-stealth';
-
-import cache from '../data/patchlogs.json' with { type: 'json' };
-
 import ProgressBar from './progress.js';
 import sleep from './sleep.js';
 import title from './title.js';

44-61: Add Chrome browser installation requirement

The pipeline failure indicates that Chrome browser is required but not found.

Add this step to your CI pipeline:

- name: Install Chrome browser
  run: npx puppeteer browsers install chrome

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
build/scraper.js (1)

44-64: 🛠️ Refactor suggestion

Enhance browser launch configuration and error handling.

The browser launch configuration could be improved for better reliability:

  1. Add specific error types for different failure scenarios
  2. Configure longer timeout for slow connections
  3. Add proxy support for better reliability

Apply this diff to enhance the implementation:

     let browser;
 
     try {
       browser = await puppeteer
         .use(StealthPlugin())
-        .launch({ headless: true, args: ['--no-sandbox', '--disable-setuid-sandbox'] });
+        .launch({
+          headless: true,
+          args: [
+            '--no-sandbox',
+            '--disable-setuid-sandbox',
+            '--disable-dev-shm-usage',
+          ],
+          timeout: 60000, // Increase timeout to 60 seconds
+        });
       const page = await browser.newPage();
+      
+      // Configure longer timeouts for navigation
+      await page.setDefaultNavigationTimeout(60000);
+      await page.setDefaultTimeout(60000);
 
       await page.goto(url, {
         waitUntil: ['networkidle0', 'domcontentloaded'],
-        timeout: 30000, // 30 second timeout
+        timeout: 60000, // 60 second timeout
       });
 
       return await page.content();
     } catch (err) {
-      console.error('Failed to fetch page:', err);
+      if (err.name === 'TimeoutError') {
+        console.error('Page load timed out:', url);
+      } else if (err.name === 'ProtocolError') {
+        console.error('Browser disconnected while loading:', url);
+      } else {
+        console.error('Failed to fetch page:', err);
+      }
       throw err;
     } finally {
-      await browser.close();
+      if (browser) {
+        await browser?.close().catch(console.error);
+      }
     }
🧹 Nitpick comments (1)
build/scraper.js (1)

Line range hint 82-85: Improve error handling for Cloudflare blocking.

The error message could be more descriptive and include troubleshooting steps.

     if (text.length < 2) {
-      throw new Error('Connection blocked by Cloudflare.');
+      throw new Error(
+        'Connection blocked by Cloudflare Turnstile. ' +
+        'This could happen if the request was flagged as suspicious. ' +
+        'Try using a different IP address or waiting before retrying.'
+      );
     }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c868f20 and c4bb663.

📒 Files selected for processing (2)
  • build/scraper.js (2 hunks)
  • index.js (1 hunks)
🔇 Additional comments (1)
build/scraper.js (1)

1-5: Fix import order to comply with coding standards.

External modules should be imported before internal modules, with empty lines between different import groups.

 import { load } from 'cheerio';
 import puppeteer from 'puppeteer-extra';
 import StealthPlugin from 'puppeteer-extra-plugin-stealth';
-
-import cache from '../data/patchlogs.json' with { type: 'json' };

 import ProgressBar from './progress.js';
 import sleep from './sleep.js';
 import title from './title.js';
+
+import cache from '../data/patchlogs.json' with { type: 'json' };

Copy link
Member

@TobiTenno TobiTenno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just something small w/ dependencies

@SlayerOrnstein
Copy link
Member Author

just something small w/ dependencies

corrected

@TobiTenno TobiTenno enabled auto-merge (squash) January 23, 2025 01:18
@TobiTenno TobiTenno merged commit 1a3611d into WFCD:master Jan 23, 2025
6 checks passed
@wfcd-bot-boi
Copy link
Collaborator

🎉 This PR is included in version 2.55.1 🎉

The release is available on:

Your semantic-release bot 📦🚀

@SlayerOrnstein SlayerOrnstein deleted the fix-cloudflare branch January 24, 2025 13:23
@coderabbitai coderabbitai bot mentioned this pull request Nov 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants