Skip to content

Conversation

@michelle0927
Copy link
Collaborator

@michelle0927 michelle0927 commented Jul 11, 2025

Resolves #17451

Summary by CodeRabbit

  • New Features

    • Introduced a "Scrape Website" action for WebScrapeAI, enabling data extraction from websites with customizable parameters including URL, commands, schema, pagination, headers, and JavaScript instructions.
    • Provides a summary message showing the scraped URL and the number of results obtained.
  • Refactor

    • Enhanced WebScrapeAI integration with a standardized API client for improved reliability.
  • Chores

    • Updated package version and added dependencies.

@vercel
Copy link

vercel bot commented Jul 11, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

3 Skipped Deployments
Name Status Preview Comments Updated (UTC)
docs-v2 ⬜️ Ignored (Inspect) Visit Preview Jul 11, 2025 3:59pm
pipedream-docs ⬜️ Ignored (Inspect) Jul 11, 2025 3:59pm
pipedream-docs-redirect-do-not-edit ⬜️ Ignored (Inspect) Jul 11, 2025 3:59pm

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jul 11, 2025

"""

Walkthrough

A new "Scrape Website" action was added to the WebScrapeAI integration, allowing users to specify a URL, extraction command, schema, pagination, headers, and custom JavaScript instructions. The app logic was refactored to support real API requests, and package dependencies were updated to include the platform library.

Changes

File(s) Change Summary
components/webscrape_ai/actions/scrape-website/... Added new "Scrape Website" action module for WebScrapeAI, supporting multiple input parameters.
components/webscrape_ai/webscrape_ai.app.mjs Refactored app logic: removed placeholder, added API request methods, implemented scrapeWebsite.
components/webscrape_ai/package.json Updated version to 0.1.0, added @pipedream/platform dependency.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Action ("scrape-website.mjs")
    participant App ("webscrape_ai.app.mjs")
    participant WebScrapeAI_API

    User->>Action: Provide URL, command, schema, etc.
    Action->>App: Call scrapeWebsite(opts)
    App->>WebScrapeAI_API: POST /scrapeWebSite with params
    WebScrapeAI_API-->>App: Return scraped data
    App-->>Action: Return response
    Action-->>User: Export summary & return data
Loading

Assessment against linked issues

Objective Addressed Explanation
Implement "run-a-task" action to scrape a provided URL and store results (#17451)
Integrate with WebScrapeAI API for web scraping (#17451)
Support input parameters: URL, schema, extraction command, etc. (#17451)

Poem

A bunny hopped to scrape a site,
With code so clean, the task took flight!
URLs and schemas, commands in tow,
The data fetched in one swift go.
Now WebScrapeAI’s ready—
For every page, fast and steady!
🐇✨
"""

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

components/webscrape_ai/actions/scrape-website/scrape-website.mjs

Oops! Something went wrong! :(

ESLint: 8.57.1

Error [ERR_MODULE_NOT_FOUND]: Cannot find package 'jsonc-eslint-parser' imported from /eslint.config.mjs
at Object.getPackageJSONURL (node:internal/modules/package_json_reader:255:9)
at packageResolve (node:internal/modules/esm/resolve:767:81)
at moduleResolve (node:internal/modules/esm/resolve:853:18)
at defaultResolve (node:internal/modules/esm/resolve:983:11)
at ModuleLoader.defaultResolve (node:internal/modules/esm/loader:801:12)
at #cachedDefaultResolve (node:internal/modules/esm/loader:725:25)
at ModuleLoader.resolve (node:internal/modules/esm/loader:708:38)
at ModuleLoader.getModuleJobForImport (node:internal/modules/esm/loader:309:38)
at #link (node:internal/modules/esm/module_job:202:49)


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2337ef8 and b665763.

📒 Files selected for processing (1)
  • components/webscrape_ai/actions/scrape-website/scrape-website.mjs (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • components/webscrape_ai/actions/scrape-website/scrape-website.mjs
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Lint Code Base
  • GitHub Check: pnpm publish
  • GitHub Check: Verify TypeScript components
  • GitHub Check: Publish TypeScript components
✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
components/webscrape_ai/actions/scrape-website/scrape-website.mjs (2)

26-30: Clarify the schema prop description.

The description shows an object example but the prop type is string, which could confuse users about the expected input format.

Consider updating the description to clarify both formats are supported:

-      description: "Schema representing the fields you want to scrape. E.g. `{\"author\":\"string\",\"comments_count\":\"integer\",\"points\":\"integer\",\"posted_time\":\"string\",\"title\":\"string\",\"url\":\"url\"}`",
+      description: "Schema representing the fields you want to scrape. Can be a JSON string or object. E.g. `{\"author\":\"string\",\"comments_count\":\"integer\",\"points\":\"integer\",\"posted_time\":\"string\",\"title\":\"string\",\"url\":\"url\"}`",

37-42: Clarify the headers format.

The description mentions "key-value pairs" but doesn't specify the exact format expected by the API.

Consider providing a clearer format example:

-      description: "List of headers in key-value pairs. i.e `Accept: application/json`",
+      description: "HTTP headers as key-value pairs, one per line. E.g. `Accept: application/json`",
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9c81a3c and 2337ef8.

⛔ Files ignored due to path filters (1)
  • pnpm-lock.yaml is excluded by !**/pnpm-lock.yaml
📒 Files selected for processing (3)
  • components/webscrape_ai/actions/scrape-website/scrape-website.mjs (1 hunks)
  • components/webscrape_ai/package.json (2 hunks)
  • components/webscrape_ai/webscrape_ai.app.mjs (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
components/webscrape_ai/package.json (1)
Learnt from: jcortes
PR: PipedreamHQ/pipedream#14935
File: components/sailpoint/package.json:15-18
Timestamp: 2024-12-12T19:23:09.039Z
Learning: When developing Pipedream components, do not add built-in Node.js modules like `fs` to `package.json` dependencies, as they are native modules provided by the Node.js runtime.
components/webscrape_ai/webscrape_ai.app.mjs (1)
Learnt from: GTFalcao
PR: PipedreamHQ/pipedream#16954
File: components/salesloft/salesloft.app.mjs:14-23
Timestamp: 2025-06-04T17:52:05.780Z
Learning: In the Salesloft API integration (components/salesloft/salesloft.app.mjs), the _makeRequest method returns response.data which directly contains arrays for list endpoints like listPeople, listCadences, listUsers, and listAccounts. The propDefinitions correctly call .map() directly on these responses without needing to destructure a nested data property.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: pnpm publish
  • GitHub Check: Verify TypeScript components
  • GitHub Check: Lint Code Base
  • GitHub Check: Publish TypeScript components
🔇 Additional comments (8)
components/webscrape_ai/package.json (2)

3-3: LGTM: Appropriate version bump for new functionality.

The minor version increment correctly reflects the addition of new functionality (the scrape website action).


15-16: LGTM: Platform dependency correctly added.

The @pipedream/platform dependency is properly added to support the axios import used in the app client.

components/webscrape_ai/webscrape_ai.app.mjs (3)

1-1: LGTM: Proper platform import for HTTP client.

The axios import from the platform is correctly implemented to support API requests.


8-22: LGTM: Standard Pipedream app client pattern.

The implementation follows the established pattern with:

  • _baseUrl() method for API endpoint
  • _makeRequest() method with automatic authentication
  • Proper parameter handling and request options spreading

23-28: LGTM: Clean API method wrapper.

The scrapeWebsite method provides a clean interface for the specific endpoint while leveraging the shared request infrastructure.

components/webscrape_ai/actions/scrape-website/scrape-website.mjs (3)

11-15: LGTM: Helpful timeout alert for users.

The alert about potential timeout issues is valuable user guidance for synchronous API operations.


56-58: LGTM: Proper schema handling.

The conditional JSON.stringify for object schemas is well implemented to handle both string and object inputs.


64-67: Ensure the response is an array (or unwrap the data property).
The current summary uses response.length directly, but the scrapeWebsite call may return a full HTTP response object or wrap the array under a data property. Please confirm the exact shape of the JSON returned by /scrapeWebSite and update the code accordingly. For example, if the array is nested in response.data, you can:

• Destructure and return only the array:

- const response = await this.webscrapeAi.scrapeWebsite({ … });
- $.export("$summary", `Scraped ${this.url} and got ${response.length} result${response.length === 1 ? "" : "s"}`);
- return response;
+ const { data } = await this.webscrapeAi.scrapeWebsite({ … });
+ $.export("$summary", `Scraped ${this.url} and got ${data.length} result${data.length === 1 ? "" : "s"}`);
+ return data;

• Or handle both cases in one go:

const response = await this.webscrapeAi.scrapeWebsite({});
const results = Array.isArray(response) ? response : response.data;
$.export("$summary", `Scraped ${this.url} and got ${results.length} result${results.length === 1 ? "" : "s"}`);
return results;

Copy link
Collaborator

@luancazarine luancazarine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @michelle0927, LGTM! Ready for QA!

@vunguyenhung vunguyenhung merged commit 1813f38 into master Jul 17, 2025
11 checks passed
@vunguyenhung vunguyenhung deleted the issue-17451 branch July 17, 2025 02:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Components] webscrape_ai

4 participants