New Components - webscrape_ai #17582

michelle0927 · 2025-07-11T15:44:48Z

Resolves #17451

Summary by CodeRabbit

New Features
- Introduced a "Scrape Website" action for WebScrapeAI, enabling data extraction from websites with customizable parameters including URL, commands, schema, pagination, headers, and JavaScript instructions.
- Provides a summary message showing the scraped URL and the number of results obtained.
Refactor
- Enhanced WebScrapeAI integration with a standardized API client for improved reliability.
Chores
- Updated package version and added dependencies.

vercel · 2025-07-11T15:44:55Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

3 Skipped Deployments

Name	Status	Preview	Updated (UTC)
docs-v2	⬜️ Ignored (Inspect)	Visit Preview	Jul 11, 2025 3:59pm
pipedream-docs	⬜️ Ignored (Inspect)		Jul 11, 2025 3:59pm
pipedream-docs-redirect-do-not-edit	⬜️ Ignored (Inspect)		Jul 11, 2025 3:59pm

coderabbitai · 2025-07-11T15:44:55Z

"""

Walkthrough

A new "Scrape Website" action was added to the WebScrapeAI integration, allowing users to specify a URL, extraction command, schema, pagination, headers, and custom JavaScript instructions. The app logic was refactored to support real API requests, and package dependencies were updated to include the platform library.

Changes

File(s)	Change Summary
components/webscrape_ai/actions/scrape-website/...	Added new "Scrape Website" action module for WebScrapeAI, supporting multiple input parameters.
components/webscrape_ai/webscrape_ai.app.mjs	Refactored app logic: removed placeholder, added API request methods, implemented scrapeWebsite.
components/webscrape_ai/package.json	Updated version to 0.1.0, added @pipedream/platform dependency.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Action ("scrape-website.mjs")
    participant App ("webscrape_ai.app.mjs")
    participant WebScrapeAI_API

    User->>Action: Provide URL, command, schema, etc.
    Action->>App: Call scrapeWebsite(opts)
    App->>WebScrapeAI_API: POST /scrapeWebSite with params
    WebScrapeAI_API-->>App: Return scraped data
    App-->>Action: Return response
    Action-->>User: Export summary & return data

Assessment against linked issues

Objective	Addressed	Explanation
Implement "run-a-task" action to scrape a provided URL and store results (#17451)	✅
Integrate with WebScrapeAI API for web scraping (#17451)	✅
Support input parameters: URL, schema, extraction command, etc. (#17451)	✅

Poem

A bunny hopped to scrape a site,
With code so clean, the task took flight!
URLs and schemas, commands in tow,
The data fetched in one swift go.
Now WebScrapeAI’s ready—
For every page, fast and steady!
🐇✨
"""

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

components/webscrape_ai/actions/scrape-website/scrape-website.mjs

Oops! Something went wrong! :(

ESLint: 8.57.1

Error [ERR_MODULE_NOT_FOUND]: Cannot find package 'jsonc-eslint-parser' imported from /eslint.config.mjs
at Object.getPackageJSONURL (node:internal/modules/package_json_reader:255:9)
at packageResolve (node:internal/modules/esm/resolve:767:81)
at moduleResolve (node:internal/modules/esm/resolve:853:18)
at defaultResolve (node:internal/modules/esm/resolve:983:11)
at ModuleLoader.defaultResolve (node:internal/modules/esm/loader:801:12)
at #cachedDefaultResolve (node:internal/modules/esm/loader:725:25)
at ModuleLoader.resolve (node:internal/modules/esm/loader:708:38)
at ModuleLoader.getModuleJobForImport (node:internal/modules/esm/loader:309:38)
at #link (node:internal/modules/esm/module_job:202:49)

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2337ef8 and b665763.

📒 Files selected for processing (1)

components/webscrape_ai/actions/scrape-website/scrape-website.mjs (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

components/webscrape_ai/actions/scrape-website/scrape-website.mjs

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: Lint Code Base
GitHub Check: pnpm publish
GitHub Check: Verify TypeScript components
GitHub Check: Publish TypeScript components

✨ Finishing Touches

📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

components/webscrape_ai/actions/scrape-website/scrape-website.mjs (2)
26-30: Clarify the schema prop description.

The description shows an object example but the prop type is string, which could confuse users about the expected input format.

Consider updating the description to clarify both formats are supported:
-      description: "Schema representing the fields you want to scrape. E.g. `{\"author\":\"string\",\"comments_count\":\"integer\",\"points\":\"integer\",\"posted_time\":\"string\",\"title\":\"string\",\"url\":\"url\"}`",
+      description: "Schema representing the fields you want to scrape. Can be a JSON string or object. E.g. `{\"author\":\"string\",\"comments_count\":\"integer\",\"points\":\"integer\",\"posted_time\":\"string\",\"title\":\"string\",\"url\":\"url\"}`",
37-42: Clarify the headers format.

The description mentions "key-value pairs" but doesn't specify the exact format expected by the API.

Consider providing a clearer format example:
-      description: "List of headers in key-value pairs. i.e `Accept: application/json`",
+      description: "HTTP headers as key-value pairs, one per line. E.g. `Accept: application/json`",

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9c81a3c and 2337ef8.

⛔ Files ignored due to path filters (1)

pnpm-lock.yaml is excluded by !**/pnpm-lock.yaml

📒 Files selected for processing (3)

components/webscrape_ai/actions/scrape-website/scrape-website.mjs (1 hunks)
components/webscrape_ai/package.json (2 hunks)
components/webscrape_ai/webscrape_ai.app.mjs (1 hunks)

🧰 Additional context used

🧠 Learnings (2)

components/webscrape_ai/package.json (1)

Learnt from: jcortes
PR: PipedreamHQ/pipedream#14935
File: components/sailpoint/package.json:15-18
Timestamp: 2024-12-12T19:23:09.039Z
Learning: When developing Pipedream components, do not add built-in Node.js modules like `fs` to `package.json` dependencies, as they are native modules provided by the Node.js runtime.

components/webscrape_ai/webscrape_ai.app.mjs (1)

Learnt from: GTFalcao
PR: PipedreamHQ/pipedream#16954
File: components/salesloft/salesloft.app.mjs:14-23
Timestamp: 2025-06-04T17:52:05.780Z
Learning: In the Salesloft API integration (components/salesloft/salesloft.app.mjs), the _makeRequest method returns response.data which directly contains arrays for list endpoints like listPeople, listCadences, listUsers, and listAccounts. The propDefinitions correctly call .map() directly on these responses without needing to destructure a nested data property.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: pnpm publish
GitHub Check: Verify TypeScript components
GitHub Check: Lint Code Base
GitHub Check: Publish TypeScript components

🔇 Additional comments (8)

components/webscrape_ai/package.json (2)

3-3: LGTM: Appropriate version bump for new functionality.

The minor version increment correctly reflects the addition of new functionality (the scrape website action).

15-16: LGTM: Platform dependency correctly added.

The @pipedream/platform dependency is properly added to support the axios import used in the app client.

components/webscrape_ai/webscrape_ai.app.mjs (3)

1-1: LGTM: Proper platform import for HTTP client.

The axios import from the platform is correctly implemented to support API requests.

8-22: LGTM: Standard Pipedream app client pattern.

The implementation follows the established pattern with:

_baseUrl() method for API endpoint

_makeRequest() method with automatic authentication

Proper parameter handling and request options spreading

23-28: LGTM: Clean API method wrapper.

The scrapeWebsite method provides a clean interface for the specific endpoint while leveraging the shared request infrastructure.
components/webscrape_ai/actions/scrape-website/scrape-website.mjs (3)

11-15: LGTM: Helpful timeout alert for users.

The alert about potential timeout issues is valuable user guidance for synchronous API operations.

56-58: LGTM: Proper schema handling.

The conditional JSON.stringify for object schemas is well implemented to handle both string and object inputs.

64-67: Ensure the response is an array (or unwrap the data property).
The current summary uses response.length directly, but the scrapeWebsite call may return a full HTTP response object or wrap the array under a data property. Please confirm the exact shape of the JSON returned by /scrapeWebSite and update the code accordingly. For example, if the array is nested in response.data, you can:

• Destructure and return only the array:
- const response = await this.webscrapeAi.scrapeWebsite({ … });
- $.export("$summary", `Scraped ${this.url} and got ${response.length} result${response.length === 1 ? "" : "s"}`);
- return response;
+ const { data } = await this.webscrapeAi.scrapeWebsite({ … });
+ $.export("$summary", `Scraped ${this.url} and got ${data.length} result${data.length === 1 ? "" : "s"}`);
+ return data;
• Or handle both cases in one go:
const response = await this.webscrapeAi.scrapeWebsite({ … });
const results = Array.isArray(response) ? response : response.data;
$.export("$summary", `Scraped ${this.url} and got ${results.length} result${results.length === 1 ? "" : "s"}`);
return results;

luancazarine

Hi @michelle0927, LGTM! Ready for QA!

michelle0927 added 2 commits July 11, 2025 11:41

new component

0d2ea0c

pnpm-lock.yaml

2337ef8

coderabbitai bot reviewed Jul 11, 2025

View reviewed changes

fix key

b665763

pipedream-component-development requested a review from luancazarine July 11, 2025 16:17

luancazarine approved these changes Jul 16, 2025

View reviewed changes

vunguyenhung merged commit 1813f38 into master Jul 17, 2025
11 checks passed

vunguyenhung deleted the issue-17451 branch July 17, 2025 02:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

New Components - webscrape_ai #17582

New Components - webscrape_ai #17582

Uh oh!

michelle0927 commented Jul 11, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

vercel bot commented Jul 11, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Jul 11, 2025 •

edited

Loading

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

luancazarine left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

New Components - webscrape_ai #17582

New Components - webscrape_ai #17582

Uh oh!

Conversation

michelle0927 commented Jul 11, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

vercel bot commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Assessment against linked issues

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

luancazarine left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

michelle0927 commented Jul 11, 2025 •

edited by coderabbitai bot

Loading

vercel bot commented Jul 11, 2025 •

edited

Loading

coderabbitai bot commented Jul 11, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)