Skip to content

Conversation

@luancazarine
Copy link
Collaborator

@luancazarine luancazarine commented Feb 26, 2025

Resolves #15137.

Summary by CodeRabbit

  • New Features

    • Upgraded to version 0.1.0 with an enhanced ScrapeNinja application, offering customizable options for configuring scraping requests such as timeouts, headers, proxy settings, and viewport preferences.
    • Introduced two dedicated scraping actions: one optimized for handling JavaScript-rendered pages and another designed for high-performance scraping without JavaScript execution.
  • Chores

    • Updated package configurations and dependencies to improve stability and compatibility.

@vercel
Copy link

vercel bot commented Feb 26, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

3 Skipped Deployments
Name Status Preview Comments Updated (UTC)
docs-v2 ⬜️ Ignored (Inspect) Visit Preview Mar 3, 2025 11:46pm
pipedream-docs ⬜️ Ignored (Inspect) Mar 3, 2025 11:46pm
pipedream-docs-redirect-do-not-edit ⬜️ Ignored (Inspect) Mar 3, 2025 11:46pm

@luancazarine luancazarine added the ai-assisted Content generated by AI, with human refinement and modification label Feb 26, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 26, 2025

Walkthrough

The pull request removes legacy files and introduces new modules that redefine the ScrapeNinja application. The outdated .gitignore and TypeScript app file are deleted. A new utility module and a new JavaScript application file are added with methods for handling API requests and scraping. Two new scraping actions are introduced to handle pages with and without JavaScript rendering. Additionally, package metadata is updated to reflect the changes in module structure and dependencies.

Changes

File(s) Change Summary
components/scrapeninja/.gitignore Deleted file; removed rules for ignoring .js, .mjs, and the dist directory.
components/scrapeninja/app/scrapeninja.app.ts Deleted file; legacy app definition removed along with the authKeys method.
components/scrapeninja/common/utils.mjs New module added; exports parseObject, parseError, and clearObj for robust data parsing and cleaning.
components/scrapeninja/package.json Updated version from "0.0.2" to "0.1.0", changed the main entry point, removed the files field, and added dependency on "@pipedream/platform".
components/scrapeninja/scrapeninja.app.mjs New app module defined; includes methods _baseUrl, _headers, _makeRequest, scrapeNonJs, and scrapeJs for interfacing with the ScrapeNinja API.
components/scrapeninja/actions/scrape-with-js-rendering/scrape-with-js-rendering.mjs
components/scrapeninja/actions/scrape-without-js/scrape-without-js.mjs
New action modules introduced; provide scraping functionality for pages that require JavaScript rendering and those that do not, respectively.

Sequence Diagram(s)

sequenceDiagram
  participant A as "Action: Scrape with JS Rendering"
  participant B as "ScrapeNinja App (scrapeJs)"
  participant C as "API Request (_makeRequest)"
  
  A->>B: Prepare scraping config (viewport, parameters)
  B->>C: Call _makeRequest for JS rendering
  C-->>B: Return API response
  B-->>A: Return scraping result
Loading
sequenceDiagram
  participant A as "Action: Scrape without JS"
  participant B as "ScrapeNinja App (scrapeNonJs)"
  participant C as "API Request (_makeRequest)"
  
  A->>B: Assemble scraping parameters and data
  B->>C: Call _makeRequest for non-JS scraping
  C-->>B: Return API response
  B-->>A: Return scraping result
Loading

Possibly related PRs

Suggested reviewers

  • michelle0927

Poem

I'm a rabbit, quick on my feet,
Hopping through code with a joyful beat.
Old files are gone, new functions in play,
ScrapeNinja's reborn in a bright new way.
🐰💻 May our code always hop smooth and free!
Cheers to changes and a path clear as can be.

Warning

There were issues while running some tools. Please review the errors and either fix the tool’s configuration or disable the tool if it’s a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

components/scrapeninja/actions/scrape-without-js/scrape-without-js.mjs

Oops! Something went wrong! :(

ESLint: 8.57.1

Error [ERR_MODULE_NOT_FOUND]: Cannot find package 'jsonc-eslint-parser' imported from /eslint.config.mjs
at packageResolve (node:internal/modules/esm/resolve:839:9)
at moduleResolve (node:internal/modules/esm/resolve:908:18)
at defaultResolve (node:internal/modules/esm/resolve:1038:11)
at ModuleLoader.defaultResolve (node:internal/modules/esm/loader:557:12)
at ModuleLoader.resolve (node:internal/modules/esm/loader:525:25)
at ModuleLoader.getModuleJob (node:internal/modules/esm/loader:246:38)
at ModuleJob._link (node:internal/modules/esm/module_job:126:49)

components/scrapeninja/actions/scrape-with-js-rendering/scrape-with-js-rendering.mjs

Oops! Something went wrong! :(

ESLint: 8.57.1

Error [ERR_MODULE_NOT_FOUND]: Cannot find package 'jsonc-eslint-parser' imported from /eslint.config.mjs
at packageResolve (node:internal/modules/esm/resolve:839:9)
at moduleResolve (node:internal/modules/esm/resolve:908:18)
at defaultResolve (node:internal/modules/esm/resolve:1038:11)
at ModuleLoader.defaultResolve (node:internal/modules/esm/loader:557:12)
at ModuleLoader.resolve (node:internal/modules/esm/loader:525:25)
at ModuleLoader.getModuleJob (node:internal/modules/esm/loader:246:38)
at ModuleJob._link (node:internal/modules/esm/module_job:126:49)

components/scrapeninja/scrapeninja.app.mjs

Oops! Something went wrong! :(

ESLint: 8.57.1

Error [ERR_MODULE_NOT_FOUND]: Cannot find package 'jsonc-eslint-parser' imported from /eslint.config.mjs
at packageResolve (node:internal/modules/esm/resolve:839:9)
at moduleResolve (node:internal/modules/esm/resolve:908:18)
at defaultResolve (node:internal/modules/esm/resolve:1038:11)
at ModuleLoader.defaultResolve (node:internal/modules/esm/loader:557:12)
at ModuleLoader.resolve (node:internal/modules/esm/loader:525:25)
at ModuleLoader.getModuleJob (node:internal/modules/esm/loader:246:38)
at ModuleJob._link (node:internal/modules/esm/module_job:126:49)

✨ Finishing Touches
  • 📝 Generate Docstrings

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Actions
 - Non JS Scraping
 - Scraping With JS Rendering
Copy link
Collaborator

@GTFalcao GTFalcao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few suggestions regarding component naming and prop descriptions

@luancazarine luancazarine marked this pull request as ready for review February 28, 2025 18:34
@luancazarine
Copy link
Collaborator Author

/approve

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (5)
components/scrapeninja/actions/non-js-scraping/non-js-scraping.mjs (1)

83-104: Consider more robust error handling.

The current error handling assumes a specific error structure with response.data. This could fail if the error doesn't have this structure (e.g., network errors, timeouts).

-  } catch ({ response: { data } }) {
-    throw new ConfigurationError(data.message || data.stderr);
+  } catch (error) {
+    if (error.response && error.response.data) {
+      throw new ConfigurationError(error.response.data.message || error.response.data.stderr);
+    }
+    throw new ConfigurationError(error.message || "An unknown error occurred");
+  }
components/scrapeninja/actions/scraping-with-js-rendering/scraping-with-js-rendering.mjs (1)

184-229: Improve error handling for more robustness.

Similar to the non-JS scraping action, the error handling here could be more robust to handle various error scenarios.

-  } catch ({ response: { data } }) {
-    throw new ConfigurationError(parseError(data));
+  } catch (error) {
+    if (error.response && error.response.data) {
+      throw new ConfigurationError(parseError(error.response.data));
+    }
+    throw new ConfigurationError(error.message || "An unknown error occurred");
+  }
components/scrapeninja/common/utils.mjs (2)

32-47: Optimize the clearObj function to avoid O(n²) complexity.

The use of spread operator (...) in the reducer can lead to O(n²) time complexity as flagged by the static analysis tool. This could impact performance with larger objects.

export const clearObj = (obj) => {
-  return Object.entries(obj)
-    .filter(([
-      _,
-      v,
-    ]) => (v != null && v != "" && _ != "$emit"))
-    .reduce((acc, [
-      k,
-      v,
-    ]) => ({
-      ...acc,
-      [k]: (!Array.isArray(v) && v === Object(v))
-        ? clearObj(v)
-        : v,
-    }), {});
+  const result = {};
+  for (const [key, value] of Object.entries(obj)) {
+    if (value != null && value !== "" && key !== "$emit") {
+      result[key] = (!Array.isArray(value) && value === Object(value))
+        ? clearObj(value)
+        : value;
+    }
+  }
+  return result;
};

This approach avoids the spread operator in the reducer, achieving the same result more efficiently.

🧰 Tools
🪛 Biome (1.9.4)

[error] 42-42: Avoid the use of spread (...) syntax on accumulators.

Spread syntax should be avoided on accumulators (like those in .reduce) because it causes a time complexity of O(n^2).
Consider methods such as .splice or .push instead.

(lint/performance/noAccumulatingSpread)


26-30: Enhance parseError to handle all possible error cases.

The parseError function doesn't have a fallback if none of the expected properties are present.

export const parseError = (data) => {
  if (data.message) return data.message;
  if (data.stderr) return data.stderr;
  if (data.errors) return Object.entries(data.errors[0])[0][1];
+  return JSON.stringify(data) || "Unknown error occurred";
};

This ensures that some meaningful message is always returned, even for unexpected error structures.

components/scrapeninja/scrapeninja.app.mjs (1)

7-161: Consider adding input validation for critical parameters

While the property definitions are comprehensive with good descriptions, consider adding runtime validation for critical parameters like URL to provide better error messages to users.

For example, you could add a method to validate the URL format before sending to the API:

validateUrl(url) {
  try {
    new URL(url);
    return true;
  } catch (err) {
    throw new Error(`Invalid URL format: ${url}`);
  }
}

And then call this from your scrape methods.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between db6687e and cdc2434.

⛔ Files ignored due to path filters (1)
  • pnpm-lock.yaml is excluded by !**/pnpm-lock.yaml
📒 Files selected for processing (7)
  • components/scrapeninja/.gitignore (0 hunks)
  • components/scrapeninja/actions/non-js-scraping/non-js-scraping.mjs (1 hunks)
  • components/scrapeninja/actions/scraping-with-js-rendering/scraping-with-js-rendering.mjs (1 hunks)
  • components/scrapeninja/app/scrapeninja.app.ts (0 hunks)
  • components/scrapeninja/common/utils.mjs (1 hunks)
  • components/scrapeninja/package.json (1 hunks)
  • components/scrapeninja/scrapeninja.app.mjs (1 hunks)
💤 Files with no reviewable changes (2)
  • components/scrapeninja/.gitignore
  • components/scrapeninja/app/scrapeninja.app.ts
🧰 Additional context used
🪛 Biome (1.9.4)
components/scrapeninja/common/utils.mjs

[error] 42-42: Avoid the use of spread (...) syntax on accumulators.

Spread syntax should be avoided on accumulators (like those in .reduce) because it causes a time complexity of O(n^2).
Consider methods such as .splice or .push instead.

(lint/performance/noAccumulatingSpread)

⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: Verify TypeScript components
  • GitHub Check: pnpm publish
  • GitHub Check: Publish TypeScript components
🔇 Additional comments (7)
components/scrapeninja/actions/non-js-scraping/non-js-scraping.mjs (1)

5-10: LGTM! The naming follows the guidelines.

The action name "Scrape without JS" correctly follows the naming convention with a verb as suggested in the prior review comments.

components/scrapeninja/actions/scraping-with-js-rendering/scraping-with-js-rendering.mjs (1)

225-225: Summary message includes useful context.

Good job including the URL in the summary message. This provides better context than the generic message in the non-JS action.

components/scrapeninja/package.json (2)

3-6: Version bump and entry point change look good.

The version bump from 0.0.2 to 0.1.0 is appropriate for the addition of new features, and the main entry point change reflects the updated file structure.


15-17: Dependency addition is necessary and correct.

Adding the @pipedream/platform dependency is necessary as the code imports ConfigurationError from this package.

components/scrapeninja/scrapeninja.app.mjs (3)

32-34: Update proxy property description as previously suggested

The description is clear, but there was a previous suggestion to update this text.

      description: "Premium or your own proxy URL (overrides `Geo` prop). [Read more about ScrapeNinja proxy setup](https://scrapeninja.net/docs/proxy-setup/).",

162-196: LGTM! Well-structured API client methods

The methods are well-organized with a clean separation of concerns:

  • Base URL and headers are properly encapsulated
  • Generic request method with appropriate defaults
  • Specific methods for different scraping endpoints

This follows good practices for API client implementation.


1-6: LGTM! Clean imports and app definition

The import statement and app definition are clean and follow Pipedream conventions.

import scrapeninja from "../../scrapeninja.app.mjs";

export default {
key: "scrapeninja-non-js-scraping",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luancazarine can you rename the actual action names / slugs? Since they are new components I think it'd be best for them to match the name before being shipped

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filenames are still the same - maybe the commit wasn't pushed?
/actions/non-js-scraping/non-js-scraping.mjs
should match the action name, as in
/actions/scrape-without-js/scrape-without-js.mjs
The same goes for the component key, for both components

luancazarine and others added 2 commits February 28, 2025 17:46
…ping-with-js-rendering.mjs

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
@luancazarine
Copy link
Collaborator Author

/approve

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (4)
components/scrapeninja/actions/scrape-without-js/scrape-without-js.mjs (1)

83-99: Validate user-provided URL and properties

The url and other optional props are passed directly to the scrapeNonJs service. If these props can contain invalid or unexpected values (e.g., malformed URLs), consider adding validations or gracefully handling invalid inputs to prevent unexpected errors in the external call.

components/scrapeninja/actions/scrape-with-js-rendering/scrape-with-js-rendering.mjs (1)

186-193: Consider additional logs or metrics for debugging complex rendering flows

When scraping pages that require JS rendering, introducing optional debug logs around waitForSelector, postWaitTime, or viewport could help in diagnosing slow or failing scraping sessions. This is especially useful when debugging complex page interactions or large timeouts.

components/scrapeninja/scrapeninja.app.mjs (2)

163-165: Prefer a configurable base URL

Currently _baseUrl() returns a hardcoded "https://scrapeninja.p.rapidapi.com". If you ever need to switch to different ScrapeNinja environments (e.g., dev, staging), consider making this URL configurable via environment variables.


166-171: Handle missing Rapid API key gracefully

this.$auth.rapid_api_key is used directly. If the key is absent, your calls will fail. Consider checking for presence or throwing an informative error if the user fails to provide a required key.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cdc2434 and 49a3b46.

📒 Files selected for processing (3)
  • components/scrapeninja/actions/scrape-with-js-rendering/scrape-with-js-rendering.mjs (1 hunks)
  • components/scrapeninja/actions/scrape-without-js/scrape-without-js.mjs (1 hunks)
  • components/scrapeninja/scrapeninja.app.mjs (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: pnpm publish
  • GitHub Check: Verify TypeScript components
  • GitHub Check: Publish TypeScript components
🔇 Additional comments (2)
components/scrapeninja/actions/scrape-without-js/scrape-without-js.mjs (1)

2-2: Confirm the behavior of parseObject for headers

The headers prop is declared as an array of strings, and here it's parsed with parseObject(this.headers). If each array element is expected to be a standard “Key: Value” string, ensure parseObject consistently produces the correct header key-value pairs or returns a meaningful error if the formatting is invalid.

Would you like a quick test or verification script to confirm expected parsing behavior?

components/scrapeninja/actions/scrape-with-js-rendering/scrape-with-js-rendering.mjs (1)

1-5: Validate imports and utilities usage

Imports for clearObj, parseError, and parseObject suggest that objects and errors are filtered or transformed. Confirm the correct usage of these utilities in the rest of the codebase to ensure consistent handling of optional properties (e.g., ensuring we don't parse null or incorrectly format error messages).

Comment on lines +102 to +104
} catch ({ response: { data } }) {
throw new ConfigurationError(data.message || data.stderr);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Guard against missing data in error response

Destructuring ({ response: { data } }) throws a runtime error if response or data is undefined. Add a fallback or restructure the catch block to avoid uncaught exceptions in cases where the error format differs from the expected shape.

} catch (err) {
-  throw new ConfigurationError(err.response.data.message || err.response.data.stderr);
+  const msg = err?.response?.data?.message || err?.response?.data?.stderr || "Unknown error";
+  throw new ConfigurationError(msg);
}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
} catch ({ response: { data } }) {
throw new ConfigurationError(data.message || data.stderr);
}
} catch (err) {
const msg = err?.response?.data?.message || err?.response?.data?.stderr || "Unknown error";
throw new ConfigurationError(msg);
}

Comment on lines +227 to +229
} catch ({ response: { data } }) {
throw new ConfigurationError(parseError(data));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Prevent crashing on unexpected error shapes

Similar to the other action, the nested destructuring in the catch block can lead to runtime errors if response or data objects are missing. Use safe access or a fallback to avoid an unhandled exception when err has a different structure.

} catch (err) {
-  throw new ConfigurationError(parseError(err.response.data));
+  const safeData = err?.response?.data;
+  throw new ConfigurationError(parseError(safeData));
}

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines +173 to +181
_makeRequest({
$ = this, path, ...opts
}) {
return axios($, {
url: this._baseUrl() + path,
headers: this._headers(),
...opts,
});
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Centralize or standardize request error handling

_makeRequest merges additional options into a single axios request. If your application frequently encounters network or parsing errors, consider building a unified error-handling layer here to avoid repeating try-catch logic in multiple actions.

@GTFalcao GTFalcao merged commit f1a3d24 into master Mar 4, 2025
11 checks passed
@GTFalcao GTFalcao deleted the issue-15137 branch March 4, 2025 00:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ai-assisted Content generated by AI, with human refinement and modification

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Components] scrapeninja

3 participants