generated from cording12/next-fast-turbo
-
-
Notifications
You must be signed in to change notification settings - Fork 879
Adding web crawler connector #499
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
samkul-swe
wants to merge
9
commits into
MODSetter:dev
Choose a base branch
from
samkul-swe:feat/webcrawler
base: dev
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+1,319
−629
Open
Changes from 7 commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
419f94e
Merge pull request #498 from MODSetter/dev
MODSetter 896e410
Webcrawler connector draft
samkul-swe 8333697
Removed the CRAWLED_URL document processors
samkul-swe 1480f85
Fixed indexing issue
samkul-swe 6d19e0f
Fixing search logic
samkul-swe 121e2f0
Renaming resources
samkul-swe ad75f81
Cleaning up files
samkul-swe 5afb421
Fix issues
samkul-swe ebea98c
Linting fixed
samkul-swe File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
59 changes: 59 additions & 0 deletions
59
surfsense_backend/alembic/versions/38_add_webcrawler_connector_enum.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,59 @@ | ||
| """Add Webcrawler connector enums | ||
|
|
||
| Revision ID: 38 | ||
| Revises: 37 | ||
| Create Date: 2025-11-17 17:00:00.000000 | ||
|
|
||
| """ | ||
|
|
||
| from collections.abc import Sequence | ||
|
|
||
| from alembic import op | ||
|
|
||
| revision: str = "38" | ||
| down_revision: str | None = "37" | ||
| branch_labels: str | Sequence[str] | None = None | ||
| depends_on: str | Sequence[str] | None = None | ||
|
|
||
|
|
||
| def upgrade() -> None: | ||
| """Safely add 'WEBCRAWLER_CONNECTOR' to enum types if missing.""" | ||
|
|
||
| # Add to searchsourceconnectortype enum | ||
| op.execute( | ||
| """ | ||
| DO $$ | ||
| BEGIN | ||
| IF NOT EXISTS ( | ||
| SELECT 1 FROM pg_type t | ||
| JOIN pg_enum e ON t.oid = e.enumtypid | ||
| WHERE t.typname = 'searchsourceconnectortype' AND e.enumlabel = 'WEBCRAWLER_CONNECTOR' | ||
| ) THEN | ||
| ALTER TYPE searchsourceconnectortype ADD VALUE 'WEBCRAWLER_CONNECTOR'; | ||
| END IF; | ||
| END | ||
| $$; | ||
| """ | ||
| ) | ||
|
|
||
| # Add to documenttype enum | ||
| op.execute( | ||
| """ | ||
| DO $$ | ||
| BEGIN | ||
| IF NOT EXISTS ( | ||
| SELECT 1 FROM pg_type t | ||
| JOIN pg_enum e ON t.oid = e.enumtypid | ||
| WHERE t.typname = 'documenttype' AND e.enumlabel = 'CRAWLED_URL' | ||
| ) THEN | ||
| ALTER TYPE documenttype ADD VALUE 'CRAWLED_URL'; | ||
| END IF; | ||
| END | ||
| $$; | ||
| """ | ||
| ) | ||
|
|
||
|
|
||
| def downgrade() -> None: | ||
| """Remove 'WEBCRAWLER_CONNECTOR' from enum types.""" | ||
| pass |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
191 changes: 191 additions & 0 deletions
191
surfsense_backend/app/connectors/webcrawler_connector.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,191 @@ | ||
| """ | ||
| WebCrawler Connector Module | ||
|
|
||
| A module for crawling web pages and extracting content using Firecrawl or AsyncChromiumLoader. | ||
| Provides a unified interface for web scraping. | ||
| """ | ||
|
|
||
| from typing import Any | ||
|
|
||
| import validators | ||
| from firecrawl import AsyncFirecrawlApp | ||
| from langchain_community.document_loaders import AsyncChromiumLoader | ||
|
|
||
|
|
||
| class WebCrawlerConnector: | ||
| """Class for crawling web pages and extracting content.""" | ||
|
|
||
| def __init__(self, firecrawl_api_key: str | None = None): | ||
| """ | ||
| Initialize the WebCrawlerConnector class. | ||
|
|
||
| Args: | ||
| firecrawl_api_key: Firecrawl API key (optional, will use AsyncChromiumLoader if not provided) | ||
| """ | ||
| self.firecrawl_api_key = firecrawl_api_key | ||
| self.use_firecrawl = bool(firecrawl_api_key) | ||
|
|
||
| def set_api_key(self, api_key: str) -> None: | ||
| """ | ||
| Set the Firecrawl API key and enable Firecrawl usage. | ||
|
|
||
| Args: | ||
| api_key: Firecrawl API key | ||
| """ | ||
| self.firecrawl_api_key = api_key | ||
| self.use_firecrawl = True | ||
|
|
||
| async def crawl_url( | ||
| self, url: str, formats: list[str] | None = None | ||
| ) -> tuple[dict[str, Any] | None, str | None]: | ||
| """ | ||
| Crawl a single URL and extract its content. | ||
|
|
||
| Args: | ||
| url: URL to crawl | ||
| formats: List of formats to extract (e.g., ["markdown", "html"]) - only for Firecrawl | ||
|
|
||
| Returns: | ||
| Tuple containing (crawl result dict, error message or None) | ||
| Result dict contains: | ||
| - content: Extracted content (markdown or HTML) | ||
| - metadata: Page metadata (title, description, etc.) | ||
| - source: Original URL | ||
| - crawler_type: Type of crawler used | ||
| """ | ||
| try: | ||
| # Validate URL | ||
| if not validators.url(url): | ||
| return None, f"Invalid URL: {url}" | ||
|
|
||
| if self.use_firecrawl: | ||
| result = await self._crawl_with_firecrawl(url, formats) | ||
| else: | ||
| result = await self._crawl_with_chromium(url) | ||
|
|
||
| return result, None | ||
|
|
||
| except Exception as e: | ||
| return None, f"Error crawling URL {url}: {e!s}" | ||
|
|
||
| async def _crawl_with_firecrawl( | ||
| self, url: str, formats: list[str] | None = None | ||
| ) -> dict[str, Any]: | ||
| """ | ||
| Crawl URL using Firecrawl. | ||
|
|
||
| Args: | ||
| url: URL to crawl | ||
| formats: List of formats to extract | ||
|
|
||
| Returns: | ||
| Dict containing crawled content and metadata | ||
|
|
||
| Raises: | ||
| ValueError: If Firecrawl scraping fails | ||
| """ | ||
| if not self.firecrawl_api_key: | ||
| raise ValueError("Firecrawl API key not set. Call set_api_key() first.") | ||
|
|
||
| firecrawl_app = AsyncFirecrawlApp(api_key=self.firecrawl_api_key) | ||
|
|
||
| # Default to markdown format | ||
| if formats is None: | ||
| formats = ["markdown"] | ||
|
|
||
| scrape_result = await firecrawl_app.scrape_url(url=url, formats=formats) | ||
|
|
||
| if not scrape_result or not scrape_result.success: | ||
| error_msg = ( | ||
| scrape_result.error | ||
| if scrape_result and hasattr(scrape_result, "error") | ||
| else "Unknown error" | ||
| ) | ||
| raise ValueError(f"Firecrawl failed to scrape URL: {error_msg}") | ||
|
|
||
| # Extract content based on format | ||
| content = scrape_result.markdown or scrape_result.html or "" | ||
|
|
||
| # Extract metadata | ||
| metadata = scrape_result.metadata if scrape_result.metadata else {} | ||
|
|
||
| return { | ||
| "content": content, | ||
| "metadata": { | ||
| "source": url, | ||
| "title": metadata.get("title", url), | ||
| "description": metadata.get("description", ""), | ||
| "language": metadata.get("language", ""), | ||
| "sourceURL": metadata.get("sourceURL", url), | ||
| **metadata, | ||
| }, | ||
| "crawler_type": "firecrawl", | ||
| } | ||
|
|
||
| async def _crawl_with_chromium(self, url: str) -> dict[str, Any]: | ||
| """ | ||
| Crawl URL using AsyncChromiumLoader. | ||
|
|
||
| Args: | ||
| url: URL to crawl | ||
|
|
||
| Returns: | ||
| Dict containing crawled content and metadata | ||
|
|
||
| Raises: | ||
| Exception: If crawling fails | ||
| """ | ||
| crawl_loader = AsyncChromiumLoader(urls=[url], headless=True) | ||
| documents = await crawl_loader.aload() | ||
|
|
||
| if not documents: | ||
| raise ValueError(f"Failed to load content from {url}") | ||
|
|
||
| doc = documents[0] | ||
|
|
||
| # Extract basic metadata from the document | ||
| metadata = doc.metadata if doc.metadata else {} | ||
|
|
||
| return { | ||
| "content": doc.page_content, | ||
| "metadata": { | ||
| "source": url, | ||
| "title": metadata.get("title", url), | ||
| **metadata, | ||
| }, | ||
| "crawler_type": "chromium", | ||
| } | ||
|
|
||
| def format_to_structured_document(self, crawl_result: dict[str, Any]) -> str: | ||
| """ | ||
| Format crawl result as a structured document. | ||
|
|
||
| Args: | ||
| crawl_result: Result from crawl_url method | ||
|
|
||
| Returns: | ||
| Structured document string | ||
| """ | ||
| metadata = crawl_result["metadata"] | ||
| content = crawl_result["content"] | ||
|
|
||
| document_parts = ["<DOCUMENT>", "<METADATA>"] | ||
|
|
||
| # Add all metadata fields | ||
| for key, value in metadata.items(): | ||
| document_parts.append(f"{key.upper()}: {value}") | ||
|
|
||
| document_parts.extend( | ||
| [ | ||
| "</METADATA>", | ||
| "<CONTENT>", | ||
| "FORMAT: markdown", | ||
| "TEXT_START", | ||
| content, | ||
| "TEXT_END", | ||
| "</CONTENT>", | ||
| "</DOCUMENT>", | ||
| ] | ||
| ) | ||
|
|
||
| return "\n".join(document_parts) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.