docs: update ScrapeGraphAI tools (#32026)

VinciGit00 · mdrxy · web-flow · commit 26c2c8f70aa4 · 2025-07-14T12:38:55.000-04:00
It was outdated

---------

Co-authored-by: Mason Daugherty &lt;github@mdrxy.com&gt;
diff --git a/docs/docs/integrations/providers/scrapegraph.mdx b/docs/docs/integrations/providers/scrapegraph.mdx
@@ -27,15 +27,15 @@ There are four tools available:
 ```python
 from langchain_scrapegraph.tools import (
     SmartScraperTool,    # Extract structured data from websites
+    SmartCrawlerTool,    # Extract data from multiple pages with crawling
     MarkdownifyTool,     # Convert webpages to markdown
-    LocalScraperTool,    # Process local HTML content
     GetCreditsTool,      # Check remaining API credits
 )
 ```
 
 Each tool serves a specific purpose:
 
 - `SmartScraperTool`: Extract structured data from websites given a URL, prompt and optional output schema
+- `SmartCrawlerTool`: Extract data from multiple pages with advanced crawling options like depth control, page limits, and domain restrictions
 - `MarkdownifyTool`: Convert any webpage to clean markdown format
-- `LocalScraperTool`: Extract structured data from a local HTML file given a prompt and optional output schema
 - `GetCreditsTool`: Check your remaining ScrapeGraph AI credits 
diff --git a/docs/docs/integrations/tools/scrapegraph.ipynb b/docs/docs/integrations/tools/scrapegraph.ipynb
@@ -30,17 +30,17 @@
     "| Class | Package | Serializable | JS support | Package latest |\n",
     "| :--- | :--- | :---: | :---: | :---: |\n",
     "| [SmartScraperTool](https://python.langchain.com/docs/integrations/tools/scrapegraph) | langchain-scrapegraph | ✅ | ❌ | ![PyPI - Version](https://img.shields.io/pypi/v/langchain-scrapegraph?style=flat-square&label=%20) |\n",
+    "| [SmartCrawlerTool](https://python.langchain.com/docs/integrations/tools/scrapegraph) | langchain-scrapegraph | ✅ | ❌ | ![PyPI - Version](https://img.shields.io/pypi/v/langchain-scrapegraph?style=flat-square&label=%20) |\n",
     "| [MarkdownifyTool](https://python.langchain.com/docs/integrations/tools/scrapegraph) | langchain-scrapegraph | ✅ | ❌ | ![PyPI - Version](https://img.shields.io/pypi/v/langchain-scrapegraph?style=flat-square&label=%20) |\n",
-    "| [LocalScraperTool](https://python.langchain.com/docs/integrations/tools/scrapegraph) | langchain-scrapegraph | ✅ | ❌ | ![PyPI - Version](https://img.shields.io/pypi/v/langchain-scrapegraph?style=flat-square&label=%20) |\n",
     "| [GetCreditsTool](https://python.langchain.com/docs/integrations/tools/scrapegraph) | langchain-scrapegraph | ✅ | ❌ | ![PyPI - Version](https://img.shields.io/pypi/v/langchain-scrapegraph?style=flat-square&label=%20) |\n",
     "\n",
     "### Tool features\n",
     "\n",
     "| Tool | Purpose | Input | Output |\n",
     "| :--- | :--- | :--- | :--- |\n",
     "| SmartScraperTool | Extract structured data from websites | URL + prompt | JSON |\n",
+    "| SmartCrawlerTool | Extract data from multiple pages with crawling | URL + prompt + crawl options | JSON |\n",
     "| MarkdownifyTool | Convert webpages to markdown | URL | Markdown text |\n",
-    "| LocalScraperTool | Extract data from HTML content | HTML + prompt | JSON |\n",
     "| GetCreditsTool | Check API credits | None | Credit info |\n",
     "\n",
     "\n",
@@ -122,21 +122,26 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 7,
+   "execution_count": null,
    "id": "8b3ddfe9",
    "metadata": {},
    "outputs": [],
    "source": [
+    "from scrapegraph_py.logger import sgai_logger\n",
+    "import json\n",
+    "\n",
     "from langchain_scrapegraph.tools import (\n",
     "    GetCreditsTool,\n",
-    "    LocalScraperTool,\n",
     "    MarkdownifyTool,\n",
+    "    SmartCrawlerTool,\n",
     "    SmartScraperTool,\n",
     ")\n",
     "\n",
+    "sgai_logger.set_logging(level=\"INFO\")\n",
+    "\n",
     "smartscraper = SmartScraperTool()\n",
+    "smartcrawler = SmartCrawlerTool()\n",
     "markdownify = MarkdownifyTool()\n",
-    "localscraper = LocalScraperTool()\n",
     "credits = GetCreditsTool()"
    ]
   },
@@ -152,9 +157,23 @@
     "Let's try each tool individually:"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "d5a88cf2",
+   "metadata": {
+    "vscode": {
+     "languageId": "raw"
+    }
+   },
+   "source": [
+    "### SmartCrawler Tool\n",
+    "\n",
+    "The SmartCrawlerTool allows you to crawl multiple pages from a website and extract structured data with advanced crawling options like depth control, page limits, and domain restrictions.\n"
+   ]
+  },
   {
    "cell_type": "code",
-   "execution_count": 6,
+   "execution_count": null,
    "id": "65310a8b",
    "metadata": {},
    "outputs": [
@@ -189,33 +208,71 @@
     "markdown = markdownify.invoke({\"website_url\": \"https://scrapegraphai.com\"})\n",
     "print(\"\\nMarkdownify Result (first 200 chars):\", markdown[:200])\n",
     "\n",
-    "local_html = \"\"\"\n",
-    "<html>\n",
-    "    <body>\n",
-    "        <h1>Company Name</h1>\n",
-    "        <p>We are a technology company focused on AI solutions.</p>\n",
-    "        <div class=\"contact\">\n",
-    "            <p>Email: contact@example.com</p>\n",
-    "            <p>Phone: (555) 123-4567</p>\n",
-    "        </div>\n",
-    "    </body>\n",
-    "</html>\n",
-    "\"\"\"\n",
-    "\n",
-    "# LocalScraper\n",
-    "result_local = localscraper.invoke(\n",
+    "# SmartCrawler\n",
+    "url = \"https://scrapegraphai.com/\"\n",
+    "prompt = (\n",
+    "    \"What does the company do? and I need text content from their privacy and terms\"\n",
+    ")\n",
+    "\n",
+    "# Use the tool with crawling parameters\n",
+    "result_crawler = smartcrawler.invoke(\n",
     "    {\n",
-    "        \"user_prompt\": \"Make a summary of the webpage and extract the email and phone number\",\n",
-    "        \"website_html\": local_html,\n",
+    "        \"url\": url,\n",
+    "        \"prompt\": prompt,\n",
+    "        \"cache_website\": True,\n",
+    "        \"depth\": 2,\n",
+    "        \"max_pages\": 2,\n",
+    "        \"same_domain_only\": True,\n",
     "    }\n",
     ")\n",
-    "print(\"LocalScraper Result:\", result_local)\n",
+    "\n",
+    "print(\"\\nSmartCrawler Result:\")\n",
+    "print(json.dumps(result_crawler, indent=2))\n",
     "\n",
     "# Check credits\n",
     "credits_info = credits.invoke({})\n",
     "print(\"\\nCredits Info:\", credits_info)"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f13fb466",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# SmartCrawler example\n",
+    "from scrapegraph_py.logger import sgai_logger\n",
+    "import json\n",
+    "\n",
+    "from langchain_scrapegraph.tools import SmartCrawlerTool\n",
+    "\n",
+    "sgai_logger.set_logging(level=\"INFO\")\n",
+    "\n",
+    "# Will automatically get SGAI_API_KEY from environment\n",
+    "tool = SmartCrawlerTool()\n",
+    "\n",
+    "# Example based on the provided code snippet\n",
+    "url = \"https://scrapegraphai.com/\"\n",
+    "prompt = (\n",
+    "    \"What does the company do? and I need text content from their privacy and terms\"\n",
+    ")\n",
+    "\n",
+    "# Use the tool with crawling parameters\n",
+    "result = tool.invoke(\n",
+    "    {\n",
+    "        \"url\": url,\n",
+    "        \"prompt\": prompt,\n",
+    "        \"cache_website\": True,\n",
+    "        \"depth\": 2,\n",
+    "        \"max_pages\": 2,\n",
+    "        \"same_domain_only\": True,\n",
+    "    }\n",
+    ")\n",
+    "\n",
+    "print(json.dumps(result, indent=2))"
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "d6e73897",
@@ -350,15 +407,21 @@
    "source": [
     "## API reference\n",
     "\n",
-    "For detailed documentation of all ScrapeGraph features and configurations head to the Langchain API reference: https://python.langchain.com/docs/integrations/tools/scrapegraph\n",
+    "For detailed documentation of all ScrapeGraph features and configurations head to [the Langchain API reference](https://python.langchain.com/docs/integrations/tools/scrapegraph).\n",
     "\n",
-    "Or to the official SDK repo: https://github.com/ScrapeGraphAI/langchain-scrapegraph"
+    "Or to [the official SDK repo](https://github.com/ScrapeGraphAI/langchain-scrapegraph)."
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d710dad8",
+   "metadata": {},
+   "source": []
   }
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "langchain",
    "language": "python",
    "name": "python3"
   },
@@ -372,7 +435,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.11.9"
+   "version": "3.10.16"
   }
  },
  "nbformat": 4,