Skip to content

Commit 26c2c8f

Browse files
VinciGit00mdrxy
andauthored
docs: update ScrapeGraphAI tools (#32026)
It was outdated --------- Co-authored-by: Mason Daugherty <[email protected]>
1 parent d96b75f commit 26c2c8f

File tree

2 files changed

+93
-30
lines changed

2 files changed

+93
-30
lines changed

docs/docs/integrations/providers/scrapegraph.mdx

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,15 +27,15 @@ There are four tools available:
2727
```python
2828
from langchain_scrapegraph.tools import (
2929
SmartScraperTool, # Extract structured data from websites
30+
SmartCrawlerTool, # Extract data from multiple pages with crawling
3031
MarkdownifyTool, # Convert webpages to markdown
31-
LocalScraperTool, # Process local HTML content
3232
GetCreditsTool, # Check remaining API credits
3333
)
3434
```
3535

3636
Each tool serves a specific purpose:
3737

3838
- `SmartScraperTool`: Extract structured data from websites given a URL, prompt and optional output schema
39+
- `SmartCrawlerTool`: Extract data from multiple pages with advanced crawling options like depth control, page limits, and domain restrictions
3940
- `MarkdownifyTool`: Convert any webpage to clean markdown format
40-
- `LocalScraperTool`: Extract structured data from a local HTML file given a prompt and optional output schema
4141
- `GetCreditsTool`: Check your remaining ScrapeGraph AI credits

docs/docs/integrations/tools/scrapegraph.ipynb

Lines changed: 91 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -30,17 +30,17 @@
3030
"| Class | Package | Serializable | JS support | Package latest |\n",
3131
"| :--- | :--- | :---: | :---: | :---: |\n",
3232
"| [SmartScraperTool](https://python.langchain.com/docs/integrations/tools/scrapegraph) | langchain-scrapegraph | ✅ | ❌ | ![PyPI - Version](https://img.shields.io/pypi/v/langchain-scrapegraph?style=flat-square&label=%20) |\n",
33+
"| [SmartCrawlerTool](https://python.langchain.com/docs/integrations/tools/scrapegraph) | langchain-scrapegraph | ✅ | ❌ | ![PyPI - Version](https://img.shields.io/pypi/v/langchain-scrapegraph?style=flat-square&label=%20) |\n",
3334
"| [MarkdownifyTool](https://python.langchain.com/docs/integrations/tools/scrapegraph) | langchain-scrapegraph | ✅ | ❌ | ![PyPI - Version](https://img.shields.io/pypi/v/langchain-scrapegraph?style=flat-square&label=%20) |\n",
34-
"| [LocalScraperTool](https://python.langchain.com/docs/integrations/tools/scrapegraph) | langchain-scrapegraph | ✅ | ❌ | ![PyPI - Version](https://img.shields.io/pypi/v/langchain-scrapegraph?style=flat-square&label=%20) |\n",
3535
"| [GetCreditsTool](https://python.langchain.com/docs/integrations/tools/scrapegraph) | langchain-scrapegraph | ✅ | ❌ | ![PyPI - Version](https://img.shields.io/pypi/v/langchain-scrapegraph?style=flat-square&label=%20) |\n",
3636
"\n",
3737
"### Tool features\n",
3838
"\n",
3939
"| Tool | Purpose | Input | Output |\n",
4040
"| :--- | :--- | :--- | :--- |\n",
4141
"| SmartScraperTool | Extract structured data from websites | URL + prompt | JSON |\n",
42+
"| SmartCrawlerTool | Extract data from multiple pages with crawling | URL + prompt + crawl options | JSON |\n",
4243
"| MarkdownifyTool | Convert webpages to markdown | URL | Markdown text |\n",
43-
"| LocalScraperTool | Extract data from HTML content | HTML + prompt | JSON |\n",
4444
"| GetCreditsTool | Check API credits | None | Credit info |\n",
4545
"\n",
4646
"\n",
@@ -122,21 +122,26 @@
122122
},
123123
{
124124
"cell_type": "code",
125-
"execution_count": 7,
125+
"execution_count": null,
126126
"id": "8b3ddfe9",
127127
"metadata": {},
128128
"outputs": [],
129129
"source": [
130+
"from scrapegraph_py.logger import sgai_logger\n",
131+
"import json\n",
132+
"\n",
130133
"from langchain_scrapegraph.tools import (\n",
131134
" GetCreditsTool,\n",
132-
" LocalScraperTool,\n",
133135
" MarkdownifyTool,\n",
136+
" SmartCrawlerTool,\n",
134137
" SmartScraperTool,\n",
135138
")\n",
136139
"\n",
140+
"sgai_logger.set_logging(level=\"INFO\")\n",
141+
"\n",
137142
"smartscraper = SmartScraperTool()\n",
143+
"smartcrawler = SmartCrawlerTool()\n",
138144
"markdownify = MarkdownifyTool()\n",
139-
"localscraper = LocalScraperTool()\n",
140145
"credits = GetCreditsTool()"
141146
]
142147
},
@@ -152,9 +157,23 @@
152157
"Let's try each tool individually:"
153158
]
154159
},
160+
{
161+
"cell_type": "markdown",
162+
"id": "d5a88cf2",
163+
"metadata": {
164+
"vscode": {
165+
"languageId": "raw"
166+
}
167+
},
168+
"source": [
169+
"### SmartCrawler Tool\n",
170+
"\n",
171+
"The SmartCrawlerTool allows you to crawl multiple pages from a website and extract structured data with advanced crawling options like depth control, page limits, and domain restrictions.\n"
172+
]
173+
},
155174
{
156175
"cell_type": "code",
157-
"execution_count": 6,
176+
"execution_count": null,
158177
"id": "65310a8b",
159178
"metadata": {},
160179
"outputs": [
@@ -189,33 +208,71 @@
189208
"markdown = markdownify.invoke({\"website_url\": \"https://scrapegraphai.com\"})\n",
190209
"print(\"\\nMarkdownify Result (first 200 chars):\", markdown[:200])\n",
191210
"\n",
192-
"local_html = \"\"\"\n",
193-
"<html>\n",
194-
" <body>\n",
195-
" <h1>Company Name</h1>\n",
196-
" <p>We are a technology company focused on AI solutions.</p>\n",
197-
" <div class=\"contact\">\n",
198-
" <p>Email: [email protected]</p>\n",
199-
" <p>Phone: (555) 123-4567</p>\n",
200-
" </div>\n",
201-
" </body>\n",
202-
"</html>\n",
203-
"\"\"\"\n",
204-
"\n",
205-
"# LocalScraper\n",
206-
"result_local = localscraper.invoke(\n",
211+
"# SmartCrawler\n",
212+
"url = \"https://scrapegraphai.com/\"\n",
213+
"prompt = (\n",
214+
" \"What does the company do? and I need text content from their privacy and terms\"\n",
215+
")\n",
216+
"\n",
217+
"# Use the tool with crawling parameters\n",
218+
"result_crawler = smartcrawler.invoke(\n",
207219
" {\n",
208-
" \"user_prompt\": \"Make a summary of the webpage and extract the email and phone number\",\n",
209-
" \"website_html\": local_html,\n",
220+
" \"url\": url,\n",
221+
" \"prompt\": prompt,\n",
222+
" \"cache_website\": True,\n",
223+
" \"depth\": 2,\n",
224+
" \"max_pages\": 2,\n",
225+
" \"same_domain_only\": True,\n",
210226
" }\n",
211227
")\n",
212-
"print(\"LocalScraper Result:\", result_local)\n",
228+
"\n",
229+
"print(\"\\nSmartCrawler Result:\")\n",
230+
"print(json.dumps(result_crawler, indent=2))\n",
213231
"\n",
214232
"# Check credits\n",
215233
"credits_info = credits.invoke({})\n",
216234
"print(\"\\nCredits Info:\", credits_info)"
217235
]
218236
},
237+
{
238+
"cell_type": "code",
239+
"execution_count": null,
240+
"id": "f13fb466",
241+
"metadata": {},
242+
"outputs": [],
243+
"source": [
244+
"# SmartCrawler example\n",
245+
"from scrapegraph_py.logger import sgai_logger\n",
246+
"import json\n",
247+
"\n",
248+
"from langchain_scrapegraph.tools import SmartCrawlerTool\n",
249+
"\n",
250+
"sgai_logger.set_logging(level=\"INFO\")\n",
251+
"\n",
252+
"# Will automatically get SGAI_API_KEY from environment\n",
253+
"tool = SmartCrawlerTool()\n",
254+
"\n",
255+
"# Example based on the provided code snippet\n",
256+
"url = \"https://scrapegraphai.com/\"\n",
257+
"prompt = (\n",
258+
" \"What does the company do? and I need text content from their privacy and terms\"\n",
259+
")\n",
260+
"\n",
261+
"# Use the tool with crawling parameters\n",
262+
"result = tool.invoke(\n",
263+
" {\n",
264+
" \"url\": url,\n",
265+
" \"prompt\": prompt,\n",
266+
" \"cache_website\": True,\n",
267+
" \"depth\": 2,\n",
268+
" \"max_pages\": 2,\n",
269+
" \"same_domain_only\": True,\n",
270+
" }\n",
271+
")\n",
272+
"\n",
273+
"print(json.dumps(result, indent=2))"
274+
]
275+
},
219276
{
220277
"cell_type": "markdown",
221278
"id": "d6e73897",
@@ -350,15 +407,21 @@
350407
"source": [
351408
"## API reference\n",
352409
"\n",
353-
"For detailed documentation of all ScrapeGraph features and configurations head to the Langchain API reference: https://python.langchain.com/docs/integrations/tools/scrapegraph\n",
410+
"For detailed documentation of all ScrapeGraph features and configurations head to [the Langchain API reference](https://python.langchain.com/docs/integrations/tools/scrapegraph).\n",
354411
"\n",
355-
"Or to the official SDK repo: https://github.com/ScrapeGraphAI/langchain-scrapegraph"
412+
"Or to [the official SDK repo](https://github.com/ScrapeGraphAI/langchain-scrapegraph)."
356413
]
414+
},
415+
{
416+
"cell_type": "markdown",
417+
"id": "d710dad8",
418+
"metadata": {},
419+
"source": []
357420
}
358421
],
359422
"metadata": {
360423
"kernelspec": {
361-
"display_name": "Python 3",
424+
"display_name": "langchain",
362425
"language": "python",
363426
"name": "python3"
364427
},
@@ -372,7 +435,7 @@
372435
"name": "python",
373436
"nbconvert_exporter": "python",
374437
"pygments_lexer": "ipython3",
375-
"version": "3.11.9"
438+
"version": "3.10.16"
376439
}
377440
},
378441
"nbformat": 4,

0 commit comments

Comments
 (0)