Skip to content

Commit b6f7b64

Browse files
authored
Merge pull request #290 from VinciGit00/pre/beta
Pre/beta
2 parents 0ba3a59 + 1cb71ed commit b6f7b64

File tree

77 files changed

+2843
-287
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

77 files changed

+2843
-287
lines changed

.gitignore

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,5 +32,7 @@ examples/graph_examples/ScrapeGraphAI_generated_graph
3232
examples/**/result.csv
3333
examples/**/result.json
3434
main.py
35+
lib/
36+
*.html
37+
.idea
3538

36-

CHANGELOG.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
1-
## [1.4.0](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.3.2...v1.4.0) (2024-05-22)
1+
2+
## [1.4.0-beta.2](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.4.0-beta.1...v1.4.0-beta.2) (2024-05-19)
23

34

45
### Features
@@ -19,13 +20,16 @@
1920

2021
* add deepseek embeddings ([659fad7](https://github.com/VinciGit00/Scrapegraph-ai/commit/659fad770a5b6ace87511513e5233a3bc1269009))
2122

23+
2224
## [1.3.0](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.2.4...v1.3.0) (2024-05-19)
2325

2426

27+
2528
### Features
2629

2730
* add new model ([8c7afa7](https://github.com/VinciGit00/Scrapegraph-ai/commit/8c7afa7570f0a104578deb35658168435cfe5ae1))
2831

32+
2933
## [1.2.4](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.2.3...v1.2.4) (2024-05-17)
3034

3135

README.md

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -22,10 +22,6 @@ The reference page for Scrapegraph-ai is available on the official page of pypy:
2222
```bash
2323
pip install scrapegraphai
2424
```
25-
you will also need to install Playwright for javascript-based scraping:
26-
```bash
27-
playwright install
28-
```
2925

3026
**Note**: it is recommended to install the library in a virtual environment to avoid conflicts with other libraries 🐱
3127

@@ -49,6 +45,7 @@ There are three main scraping pipelines that can be used to extract information
4945
- `SmartScraperGraph`: single-page scraper that only needs a user prompt and an input source;
5046
- `SearchGraph`: multi-page scraper that extracts information from the top n search results of a search engine;
5147
- `SpeechGraph`: single-page scraper that extracts information from a website and generates an audio file.
48+
- `SmartScraperMultiGraph`: multiple page scraper given a single prompt
5249

5350
It is possible to use different LLM through APIs, such as **OpenAI**, **Groq**, **Azure** and **Gemini**, or local models using **Ollama**.
5451

examples/bedrock/.env.example

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
AWS_ACCESS_KEY_ID="..."
2+
AWS_SECRET_ACCESS_KEY="..."
3+
AWS_SESSION_TOKEN="..."
4+
AWS_DEFAULT_REGION="..."

examples/bedrock/README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
This folder contains examples of how to use ScrapeGraphAI with [Amazon Bedrock](https://aws.amazon.com/bedrock/) ⛰️. The examples show how to extract information from websites and files using a natural language prompt.
2+
3+
![](scrapegraphai_bedrock.png)
Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
"""
2+
Basic example of scraping pipeline using CSVScraperGraph from CSV documents
3+
"""
4+
5+
import os
6+
import json
7+
8+
from dotenv import load_dotenv
9+
10+
import pandas as pd
11+
12+
from scrapegraphai.graphs import CSVScraperGraph
13+
from scrapegraphai.utils import convert_to_csv, convert_to_json, prettify_exec_info
14+
15+
load_dotenv()
16+
17+
# ************************************************
18+
# Read the CSV file
19+
# ************************************************
20+
21+
FILE_NAME = "inputs/username.csv"
22+
curr_dir = os.path.dirname(os.path.realpath(__file__))
23+
file_path = os.path.join(curr_dir, FILE_NAME)
24+
25+
text = pd.read_csv(file_path)
26+
27+
# ************************************************
28+
# Define the configuration for the graph
29+
# ************************************************
30+
31+
graph_config = {
32+
"llm": {
33+
"model": "bedrock/anthropic.claude-3-sonnet-20240229-v1:0",
34+
"temperature": 0.0
35+
},
36+
"embeddings": {
37+
"model": "bedrock/cohere.embed-multilingual-v3"
38+
}
39+
}
40+
41+
# ************************************************
42+
# Create the CSVScraperGraph instance and run it
43+
# ************************************************
44+
45+
csv_scraper_graph = CSVScraperGraph(
46+
prompt="List me all the last names",
47+
source=str(text), # Pass the content of the file, not the file object
48+
config=graph_config
49+
)
50+
51+
result = csv_scraper_graph.run()
52+
print(json.dumps(result, indent=4))
53+
54+
# ************************************************
55+
# Get graph execution info
56+
# ************************************************
57+
58+
graph_exec_info = csv_scraper_graph.get_execution_info()
59+
print(prettify_exec_info(graph_exec_info))
60+
61+
# Save to json or csv
62+
convert_to_csv(result, "result")
63+
convert_to_json(result, "result")
Lines changed: 127 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,127 @@
1+
"""
2+
Example of custom graph using existing nodes
3+
"""
4+
5+
import json
6+
7+
from dotenv import load_dotenv
8+
9+
from langchain_aws import BedrockEmbeddings
10+
from scrapegraphai.models import Bedrock
11+
from scrapegraphai.graphs import BaseGraph
12+
from scrapegraphai.nodes import (
13+
FetchNode,
14+
ParseNode,
15+
RAGNode,
16+
GenerateAnswerNode,
17+
RobotsNode
18+
)
19+
20+
load_dotenv()
21+
22+
# ************************************************
23+
# Define the configuration for the graph
24+
# ************************************************
25+
26+
graph_config = {
27+
"llm": {
28+
"model": "bedrock/anthropic.claude-3-sonnet-20240229-v1:0",
29+
"temperature": 0.0
30+
},
31+
"embeddings": {
32+
"model": "bedrock/cohere.embed-multilingual-v3"
33+
}
34+
}
35+
36+
# ************************************************
37+
# Define the graph nodes
38+
# ************************************************
39+
40+
llm_model = Bedrock({
41+
'model_id': graph_config["llm"]["model"].split("/")[-1],
42+
'model_kwargs': {
43+
'temperature': 0.0
44+
}})
45+
embedder = BedrockEmbeddings(model_id=graph_config["embeddings"]["model"].split("/")[-1])
46+
47+
# Define the nodes for the graph
48+
robot_node = RobotsNode(
49+
input="url",
50+
output=["is_scrapable"],
51+
node_config={
52+
"llm_model": llm_model,
53+
"force_scraping": True,
54+
"verbose": True,
55+
}
56+
)
57+
58+
fetch_node = FetchNode(
59+
input="url | local_dir",
60+
output=["doc", "link_urls", "img_urls"],
61+
node_config={
62+
"verbose": True,
63+
"headless": True,
64+
}
65+
)
66+
67+
parse_node = ParseNode(
68+
input="doc",
69+
output=["parsed_doc"],
70+
node_config={
71+
"chunk_size": 4096,
72+
"verbose": True,
73+
}
74+
)
75+
76+
rag_node = RAGNode(
77+
input="user_prompt & (parsed_doc | doc)",
78+
output=["relevant_chunks"],
79+
node_config={
80+
"llm_model": llm_model,
81+
"embedder_model": embedder,
82+
"verbose": True,
83+
}
84+
)
85+
86+
generate_answer_node = GenerateAnswerNode(
87+
input="user_prompt & (relevant_chunks | parsed_doc | doc)",
88+
output=["answer"],
89+
node_config={
90+
"llm_model": llm_model,
91+
"verbose": True,
92+
}
93+
)
94+
95+
# ************************************************
96+
# Create the graph by defining the connections
97+
# ************************************************
98+
99+
graph = BaseGraph(
100+
nodes=[
101+
robot_node,
102+
fetch_node,
103+
parse_node,
104+
rag_node,
105+
generate_answer_node,
106+
],
107+
edges=[
108+
(robot_node, fetch_node),
109+
(fetch_node, parse_node),
110+
(parse_node, rag_node),
111+
(rag_node, generate_answer_node)
112+
],
113+
entry_point=robot_node
114+
)
115+
116+
# ************************************************
117+
# Execute the graph
118+
# ************************************************
119+
120+
result, execution_info = graph.execute({
121+
"user_prompt": "List me all the articles",
122+
"url": "https://perinim.github.io/projects"
123+
})
124+
125+
# Get the answer from the result
126+
result = result.get("answer", "No answer found.")
127+
print(json.dumps(result, indent=4))

examples/bedrock/inputs/books.xml

Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
<?xml version="1.0"?>
2+
<catalog>
3+
<book id="bk101">
4+
<author>Gambardella, Matthew</author>
5+
<title>XML Developer's Guide</title>
6+
<genre>Computer</genre>
7+
<price>44.95</price>
8+
<publish_date>2000-10-01</publish_date>
9+
<description>An in-depth look at creating applications
10+
with XML.</description>
11+
</book>
12+
<book id="bk102">
13+
<author>Ralls, Kim</author>
14+
<title>Midnight Rain</title>
15+
<genre>Fantasy</genre>
16+
<price>5.95</price>
17+
<publish_date>2000-12-16</publish_date>
18+
<description>A former architect battles corporate zombies,
19+
an evil sorceress, and her own childhood to become queen
20+
of the world.</description>
21+
</book>
22+
<book id="bk103">
23+
<author>Corets, Eva</author>
24+
<title>Maeve Ascendant</title>
25+
<genre>Fantasy</genre>
26+
<price>5.95</price>
27+
<publish_date>2000-11-17</publish_date>
28+
<description>After the collapse of a nanotechnology
29+
society in England, the young survivors lay the
30+
foundation for a new society.</description>
31+
</book>
32+
<book id="bk104">
33+
<author>Corets, Eva</author>
34+
<title>Oberon's Legacy</title>
35+
<genre>Fantasy</genre>
36+
<price>5.95</price>
37+
<publish_date>2001-03-10</publish_date>
38+
<description>In post-apocalypse England, the mysterious
39+
agent known only as Oberon helps to create a new life
40+
for the inhabitants of London. Sequel to Maeve
41+
Ascendant.</description>
42+
</book>
43+
<book id="bk105">
44+
<author>Corets, Eva</author>
45+
<title>The Sundered Grail</title>
46+
<genre>Fantasy</genre>
47+
<price>5.95</price>
48+
<publish_date>2001-09-10</publish_date>
49+
<description>The two daughters of Maeve, half-sisters,
50+
battle one another for control of England. Sequel to
51+
Oberon's Legacy.</description>
52+
</book>
53+
<book id="bk106">
54+
<author>Randall, Cynthia</author>
55+
<title>Lover Birds</title>
56+
<genre>Romance</genre>
57+
<price>4.95</price>
58+
<publish_date>2000-09-02</publish_date>
59+
<description>When Carla meets Paul at an ornithology
60+
conference, tempers fly as feathers get ruffled.</description>
61+
</book>
62+
<book id="bk107">
63+
<author>Thurman, Paula</author>
64+
<title>Splish Splash</title>
65+
<genre>Romance</genre>
66+
<price>4.95</price>
67+
<publish_date>2000-11-02</publish_date>
68+
<description>A deep sea diver finds true love twenty
69+
thousand leagues beneath the sea.</description>
70+
</book>
71+
<book id="bk108">
72+
<author>Knorr, Stefan</author>
73+
<title>Creepy Crawlies</title>
74+
<genre>Horror</genre>
75+
<price>4.95</price>
76+
<publish_date>2000-12-06</publish_date>
77+
<description>An anthology of horror stories about roaches,
78+
centipedes, scorpions and other insects.</description>
79+
</book>
80+
<book id="bk109">
81+
<author>Kress, Peter</author>
82+
<title>Paradox Lost</title>
83+
<genre>Science Fiction</genre>
84+
<price>6.95</price>
85+
<publish_date>2000-11-02</publish_date>
86+
<description>After an inadvertant trip through a Heisenberg
87+
Uncertainty Device, James Salway discovers the problems
88+
of being quantum.</description>
89+
</book>
90+
<book id="bk110">
91+
<author>O'Brien, Tim</author>
92+
<title>Microsoft .NET: The Programming Bible</title>
93+
<genre>Computer</genre>
94+
<price>36.95</price>
95+
<publish_date>2000-12-09</publish_date>
96+
<description>Microsoft's .NET initiative is explored in
97+
detail in this deep programmer's reference.</description>
98+
</book>
99+
<book id="bk111">
100+
<author>O'Brien, Tim</author>
101+
<title>MSXML3: A Comprehensive Guide</title>
102+
<genre>Computer</genre>
103+
<price>36.95</price>
104+
<publish_date>2000-12-01</publish_date>
105+
<description>The Microsoft MSXML3 parser is covered in
106+
detail, with attention to XML DOM interfaces, XSLT processing,
107+
SAX and more.</description>
108+
</book>
109+
<book id="bk112">
110+
<author>Galos, Mike</author>
111+
<title>Visual Studio 7: A Comprehensive Guide</title>
112+
<genre>Computer</genre>
113+
<price>49.95</price>
114+
<publish_date>2001-04-16</publish_date>
115+
<description>Microsoft Visual Studio 7 is explored in depth,
116+
looking at how Visual Basic, Visual C++, C#, and ASP+ are
117+
integrated into a comprehensive development
118+
environment.</description>
119+
</book>
120+
</catalog>

0 commit comments

Comments
 (0)