Skip to content

Commit b99e362

Browse files
authored
Merge pull request #117 from VinciGit00/pre/beta
2 parents dc97c60 + 9356124 commit b99e362

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

53 files changed

+2313
-2953
lines changed

.gitignore

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,6 @@ venv/
2929
*.google-cookie
3030
examples/graph_examples/ScrapeGraphAI_generated_graph
3131
examples/**/*.csv
32-
examples/**/*.json
3332
main.py
3433
poetry.lock
3534

CHANGELOG.md

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,53 @@
1+
## [0.5.0-beta.6](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.5.0-beta.5...v0.5.0-beta.6) (2024-04-30)
2+
3+
4+
### Features
5+
6+
* added verbose flag to suppress print statements ([2dd7817](https://github.com/VinciGit00/Scrapegraph-ai/commit/2dd7817cfb37cfbeb7e65b3a24655ab238f48026))
7+
8+
## [0.5.0-beta.5](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.5.0-beta.4...v0.5.0-beta.5) (2024-04-30)
9+
10+
11+
### Features
12+
13+
* **refactor:** changed variable names ([8fba7e5](https://github.com/VinciGit00/Scrapegraph-ai/commit/8fba7e5490f916b325588443bba3fff5c0733c17))
14+
15+
## [0.5.0-beta.4](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.5.0-beta.3...v0.5.0-beta.4) (2024-04-30)
16+
17+
18+
### Bug Fixes
19+
20+
* script generator and add new benchmarks ([e3d0194](https://github.com/VinciGit00/Scrapegraph-ai/commit/e3d0194dc93b20dc254fc48bba11559bf8a3a185))
21+
22+
## [0.5.0-beta.3](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.5.0-beta.2...v0.5.0-beta.3) (2024-04-30)
23+
24+
25+
### Features
26+
27+
* add cluade integration ([e0ffc83](https://github.com/VinciGit00/Scrapegraph-ai/commit/e0ffc838b06c0f024026a275fc7f7b4243ad5cf9))
28+
29+
## [0.5.0-beta.2](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.5.0-beta.1...v0.5.0-beta.2) (2024-04-30)
30+
31+
32+
### Features
33+
34+
* **fetch:** added playwright support ([42ab0aa](https://github.com/VinciGit00/Scrapegraph-ai/commit/42ab0aa1d275b5798ab6fc9feea575fe59b6e767))
35+
36+
## [0.5.0-beta.1](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.4.1...v0.5.0-beta.1) (2024-04-30)
37+
38+
39+
### Features
40+
41+
* add co-author ([719a353](https://github.com/VinciGit00/Scrapegraph-ai/commit/719a353410992cc96f46ec984a5d3ec372e71ad2))
42+
* base groq + requirements + toml update with groq ([7dd5b1a](https://github.com/VinciGit00/Scrapegraph-ai/commit/7dd5b1a03327750ffa5b2fb647eda6359edd1fc2))
43+
* **llm:** implemented groq model ([dbbf10f](https://github.com/VinciGit00/Scrapegraph-ai/commit/dbbf10fc77b34d99d64c6cd7f74524b6d8e57fa5))
44+
* updated requirements.txt ([d368725](https://github.com/VinciGit00/Scrapegraph-ai/commit/d36872518a6d234eba5f8b7ddca7da93797874b2))
45+
46+
47+
### CI
48+
49+
* **release:** 0.4.0-beta.3 [skip ci] ([d13321b](https://github.com/VinciGit00/Scrapegraph-ai/commit/d13321b2f86d98e2a3a0c563172ca0dd29cdf5fb))
50+
151
## [0.4.1](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.4.0...v0.4.1) (2024-04-28)
252

353

README.md

Lines changed: 37 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,10 @@ The reference page for Scrapegraph-ai is available on the official page of pypy:
2323
```bash
2424
pip install scrapegraphai
2525
```
26+
you will also need to install Playwright for javascript-based scraping:
27+
```bash
28+
playwright install
29+
```
2630
## 🔍 Demo
2731
Official streamlit demo:
2832

@@ -46,6 +50,7 @@ You can use the `SmartScraper` class to extract information from a website using
4650
The `SmartScraper` class is a direct graph implementation that uses the most common nodes present in a web scraping pipeline. For more information, please see the [documentation](https://scrapegraph-ai.readthedocs.io/en/latest/).
4751
### Case 1: Extracting information using Ollama
4852
Remember to download the model on Ollama separately!
53+
4954
```python
5055
from scrapegraphai.graphs import SmartScraperGraph
5156

@@ -129,7 +134,38 @@ result = smart_scraper_graph.run()
129134
print(result)
130135
```
131136

132-
### Case 4: Extracting information using Gemini
137+
### Case 4: Extracting information using Groq
138+
```python
139+
from scrapegraphai.graphs import SmartScraperGraph
140+
from scrapegraphai.utils import prettify_exec_info
141+
142+
groq_key = os.getenv("GROQ_APIKEY")
143+
144+
graph_config = {
145+
"llm": {
146+
"model": "groq/gemma-7b-it",
147+
"api_key": groq_key,
148+
"temperature": 0
149+
},
150+
"embeddings": {
151+
"model": "ollama/nomic-embed-text",
152+
"temperature": 0,
153+
"base_url": "http://localhost:11434",
154+
},
155+
"headless": False
156+
}
157+
158+
smart_scraper_graph = SmartScraperGraph(
159+
prompt="List me all the projects with their description and the author.",
160+
source="https://perinim.github.io/projects",
161+
config=graph_config
162+
)
163+
164+
result = smart_scraper_graph.run()
165+
print(result)
166+
```
167+
168+
### Case 5: Extracting information using Gemini
133169
```python
134170
from scrapegraphai.graphs import SmartScraperGraph
135171
GOOGLE_APIKEY = "YOUR_API_KEY"
Lines changed: 15 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
# Local models
2+
# Local models
23
The two websites benchmark are:
34
- Example 1: https://perinim.github.io/projects
45
- Example 2: https://www.wired.com (at 17/4/2024)
@@ -9,14 +10,12 @@ The time is measured in seconds
910

1011
The model runned for this benchmark is Mistral on Ollama with nomic-embed-text
1112

12-
In particular, is tested with ScriptCreatorGraph
13-
1413
| Hardware | Model | Example 1 | Example 2 |
1514
| ---------------------- | --------------------------------------- | --------- | --------- |
1615
| Macbook 14' m1 pro | Mistral on Ollama with nomic-embed-text | 30.54s | 35.76s |
17-
| Macbook m2 max | Mistral on Ollama with nomic-embed-text | 18,46s | 19.59 |
18-
| Macbook 14' m1 pro<br> | Llama3 on Ollama with nomic-embed-text | 27.82s | 29.98s |
19-
| Macbook m2 max<br> | Llama3 on Ollama with nomic-embed-text | 20.83s | 12.29s |
16+
| Macbook m2 max | Mistral on Ollama with nomic-embed-text | | |
17+
| Macbook 14' m1 pro<br> | Llama3 on Ollama with nomic-embed-text | 27.82s | 29.986s |
18+
| Macbook m2 max<br> | Llama3 on Ollama with nomic-embed-text | | |
2019

2120

2221
**Note**: the examples on Docker are not runned on other devices than the Macbook because the performance are to slow (10 times slower than Ollama).
@@ -25,17 +24,20 @@ In particular, is tested with ScriptCreatorGraph
2524
**URL**: https://perinim.github.io/projects
2625
**Task**: List me all the projects with their description.
2726

28-
| Name | Execution time | total_tokens | prompt_tokens | completion_tokens | successful_requests | total_cost_USD |
29-
| ------------------- | ---------------| ------------ | ------------- | ----------------- | ------------------- | -------------- |
30-
| gpt-3.5-turbo | 4.50s | 1897 | 1802 | 95 | 1 | 0.002893 |
31-
| gpt-4-turbo | 7.88s | 1920 | 1802 | 118 | 1 | 0.02156 |
27+
| Name | Execution time (seconds) | total_tokens | prompt_tokens | completion_tokens | successful_requests | total_cost_USD |
28+
| --------------------------- | ------------------------ | ------------ | ------------- | ----------------- | ------------------- | -------------- |
29+
| gpt-3.5-turbo | 24.21 | 1892 | 1802 | 90 | 1 | 0.002883 |
30+
| gpt-4-turbo-preview | 6.614 | 1936 | 1802 | 134 | 1 | 0.02204 |
31+
| Grooq with nomic-embed-text | 6.71 | 2201 | 2024 | 177 | 1 | 0 |
3232

3333
### Example 2: Wired
3434
**URL**: https://www.wired.com
3535
**Task**: List me all the articles with their description.
3636

37-
| Name | Execution time (seconds) | total_tokens | prompt_tokens | completion_tokens | successful_requests | total_cost_USD |
38-
| ------------------- | ------------------------ | ------------ | ------------- | ----------------- | ------------------- | -------------- |
39-
| gpt-3.5-turbo | Error (text too long) | - | - | - | - | - |
40-
| gpt-4-turbo | Error (TPM limit reach)| - | - | - | - | - |
37+
| Name | Execution time (seconds) | total_tokens | prompt_tokens | completion_tokens | successful_requests | total_cost_USD |
38+
| --------------------------- | ------------------------ | ------------ | ------------- | ----------------- | ------------------- | -------------- |
39+
| gpt-3.5-turbo | | | | | | |
40+
| gpt-4-turbo-preview | | | | | | |
41+
| Grooq with nomic-embed-text | | | | | | |
42+
4143

Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
"""
2+
Basic example of scraping pipeline using SmartScraper from text
3+
"""
4+
import os
5+
from dotenv import load_dotenv
6+
from scrapegraphai.graphs import ScriptCreatorGraph
7+
from scrapegraphai.utils import prettify_exec_info
8+
9+
load_dotenv()
10+
11+
# ************************************************
12+
# Read the text file
13+
# ************************************************
14+
files = ["inputs/example_1.txt", "inputs/example_2.txt"]
15+
tasks = ["List me all the projects with their description.",
16+
"List me all the articles with their description."]
17+
18+
# ************************************************
19+
# Define the configuration for the graph
20+
# ************************************************
21+
22+
groq_key = os.getenv("GROQ_APIKEY")
23+
24+
graph_config = {
25+
"llm": {
26+
"model": "groq/gemma-7b-it",
27+
"api_key": groq_key,
28+
"temperature": 0
29+
},
30+
"embeddings": {
31+
"model": "ollama/nomic-embed-text",
32+
"temperature": 0,
33+
"base_url": "http://localhost:11434", # set ollama URL arbitrarily
34+
},
35+
"headless": False,
36+
"library": "beautifoulsoup"
37+
}
38+
39+
40+
# ************************************************
41+
# Create the SmartScraperGraph instance and run it
42+
# ************************************************
43+
44+
for i in range(0, 2):
45+
with open(files[i], 'r', encoding="utf-8") as file:
46+
text = file.read()
47+
48+
smart_scraper_graph = ScriptCreatorGraph(
49+
prompt=tasks[i],
50+
source=text,
51+
config=graph_config
52+
)
53+
54+
result = smart_scraper_graph.run()
55+
print(result)
56+
# ************************************************
57+
# Get graph execution info
58+
# ************************************************
59+
60+
graph_exec_info = smart_scraper_graph.get_execution_info()
61+
print(prettify_exec_info(graph_exec_info))

examples/benchmarks/GenerateScraper/benchmark_llama3.py

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2,11 +2,8 @@
22
Basic example of scraping pipeline using SmartScraper from text
33
"""
44

5-
import os
6-
from dotenv import load_dotenv
75
from scrapegraphai.graphs import ScriptCreatorGraph
86
from scrapegraphai.utils import prettify_exec_info
9-
load_dotenv()
107

118
# ************************************************
129
# Read the text file
@@ -19,8 +16,6 @@
1916
# Define the configuration for the graph
2017
# ************************************************
2118

22-
openai_key = os.getenv("GPT4_KEY")
23-
2419

2520
graph_config = {
2621
"llm": {

examples/benchmarks/SmartScraper/Readme.md

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -5,37 +5,37 @@ The two websites benchmark are:
55

66
Both are strored locally as txt file in .txt format because in this way we do not have to think about the internet connection
77

8-
In particular, is tested with SmartScraper
9-
10-
| Hardware | Moodel | Example 1 | Example 2 |
8+
| Hardware | Model | Example 1 | Example 2 |
119
| ------------------ | --------------------------------------- | --------- | --------- |
1210
| Macbook 14' m1 pro | Mistral on Ollama with nomic-embed-text | 11.60s | 26.61s |
1311
| Macbook m2 max | Mistral on Ollama with nomic-embed-text | 8.05s | 12.17s |
14-
| Macbook 14' m1 pro | Llama3 on Ollama with nomic-embed-text | 29.871s | 35.32s |
12+
| Macbook 14' m1 pro | Llama3 on Ollama with nomic-embed-text | 29.87s | 35.32s |
1513
| Macbook m2 max | Llama3 on Ollama with nomic-embed-text | 18.36s | 78.32s |
1614

17-
1815
**Note**: the examples on Docker are not runned on other devices than the Macbook because the performance are to slow (10 times slower than Ollama). Indeed the results are the following:
1916

2017
| Hardware | Example 1 | Example 2 |
2118
| ------------------ | --------- | --------- |
22-
| Macbook 14' m1 pro | 139.89s | Too long |
19+
| Macbook 14' m1 pro | 139.89 | Too long |
2320
# Performance on APIs services
2421
### Example 1: personal portfolio
2522
**URL**: https://perinim.github.io/projects
2623
**Task**: List me all the projects with their description.
2724

28-
| Name | Execution time | total_tokens | prompt_tokens | completion_tokens | successful_requests | total_cost_USD |
29-
| ------------------- | ---------------| ------------ | ------------- | ----------------- | ------------------- | -------------- |
30-
| gpt-3.5-turbo | 5.58s | 445 | 272 | 173 | 1 | 0.000754 |
31-
| gpt-4-turbo | 9.76s | 445 | 272 | 173 | 1 | 0.00791 |
25+
| Name | Execution time (seconds) | total_tokens | prompt_tokens | completion_tokens | successful_requests | total_cost_USD |
26+
| --------------------------- | ------------------------ | ------------ | ------------- | ----------------- | ------------------- | -------------- |
27+
| gpt-3.5-turbo | 25.22 | 445 | 272 | 173 | 1 | 0.000754 |
28+
| gpt-4-turbo-preview | 9.53 | 449 | 272 | 177 | 1 | 0.00803 |
29+
| Grooq with nomic-embed-text | 1.99 | 474 | 284 | 190 | 1 | 0 |
3230

3331
### Example 2: Wired
3432
**URL**: https://www.wired.com
3533
**Task**: List me all the articles with their description.
3634

37-
| Name | Execution time (seconds) | total_tokens | prompt_tokens | completion_tokens | successful_requests | total_cost_USD |
38-
| ------------------- | ------------------------ | ------------ | ------------- | ----------------- | ------------------- | -------------- |
39-
| gpt-3.5-turbo | 6.50 | 2442 | 2199 | 243 | 1 | 0.003784 |
40-
| gpt-4-turbo | 76.07 | 3521 | 2199 | 1322 | 1 | 0.06165 |
35+
| Name | Execution time (seconds) | total_tokens | prompt_tokens | completion_tokens | successful_requests | total_cost_USD |
36+
| --------------------------- | ------------------------ | ------------ | ------------- | ----------------- | ------------------- | -------------- |
37+
| gpt-3.5-turbo | 25.89 | 445 | 272 | 173 | 1 | 0.000754 |
38+
| gpt-4-turbo-preview | 64.70 | 3573 | 2199 | 1374 | 1 | 0.06321 |
39+
| Grooq with nomic-embed-text | 3.82 | 2459 | 2192 | 267 | 1 | 0 |
40+
4141

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
"""
2+
Basic example of scraping pipeline using SmartScraper from text
3+
"""
4+
import os
5+
from dotenv import load_dotenv
6+
from scrapegraphai.graphs import SmartScraperGraph
7+
from scrapegraphai.utils import prettify_exec_info
8+
9+
load_dotenv()
10+
11+
files = ["inputs/example_1.txt", "inputs/example_2.txt"]
12+
tasks = ["List me all the projects with their description.",
13+
"List me all the articles with their description."]
14+
15+
16+
# ************************************************
17+
# Define the configuration for the graph
18+
# ************************************************
19+
20+
groq_key = os.getenv("GROQ_APIKEY")
21+
22+
graph_config = {
23+
"llm": {
24+
"model": "groq/gemma-7b-it",
25+
"api_key": groq_key,
26+
"temperature": 0
27+
},
28+
"embeddings": {
29+
"model": "ollama/nomic-embed-text",
30+
"temperature": 0,
31+
"base_url": "http://localhost:11434", # set ollama URL arbitrarily
32+
},
33+
"headless": False
34+
}
35+
36+
# ************************************************
37+
# Create the SmartScraperGraph instance and run it
38+
# ************************************************
39+
40+
for i in range(0, 2):
41+
with open(files[i], 'r', encoding="utf-8") as file:
42+
text = file.read()
43+
44+
smart_scraper_graph = SmartScraperGraph(
45+
prompt=tasks[i],
46+
source=text,
47+
config=graph_config
48+
)
49+
50+
result = smart_scraper_graph.run()
51+
print(result)
52+
# ************************************************
53+
# Get graph execution info
54+
# ************************************************
55+
56+
graph_exec_info = smart_scraper_graph.get_execution_info()
57+
print(prettify_exec_info(graph_exec_info))

examples/benchmarks/SmartScraper/benchmark_llama3.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,6 @@
22
Basic example of scraping pipeline using SmartScraper from text
33
"""
44

5-
import os
65
from scrapegraphai.graphs import SmartScraperGraph
76
from scrapegraphai.utils import prettify_exec_info
87

0 commit comments

Comments
 (0)