Skip to content

Commit 52643d5

Browse files
committed
Merge branch 'main' into temp
2 parents 971cc2d + 72ee93a commit 52643d5

File tree

15 files changed

+324
-235
lines changed

15 files changed

+324
-235
lines changed

CHANGELOG.md

Lines changed: 125 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,130 @@
1+
12
## [1.10.0-beta.7](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.10.0-beta.6...v1.10.0-beta.7) (2024-07-23)
23

4+
## [1.11.2](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.11.1...v1.11.2) (2024-07-23)
5+
6+
7+
### Bug Fixes
8+
9+
* md conversion ([1d41f6e](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/1d41f6eafe8ed0e191bb6a258d54c6388ff283c6))
10+
11+
## [1.11.1](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.11.0...v1.11.1) (2024-07-23)
12+
13+
14+
### Bug Fixes
15+
16+
* md conversion ([5a45e9f](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/5a45e9f2d86a1c58b8ea321e3df9718bc00f9c28))
17+
18+
## [1.11.0](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.10.4...v1.11.0) (2024-07-23)
19+
20+
21+
### Features
22+
23+
* add new toml ([fcb3220](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/fcb3220868e7ef1127a7a47f40d0379be282e6eb))
24+
* add nvidia connection ([fc0dadb](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/fc0dadb8f812dfd636dec856921a971b58695ce3))
25+
26+
27+
### Bug Fixes
28+
29+
* **md_conversion:** add absolute links md, added missing dependency ([12b5ead](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/12b5eada6ea783770afd630ede69b8cf867a7ded))
30+
31+
32+
### chore
33+
34+
* **dependecies:** add script to auto-update requirements ([3289c7b](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/3289c7bf5ec58ac3d04e9e5e8e654af9abcee228))
35+
* **ci:** set up workflow for requirements auto-update ([295fc28](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/295fc28ceb02c78198f7fbe678352503b3259b6b))
36+
* update requirements.txt ([c7bac98](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/c7bac98d2e79e5ab98fa65d7efa858a2cdda1622))
37+
* upgrade dependencies and scripts ([74d142e](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/74d142eaae724b087eada9c0c876b40a2ccc7cae))
38+
* **pyproject:** upgrade dependencies ([0425124](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/0425124c570f765b98fcf67ba6649f4f9fe76b15))
39+
40+
41+
### Docs
42+
43+
* add hero image ([4182e23](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/4182e23e3b8d8f141b119b6014ae3ff20b3892e3))
44+
* updated readme ([c377ae0](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/c377ae0544a78ebdc0d15f8d23b3846c26876c8c))
45+
46+
47+
### CI
48+
49+
* **release:** 1.10.0-beta.6 [skip ci] ([254bde7](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/254bde7008b41ffa434925e3ae84340c53a565bd))
50+
* **release:** 1.10.0-beta.7 [skip ci] ([1756e85](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/1756e8522f3874de8afbef9ac327f9b3f1a49d07))
51+
* **release:** 1.10.0-beta.8 [skip ci] ([255e569](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/255e569172b1029bc2a723b2ec66bcf3d3ee3791))
52+
53+
## [1.10.0-beta.8](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.10.0-beta.7...v1.10.0-beta.8) (2024-07-23)
54+
55+
## [1.10.4](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.10.3...v1.10.4) (2024-07-22)
56+
57+
58+
59+
### Bug Fixes
60+
61+
62+
* **md_conversion:** add absolute links md, added missing dependency ([12b5ead](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/12b5eada6ea783770afd630ede69b8cf867a7ded))
63+
64+
## [1.10.0-beta.7](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.10.0-beta.6...v1.10.0-beta.7) (2024-07-23)
65+
66+
67+
### Features
68+
69+
* add nvidia connection ([fc0dadb](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/fc0dadb8f812dfd636dec856921a971b58695ce3))
70+
71+
72+
### chore
73+
74+
* **dependecies:** add script to auto-update requirements ([3289c7b](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/3289c7bf5ec58ac3d04e9e5e8e654af9abcee228))
75+
* **ci:** set up workflow for requirements auto-update ([295fc28](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/295fc28ceb02c78198f7fbe678352503b3259b6b))
76+
* update requirements.txt ([c7bac98](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/c7bac98d2e79e5ab98fa65d7efa858a2cdda1622))
77+
78+
## [1.10.0-beta.6](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.10.0-beta.5...v1.10.0-beta.6) (2024-07-22)
79+
80+
* parse node ([09256f7](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/09256f7b11a7a1c2aba01cf8de70401af1e86fe4))
81+
82+
## [1.10.3](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.10.2...v1.10.3) (2024-07-22)
83+
84+
85+
### Bug Fixes
86+
87+
* parse_html node have a bug ([71f894e](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/71f894eee3468fac8ad2c724ad1f9fd4b5f64140))
88+
89+
## [1.10.2](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.10.1...v1.10.2) (2024-07-21)
90+
91+
92+
### Bug Fixes
93+
94+
* telemetry version ([b0418b6](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/b0418b679cf45e1e680d2daadcc47e6e4f585575))
95+
96+
## [1.10.1](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.10.0...v1.10.1) (2024-07-21)
97+
98+
99+
### Bug Fixes
100+
101+
* abstract_graph moel token bug ([ce6be37](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/ce6be37fbc1095afe4df6a2fc206923e477190e5))
102+
103+
## [1.10.0](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.9.2...v1.10.0) (2024-07-20)
104+
105+
106+
3107

4108
### Features
5109

6110
* add nvidia connection ([fc0dadb](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/fc0dadb8f812dfd636dec856921a971b58695ce3))
7111

8112

113+
* add new toml ([fcb3220](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/fcb3220868e7ef1127a7a47f40d0379be282e6eb))
114+
115+
* add gpt4o omni ([431edb7](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/431edb7bb2504f4c1335c3ae3ce2f91867fa7222))
116+
* add searchngx integration ([5c92186](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/5c9218608140bf694fbfd96aa90276bc438bb475))
117+
* refactoring_to_md function ([602dd00](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/602dd00209ee1d72a1223fc4793759450921fcf9))
118+
119+
120+
### Bug Fixes
121+
122+
* add gpt o mini for azure ([77777c8](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/77777c898d1fad40f340b06c5b36d35b65409ea6))
123+
* parse_node ([07f1e23](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/07f1e23d235db1a0db2cb155f10b73b0bf882269))
124+
* search link node ([cf3ab55](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/cf3ab5564ae5c415c63d1771b32ea68f5169ca82))
125+
126+
127+
9128
### chore
10129

11130
* **dependecies:** add script to auto-update requirements ([3289c7b](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/3289c7bf5ec58ac3d04e9e5e8e654af9abcee228))
@@ -28,6 +147,10 @@
28147
### chore
29148

30149
* **pyproject:** upgrade dependencies ([0425124](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/0425124c570f765b98fcf67ba6649f4f9fe76b15))
150+
* correct search engine name ([7ba2f6a](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/7ba2f6ae0b9d2e9336e973e1f57ab8355c739e27))
151+
* remove unused import ([fd1b7cb](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/fd1b7cb24a7c252277607abde35826e3c58e34ef))
152+
* **ci:** upgrade lockfiles ([c7b05a4](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/c7b05a4993df14d6ed4848121a3cd209571232f7))
153+
* upgrade tiktoken ([7314bc3](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/7314bc383068db590662bf7e512f799529308991))
31154

32155

33156

@@ -49,6 +172,7 @@
49172
* **release:** 1.9.0-beta.5 [skip ci] ([bb62439](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/bb624399cfc3924825892dd48697fc298ad3b002))
50173
* **release:** 1.9.0-beta.6 [skip ci] ([54a69de](https://github.com/ScrapeGraphAI/Scrapegraph-ai/commit/54a69de69e8077e02fd5584783ca62cc2e0ec5bb))
51174

175+
52176
## [1.10.0-beta.5](https://github.com/ScrapeGraphAI/Scrapegraph-ai/compare/v1.10.0-beta.4...v1.10.0-beta.5) (2024-07-20)
53177

54178

@@ -376,7 +500,7 @@
376500
* **release:** 1.6.1 [skip ci] ([44fbd71](https://github.com/VinciGit00/Scrapegraph-ai/commit/44fbd71742a57a4b10f22ed33781bb67aa77e58d))
377501

378502
## [1.6.1](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.6.0...v1.6.1) (2024-06-15)
379-
=======
503+
380504

381505

382506
### Bug Fixes

README.md

Lines changed: 40 additions & 115 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ ScrapeGraphAI is a *web scraping* python library that uses LLM and direct graph
1717
Just say which information you want to extract and the library will do it for you!
1818

1919
<p align="center">
20-
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/scrapegraphai_logo.png" alt="Scrapegraph-ai Logo" style="width: 50%;">
20+
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/sgai-hero.png" alt="ScrapeGraphAI Hero" style="width: 100%;">
2121
</p>
2222

2323
## 🚀 Quick install
@@ -26,159 +26,84 @@ The reference page for Scrapegraph-ai is available on the official page of PyPI:
2626

2727
```bash
2828
pip install scrapegraphai
29+
30+
playwright install
2931
```
3032

3133
**Note**: it is recommended to install the library in a virtual environment to avoid conflicts with other libraries 🐱
3234

33-
## 🔍 Demo
34-
Official streamlit demo:
35-
36-
[![My Skills](https://skillicons.dev/icons?i=react)](https://scrapegraph-ai-web-dashboard.streamlit.app)
37-
38-
Try it directly on the web using Google Colab:
39-
40-
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1sEZBonBMGP44CtO6GQTwAlL0BGJXjtfd?usp=sharing)
41-
42-
## 📖 Documentation
43-
44-
The documentation for ScrapeGraphAI can be found [here](https://scrapegraph-ai.readthedocs.io/en/latest/).
45-
46-
Check out also the Docusaurus [here](https://scrapegraph-doc.onrender.com/).
47-
4835
## 💻 Usage
49-
There are multiple standard scraping pipelines that can be used to extract information from a website (or local file):
50-
- `SmartScraperGraph`: single-page scraper that only needs a user prompt and an input source;
51-
- `SearchGraph`: multi-page scraper that extracts information from the top n search results of a search engine;
52-
- `SpeechGraph`: single-page scraper that extracts information from a website and generates an audio file.
53-
- `ScriptCreatorGraph`: single-page scraper that extracts information from a website and generates a Python script.
36+
There are multiple standard scraping pipelines that can be used to extract information from a website (or local file).
5437

55-
- `SmartScraperMultiGraph`: multi-page scraper that extracts information from multiple pages given a single prompt and a list of sources;
56-
- `ScriptCreatorMultiGraph`: multi-page scraper that generates a Python script for extracting information from multiple pages given a single prompt and a list of sources.
57-
58-
It is possible to use different LLM through APIs, such as **OpenAI**, **Groq**, **Azure** and **Gemini**, or local models using **Ollama**.
38+
The most common one is the `SmartScraperGraph`, which extracts information from a single page given a user prompt and a source URL.
5939

60-
### Case 1: SmartScraper using Local Models
61-
62-
Remember to have [Ollama](https://ollama.com/) installed and download the models using the **ollama pull** command.
6340

6441
```python
42+
import json
6543
from scrapegraphai.graphs import SmartScraperGraph
6644

45+
# Define the configuration for the scraping pipeline
6746
graph_config = {
6847
"llm": {
69-
"model": "ollama/mistral",
70-
"temperature": 0,
71-
"format": "json", # Ollama needs the format to be specified explicitly
72-
"base_url": "http://localhost:11434", # set Ollama URL
73-
},
74-
"embeddings": {
75-
"model": "ollama/nomic-embed-text",
76-
"base_url": "http://localhost:11434", # set Ollama URL
48+
"api_key": "YOUR_OPENAI_APIKEY",
49+
"model": "gpt-4o-mini",
7750
},
7851
"verbose": True,
52+
"headless": False,
7953
}
8054

55+
# Create the SmartScraperGraph instance
8156
smart_scraper_graph = SmartScraperGraph(
82-
prompt="List me all the projects with their descriptions",
83-
# also accepts a string with the already downloaded HTML code
84-
source="https://perinim.github.io/projects",
57+
prompt="Find some information about what does the company do, the name and a contact email.",
58+
source="https://scrapegraphai.com/",
8559
config=graph_config
8660
)
8761

62+
# Run the pipeline
8863
result = smart_scraper_graph.run()
89-
print(result)
90-
64+
print(json.dumps(result, indent=4))
9165
```
9266

93-
The output will be a list of projects with their descriptions like the following:
67+
The output will be a dictionary like the following:
9468

9569
```python
96-
{'projects': [{'title': 'Rotary Pendulum RL', 'description': 'Open Source project aimed at controlling a real life rotary pendulum using RL algorithms'}, {'title': 'DQN Implementation from scratch', 'description': 'Developed a Deep Q-Network algorithm to train a simple and double pendulum'}, ...]}
97-
```
98-
99-
### Case 2: SearchGraph using Mixed Models
100-
101-
We use **Groq** for the LLM and **Ollama** for the embeddings.
102-
103-
```python
104-
from scrapegraphai.graphs import SearchGraph
105-
106-
# Define the configuration for the graph
107-
graph_config = {
108-
"llm": {
109-
"model": "groq/gemma-7b-it",
110-
"api_key": "GROQ_API_KEY",
111-
"temperature": 0
112-
},
113-
"embeddings": {
114-
"model": "ollama/nomic-embed-text",
115-
"base_url": "http://localhost:11434", # set ollama URL arbitrarily
116-
},
117-
"max_results": 5,
70+
{
71+
"company": "ScrapeGraphAI",
72+
"name": "ScrapeGraphAI Extracting content from websites and local documents using LLM",
73+
"contact_email": "[email protected]"
11874
}
75+
```
11976

120-
# Create the SearchGraph instance
121-
search_graph = SearchGraph(
122-
prompt="List me all the traditional recipes from Chioggia",
123-
config=graph_config
124-
)
77+
There are other pipelines that can be used to extract information from multiple pages, generate Python scripts, or even generate audio files.
12578

126-
# Run the graph
127-
result = search_graph.run()
128-
print(result)
129-
```
79+
| Pipeline Name | Description |
80+
|-------------------------|------------------------------------------------------------------------------------------------------------------|
81+
| SmartScraperGraph | Single-page scraper that only needs a user prompt and an input source. |
82+
| SearchGraph | Multi-page scraper that extracts information from the top n search results of a search engine. |
83+
| SpeechGraph | Single-page scraper that extracts information from a website and generates an audio file. |
84+
| ScriptCreatorGraph | Single-page scraper that extracts information from a website and generates a Python script. |
85+
| SmartScraperMultiGraph | Multi-page scraper that extracts information from multiple pages given a single prompt and a list of sources. |
86+
| ScriptCreatorMultiGraph | Multi-page scraper that generates a Python script for extracting information from multiple pages and sources. |
13087

131-
The output will be a list of recipes like the following:
88+
It is possible to use different LLM through APIs, such as **OpenAI**, **Groq**, **Azure** and **Gemini**, or local models using **Ollama**.
13289

133-
```python
134-
{'recipes': [{'name': 'Sarde in Saòre'}, {'name': 'Bigoli in salsa'}, {'name': 'Seppie in umido'}, {'name': 'Moleche frite'}, {'name': 'Risotto alla pescatora'}, {'name': 'Broeto'}, {'name': 'Bibarasse in Cassopipa'}, {'name': 'Risi e bisi'}, {'name': 'Smegiassa Ciosota'}]}
135-
```
136-
### Case 3: SpeechGraph using OpenAI
90+
Remember to have [Ollama](https://ollama.com/) installed and download the models using the **ollama pull** command, if you want to use local models.
13791

138-
You just need to pass the OpenAI API key and the model name.
92+
## 🔍 Demo
93+
Official streamlit demo:
13994

140-
```python
141-
from scrapegraphai.graphs import SpeechGraph
95+
[![My Skills](https://skillicons.dev/icons?i=react)](https://scrapegraph-ai-web-dashboard.streamlit.app)
14296

143-
graph_config = {
144-
"llm": {
145-
"api_key": "OPENAI_API_KEY",
146-
"model": "gpt-3.5-turbo",
147-
},
148-
"tts_model": {
149-
"api_key": "OPENAI_API_KEY",
150-
"model": "tts-1",
151-
"voice": "alloy"
152-
},
153-
"output_path": "audio_summary.mp3",
154-
}
97+
Try it directly on the web using Google Colab:
15598

156-
# ************************************************
157-
# Create the SpeechGraph instance and run it
158-
# ************************************************
99+
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1sEZBonBMGP44CtO6GQTwAlL0BGJXjtfd?usp=sharing)
159100

160-
speech_graph = SpeechGraph(
161-
prompt="Make a detailed audio summary of the projects.",
162-
source="https://perinim.github.io/projects/",
163-
config=graph_config,
164-
)
101+
## 📖 Documentation
165102

166-
result = speech_graph.run()
167-
print(result)
103+
The documentation for ScrapeGraphAI can be found [here](https://scrapegraph-ai.readthedocs.io/en/latest/).
168104

169-
```
105+
Check out also the Docusaurus [here](https://scrapegraph-doc.onrender.com/).
170106

171-
The output will be an audio file with the summary of the projects on the page.
172-
173-
## Sponsors
174-
<div style="text-align: center;">
175-
<a href="https://serpapi.com?utm_source=scrapegraphai">
176-
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/serp_api_logo.png" alt="SerpAPI" style="width: 10%;">
177-
</a>
178-
<a href="https://dashboard.statproxies.com/?refferal=scrapegraph">
179-
<img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/transparent_stat.png" alt="Stats" style="width: 15%;">
180-
</a>
181-
</div>
182107

183108
## 🤝 Contributing
184109

docs/assets/sgai-hero.png

66.9 KB
Loading

examples/openai/smart_scraper_openai.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,8 +27,8 @@
2727
# ************************************************
2828

2929
smart_scraper_graph = SmartScraperGraph(
30-
prompt="Extract me the python code inside the page",
31-
source="https://www.exploit-db.com/exploits/51447",
30+
prompt="List me what does the company do, the name and a contact email.",
31+
source="https://scrapegraphai.com/",
3232
config=graph_config
3333
)
3434

0 commit comments

Comments
 (0)