Skip to content

Commit a4ee757

Browse files
authored
Merge branch 'pre/beta' into pdf_scraper_refactoring
2 parents 8d5eb0b + e1006f3 commit a4ee757

37 files changed

+1440
-342
lines changed

.python-version

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
3.10.14

CHANGELOG.md

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,37 @@
1+
## [1.5.0-beta.1](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.4.0...v1.5.0-beta.1) (2024-05-24)
2+
3+
4+
### Features
5+
6+
* **knowledgegraph:** add knowledge graph node ([0196423](https://github.com/VinciGit00/Scrapegraph-ai/commit/0196423bdeea6568086aae6db8fc0f5652fc4e87))
7+
* add logger integration ([e53766b](https://github.com/VinciGit00/Scrapegraph-ai/commit/e53766b16e89254f945f9b54b38445a24f8b81f2))
8+
* **smart-scraper-multi:** add schema to graphs and created SmartScraperMultiGraph ([fc58e2d](https://github.com/VinciGit00/Scrapegraph-ai/commit/fc58e2d3a6f05efa72b45c9e68c6bb41a1eee755))
9+
* **base_graph:** alligned with main ([73fa31d](https://github.com/VinciGit00/Scrapegraph-ai/commit/73fa31db0f791d1fd63b489ac88cc6e595aa07f9))
10+
* **verbose:** centralized graph logging on debug or warning depending on verbose ([c807695](https://github.com/VinciGit00/Scrapegraph-ai/commit/c807695720a85c74a0b4365afb397bbbcd7e2889))
11+
* **node:** knowledge graph node ([8c33ea3](https://github.com/VinciGit00/Scrapegraph-ai/commit/8c33ea3fbce18f74484fe7bd9469ab95c985ad0b))
12+
* **multiple:** quick fix working ([58cc903](https://github.com/VinciGit00/Scrapegraph-ai/commit/58cc903d556d0b8db10284493b05bed20992c339))
13+
* **kg:** removed import ([a338383](https://github.com/VinciGit00/Scrapegraph-ai/commit/a338383399b669ae2dd7bfcec168b791e8206816))
14+
* **docloaders:** undetected-playwright ([7b3ee4e](https://github.com/VinciGit00/Scrapegraph-ai/commit/7b3ee4e71e4af04edeb47999d70d398b67c93ac4))
15+
* **multiple_search:** working multiple example ([bed3eed](https://github.com/VinciGit00/Scrapegraph-ai/commit/bed3eed50c1678cfb07cba7b451ac28d38c87d7c))
16+
* **kg:** working rag kg ([c75e6a0](https://github.com/VinciGit00/Scrapegraph-ai/commit/c75e6a06b1a647f03e6ac6eeacdc578a85baa25b))
17+
18+
19+
### Bug Fixes
20+
21+
* error in jsons ([ca436ab](https://github.com/VinciGit00/Scrapegraph-ai/commit/ca436abf3cbff21d752a71969e787e8f8c98c6a8))
22+
* **logger:** set up centralized root logger in base node ([4348d4f](https://github.com/VinciGit00/Scrapegraph-ai/commit/4348d4f4db6f30213acc1bbccebc2b143b4d2636))
23+
* **logging:** source code citation ([d139480](https://github.com/VinciGit00/Scrapegraph-ai/commit/d1394809d704bee4085d494ddebab772306b3b17))
24+
* template names ([b82f33a](https://github.com/VinciGit00/Scrapegraph-ai/commit/b82f33aee72515e4258e6f508fce15028eba5cbe))
25+
* **node-logging:** use centralized logger in each node for logging ([c251cc4](https://github.com/VinciGit00/Scrapegraph-ai/commit/c251cc45d3694f8e81503e38a6d2b362452b740e))
26+
* **web-loader:** use sublogger ([0790ecd](https://github.com/VinciGit00/Scrapegraph-ai/commit/0790ecd2083642af9f0a84583216ababe351cd76))
27+
28+
29+
### CI
30+
31+
* **release:** 1.2.0-beta.1 [skip ci] ([fd3e0aa](https://github.com/VinciGit00/Scrapegraph-ai/commit/fd3e0aa5823509dfb46b4f597521c24d4eb345f1))
32+
* **release:** 1.3.0-beta.1 [skip ci] ([191db0b](https://github.com/VinciGit00/Scrapegraph-ai/commit/191db0bc779e4913713b47b68ec4162a347da3ea))
33+
* **release:** 1.4.0-beta.1 [skip ci] ([2caddf9](https://github.com/VinciGit00/Scrapegraph-ai/commit/2caddf9a99b5f3aedc1783216f21d23cd35b3a8c))
34+
* **release:** 1.4.0-beta.2 [skip ci] ([f1a2523](https://github.com/VinciGit00/Scrapegraph-ai/commit/f1a25233d650010e1932e0ab80938079a22a296d))
135

236
## [1.4.0-beta.2](https://github.com/VinciGit00/Scrapegraph-ai/compare/v1.4.0-beta.1...v1.4.0-beta.2) (2024-05-19)
337

examples/local_models/smart_scraper_ollama.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@
2020
# "base_url": "http://localhost:11434", # set ollama URL arbitrarily
2121
},
2222
"verbose": True,
23+
"headless": False
2324
}
2425

2526
# ************************************************

examples/openai/smart_scraper_openai.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,10 +18,10 @@
1818

1919
graph_config = {
2020
"llm": {
21-
"api_key":openai_key,
21+
"api_key": openai_key,
2222
"model": "gpt-3.5-turbo",
2323
},
24-
"verbose": True,
24+
"verbose": False,
2525
"headless": False,
2626
}
2727

examples/single_node/robot_node.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111

1212
graph_config = {
1313
"llm": {
14-
"model": "ollama/llama3",
14+
"model_name": "ollama/llama3",
1515
"temperature": 0,
1616
"streaming": True
1717
},

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
name = "scrapegraphai"
33

44

5-
version = "1.4.0b2"
5+
version = "1.5.0b1"
66

77

88
description = "A web scraping library based on LangChain which uses LLM and direct graph logic to create scraping pipelines."

scrapegraphai/docloaders/chromium.py

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,13 @@
11
import asyncio
2-
import logging
32
from typing import Any, AsyncIterator, Iterator, List, Optional
43

54
from langchain_community.document_loaders.base import BaseLoader
65
from langchain_core.documents import Document
76

8-
from ..utils import Proxy, dynamic_import, parse_or_search_proxy
7+
from ..utils import Proxy, dynamic_import, get_logger, parse_or_search_proxy
98

109

11-
logger = logging.getLogger(__name__)
10+
logger = get_logger("web-loader")
1211

1312

1413
class ChromiumLoader(BaseLoader):

scrapegraphai/graphs/abstract_graph.py

Lines changed: 58 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,28 @@
11
"""
22
AbstractGraph Module
33
"""
4+
45
from abc import ABC, abstractmethod
56
from typing import Optional
7+
68
from langchain_aws import BedrockEmbeddings
7-
from langchain_openai import AzureOpenAIEmbeddings, OpenAIEmbeddings
89
from langchain_community.embeddings import HuggingFaceHubEmbeddings, OllamaEmbeddings
910
from langchain_google_genai import GoogleGenerativeAIEmbeddings
10-
from ..helpers import models_tokens
11-
from ..models import AzureOpenAI, Bedrock, Gemini, Groq, HuggingFace, Ollama, OpenAI, Anthropic, DeepSeek
1211
from langchain_google_genai.embeddings import GoogleGenerativeAIEmbeddings
12+
from langchain_openai import AzureOpenAIEmbeddings, OpenAIEmbeddings
13+
14+
from ..helpers import models_tokens
15+
from ..models import (
16+
Anthropic,
17+
AzureOpenAI,
18+
Bedrock,
19+
Gemini,
20+
Groq,
21+
HuggingFace,
22+
Ollama,
23+
OpenAI,
24+
)
25+
from ..utils.logging import set_verbosity_debug, set_verbosity_warning
1326

1427
from ..helpers import models_tokens
1528
from ..models import AzureOpenAI, Bedrock, Gemini, Groq, HuggingFace, Ollama, OpenAI, Anthropic, DeepSeek
@@ -67,10 +80,15 @@ def __init__(self, prompt: str, config: dict, source: Optional[str] = None, sche
6780
self.execution_info = None
6881

6982
# Set common configuration parameters
70-
self.verbose = False if config is None else config.get(
71-
"verbose", False)
72-
self.headless = True if config is None else config.get(
73-
"headless", True)
83+
84+
verbose = bool(config and config.get("verbose"))
85+
86+
if verbose:
87+
set_verbosity_debug()
88+
else:
89+
set_verbosity_warning()
90+
91+
self.headless = True if config is None else config.get("headless", True)
7492
self.loader_kwargs = config.get("loader_kwargs", {})
7593

7694
common_params = {
@@ -96,22 +114,22 @@ def set_common_params(self, params: dict, overwrite=False):
96114

97115
def _set_model_token(self, llm):
98116

99-
if 'Azure' in str(type(llm)):
117+
if "Azure" in str(type(llm)):
100118
try:
101119
self.model_token = models_tokens["azure"][llm.model_name]
102120
except KeyError:
103121
raise KeyError("Model not supported")
104122

105-
elif 'HuggingFaceEndpoint' in str(type(llm)):
106-
if 'mistral' in llm.repo_id:
123+
elif "HuggingFaceEndpoint" in str(type(llm)):
124+
if "mistral" in llm.repo_id:
107125
try:
108-
self.model_token = models_tokens['mistral'][llm.repo_id]
126+
self.model_token = models_tokens["mistral"][llm.repo_id]
109127
except KeyError:
110128
raise KeyError("Model not supported")
111-
elif 'Google' in str(type(llm)):
129+
elif "Google" in str(type(llm)):
112130
try:
113-
if 'gemini' in llm.model:
114-
self.model_token = models_tokens['gemini'][llm.model]
131+
if "gemini" in llm.model:
132+
self.model_token = models_tokens["gemini"][llm.model]
115133
except KeyError:
116134
raise KeyError("Model not supported")
117135

@@ -129,17 +147,14 @@ def _create_llm(self, llm_config: dict, chat=False) -> object:
129147
KeyError: If the model is not supported.
130148
"""
131149

132-
llm_defaults = {
133-
"temperature": 0,
134-
"streaming": False
135-
}
150+
llm_defaults = {"temperature": 0, "streaming": False}
136151
llm_params = {**llm_defaults, **llm_config}
137152

138153
# If model instance is passed directly instead of the model details
139-
if 'model_instance' in llm_params:
154+
if "model_instance" in llm_params:
140155
if chat:
141-
self._set_model_token(llm_params['model_instance'])
142-
return llm_params['model_instance']
156+
self._set_model_token(llm_params["model_instance"])
157+
return llm_params["model_instance"]
143158

144159
# Instantiate the language model based on the model name
145160
if "gpt-" in llm_params["model"]:
@@ -208,19 +223,21 @@ def _create_llm(self, llm_config: dict, chat=False) -> object:
208223
elif "bedrock" in llm_params["model"]:
209224
llm_params["model"] = llm_params["model"].split("/")[-1]
210225
model_id = llm_params["model"]
211-
client = llm_params.get('client', None)
226+
client = llm_params.get("client", None)
212227
try:
213228
self.model_token = models_tokens["bedrock"][llm_params["model"]]
214229
except KeyError:
215230
print("model not found, using default token size (8192)")
216231
self.model_token = 8192
217-
return Bedrock({
218-
"client": client,
219-
"model_id": model_id,
220-
"model_kwargs": {
221-
"temperature": llm_params["temperature"],
232+
return Bedrock(
233+
{
234+
"client": client,
235+
"model_id": model_id,
236+
"model_kwargs": {
237+
"temperature": llm_params["temperature"],
238+
},
222239
}
223-
})
240+
)
224241
elif "claude-3-" in llm_params["model"]:
225242
try:
226243
self.model_token = models_tokens["claude"]["claude3"]
@@ -236,8 +253,7 @@ def _create_llm(self, llm_config: dict, chat=False) -> object:
236253
self.model_token = 8192
237254
return DeepSeek(llm_params)
238255
else:
239-
raise ValueError(
240-
"Model provided by the configuration not supported")
256+
raise ValueError("Model provided by the configuration not supported")
241257

242258
def _create_default_embedder(self, llm_config=None) -> object:
243259
"""
@@ -250,8 +266,9 @@ def _create_default_embedder(self, llm_config=None) -> object:
250266
ValueError: If the model is not supported.
251267
"""
252268
if isinstance(self.llm_model, Gemini):
253-
return GoogleGenerativeAIEmbeddings(google_api_key=llm_config['api_key'],
254-
model="models/embedding-001")
269+
return GoogleGenerativeAIEmbeddings(
270+
google_api_key=llm_config["api_key"], model="models/embedding-001"
271+
)
255272
if isinstance(self.llm_model, OpenAI):
256273
return OpenAIEmbeddings(api_key=self.llm_model.openai_api_key)
257274
elif isinstance(self.llm_model, DeepSeek):
@@ -288,8 +305,8 @@ def _create_embedder(self, embedder_config: dict) -> object:
288305
Raises:
289306
KeyError: If the model is not supported.
290307
"""
291-
if 'model_instance' in embedder_config:
292-
return embedder_config['model_instance']
308+
if "model_instance" in embedder_config:
309+
return embedder_config["model_instance"]
293310
# Instantiate the embedding model based on the model name
294311
if "openai" in embedder_config["model"]:
295312
return OpenAIEmbeddings(api_key=embedder_config["api_key"])
@@ -306,25 +323,27 @@ def _create_embedder(self, embedder_config: dict) -> object:
306323
try:
307324
models_tokens["hugging_face"][embedder_config["model"]]
308325
except KeyError as exc:
309-
raise KeyError("Model not supported")from exc
326+
raise KeyError("Model not supported") from exc
310327
return HuggingFaceHubEmbeddings(model=embedder_config["model"])
311328
elif "gemini" in embedder_config["model"]:
312329
try:
313330
models_tokens["gemini"][embedder_config["model"]]
314331
except KeyError as exc:
315-
raise KeyError("Model not supported")from exc
332+
raise KeyError("Model not supported") from exc
316333
return GoogleGenerativeAIEmbeddings(model=embedder_config["model"])
317334
elif "bedrock" in embedder_config["model"]:
318335
embedder_config["model"] = embedder_config["model"].split("/")[-1]
319-
client = embedder_config.get('client', None)
336+
client = embedder_config.get("client", None)
320337
try:
321338
models_tokens["bedrock"][embedder_config["model"]]
322339
except KeyError as exc:
323340
raise KeyError("Model not supported") from exc
324-
return BedrockEmbeddings(client=client, model_id=embedder_config["model"])
341+
return BedrockEmbeddings(client=client, model_id=embedder_config["model"])
342+
else:
343+
raise ValueError("Model provided by the configuration not supported")
325344

326345
def get_state(self, key=None) -> dict:
327-
"""""
346+
""" ""
328347
Get the final state of the graph.
329348
330349
Args:

scrapegraphai/helpers/models_tokens.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
models_tokens = {
66
"openai": {
77
"gpt-3.5-turbo-0125": 16385,
8+
"gpt-3.5": 4096,
89
"gpt-3.5-turbo": 4096,
910
"gpt-3.5-turbo-1106": 16385,
1011
"gpt-3.5-turbo-instruct": 4096,

0 commit comments

Comments
 (0)