Skip to content

Commit 1774b18

Browse files
committed
refactor of embeddings
1 parent b6f7b64 commit 1774b18

File tree

5 files changed

+85
-19
lines changed

5 files changed

+85
-19
lines changed

examples/example.py

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
from scrapegraphai.graphs import PDFScraperGraph
2+
3+
graph_config = {
4+
"llm": {
5+
"model": "ollama/llama3",
6+
"temperature": 0,
7+
"format": "json", # Ollama needs the format to be specified explicitly
8+
"model_tokens": 4000,
9+
},
10+
"embeddings": {
11+
"model": "ollama/nomic-embed-text",
12+
"temperature": 0,
13+
},
14+
"verbose": True,
15+
"headless": False,
16+
}
17+
18+
# Covert to list
19+
sources = [
20+
"This paper provides evidence from a natural experiment on the relationship between positive affect and productivity. We link highly detailed administrative data on the behaviors and performance of all telesales workers at a large telecommunications company with survey reports of employee happiness that we collected on a weekly basis. We use variation in worker mood arising from visual exposure to weather—the interaction between call center architecture and outdoor weather conditions—in order to provide a quasi-experimental test of the effect of happiness on productivity. We find evidence of a positive impact on sales performance, which is driven by changes in labor productivity – largely through workers converting more calls into sales, and to a lesser extent by making more calls per hour and adhering more closely to their schedule. We find no evidence in our setting of effects on measures of high-frequency labor supply such as attendance and break-taking.",
21+
"The diffusion of social media coincided with a worsening of mental health conditions among adolescents and young adults in the United States, giving rise to speculation that social media might be detrimental to mental health. In this paper, we provide quasi-experimental estimates of the impact of social media on mental health by leveraging a unique natural experiment: the staggered introduction of Facebook across U.S. colleges. Our analysis couples data on student mental health around the years of Facebook's expansion with a generalized difference-in-differences empirical strategy. We find that the roll-out of Facebook at a college increased symptoms of poor mental health, especially depression. We also find that, among students predicted to be most susceptible to mental illness, the introduction of Facebook led to increased utilization of mental healthcare services. Lastly, we find that, after the introduction of Facebook, students were more likely to report experiencing impairments to academic performance resulting from poor mental health. Additional evidence on mechanisms suggests that the results are due to Facebook fostering unfavorable social comparisons.",
22+
"Hollywood films are generally released first in the United States and then later abroad, with some variation in lags across films and countries. With the growth in movie piracy since the appearance of BitTorrent in 2003, films have become available through illegal piracy immediately after release in the US, while they are not available for legal viewing abroad until their foreign premieres in each country. We make use of this variation in international release lags to ask whether longer lags – which facilitate more local pre-release piracy – depress theatrical box office receipts, particularly after the widespread adoption of BitTorrent. We find that longer release windows are associated with decreased box office returns, even after controlling for film and country fixed effects. This relationship is much stronger in contexts where piracy is more prevalent: after BitTorrent’s adoption and in heavily-pirated genres. Our findings indicate that, as a lower bound, international box office returns in our sample were at least 7% lower than they would have been in the absence of pre-release piracy. By contrast, we do not see evidence of elevated sales displacement in US box office revenue following the adoption of BitTorrent, and we suggest that delayed legal availability of the content abroad may drive the losses to piracy."
23+
# Add more sources here
24+
]
25+
26+
prompt = """
27+
You are an expert in reviewing academic manuscripts. Please analyze the abstracts provided from an academic journal article to extract and clearly identify the following elements:
28+
29+
Independent Variable (IV): The variable that is manipulated or considered as the primary cause affecting other variables.
30+
Dependent Variable (DV): The variable that is measured or observed, which is expected to change as a result of variations in the Independent Variable.
31+
Exogenous Shock: Identify any external or unexpected events used in the study that serve as a natural experiment or provide a unique setting for observing the effects on the IV and DV.
32+
Response Format: For each abstract, present your response in the following structured format:
33+
34+
Independent Variable (IV):
35+
Dependent Variable (DV):
36+
Exogenous Shock:
37+
38+
Example Queries and Responses:
39+
40+
Query: This paper provides evidence from a natural experiment on the relationship between positive affect and productivity. We link highly detailed administrative data on the behaviors and performance of all telesales workers at a large telecommunications company with survey reports of employee happiness that we collected on a weekly basis. We use variation in worker mood arising from visual exposure to weather the interaction between call center architecture and outdoor weather conditions in order to provide a quasi-experimental test of the effect of happiness on productivity. We find evidence of a positive impact on sales performance, which is driven by changes in labor productivity largely through workers converting more calls into sales, and to a lesser extent by making more calls per hour and adhering more closely to their schedule. We find no evidence in our setting of effects on measures of high-frequency labor supply such as attendance and break-taking.
41+
42+
Response:
43+
44+
Independent Variable (IV): Employee happiness.
45+
Dependent Variable (DV): Overall firm productivity.
46+
Exogenous Shock: Sudden company-wide increase in bonus payments.
47+
48+
Query: The diffusion of social media coincided with a worsening of mental health conditions among adolescents and young adults in the United States, giving rise to speculation that social media might be detrimental to mental health. In this paper, we provide quasi-experimental estimates of the impact of social media on mental health by leveraging a unique natural experiment: the staggered introduction of Facebook across U.S. colleges. Our analysis couples data on student mental health around the years of Facebook's expansion with a generalized difference-in-differences empirical strategy. We find that the roll-out of Facebook at a college increased symptoms of poor mental health, especially depression. We also find that, among students predicted to be most susceptible to mental illness, the introduction of Facebook led to increased utilization of mental healthcare services. Lastly, we find that, after the introduction of Facebook, students were more likely to report experiencing impairments to academic performance resulting from poor mental health. Additional evidence on mechanisms suggests that the results are due to Facebook fostering unfavorable social comparisons.
49+
50+
Response:
51+
52+
Independent Variable (IV): Exposure to social media.
53+
Dependent Variable (DV): Mental health outcomes.
54+
Exogenous Shock: staggered introduction of Facebook across U.S. colleges.
55+
"""
56+
results = []
57+
for source in sources:
58+
pdf_scraper_graph = PDFScraperGraph(
59+
prompt=prompt,
60+
source=source,
61+
config=graph_config
62+
)
63+
result = pdf_scraper_graph.run()
64+
results.append(result)

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,7 @@ classifiers = [
6767
"Programming Language :: Python :: 3",
6868
"Operating System :: OS Independent",
6969
]
70-
requires-python = ">= 3.9"
70+
requires-python = ">=3.9,<3.12"
7171

7272
[build-system]
7373
requires = ["hatchling"]

requirements-dev.lock

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -45,10 +45,6 @@ certifi==2024.2.2
4545
# via requests
4646
charset-normalizer==3.3.2
4747
# via requests
48-
colorama==0.4.6
49-
# via ipython
50-
# via pytest
51-
# via tqdm
5248
dataclasses-json==0.6.6
5349
# via langchain
5450
# via langchain-community
@@ -104,7 +100,6 @@ graphviz==0.20.3
104100
# via scrapegraphai
105101
greenlet==3.0.3
106102
# via playwright
107-
# via sqlalchemy
108103
groq==0.5.0
109104
# via langchain-groq
110105
grpcio==1.63.0
@@ -217,8 +212,11 @@ pandas==2.2.2
217212
# via scrapegraphai
218213
parso==0.8.4
219214
# via jedi
215+
pexpect==4.9.0
216+
# via ipython
220217
playwright==1.43.0
221218
# via scrapegraphai
219+
# via undetected-playwright
222220
pluggy==1.5.0
223221
# via pytest
224222
prompt-toolkit==3.0.43
@@ -233,6 +231,8 @@ protobuf==4.25.3
233231
# via googleapis-common-protos
234232
# via grpcio-status
235233
# via proto-plus
234+
ptyprocess==0.7.0
235+
# via pexpect
236236
pure-eval==0.2.2
237237
# via stack-data
238238
pyasn1==0.6.0
@@ -342,6 +342,8 @@ typing-inspect==0.9.0
342342
# via dataclasses-json
343343
tzdata==2024.1
344344
# via pandas
345+
undetected-playwright==0.3.0
346+
# via scrapegraphai
345347
uritemplate==4.1.1
346348
# via google-api-python-client
347349
urllib3==2.2.1

requirements.lock

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -45,9 +45,6 @@ certifi==2024.2.2
4545
# via requests
4646
charset-normalizer==3.3.2
4747
# via requests
48-
colorama==0.4.6
49-
# via ipython
50-
# via tqdm
5148
dataclasses-json==0.6.6
5249
# via langchain
5350
# via langchain-community
@@ -102,7 +99,6 @@ graphviz==0.20.3
10299
# via scrapegraphai
103100
greenlet==3.0.3
104101
# via playwright
105-
# via sqlalchemy
106102
groq==0.5.0
107103
# via langchain-groq
108104
grpcio==1.63.0
@@ -212,8 +208,11 @@ pandas==2.2.2
212208
# via scrapegraphai
213209
parso==0.8.4
214210
# via jedi
211+
pexpect==4.9.0
212+
# via ipython
215213
playwright==1.43.0
216214
# via scrapegraphai
215+
# via undetected-playwright
217216
prompt-toolkit==3.0.43
218217
# via ipython
219218
proto-plus==1.23.0
@@ -226,6 +225,8 @@ protobuf==4.25.3
226225
# via googleapis-common-protos
227226
# via grpcio-status
228227
# via proto-plus
228+
ptyprocess==0.7.0
229+
# via pexpect
229230
pure-eval==0.2.2
230231
# via stack-data
231232
pyasn1==0.6.0
@@ -330,6 +331,8 @@ typing-inspect==0.9.0
330331
# via dataclasses-json
331332
tzdata==2024.1
332333
# via pandas
334+
undetected-playwright==0.3.0
335+
# via scrapegraphai
333336
uritemplate==4.1.1
334337
# via google-api-python-client
335338
urllib3==2.2.1

scrapegraphai/graphs/abstract_graph.py

Lines changed: 6 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -282,41 +282,38 @@ def _create_embedder(self, embedder_config: dict) -> object:
282282
if 'model_instance' in embedder_config:
283283
return embedder_config['model_instance']
284284
# Instantiate the embedding model based on the model name
285-
if "openai" in embedder_config["model"]:
285+
if "openai" in embedder_config["model"].split("/")[0]:
286286
return OpenAIEmbeddings(api_key=embedder_config["api_key"])
287287
elif "azure" in embedder_config["model"]:
288288
return AzureOpenAIEmbeddings()
289-
elif "ollama" in embedder_config["model"]:
289+
elif "ollama" in embedder_config["model"].split("/")[0]:
290+
print("ciao")
290291
embedder_config["model"] = embedder_config["model"].split("ollama/")[-1]
291292
try:
292293
models_tokens["ollama"][embedder_config["model"]]
293294
except KeyError as exc:
294295
raise KeyError("Model not supported") from exc
295296
return OllamaEmbeddings(**embedder_config)
296-
elif "hugging_face" in embedder_config["model"]:
297+
elif "hugging_face" in embedder_config["model"].split("/")[0]:
297298
try:
298299
models_tokens["hugging_face"][embedder_config["model"]]
299300
except KeyError as exc:
300301
raise KeyError("Model not supported")from exc
301302
return HuggingFaceHubEmbeddings(model=embedder_config["model"])
302-
elif "gemini" in embedder_config["model"]:
303+
elif "gemini" in embedder_config["model"].split("/")[0]:
303304
try:
304305
models_tokens["gemini"][embedder_config["model"]]
305306
except KeyError as exc:
306307
raise KeyError("Model not supported")from exc
307308
return GoogleGenerativeAIEmbeddings(model=embedder_config["model"])
308-
elif "bedrock" in embedder_config["model"]:
309+
elif "bedrock" in embedder_config["model"].split("/")[0]:
309310
embedder_config["model"] = embedder_config["model"].split("/")[-1]
310311
client = embedder_config.get('client', None)
311312
try:
312313
models_tokens["bedrock"][embedder_config["model"]]
313314
except KeyError as exc:
314315
raise KeyError("Model not supported") from exc
315316
return BedrockEmbeddings(client=client, model_id=embedder_config["model"])
316-
else:
317-
raise ValueError(
318-
"Model provided by the configuration not supported")
319-
320317
def get_state(self, key=None) -> dict:
321318
"""""
322319
Get the final state of the graph.

0 commit comments

Comments
 (0)