Skip to content

Commit fcb3abb

Browse files
committed
feat(omni-search): added omni search graph and updated docs
1 parent a296927 commit fcb3abb

File tree

9 files changed

+237
-4
lines changed

9 files changed

+237
-4
lines changed

docs/assets/omniscrapergraph.png

72.2 KB
Loading

docs/assets/omnisearchgraph.png

56.7 KB
Loading

docs/source/scrapers/graph_config.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,8 @@ Some interesting ones are:
1010
- `headless`: If set to `False`, the web browser will be opened on the URL requested and close right after the HTML is fetched.
1111
- `max_results`: The maximum number of results to be fetched from the search engine. Useful in `SearchGraph`.
1212
- `output_path`: The path where the output files will be saved. Useful in `SpeechGraph`.
13+
- `loader_kwargs`: A dictionary with additional parameters to be passed to the `Loader` class, such as `proxy`.
14+
- `max_images`: The maximum number of images to be analyzed. Useful in `OmniScraperGraph` and `OmniSearchGraph`.
1315

1416
Proxy Rotation
1517
^^^^^^^^^^^^^^

docs/source/scrapers/graphs.rst

Lines changed: 65 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,16 +3,80 @@ Graphs
33

44
Graphs are scraping pipelines aimed at solving specific tasks. They are composed by nodes which can be configured individually to address different aspects of the task (fetching data, extracting information, etc.).
55

6-
There are currently three types of graphs available in the library:
6+
There are three types of graphs available in the library:
77

88
- **SmartScraperGraph**: one-page scraper that requires a user-defined prompt and a URL (or local file) to extract information from using LLM.
99
- **SearchGraph**: multi-page scraper that only requires a user-defined prompt to extract information from a search engine using LLM. It is built on top of SmartScraperGraph.
1010
- **SpeechGraph**: text-to-speech pipeline that generates an answer as well as a requested audio file. It is built on top of SmartScraperGraph and requires a user-defined prompt and a URL (or local file).
1111

12+
With the introduction of `GPT-4o`, two new powerful graphs have been created:
13+
14+
- **OmniScraperGraph**: similar to `SmartScraperGraph`, but with the ability to scrape images and describe them.
15+
- **OmniSearchGraph**: similar to `SearchGraph`, but with the ability to scrape images and describe them.
16+
1217
.. note::
1318

1419
They all use a graph configuration to set up LLM models and other parameters. To find out more about the configurations, check the :ref:`LLM` and :ref:`Configuration` sections.
1520

21+
OmniScraperGraph
22+
^^^^^^^^^^^^^^^^
23+
24+
.. image:: ../../assets/omniscrapergraph.png
25+
:align: center
26+
:width: 90%
27+
:alt: OmniScraperGraph
28+
|
29+
30+
First we define the graph configuration, which includes the LLM model and other parameters. Then we create an instance of the OmniScraperGraph class, passing the prompt, source, and configuration as arguments. Finally, we run the graph and print the result.
31+
It will fetch the data from the source and extract the information based on the prompt in JSON format.
32+
33+
.. code-block:: python
34+
35+
from scrapegraphai.graphs import OmniScraperGraph
36+
37+
graph_config = {
38+
"llm": {...},
39+
}
40+
41+
omni_scraper_graph = OmniScraperGraph(
42+
prompt="List me all the projects with their titles and image links and descriptions.",
43+
source="https://perinim.github.io/projects",
44+
config=graph_config
45+
)
46+
47+
result = omni_scraper_graph.run()
48+
print(result)
49+
50+
OmniSearchGraph
51+
^^^^^^^^^^^^^^^
52+
53+
.. image:: ../../assets/omnisearchgraph.png
54+
:align: center
55+
:width: 80%
56+
:alt: OmniSearchGraph
57+
|
58+
59+
Similar to OmniScraperGraph, we define the graph configuration, create multiple of the OmniSearchGraph class, and run the graph.
60+
It will create a search query, fetch the first n results from the search engine, run n OmniScraperGraph instances, and return the results in JSON format.
61+
62+
.. code-block:: python
63+
64+
from scrapegraphai.graphs import OmniSearchGraph
65+
66+
graph_config = {
67+
"llm": {...},
68+
}
69+
70+
# Create the OmniSearchGraph instance
71+
omni_search_graph = OmniSearchGraph(
72+
prompt="List me all Chioggia's famous dishes and describe their pictures.",
73+
config=graph_config
74+
)
75+
76+
# Run the graph
77+
result = omni_search_graph.run()
78+
print(result)
79+
1680
SmartScraperGraph
1781
^^^^^^^^^^^^^^^^^
1882

examples/openai/omni_scraper_openai.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
import os, json
66
from dotenv import load_dotenv
77
from scrapegraphai.graphs import OmniScraperGraph
8-
from scrapegraphai.utils import prettify_exec_info, convert_to_csv
8+
from scrapegraphai.utils import prettify_exec_info
99

1010
load_dotenv()
1111

@@ -22,7 +22,8 @@
2222
"model": "gpt-4o",
2323
},
2424
"verbose": True,
25-
"headless": False,
25+
"headless": True,
26+
"max_images": 5
2627
}
2728

2829
# ************************************************
Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
"""
2+
Example of OmniSearchGraph
3+
"""
4+
5+
import os, json
6+
from dotenv import load_dotenv
7+
from scrapegraphai.graphs import OmniSearchGraph
8+
from scrapegraphai.utils import prettify_exec_info
9+
load_dotenv()
10+
11+
# ************************************************
12+
# Define the configuration for the graph
13+
# ************************************************
14+
15+
openai_key = os.getenv("OPENAI_APIKEY")
16+
17+
graph_config = {
18+
"llm": {
19+
"api_key": openai_key,
20+
"model": "gpt-4o",
21+
},
22+
"max_results": 2,
23+
"max_images": 5,
24+
"verbose": True,
25+
}
26+
27+
# ************************************************
28+
# Create the OmniSearchGraph instance and run it
29+
# ************************************************
30+
31+
omni_search_graph = OmniSearchGraph(
32+
prompt="List me all Chioggia's famous dishes and describe their pictures.",
33+
config=graph_config
34+
)
35+
36+
result = omni_search_graph.run()
37+
print(json.dumps(result, indent=2))
38+
39+
# ************************************************
40+
# Get graph execution info
41+
# ************************************************
42+
43+
graph_exec_info = omni_search_graph.get_execution_info()
44+
print(prettify_exec_info(graph_exec_info))
45+

scrapegraphai/graphs/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,3 +14,4 @@
1414
from .csv_scraper_graph import CSVScraperGraph
1515
from .pdf_scraper_graph import PDFScraperGraph
1616
from .omni_scraper_graph import OmniScraperGraph
17+
from .omni_search_graph import OmniSearchGraph

scrapegraphai/graphs/omni_scraper_graph.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@ class OmniScraperGraph(AbstractGraph):
2929
configured for generating embeddings.
3030
verbose (bool): A flag indicating whether to show print statements during execution.
3131
headless (bool): A flag indicating whether to run the graph in headless mode.
32+
max_images (int): The maximum number of images to process.
3233
3334
Args:
3435
prompt (str): The prompt for the graph.
@@ -48,7 +49,7 @@ class OmniScraperGraph(AbstractGraph):
4849
def __init__(self, prompt: str, source: str, config: dict):
4950

5051
self.max_images = 5 if config is None else config.get("max_images", 5)
51-
52+
5253
super().__init__(prompt, config, source)
5354

5455
self.input_key = "url" if source.startswith("http") else "local_dir"
Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
"""
2+
OmniSearchGraph Module
3+
"""
4+
5+
from copy import deepcopy
6+
7+
from .base_graph import BaseGraph
8+
from ..nodes import (
9+
SearchInternetNode,
10+
GraphIteratorNode,
11+
MergeAnswersNode
12+
)
13+
from .abstract_graph import AbstractGraph
14+
from .omni_scraper_graph import OmniScraperGraph
15+
16+
17+
class OmniSearchGraph(AbstractGraph):
18+
"""
19+
OmniSearchGraph is a scraping pipeline that searches the internet for answers to a given prompt.
20+
It only requires a user prompt to search the internet and generate an answer.
21+
22+
Attributes:
23+
prompt (str): The user prompt to search the internet.
24+
llm_model (dict): The configuration for the language model.
25+
embedder_model (dict): The configuration for the embedder model.
26+
headless (bool): A flag to run the browser in headless mode.
27+
verbose (bool): A flag to display the execution information.
28+
model_token (int): The token limit for the language model.
29+
max_results (int): The maximum number of results to return.
30+
31+
Args:
32+
prompt (str): The user prompt to search the internet.
33+
config (dict): Configuration parameters for the graph.
34+
35+
Example:
36+
>>> omni_search_graph = OmniSearchGraph(
37+
... "What is Chioggia famous for?",
38+
... {"llm": {"model": "gpt-4o"}}
39+
... )
40+
>>> result = search_graph.run()
41+
"""
42+
43+
def __init__(self, prompt: str, config: dict):
44+
45+
self.max_results = config.get("max_results", 3)
46+
self.copy_config = deepcopy(config)
47+
48+
super().__init__(prompt, config)
49+
50+
def _create_graph(self) -> BaseGraph:
51+
"""
52+
Creates the graph of nodes representing the workflow for web scraping and searching.
53+
54+
Returns:
55+
BaseGraph: A graph instance representing the web scraping and searching workflow.
56+
"""
57+
58+
# ************************************************
59+
# Create a OmniScraperGraph instance
60+
# ************************************************
61+
62+
omni_scraper_instance = OmniScraperGraph(
63+
prompt="",
64+
source="",
65+
config=self.copy_config
66+
)
67+
68+
# ************************************************
69+
# Define the graph nodes
70+
# ************************************************
71+
72+
search_internet_node = SearchInternetNode(
73+
input="user_prompt",
74+
output=["urls"],
75+
node_config={
76+
"llm_model": self.llm_model,
77+
"max_results": self.max_results
78+
}
79+
)
80+
graph_iterator_node = GraphIteratorNode(
81+
input="user_prompt & urls",
82+
output=["results"],
83+
node_config={
84+
"graph_instance": omni_scraper_instance,
85+
}
86+
)
87+
88+
merge_answers_node = MergeAnswersNode(
89+
input="user_prompt & results",
90+
output=["answer"],
91+
node_config={
92+
"llm_model": self.llm_model,
93+
}
94+
)
95+
96+
return BaseGraph(
97+
nodes=[
98+
search_internet_node,
99+
graph_iterator_node,
100+
merge_answers_node
101+
],
102+
edges=[
103+
(search_internet_node, graph_iterator_node),
104+
(graph_iterator_node, merge_answers_node)
105+
],
106+
entry_point=search_internet_node
107+
)
108+
109+
def run(self) -> str:
110+
"""
111+
Executes the web scraping and searching process.
112+
113+
Returns:
114+
str: The answer to the prompt.
115+
"""
116+
inputs = {"user_prompt": self.prompt}
117+
self.final_state, self.execution_info = self.graph.execute(inputs)
118+
119+
return self.final_state.get("answer", "No answer found.")

0 commit comments

Comments
 (0)