Skip to content

Commit 86b0afc

Browse files
authored
Merge pull request #41 from TencentCloudADP/Lightblues/issue36
Feat: customize search & crawl backend
2 parents ad6b9ff + 6ff6ae1 commit 86b0afc

File tree

16 files changed

+449
-86
lines changed

16 files changed

+449
-86
lines changed

configs/agents/tools/search.yaml

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,18 @@ name: search
22
mode: builtin
33
activated_tools: null
44
config:
5+
# search config
6+
# - `JINA_API_KEY` is required for jina. Ref: https://jina.ai/
7+
# - `SERPER_API_KEY` is required for google. Ref: https://serper.dev/
8+
search_engine: google # google | jina | baidu | duckduckgo
9+
search_params: {"gl": "cn", "hl": "zh-cn"} # search params for google & jina
10+
search_banned_sites: []
11+
# crawl config
12+
# - `JINA_API_KEY` is required for jina
13+
# - `crawl4ai` and `playwright` should be installed for crawl4ai. Ref: https://github.com/unclecode/crawl4ai
14+
crawl_engine: jina # jina | crawl4ai
15+
# llm config used in web_qa
516
summary_token_limit: 10_000
6-
SERPER_API_KEY: ${oc.env:SERPER_API_KEY}
7-
JINA_API_KEY: ${oc.env:JINA_API_KEY}
817
config_llm:
918
model_provider:
1019
type: ${oc.env:UTU_LLM_TYPE}

configs/eval/ww.yaml

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -66,9 +66,13 @@ agent:
6666
mode: builtin
6767
activated_tools: null
6868
config:
69+
search_engine: google # google | jina | baidu | duckduckgo
70+
search_params: {"gl": "cn", "hl": "zh-cn"} # search params for google & jina
71+
# https://huggingface.co/datasets/callanwu/WebWalkerQA
72+
# https://huggingface.co/spaces/dobval/WebThinker
73+
search_banned_sites: ["https://huggingface.co/", "https://grok.com/share/", "https://modelscope.cn/datasets/"]
74+
crawl_engine: jina # jina | crawl4ai
6975
summary_token_limit: 10_000
70-
SERPER_API_KEY: ${oc.env:SERPER_API_KEY}
71-
JINA_API_KEY: ${oc.env:JINA_API_KEY}
7276
config_llm:
7377
model_provider:
7478
type: ${oc.env:UTU_LLM_TYPE}

docs/tools.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ Here is a summary of some key toolkits available in the framework:
1616

1717
| Toolkit Class | Provided Tools (Functions) | Core Functionality & Mechanism |
1818
| :--- | :--- | :--- |
19-
| **[SearchToolkit][utu.tools.search_toolkit.SearchToolkit]** | `search_google_api`, `web_qa` | Performs web searches using the Serper API and reads webpage content using the Jina API. It can use an LLM to answer questions based on page content. |
19+
| **[SearchToolkit][utu.tools.search_toolkit.SearchToolkit]** | `search`, `web_qa` | Performs web searches using the Serper API and reads webpage content using the Jina API. It can use an LLM to answer questions based on page content. |
2020
| **[DocumentToolkit][utu.tools.document_toolkit.DocumentToolkit]** | `document_qa` | Processes local or remote documents (PDF, DOCX, etc.). It uses the `chunkr.ai` service to parse the document and an LLM to answer questions or provide a summary. |
2121
| **[PythonExecutorToolkit][utu.tools.python_executor_toolkit.PythonExecutorToolkit]** | `execute_python_code` | Executes Python code snippets in an isolated environment using `IPython.core.interactiveshell`. It runs in a separate thread to prevent blocking and can capture outputs, errors, and even `matplotlib` plots. |
2222
| **[BashToolkit][utu.tools.bash_toolkit.BashToolkit]** | `run_bash` | Provides a persistent local shell session using the `pexpect` library. This allows the agent to run a series of commands that maintain state (e.g., current directory). |

examples/wide_research/prompts.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ planner: |
33
44
You should obey the following workflow:
55
1. **Clarify the user query**: investigate the user's query carefully, and figure out the subtasks.
6-
- Use the "search_google_api" when you need to gather background information from the web.
6+
- Use the "search" when you need to gather background information from the web.
77
- The returned snippet is very simple, so use the "web_qa" to collect detailed information from a specific webpage.
88
2. **Collect information parallelly**:
99
- use the "search_wide" tool to collect structured information from the web.

tests/models/test_react_standalone.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -34,10 +34,10 @@
3434
},
3535
},
3636
},
37-
"search_google_api": {
37+
"search": {
3838
"type": "function",
3939
"function": {
40-
"name": "search_google_api",
40+
"name": "search",
4141
"description": "Search the query via Google api, the query should be a search query like humans search in Google, concrete and not vague or super long. More the single most important items.", # pylint: disable=line-too-long
4242
"parameters": {
4343
"type": "object",
@@ -86,7 +86,7 @@
8686
{
8787
"id": "0",
8888
"type": "function",
89-
"function": {"name": "search_google_api", "arguments": str({"query": "smolagents package"})},
89+
"function": {"name": "search", "arguments": str({"query": "smolagents package"})},
9090
}
9191
],
9292
},
@@ -118,7 +118,7 @@
118118
),
119119
},
120120
],
121-
"tools": [tools["search_google_api"], tools["web_qa"]],
121+
"tools": [tools["search"], tools["web_qa"]],
122122
}
123123
]
124124

tests/tools/test_search_toolkit.py

Lines changed: 44 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,43 @@
1-
import hashlib
21
import json
32

43
import pytest
54

65
from utu.config import ConfigLoader
76
from utu.tools import SearchToolkit
7+
from utu.tools.search.baidu_search import BaiduSearch
8+
from utu.tools.search.crawl4ai_crawl import Crawl4aiCrawl
9+
from utu.tools.search.duckduckgo_search import DuckDuckGoSearch
10+
from utu.tools.search.google_search import GoogleSearch
11+
from utu.tools.search.jina_crawl import JinaCrawl
12+
from utu.tools.search.jina_search import JinaSearch
13+
14+
15+
# ----------------------------------------------------------------------------
16+
async def test_baidu_search():
17+
baidu_search = BaiduSearch()
18+
result = await baidu_search.search_baidu("上海天气")
19+
print(result)
20+
21+
22+
async def test_google_search():
23+
google_search = GoogleSearch()
24+
result = await google_search.search_google("上海天气")
25+
print(result)
26+
827

28+
async def test_jina_search():
29+
jina_search = JinaSearch()
30+
result = await jina_search.search_jina("明天上海天气")
31+
print(result)
932

33+
34+
async def test_duckduckgo_search():
35+
duckduckgo_search = DuckDuckGoSearch()
36+
result = await duckduckgo_search.search_duckduckgo("明天上海天气")
37+
print(result)
38+
39+
40+
# ----------------------------------------------------------------------------
1041
@pytest.fixture
1142
def search_toolkit() -> SearchToolkit:
1243
config = ConfigLoader.load_toolkit_config("search")
@@ -25,28 +56,30 @@ async def test_tool_schema(search_toolkit: SearchToolkit):
2556
TEST_QUERY = "南京工业大学计算机与信息工程学院 更名 报道"
2657

2758

28-
async def test_search_google_api(search_toolkit: SearchToolkit):
29-
result = await search_toolkit.search_google_api(TEST_QUERY, num_results=10)
59+
async def test_search(search_toolkit: SearchToolkit):
60+
result = await search_toolkit.search(TEST_QUERY, num_results=10)
3061
print(result)
3162

3263

64+
# ----------------------------------------------------------------------------
3365
TEST_URL = "https://docs.crawl4ai.com/core/simple-crawling/"
3466

3567

36-
async def test_get_content(search_toolkit: SearchToolkit):
37-
result = await search_toolkit.get_content(TEST_URL)
68+
async def test_jina_crawl():
69+
jina_crawl = JinaCrawl()
70+
result = await jina_crawl.crawl(TEST_URL)
3871
print(result)
3972

4073

41-
async def test_cache(search_toolkit: SearchToolkit):
42-
for _ in range(2):
43-
res = await search_toolkit.get_content(TEST_URL)
44-
hash = hashlib.md5(res.encode()).hexdigest()
45-
print(hash)
74+
async def test_crawl4ai_crawl():
75+
crawl4ai_crawl = Crawl4aiCrawl()
76+
result = await crawl4ai_crawl.crawl(TEST_URL)
77+
print(result)
4678

4779

80+
# ----------------------------------------------------------------------------
4881
queries = (
49-
("https://docs.crawl4ai.com/core/simple-crawling/", ""),
82+
("https://github.com/TencentCloudADP/Youtu-agent", ""),
5083
("https://docs.crawl4ai.com/core/simple-crawling/", "How to log?"),
5184
("https://github.com/theskumar/python-dotenv", "Summary this page"),
5285
)

utu/tools/search/baidu_search.py

Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
import aiohttp
2+
from bs4 import BeautifulSoup
3+
4+
from ...utils import get_logger
5+
from ..utils import ContentFilter
6+
7+
logger = get_logger(__name__)
8+
9+
10+
class BaiduSearch:
11+
"""Baidu Search."""
12+
13+
def __init__(self, config: dict = None) -> None:
14+
self.url = "https://www.baidu.com/s"
15+
self.headers = {
16+
"User-Agent": (
17+
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
18+
"AppleWebKit/537.36 (KHTML, like Gecko) "
19+
"Chrome/120.0.0.0 Safari/537.36"
20+
),
21+
"Referer": "https://www.baidu.com",
22+
}
23+
config = config or {}
24+
search_banned_sites = config.get("search_banned_sites", [])
25+
self.content_filter = ContentFilter(search_banned_sites) if search_banned_sites else None
26+
27+
async def search(self, query: str, num_results: int = 5) -> str:
28+
"""standard search interface."""
29+
res = await self.search_baidu(query)
30+
# filter
31+
if self.content_filter:
32+
results = self.content_filter.filter_results(res["data"], num_results, key="url")
33+
else:
34+
results = res["data"][:num_results]
35+
# format
36+
formatted_results = []
37+
for i, r in enumerate(results, 1):
38+
formatted_results.append(f"{i}. {r['title']} ({r['url']})")
39+
if "description" in r:
40+
formatted_results[-1] += f"\ndescription: {r['description']}"
41+
msg = "\n".join(formatted_results)
42+
return msg
43+
44+
# @async_file_cache(expire_time=None)
45+
async def search_baidu(self, query: str) -> dict:
46+
"""Search Baidu using web scraping to retrieve relevant search results.
47+
48+
- WARNING: Uses web scraping which may be subject to rate limiting or anti-bot measures.
49+
50+
Returns:
51+
Example result:
52+
{
53+
'result_id': 1,
54+
'title': '百度百科',
55+
'description': '百度百科是一部内容开放、自由的网络百科全书...',
56+
'url': 'https://baike.baidu.com/'
57+
}
58+
"""
59+
params = {"wd": query, "rn": "20"}
60+
async with aiohttp.ClientSession() as session:
61+
async with session.get(self.url, headers=self.headers, params=params) as response:
62+
response.raise_for_status() # avoid cache error!
63+
results = await response.text(encoding="utf-8")
64+
65+
soup = BeautifulSoup(results, "html.parser")
66+
results = []
67+
for idx, item in enumerate(soup.select(".result"), 1):
68+
title_element = item.select_one("h3 > a")
69+
title = title_element.get_text(strip=True) if title_element else ""
70+
link = title_element["href"] if title_element else ""
71+
desc_element = item.select_one(".c-abstract, .c-span-last")
72+
desc = desc_element.get_text(strip=True) if desc_element else ""
73+
74+
results.append(
75+
{
76+
"result_id": idx,
77+
"title": title,
78+
"description": desc,
79+
"url": link,
80+
}
81+
)
82+
if len(results) == 0:
83+
logger.warning(f"No results found from Baidu search: {query}")
84+
return {"data": results}

utu/tools/search/crawl4ai_crawl.py

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
try:
2+
from crawl4ai import AsyncWebCrawler
3+
except ImportError as e:
4+
raise ImportError(
5+
"Please install crawl4ai: `uv pip install crawl4ai && python -m playwright install --with-deps chromium`"
6+
) from e # noqa: E501
7+
from ...utils import async_file_cache, get_logger
8+
9+
logger = get_logger(__name__)
10+
11+
12+
class Crawl4aiCrawl:
13+
"""Crawl4ai Crawl.
14+
15+
- repo: https://github.com/unclecode/crawl4ai
16+
"""
17+
18+
def __init__(self, config: dict = None) -> None:
19+
config = config or {}
20+
21+
async def crawl(self, url: str) -> str:
22+
"""standard crawl interface."""
23+
return await self.crawl_crawl4ai(url)
24+
25+
@async_file_cache(expire_time=None)
26+
async def crawl_crawl4ai(self, url: str) -> str:
27+
# Get the content of the url
28+
async with AsyncWebCrawler() as crawler:
29+
result = await crawler.arun(
30+
url=url,
31+
)
32+
return result.markdown
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
try:
2+
from ddgs import DDGS
3+
except ImportError as e:
4+
raise ImportError("Please install ddgs first: `uv pip install ddgs`") from e
5+
from ...utils import get_logger
6+
from ..utils import ContentFilter
7+
8+
logger = get_logger(__name__)
9+
10+
11+
class DuckDuckGoSearch:
12+
"""DuckDuckGo Search.
13+
14+
- repo: https://github.com/deedy5/ddgs
15+
"""
16+
17+
def __init__(self, config: dict = None) -> None:
18+
self.ddgs = DDGS()
19+
config = config or {}
20+
search_banned_sites = config.get("search_banned_sites", [])
21+
self.content_filter = ContentFilter(search_banned_sites) if search_banned_sites else None
22+
23+
async def search(self, query: str, num_results: int = 5) -> str:
24+
"""standard search interface."""
25+
res = await self.search_duckduckgo(query)
26+
# filter
27+
if self.content_filter:
28+
results = self.content_filter.filter_results(res, num_results, key="href")
29+
else:
30+
results = res[:num_results]
31+
# format
32+
formatted_results = []
33+
for i, r in enumerate(results, 1):
34+
formatted_results.append(f"{i}. {r['title']} ({r['href']})")
35+
if "body" in r:
36+
formatted_results[-1] += f"\nbody: {r['body']}"
37+
msg = "\n".join(formatted_results)
38+
return msg
39+
40+
async def search_duckduckgo(self, query: str) -> list:
41+
"""Use DuckDuckGo search engine to search for information on the given query.
42+
43+
Returns:
44+
[{
45+
"title": ...
46+
"href": ...
47+
"body": ...
48+
}]
49+
"""
50+
results = self.ddgs.text(query, max_results=100)
51+
return results

utu/tools/search/google_search.py

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
import aiohttp
2+
3+
from ...utils import EnvUtils, async_file_cache, get_logger
4+
from ..utils import ContentFilter
5+
6+
logger = get_logger(__name__)
7+
8+
9+
class GoogleSearch:
10+
"""Google Search.
11+
12+
- API key: `SERPER_API_KEY`
13+
"""
14+
15+
def __init__(self, config: dict = None) -> None:
16+
self.serper_url = r"https://google.serper.dev/search"
17+
self.serper_header = {"X-API-KEY": EnvUtils.get_env("SERPER_API_KEY"), "Content-Type": "application/json"}
18+
config = config or {}
19+
self.search_params = config.get("search_params", {})
20+
search_banned_sites = config.get("search_banned_sites", [])
21+
self.content_filter = ContentFilter(search_banned_sites) if search_banned_sites else None
22+
23+
async def search(self, query: str, num_results: int = 5) -> str:
24+
"""standard search interface."""
25+
res = await self.search_google(query)
26+
# filter
27+
if self.content_filter:
28+
results = self.content_filter.filter_results(res["organic"], num_results)
29+
else:
30+
results = res["organic"][:num_results]
31+
# format
32+
formatted_results = []
33+
for i, r in enumerate(results, 1):
34+
formatted_results.append(f"{i}. {r['title']} ({r['link']})")
35+
if "snippet" in r:
36+
formatted_results[-1] += f"\nsnippet: {r['snippet']}"
37+
if "sitelinks" in r:
38+
formatted_results[-1] += f"\nsitelinks: {r['sitelinks']}"
39+
msg = "\n".join(formatted_results)
40+
return msg
41+
42+
@async_file_cache(expire_time=None)
43+
async def search_google(self, query: str) -> dict:
44+
"""Call the serper.dev API and cache the results."""
45+
params = {"q": query, **self.search_params, "num": 100}
46+
async with aiohttp.ClientSession() as session:
47+
async with session.post(self.serper_url, headers=self.serper_header, json=params) as response:
48+
response.raise_for_status() # avoid cache error!
49+
results = await response.json()
50+
return results

0 commit comments

Comments
 (0)