Skip to content

Commit 88c0528

Browse files
feat(benchmark): add xbench-ds prep support and add_new_tools doc (#37)
* add new tools doc, support xbench-ds benchmark preparation * docs(prepare-benchmark): add xbench-ds
1 parent 5e1bedd commit 88c0528

File tree

8 files changed

+130
-3
lines changed

8 files changed

+130
-3
lines changed
Lines changed: 71 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,76 @@
1+
# Adding New Tools
12

2-
# - Coming Soon -
3+
## What This Does
4+
Extend the agent’s functionality by introducing a new tool. Each tool is implemented as an MCP server and registered via configuration.
35

6+
## Implementation Steps
7+
8+
### 1. Create MCP Server
9+
Create a new file `src/tool/mcp_servers/new-mcp-server.py` that implements the tool’s core logic.
10+
11+
```python
12+
from fastmcp import FastMCP
13+
14+
# Initialize FastMCP server
15+
mcp = FastMCP("new-mcp-server")
16+
17+
@mcp.tool()
18+
async def tool_name(param: str) -> str:
19+
"""
20+
Explanation of the tool, its parameters, and return value.
21+
"""
22+
tool_result = ... # Your logic here
23+
return tool_result
24+
25+
if __name__ == "__main__":
26+
mcp.run(transport="stdio")
27+
```
28+
29+
> Tool schemas are automatically generated from `docstrings` and `hints` via the FastMCP protocol.
30+
31+
32+
### 2. Create Tool Config
33+
Add a new config file at `config/tools/new-tool-name.yaml`:
34+
35+
```yaml
36+
name: "new-tool-name"
37+
tool_command: "python"
38+
args:
39+
- "-m"
40+
- "src.tool.mcp_servers.new-mcp-server" # Match the server file created above
41+
```
42+
43+
44+
### 3. Register Tool in Agent Config
45+
Enable the new tool inside your agent config (e.g., `config/agent-with-new-tool.yaml`):
46+
47+
```yaml
48+
main_agent:
49+
...
50+
tool_config:
51+
- tool-reasoning
52+
- new-tool-name # 👈 Add your new tool here
53+
...
54+
sub_agents:
55+
agent-worker:
56+
...
57+
tool_config:
58+
- tool-searching
59+
- tool-image-video
60+
- tool-reading
61+
- tool-code
62+
- tool-audio
63+
- new-tool-name # 👈 Add your new tool here
64+
...
65+
```
66+
67+
68+
## Examples
69+
- `tool-reasoning` – reasoning utilities
70+
- `tool-image-video` – visual understanding
71+
- `new-tool-name` – your custom tool
472

573
---
74+
675
**Last Updated:** Sep 2025
7-
**Doc Contributor:** Team @ MiroMind AI
76+
**Doc Contributor:** Team @ MiroMind AI

docs/mkdocs/docs/download_datasets.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -78,6 +78,7 @@ uv run main.py prepare-benchmark get webwalkerqa
7878
uv run main.py prepare-benchmark get browsecomp-test
7979
uv run main.py prepare-benchmark get browsecomp-zh-test
8080
uv run main.py prepare-benchmark get hle
81+
uv run main.py prepare-benchmark get xbench-ds
8182
```
8283

8384
### What This Script Does
@@ -92,6 +93,7 @@ uv run main.py prepare-benchmark get hle
9293
- `browsecomp-test` - English BrowseComp test set
9394
- `browsecomp-zh-test` - Chinese BrowseComp test set
9495
- `hle` - HLE dataset
96+
- `xbench-ds` - xbench-DeepSearch dataset
9597

9698
### Customizing Dataset Selection
9799

scripts/run_evaluate_multiple_runs_nohintreason_hle.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,7 @@ for i in $(seq 1 $NUM_RUNS); do
4848
benchmark.execution.max_tasks=null \
4949
benchmark.execution.max_concurrent=$MAX_CONCURRENT \
5050
benchmark.execution.pass_at_k=1 \
51+
output_dir="$RESULTS_DIR/$RUN_ID" \
5152
hydra.run.dir=${RESULTS_DIR}/$RUN_ID \
5253
> "$RESULTS_DIR/${RUN_ID}_output.log" 2>&1
5354

scripts/run_evaluate_multiple_runs_nosandbox_gaia-validation.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,7 @@ for i in $(seq 1 $NUM_RUNS); do
4848
benchmark.execution.max_tasks=null \
4949
benchmark.execution.max_concurrent=$MAX_CONCURRENT \
5050
benchmark.execution.pass_at_k=1 \
51+
output_dir="$RESULTS_DIR/$RUN_ID" \
5152
hydra.run.dir=${RESULTS_DIR}/$RUN_ID \
5253
> "$RESULTS_DIR/${RUN_ID}_output.log" 2>&1
5354

scripts/run_evaluate_multiple_runs_xbench-ds.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,7 @@ for i in $(seq 1 $NUM_RUNS); do
4848
benchmark.execution.max_tasks=null \
4949
benchmark.execution.max_concurrent=$MAX_CONCURRENT \
5050
benchmark.execution.pass_at_k=1 \
51+
output_dir="$RESULTS_DIR/$RUN_ID" \
5152
hydra.run.dir=${RESULTS_DIR}/$RUN_ID \
5253
> "$RESULTS_DIR/${RUN_ID}_output.log" 2>&1
5354

scripts/run_prepare_benchmark.sh

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,4 +19,5 @@ uv run main.py prepare-benchmark get frames-test
1919
uv run main.py prepare-benchmark get webwalkerqa
2020
uv run main.py prepare-benchmark get browsecomp-test
2121
uv run main.py prepare-benchmark get browsecomp-zh-test
22-
uv run main.py prepare-benchmark get hle
22+
uv run main.py prepare-benchmark get hle
23+
uv run main.py prepare-benchmark get xbench-ds
Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# SPDX-FileCopyrightText: 2025 MiromindAI
2+
#
3+
# SPDX-License-Identifier: Apache-2.0
4+
5+
import base64
6+
from typing import Generator, MutableMapping
7+
8+
from datasets import load_dataset
9+
10+
from utils.prepare_benchmark.common import Task
11+
12+
13+
def xor_decrypt(data, key):
14+
"""
15+
XOR decrypt data with a key
16+
"""
17+
key_bytes = key.encode('utf-8')
18+
key_length = len(key_bytes)
19+
return bytes([data[i] ^ key_bytes[i % key_length] for i in range(len(data))])
20+
21+
def gen_xbench_ds(hf_token: str) -> Generator[Task, None, None]:
22+
dataset = load_dataset(
23+
"xbench/DeepSearch",
24+
split="train",
25+
)
26+
for x in dataset:
27+
metadata: MutableMapping = x # type: ignore
28+
task_id = metadata.pop("id")
29+
30+
key = metadata.pop("canary")
31+
prompt = xor_decrypt(base64.b64decode(metadata.pop("prompt")), key).decode('utf-8')
32+
answer = xor_decrypt(base64.b64decode(metadata.pop("answer")), key).decode('utf-8')
33+
reference_steps = xor_decrypt(base64.b64decode(metadata.pop("reference_steps")), key).decode('utf-8')
34+
task = Task(
35+
task_id=task_id,
36+
task_question=prompt,
37+
ground_truth=answer,
38+
file_path=None,
39+
metadata={"reference_steps": reference_steps},
40+
)
41+
yield task
42+
43+
return

utils/prepare_benchmark/main.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@
1717
from utils.prepare_benchmark.gen_gaia_text_only import gen_gaia_text_only
1818
from utils.prepare_benchmark.gen_hle import gen_hle_test
1919
from utils.prepare_benchmark.gen_webwalkerqa import gen_webwalkerqa
20+
from utils.prepare_benchmark.gen_xbench_ds import gen_xbench_ds
2021

2122

2223
@dataclasses.dataclass
@@ -29,6 +30,7 @@ class _Env:
2930
"browsecomp-test",
3031
"browsecomp-zh-test",
3132
"hle",
33+
"xbench-ds",
3234
)
3335
meta_filename = "standardized_data.jsonl"
3436
data_dir: pathlib.Path
@@ -99,6 +101,13 @@ def gen():
99101
for x in gen_hle_test(env.hf_token, env.data_dir):
100102
yield x
101103

104+
return gen
105+
case "xbench-ds":
106+
107+
def gen():
108+
for x in gen_xbench_ds(env.hf_token):
109+
yield x
110+
102111
return gen
103112
case _:
104113
raise ValueError("not supported")

0 commit comments

Comments
 (0)