Skip to content

Commit 8b67b83

Browse files
feat(xbench-ds): add xbench-ds evaluation and related docs (#48)
* add new tools doc, support xbench-ds benchmark preparation * docs(prepare-benchmark): add xbench-ds * make doc clearer * feat(xbench-ds): add xbench-ds evaluation and related docs * fixbug: xbench-ds doc * update xbench-ds docs * update xbench-ds docs, fix small read_file bug, set llm_as_judge temp to 0
1 parent a414013 commit 8b67b83

File tree

11 files changed

+518
-225
lines changed

11 files changed

+518
-225
lines changed

config/agent_prompts/main_gaia.py

Lines changed: 0 additions & 185 deletions
This file was deleted.

config/agent_xbench-ds.yaml

Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
defaults:
2+
- benchmark: xbench-ds
3+
- override hydra/job_logging: none
4+
- _self_ # Allow defining variables at the top of this file
5+
6+
7+
main_agent:
8+
prompt_class: MainAgentPrompt_GAIA
9+
llm:
10+
provider_class: "ClaudeOpenRouterClient"
11+
model_name: "anthropic/claude-3.7-sonnet"
12+
async_client: true
13+
temperature: 0.3
14+
top_p: 0.95
15+
min_p: 0.0
16+
top_k: -1
17+
max_tokens: 32000
18+
openrouter_api_key: "${oc.env:OPENROUTER_API_KEY,???}"
19+
openrouter_base_url: "${oc.env:OPENROUTER_BASE_URL,https://openrouter.ai/api/v1}"
20+
openrouter_provider: "anthropic"
21+
disable_cache_control: false
22+
keep_tool_result: -1
23+
oai_tool_thinking: false
24+
25+
tool_config:
26+
- tool-reasoning
27+
28+
max_turns: -1 # Maximum number of turns for main agent execution
29+
max_tool_calls_per_turn: 10 # Maximum number of tool calls per turn
30+
31+
input_process:
32+
o3_hint: true
33+
output_process:
34+
o3_final_answer: true
35+
36+
openai_api_key: "${oc.env:OPENAI_API_KEY,???}" # used for o3 hints and final answer extraction
37+
add_message_id: true
38+
keep_tool_result: -1
39+
chinese_context: "true"
40+
41+
42+
sub_agents:
43+
agent-worker:
44+
prompt_class: SubAgentWorkerPrompt
45+
llm:
46+
provider_class: "ClaudeOpenRouterClient"
47+
model_name: "anthropic/claude-3.7-sonnet"
48+
async_client: true
49+
temperature: 0.3
50+
top_p: 0.95
51+
min_p: 0.0
52+
top_k: -1
53+
max_tokens: 32000
54+
openrouter_api_key: "${oc.env:OPENROUTER_API_KEY,???}"
55+
openrouter_base_url: "${oc.env:OPENROUTER_BASE_URL,https://openrouter.ai/api/v1}"
56+
openrouter_provider: "anthropic"
57+
disable_cache_control: false
58+
keep_tool_result: -1
59+
oai_tool_thinking: false
60+
61+
tool_config:
62+
- tool-searching
63+
- tool-image-video
64+
- tool-reading
65+
- tool-code
66+
- tool-audio
67+
68+
max_turns: -1 # Maximum number of turns for main agent execution
69+
max_tool_calls_per_turn: 10 # Maximum number of tool calls per turn
70+
71+
72+
# Can define some top-level or default parameters here
73+
output_dir: logs/
74+
data_dir: "${oc.env:DATA_DIR,data}" # Points to where data is stored
75+

config/benchmark/xbench-ds.yaml

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
# config/benchmark/xbench-ds.yaml
2+
defaults:
3+
- default
4+
- _self_
5+
6+
name: "xbench-ds"
7+
8+
data:
9+
data_dir: "${data_dir}/xbench-ds"
10+
11+
execution:
12+
max_tasks: null # null means no limit
13+
max_concurrent: 10
14+
pass_at_k: 1
15+
16+
openai_api_key: "${oc.env:OPENAI_API_KEY,???}"

docs/mkdocs/docs/xbench_ds.md

Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
# xbench-DeepSearch
2+
3+
The **xbench** benchmark is an evaluation framework designed to measure both the intelligence frontier and real-world utility of AI agents. It consists of complementary tracks that test core model capabilities like reasoning, tool use, memory, and workflows grounded in business and professional settings. Its **DeepSearch** sub-track measures agents’ ability to conduct open-domain information retrieval, combining fact finding, comparison, and synthesis through multi-step search and tool use.
4+
5+
See more details at [xbench official website](https://xbench.org/agi/aisearch) and [xbench-DeepSearch Eval Card](https://xbench.org/files/Eval%20Card%20xbench-DeepSearch.pdf).
6+
7+
8+
---
9+
10+
## Setup and Evaluation Guide
11+
12+
### Step 1: Download the xbench-DeepSearch Dataset
13+
14+
**Direct Download (Recommended)**
15+
16+
!!! tip "Dataset Setup"
17+
Use the integrated prepare-benchmark command to download and process the dataset:
18+
19+
```bash
20+
uv run main.py prepare-benchmark get xbench-ds
21+
```
22+
23+
By default, this will create the standardized dataset at data/xbench-ds/standardized_data.jsonl.
24+
25+
### Step 2: Configure API Keys
26+
27+
!!! warning "Required API Configuration"
28+
Set up the required API keys for model access and tool functionality. Update the `.env` file to include the following keys:
29+
30+
```env title=".env Configuration"
31+
# Search and web scraping capabilities
32+
SERPER_API_KEY="your-serper-api-key"
33+
JINA_API_KEY="your-jina-api-key"
34+
35+
# Code execution environment
36+
E2B_API_KEY="your-e2b-api-key"
37+
38+
# Primary LLM provider (Claude-3.7-Sonnet via OpenRouter)
39+
OPENROUTER_API_KEY="your-openrouter-api-key"
40+
OPENROUTER_BASE_URL="https://openrouter.ai/api/v1"
41+
42+
# Vision understanding capabilities
43+
ANTHROPIC_API_KEY="your-anthropic-api-key"
44+
GEMINI_API_KEY="your-gemini-api-key"
45+
46+
# LLM as judge, reasoning, and O3 hints
47+
OPENAI_API_KEY="your-openai-api-key"
48+
OPENAI_BASE_URL="https://api.openai.com/v1"
49+
```
50+
51+
### Step 3: Run the Evaluation
52+
53+
```bash
54+
uv run main.py common-benchmark \
55+
--config_file_name=agent_xbench-ds \
56+
output_dir="logs/xbench-ds/$(date +"%Y%m%d_%H%M")"
57+
```
58+
59+
### Step 4: Monitor Progress and Resume
60+
61+
!!! tip "Progress Tracking"
62+
You can monitor the evaluation progress in real-time:
63+
64+
```bash title="Check Progress"
65+
uv run utils/progress_check/check_xbench_progress.py $PATH_TO_LOG
66+
```
67+
68+
Replace `$PATH_TO_LOG` with your actual output directory path.
69+
70+
!!! note "Resume Capability"
71+
If the evaluation is interrupted, you can resume from where it left off by specifying the same output directory:
72+
73+
```bash title="Resume Interrupted Evaluation"
74+
uv run main.py common-benchmark \
75+
--config_file_name=agent_xbench-ds \
76+
output_dir="logs/xbench-ds/20250922_1430"
77+
```
78+
79+
---
80+
81+
## Post-Processing for Enhanced Performance
82+
83+
!!! tip "Test-Time Scaling for Improved Reliability"
84+
Test-time scaling can significantly improve the reliability of model responses. Instead of simple majority voting, we employ a comprehensive **parallel thinking** approach that:
85+
86+
- Aggregates final summary steps from each agent run before outputting results
87+
- Uses another agent (o3 by default) to make final decisions based on equivalence and source reliability criteria
88+
- Provides more robust and accurate final answers
89+
90+
Execute the following command to run multiple xbench-DeepSearch evaluations and automatically enable parallel thinking for enhanced performance.
91+
92+
```bash title="Multiple runs with parallel thinking post-processing"
93+
bash scripts/run_evaluate_mulitple_runs_xbench-ds.sh
94+
```
95+
96+
### Running Parallel Thinking Analysis alone
97+
98+
After completing evaluations (single or multiple runs), you can apply parallel thinking post-processing to aggregate and generate the final result.
99+
100+
```bash title="Parallel Thinking Post-Processing"
101+
uv run utils/util_llm_parallel_thinking.py \
102+
--benchmark xbench-ds \
103+
--results_dir "logs/xbench-ds/20250922_1430"
104+
```
105+
106+
The program automatically reads results from each run in the specified directory and performs aggregated analysis. The final output files are generated in the `results_dir`:
107+
108+
- **`llm_parallel_thinking_Nruns.json`** - Detailed analysis results
109+
- **`llm_parallel_thinking_accuracy_Nruns.txt`** - Final accuracy
110+
111+
Where `N` represents the total number of experimental runs (**minimum of 1**).
112+
113+
---
114+
115+
!!! info "Documentation Info"
116+
**Last Updated:** September 2025 · **Doc Contributor:** Team @ MiroMind AI

0 commit comments

Comments
 (0)