MiroMindAI · BinWang28 · Sep 25, 2025 · Sep 15, 2025 · Sep 17, 2025 · Sep 17, 2025
diff --git a/config/agent_prompts/main_gaia.py b/config/agent_prompts/main_gaia.py
diff --git a/config/agent_xbench-ds.yaml b/config/agent_xbench-ds.yaml
@@ -0,0 +1,75 @@
+defaults:
+  - benchmark: xbench-ds
+  - override hydra/job_logging: none
+  - _self_  # Allow defining variables at the top of this file
+
+
+main_agent:
+  prompt_class: MainAgentPrompt_GAIA
+  llm: 
+    provider_class: "ClaudeOpenRouterClient"
+    model_name: "anthropic/claude-3.7-sonnet"
+    async_client: true
+    temperature: 0.3
+    top_p: 0.95
+    min_p: 0.0
+    top_k: -1
+    max_tokens: 32000
+    openrouter_api_key: "${oc.env:OPENROUTER_API_KEY,???}"
+    openrouter_base_url: "${oc.env:OPENROUTER_BASE_URL,https://openrouter.ai/api/v1}"
+    openrouter_provider: "anthropic"
+    disable_cache_control: false
+    keep_tool_result: -1
+    oai_tool_thinking: false
+
+  tool_config:
+    - tool-reasoning
+
+  max_turns: -1  # Maximum number of turns for main agent execution
+  max_tool_calls_per_turn: 10  # Maximum number of tool calls per turn
+
+  input_process:
+    o3_hint: true
+  output_process:
+    o3_final_answer: true
+
+  openai_api_key: "${oc.env:OPENAI_API_KEY,???}" # used for o3 hints and final answer extraction
+  add_message_id: true
+  keep_tool_result: -1
+  chinese_context: "true"
+
+
+sub_agents:
+  agent-worker:
+    prompt_class: SubAgentWorkerPrompt
+    llm: 
+      provider_class: "ClaudeOpenRouterClient"
+      model_name: "anthropic/claude-3.7-sonnet"
+      async_client: true
+      temperature: 0.3
+      top_p: 0.95
+      min_p: 0.0
+      top_k: -1
+      max_tokens: 32000
+      openrouter_api_key: "${oc.env:OPENROUTER_API_KEY,???}"
+      openrouter_base_url: "${oc.env:OPENROUTER_BASE_URL,https://openrouter.ai/api/v1}"
+      openrouter_provider: "anthropic"
+      disable_cache_control: false
+      keep_tool_result: -1
+      oai_tool_thinking: false
+
+    tool_config:
+      - tool-searching
+      - tool-image-video
+      - tool-reading
+      - tool-code
+      - tool-audio
+
+    max_turns: -1  # Maximum number of turns for main agent execution
+    max_tool_calls_per_turn: 10  # Maximum number of tool calls per turn
+
+
+# Can define some top-level or default parameters here
+output_dir: logs/
+data_dir: "${oc.env:DATA_DIR,data}"  # Points to where data is stored
+
diff --git a/config/benchmark/xbench-ds.yaml b/config/benchmark/xbench-ds.yaml
@@ -0,0 +1,16 @@
+# config/benchmark/xbench-ds.yaml
+defaults:
+  - default
+  - _self_
+
+name: "xbench-ds"
+
+data:
+  data_dir: "${data_dir}/xbench-ds"
+
+execution:
+  max_tasks: null  # null means no limit
+  max_concurrent: 10
+  pass_at_k: 1
+
+openai_api_key: "${oc.env:OPENAI_API_KEY,???}"
diff --git a/docs/mkdocs/docs/xbench_ds.md b/docs/mkdocs/docs/xbench_ds.md
@@ -0,0 +1,116 @@
+# xbench-DeepSearch
+
+The **xbench** benchmark is an evaluation framework designed to measure both the intelligence frontier and real-world utility of AI agents. It consists of complementary tracks that test core model capabilities like reasoning, tool use, memory, and workflows grounded in business and professional settings. Its **DeepSearch** sub-track measures agents’ ability to conduct open-domain information retrieval, combining fact finding, comparison, and synthesis through multi-step search and tool use.
+
+See more details at [xbench official website](https://xbench.org/agi/aisearch) and [xbench-DeepSearch Eval Card](https://xbench.org/files/Eval%20Card%20xbench-DeepSearch.pdf).
+
+
+---
+
+## Setup and Evaluation Guide
+
+### Step 1: Download the xbench-DeepSearch Dataset
+
+**Direct Download (Recommended)**
+
+!!! tip "Dataset Setup"
+    Use the integrated prepare-benchmark command to download and process the dataset:
+
+```bash
+uv run main.py prepare-benchmark get xbench-ds
+```
+
+By default, this will create the standardized dataset at data/xbench-ds/standardized_data.jsonl.
+
+### Step 2: Configure API Keys
+
+!!! warning "Required API Configuration"
+    Set up the required API keys for model access and tool functionality. Update the `.env` file to include the following keys:
+
+```env title=".env Configuration"
+# Search and web scraping capabilities
+SERPER_API_KEY="your-serper-api-key"
+JINA_API_KEY="your-jina-api-key"
+
+# Code execution environment
+E2B_API_KEY="your-e2b-api-key"
+
+# Primary LLM provider (Claude-3.7-Sonnet via OpenRouter)
+OPENROUTER_API_KEY="your-openrouter-api-key"
+OPENROUTER_BASE_URL="https://openrouter.ai/api/v1"
+
+# Vision understanding capabilities
+ANTHROPIC_API_KEY="your-anthropic-api-key"
+GEMINI_API_KEY="your-gemini-api-key"
+
+# LLM as judge, reasoning, and O3 hints
+OPENAI_API_KEY="your-openai-api-key"
+OPENAI_BASE_URL="https://api.openai.com/v1"
+```
+
+### Step 3: Run the Evaluation
+
+```bash
+uv run main.py common-benchmark \
+  --config_file_name=agent_xbench-ds \
+  output_dir="logs/xbench-ds/$(date +"%Y%m%d_%H%M")"
+```
+
+### Step 4: Monitor Progress and Resume
+
+!!! tip "Progress Tracking"
+    You can monitor the evaluation progress in real-time:
+
+```bash title="Check Progress"
+uv run utils/progress_check/check_xbench_progress.py $PATH_TO_LOG
+```
+
+Replace `$PATH_TO_LOG` with your actual output directory path.
+
+!!! note "Resume Capability"
+    If the evaluation is interrupted, you can resume from where it left off by specifying the same output directory:
+
+```bash title="Resume Interrupted Evaluation"
+uv run main.py common-benchmark \
+  --config_file_name=agent_xbench-ds \
+  output_dir="logs/xbench-ds/20250922_1430"
+```
+
+---
+
+## Post-Processing for Enhanced Performance
+
+!!! tip "Test-Time Scaling for Improved Reliability"
+    Test-time scaling can significantly improve the reliability of model responses. Instead of simple majority voting, we employ a comprehensive **parallel thinking** approach that:
+
+    - Aggregates final summary steps from each agent run before outputting results
+    - Uses another agent (o3 by default) to make final decisions based on equivalence and source reliability criteria
+    - Provides more robust and accurate final answers
+
+Execute the following command to run multiple xbench-DeepSearch evaluations and automatically enable parallel thinking for enhanced performance.
+
+```bash title="Multiple runs with parallel thinking post-processing"
+bash scripts/run_evaluate_mulitple_runs_xbench-ds.sh
+```
+
+### Running Parallel Thinking Analysis alone
+
+After completing evaluations (single or multiple runs), you can apply parallel thinking post-processing to aggregate and generate the final result.
+
+```bash title="Parallel Thinking Post-Processing"
+uv run utils/util_llm_parallel_thinking.py \
+  --benchmark xbench-ds \
+  --results_dir "logs/xbench-ds/20250922_1430"
+```
+
+The program automatically reads results from each run in the specified directory and performs aggregated analysis. The final output files are generated in the `results_dir`:
+
+- **`llm_parallel_thinking_Nruns.json`** - Detailed analysis results
+- **`llm_parallel_thinking_accuracy_Nruns.txt`** - Final accuracy
+
+Where `N` represents the total number of experimental runs (**minimum of 1**).
+
+---
+
+!!! info "Documentation Info"
+    **Last Updated:** September 2025 · **Doc Contributor:** Team @ MiroMind AI