|
| 1 | +# xbench-DeepSearch |
| 2 | + |
| 3 | +The **xbench** benchmark is an evaluation framework designed to measure both the intelligence frontier and real-world utility of AI agents. It consists of complementary tracks that test core model capabilities like reasoning, tool use, memory, and workflows grounded in business and professional settings. Its **DeepSearch** sub-track measures agents’ ability to conduct open-domain information retrieval, combining fact finding, comparison, and synthesis through multi-step search and tool use. |
| 4 | + |
| 5 | +See more details at [xbench official website](https://xbench.org/agi/aisearch) and [xbench-DeepSearch Eval Card](https://xbench.org/files/Eval%20Card%20xbench-DeepSearch.pdf). |
| 6 | + |
| 7 | + |
| 8 | +--- |
| 9 | + |
| 10 | +## Setup and Evaluation Guide |
| 11 | + |
| 12 | +### Step 1: Download the xbench-DeepSearch Dataset |
| 13 | + |
| 14 | +**Direct Download (Recommended)** |
| 15 | + |
| 16 | +!!! tip "Dataset Setup" |
| 17 | + Use the integrated prepare-benchmark command to download and process the dataset: |
| 18 | + |
| 19 | +```bash |
| 20 | +uv run main.py prepare-benchmark get xbench-ds |
| 21 | +``` |
| 22 | + |
| 23 | +By default, this will create the standardized dataset at data/xbench-ds/standardized_data.jsonl. |
| 24 | + |
| 25 | +### Step 2: Configure API Keys |
| 26 | + |
| 27 | +!!! warning "Required API Configuration" |
| 28 | + Set up the required API keys for model access and tool functionality. Update the `.env` file to include the following keys: |
| 29 | + |
| 30 | +```env title=".env Configuration" |
| 31 | +# Search and web scraping capabilities |
| 32 | +SERPER_API_KEY="your-serper-api-key" |
| 33 | +JINA_API_KEY="your-jina-api-key" |
| 34 | +
|
| 35 | +# Code execution environment |
| 36 | +E2B_API_KEY="your-e2b-api-key" |
| 37 | +
|
| 38 | +# Primary LLM provider (Claude-3.7-Sonnet via OpenRouter) |
| 39 | +OPENROUTER_API_KEY="your-openrouter-api-key" |
| 40 | +OPENROUTER_BASE_URL="https://openrouter.ai/api/v1" |
| 41 | +
|
| 42 | +# Vision understanding capabilities |
| 43 | +ANTHROPIC_API_KEY="your-anthropic-api-key" |
| 44 | +GEMINI_API_KEY="your-gemini-api-key" |
| 45 | +
|
| 46 | +# LLM as judge, reasoning, and O3 hints |
| 47 | +OPENAI_API_KEY="your-openai-api-key" |
| 48 | +OPENAI_BASE_URL="https://api.openai.com/v1" |
| 49 | +``` |
| 50 | + |
| 51 | +### Step 3: Run the Evaluation |
| 52 | + |
| 53 | +```bash |
| 54 | +uv run main.py common-benchmark \ |
| 55 | + --config_file_name=agent_xbench-ds \ |
| 56 | + output_dir="logs/xbench-ds/$(date +"%Y%m%d_%H%M")" |
| 57 | +``` |
| 58 | + |
| 59 | +### Step 4: Monitor Progress and Resume |
| 60 | + |
| 61 | +!!! tip "Progress Tracking" |
| 62 | + You can monitor the evaluation progress in real-time: |
| 63 | + |
| 64 | +```bash title="Check Progress" |
| 65 | +uv run utils/progress_check/check_xbench_progress.py $PATH_TO_LOG |
| 66 | +``` |
| 67 | + |
| 68 | +Replace `$PATH_TO_LOG` with your actual output directory path. |
| 69 | + |
| 70 | +!!! note "Resume Capability" |
| 71 | + If the evaluation is interrupted, you can resume from where it left off by specifying the same output directory: |
| 72 | + |
| 73 | +```bash title="Resume Interrupted Evaluation" |
| 74 | +uv run main.py common-benchmark \ |
| 75 | + --config_file_name=agent_xbench-ds \ |
| 76 | + output_dir="logs/xbench-ds/20250922_1430" |
| 77 | +``` |
| 78 | + |
| 79 | +--- |
| 80 | + |
| 81 | +## Post-Processing for Enhanced Performance |
| 82 | + |
| 83 | +!!! tip "Test-Time Scaling for Improved Reliability" |
| 84 | + Test-time scaling can significantly improve the reliability of model responses. Instead of simple majority voting, we employ a comprehensive **parallel thinking** approach that: |
| 85 | + |
| 86 | + - Aggregates final summary steps from each agent run before outputting results |
| 87 | + - Uses another agent (o3 by default) to make final decisions based on equivalence and source reliability criteria |
| 88 | + - Provides more robust and accurate final answers |
| 89 | + |
| 90 | +Execute the following command to run multiple xbench-DeepSearch evaluations and automatically enable parallel thinking for enhanced performance. |
| 91 | + |
| 92 | +```bash title="Multiple runs with parallel thinking post-processing" |
| 93 | +bash scripts/run_evaluate_mulitple_runs_xbench-ds.sh |
| 94 | +``` |
| 95 | + |
| 96 | +### Running Parallel Thinking Analysis alone |
| 97 | + |
| 98 | +After completing evaluations (single or multiple runs), you can apply parallel thinking post-processing to aggregate and generate the final result. |
| 99 | + |
| 100 | +```bash title="Parallel Thinking Post-Processing" |
| 101 | +uv run utils/util_llm_parallel_thinking.py \ |
| 102 | + --benchmark xbench-ds \ |
| 103 | + --results_dir "logs/xbench-ds/20250922_1430" |
| 104 | +``` |
| 105 | + |
| 106 | +The program automatically reads results from each run in the specified directory and performs aggregated analysis. The final output files are generated in the `results_dir`: |
| 107 | + |
| 108 | +- **`llm_parallel_thinking_Nruns.json`** - Detailed analysis results |
| 109 | +- **`llm_parallel_thinking_accuracy_Nruns.txt`** - Final accuracy |
| 110 | + |
| 111 | +Where `N` represents the total number of experimental runs (**minimum of 1**). |
| 112 | + |
| 113 | +--- |
| 114 | + |
| 115 | +!!! info "Documentation Info" |
| 116 | + **Last Updated:** September 2025 · **Doc Contributor:** Team @ MiroMind AI |
0 commit comments