Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions config/benchmark/xbench-ds.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# config/benchmark/xbench-ds.yaml
defaults:
- default
- _self_

name: "xbench-ds"

data:
data_dir: "${data_dir}/xbench-ds"

execution:
max_tasks: null # null means no limit
max_concurrent: 10
pass_at_k: 1

openai_api_key: "${oc.env:OPENAI_API_KEY,???}"
114 changes: 114 additions & 0 deletions docs/mkdocs/docs/xbench_ds.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
# xbench-DeepSearch

The **xbench** benchmark is an evaluation framework designed to measure both the intelligence frontier and real-world utility of AI agents. It consists of complementary tracks that test core model capabilities like reasoning, tool use, memory, and workflows grounded in business and professional settings. Its **DeepSearch** sub-track measures agents’ ability to conduct open-domain information retrieval, combining fact finding, comparison, and synthesis through multi-step search and tool use.

See more details at [xbench official website](https://xbench.org/agi/aisearch) and [xbench-DeepSearch Eval Card](https://xbench.org/files/Eval%20Card%20xbench-DeepSearch.pdf).


---

## Setup and Evaluation Guide

### Step 1: Download the xbench-DeepSearch Dataset

**Direct Download (Recommended)**

!!! tip "Dataset Setup"
Use the integrated prepare-benchmark command to download and process the dataset:

```bash
uv run main.py prepare-benchmark get xbench-ds
```

By default, this will create the standardized dataset at data/xbench-ds/standardized_data.jsonl.

### Step 2: Configure API Keys

!!! warning "Required API Configuration"
Set up the required API keys for model access and tool functionality. Update the `.env` file to include the following keys:

```env title=".env Configuration"
# Search and web scraping capabilities
SERPER_API_KEY="your-serper-api-key"
JINA_API_KEY="your-jina-api-key"

# Code execution environment
E2B_API_KEY="your-e2b-api-key"

# Primary LLM provider (Claude-3.7-Sonnet via OpenRouter)
OPENROUTER_API_KEY="your-openrouter-api-key"
OPENROUTER_BASE_URL="https://openrouter.ai/api/v1"

# Vision understanding capabilities
ANTHROPIC_API_KEY="your-anthropic-api-key"
GEMINI_API_KEY="your-gemini-api-key"

# LLM as judge, reasoning, and O3 hints
OPENAI_API_KEY="your-openai-api-key"
OPENAI_BASE_URL="https://api.openai.com/v1"
```

### Step 3: Run the Evaluation

!!! note "Chinese Context Configuration"
Since xbench-DeepSearch operates in a Chinese context, enable Chinese prompts by setting the environment variable `CHINESE_CONTEXT="true"`

```bash title="Run xbench-DeepSearch Evaluation"
export CHINESE_CONTEXT="true"
uv run main.py common-benchmark \
--config_file_name=agent_quickstart_1 \
--benchmark=xbench-ds \
output_dir="logs/xbench-ds/$(date +"%Y%m%d_%H%M")"
```

### Step 4: Monitor Progress and Resume

!!! tip "Progress Tracking"
You can monitor the evaluation progress in real-time:

```bash title="Check Progress"
uv run utils/progress_check/check_xbench_progress.py $PATH_TO_LOG
```

Replace `$PATH_TO_LOG` with your actual output directory path.

!!! note "Resume Capability"
If the evaluation is interrupted, you can resume from where it left off by specifying the same output directory:

```bash title="Resume Interrupted Evaluation"
uv run main.py common-benchmark \
--config_file_name=agent_quickstart_1 \
--benchmark=xbench-ds \
output_dir="logs/xbench-ds/20250922_1430"
```

---

## Post-Processing for Enhanced Performance

!!! tip "Test-Time Scaling for Improved Reliability"
Test-time scaling can significantly improve the reliability of model responses. Instead of simple majority voting, we employ a comprehensive **parallel thinking** approach that:

- Aggregates final summary steps from each agent run before outputting results
- Uses another agent (o3 by default) to make final decisions based on equivalence and source reliability criteria
- Provides more robust and accurate final answers

### Running Parallel Thinking Analysis

```bash title="Parallel Thinking Post-Processing"
uv run utils/util_llm_parallel_thinking.py \
--benchmark xbench-ds \
--results_dir "logs/xbench-ds/20250922_1430"
```

The program automatically reads results from each run in the specified directory and performs aggregated analysis. The final output files are generated in the `results_dir`:

- **`llm_parallel_thinking_Nruns.json`** - Detailed analysis results
- **`llm_parallel_thinking_accuracy_Nruns.txt`** - Final accuracy

Where `N` represents the total number of experimental runs (**minimum of 1**).

---

!!! info "Documentation Info"
**Last Updated:** September 2025 · **Doc Contributor:** Team @ MiroMind AI
1 change: 1 addition & 0 deletions docs/mkdocs/mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ nav:
- GAIA-Validation-Text-Only: gaia_validation_text_only.md
- GAIA-Test: gaia_test.md
- FutureX: futurex.md
- xBench-DeepSearch: xbench_ds.md
- Download Datasets: download_datasets.md
- Add New Benchmarks: contribute_benchmarks.md

Expand Down
244 changes: 244 additions & 0 deletions utils/progress_check/check_xbench_progress.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,244 @@
#!/usr/bin/env python3
"""
xbench-DeepSearch Progress Checker

This script analyzes xbench-DeepSearch benchmark results in a log folder to count:
- Total files processed
- Files with status "completed"
- Files with predictions (final_boxed_answer)
- Files with errors

Usage:
python check_xbench_progress.py [LOG_FOLDER_PATH]

If no path is provided, uses the current directory.
"""

import json
import sys
from pathlib import Path
from typing import Dict, List, Tuple


def analyze_xbench_results(log_folder: str) -> Dict[str, int]:
"""
Analyze xbench-DeepSearch benchmark results from JSON log files.

Args:
log_folder: Path to folder containing task_*.json files

Returns:
Dictionary with counts of different categories
"""
log_path = Path(log_folder)

if not log_path.exists():
raise FileNotFoundError(f"Log folder not found: {log_folder}")

# Find all task JSON files
json_files = list(log_path.glob("task_*_attempt_*.json"))

results = {
"total_files": 0,
"completed_status": 0,
"running_status": 0,
"failed_status": 0,
"with_predictions": 0,
"without_predictions": 0,
"with_errors": 0,
"parse_errors": 0,
}

completed_files = []
running_files = []
failed_files = []
prediction_files = []
error_files = []
parse_error_files = []

print(f"Scanning {len(json_files)} files in {log_folder}...")

for json_file in json_files:
results["total_files"] += 1

try:
with open(json_file, "r", encoding="utf-8") as f:
data = json.load(f)

status = data.get("status", "").lower()
final_answer = data.get("final_boxed_answer", "")
error_msg = data.get("error", "")

# Count by status
if status == "completed":
results["completed_status"] += 1
completed_files.append(json_file.name)
elif status == "running":
results["running_status"] += 1
running_files.append(json_file.name)
elif status in ["failed", "error"]:
results["failed_status"] += 1
failed_files.append(json_file.name)
else:
# Unknown status
results["failed_status"] += 1
failed_files.append((json_file.name, f"Unknown status: {status}"))

# Count by prediction availability
if final_answer and final_answer.strip():
results["with_predictions"] += 1
prediction_files.append(
(
json_file.name,
final_answer[:100] + "..."
if len(final_answer) > 100
else final_answer,
)
)
else:
results["without_predictions"] += 1

# Count by error presence
if error_msg and error_msg.strip():
results["with_errors"] += 1
error_files.append((json_file.name, error_msg))

except (json.JSONDecodeError, KeyError, FileNotFoundError) as e:
results["parse_errors"] += 1
parse_error_files.append((json_file.name, str(e)))
print(f"Error parsing {json_file.name}: {e}")

return (
results,
completed_files,
running_files,
failed_files,
prediction_files,
error_files,
parse_error_files,
)


def display_results(
results: Dict[str, int],
completed_files: List[str],
running_files: List[str],
failed_files: List[str],
prediction_files: List[Tuple[str, str]],
error_files: List[Tuple[str, str]],
parse_error_files: List[Tuple[str, str]],
) -> None:
"""Display the analysis results in a formatted way."""

print("\n" + "=" * 60)
print("xbench-DeepSearch BENCHMARK RESULTS SUMMARY")
print("=" * 60)

total = results["total_files"]
completed = results["completed_status"]
running = results["running_status"]
failed = results["failed_status"]
with_predictions = results["with_predictions"]
with_errors = results["with_errors"]

print(f"Total files processed: {total:3d}")
print(
f"Files with status 'completed': {completed:3d} ({completed/total*100:.1f}%)"
)
print(f"Files with status 'running': {running:3d} ({running/total*100:.1f}%)")
print(f"Files with status 'failed': {failed:3d} ({failed/total*100:.1f}%)")
print(
f"Files with predictions: {with_predictions:3d} ({with_predictions/total*100:.1f}%)"
)
print(
f"Files with errors: {with_errors:3d} ({with_errors/total*100:.1f}%)"
)
print(f"Files with parse errors: {results['parse_errors']:3d}")

if completed > 0:
prediction_rate = with_predictions / completed * 100
print(f"\nPrediction rate (predictions/completed): {prediction_rate:.1f}%")

print("\n" + "-" * 60)
print(f"SUMMARY: {completed} tasks completed, {with_predictions} with predictions")
print("-" * 60)

# Show some example files for verification
if completed_files:
print("\nFirst 5 completed files:")
for i, filename in enumerate(completed_files[:5], 1):
print(f" {i}. {filename}")
if len(completed_files) > 5:
print(f" ... and {len(completed_files) - 5} more")

if running_files:
print("\nFirst 5 running files:")
for i, filename in enumerate(running_files[:5], 1):
print(f" {i}. {filename}")
if len(running_files) > 5:
print(f" ... and {len(running_files) - 5} more")

if prediction_files:
print("\nFirst 5 files with predictions:")
for i, (filename, prediction) in enumerate(prediction_files[:5], 1):
print(f" {i}. {filename}")
print(f" Prediction: {prediction}")
if len(prediction_files) > 5:
print(f" ... and {len(prediction_files) - 5} more")

if error_files:
print("\nFiles with errors:")
for filename, error in error_files[:5]:
print(f" - {filename}: {error[:100]}...")
if len(error_files) > 5:
print(f" ... and {len(error_files) - 5} more")

if parse_error_files:
print("\nFiles with parse errors:")
for filename, error in parse_error_files:
print(f" - {filename}: {error}")


def main():
"""Main function to run the analysis."""

# Check if folder path was provided as command line argument
if len(sys.argv) > 1:
log_folder = sys.argv[1]
print(f"Using provided folder path: {log_folder}")
else:
log_folder = "."
print(f"No folder path provided, using current directory: {log_folder}")

try:
print(f"Analyzing xbench-DeepSearch benchmark results in: {log_folder}")
(
results,
completed_files,
running_files,
failed_files,
prediction_files,
error_files,
parse_error_files,
) = analyze_xbench_results(log_folder)
display_results(
results,
completed_files,
running_files,
failed_files,
prediction_files,
error_files,
parse_error_files,
)

except Exception as e:
print(f"Error: {e}")
print(f"\nUsage: python {sys.argv[0]} [LOG_FOLDER_PATH]")
print(f"Example: python {sys.argv[0]} logs/xbench-ds/claude03_claude_dual/run_1")
return 1

return 0


if __name__ == "__main__":
exit(main())
Loading