MiroMindAI · xuan-dong-shanda · Sep 15, 2025 · Sep 17, 2025 · Sep 17, 2025 · Sep 18, 2025
diff --git a/config/benchmark/xbench-ds.yaml b/config/benchmark/xbench-ds.yaml
@@ -0,0 +1,16 @@
+# config/benchmark/xbench-ds.yaml
+defaults:
+  - default
+  - _self_
+
+name: "xbench-ds"
+
+data:
+  data_dir: "${data_dir}/xbench-ds"
+
+execution:
+  max_tasks: null  # null means no limit
+  max_concurrent: 10
+  pass_at_k: 1
+
+openai_api_key: "${oc.env:OPENAI_API_KEY,???}"
diff --git a/docs/mkdocs/docs/xbench_ds.md b/docs/mkdocs/docs/xbench_ds.md
@@ -0,0 +1,114 @@
+# xbench-DeepSearch
+
+The **xbench** benchmark is an evaluation framework designed to measure both the intelligence frontier and real-world utility of AI agents. It consists of complementary tracks that test core model capabilities like reasoning, tool use, memory, and workflows grounded in business and professional settings. Its **DeepSearch** sub-track measures agents’ ability to conduct open-domain information retrieval, combining fact finding, comparison, and synthesis through multi-step search and tool use.
+
+See more details at [xbench official website](https://xbench.org/agi/aisearch) and [xbench-DeepSearch Eval Card](https://xbench.org/files/Eval%20Card%20xbench-DeepSearch.pdf).
+
+
+---
+
+## Setup and Evaluation Guide
+
+### Step 1: Download the xbench-DeepSearch Dataset
+
+**Direct Download (Recommended)**
+
+!!! tip "Dataset Setup"
+    Use the integrated prepare-benchmark command to download and process the dataset:
+
+```bash
+uv run main.py prepare-benchmark get xbench-ds
+```
+
+By default, this will create the standardized dataset at data/xbench-ds/standardized_data.jsonl.
+
+### Step 2: Configure API Keys
+
+!!! warning "Required API Configuration"
+    Set up the required API keys for model access and tool functionality. Update the `.env` file to include the following keys:
+
+```env title=".env Configuration"
+# Search and web scraping capabilities
+SERPER_API_KEY="your-serper-api-key"
+JINA_API_KEY="your-jina-api-key"
+
+# Code execution environment
+E2B_API_KEY="your-e2b-api-key"
+
+# Primary LLM provider (Claude-3.7-Sonnet via OpenRouter)
+OPENROUTER_API_KEY="your-openrouter-api-key"
+OPENROUTER_BASE_URL="https://openrouter.ai/api/v1"
+
+# Vision understanding capabilities
+ANTHROPIC_API_KEY="your-anthropic-api-key"
+GEMINI_API_KEY="your-gemini-api-key"
+
+# LLM as judge, reasoning, and O3 hints
+OPENAI_API_KEY="your-openai-api-key"
+OPENAI_BASE_URL="https://api.openai.com/v1"
+```
+
+### Step 3: Run the Evaluation
+
+!!! note "Chinese Context Configuration"
+    Since xbench-DeepSearch operates in a Chinese context, enable Chinese prompts by setting the environment variable `CHINESE_CONTEXT="true"`
+
+```bash title="Run xbench-DeepSearch Evaluation"
+export CHINESE_CONTEXT="true"
+uv run main.py common-benchmark \
+  --config_file_name=agent_quickstart_1 \
+  --benchmark=xbench-ds \
+  output_dir="logs/xbench-ds/$(date +"%Y%m%d_%H%M")"
+```
+
+### Step 4: Monitor Progress and Resume
+
+!!! tip "Progress Tracking"
+    You can monitor the evaluation progress in real-time:
+
+```bash title="Check Progress"
+uv run utils/progress_check/check_xbench_progress.py $PATH_TO_LOG
+```
+
+Replace `$PATH_TO_LOG` with your actual output directory path.
+
+!!! note "Resume Capability"
+    If the evaluation is interrupted, you can resume from where it left off by specifying the same output directory:
+
+```bash title="Resume Interrupted Evaluation"
+uv run main.py common-benchmark \
+  --config_file_name=agent_quickstart_1 \
+  --benchmark=xbench-ds \
+  output_dir="logs/xbench-ds/20250922_1430"
+```
+
+---
+
+## Post-Processing for Enhanced Performance
+
+!!! tip "Test-Time Scaling for Improved Reliability"
+    Test-time scaling can significantly improve the reliability of model responses. Instead of simple majority voting, we employ a comprehensive **parallel thinking** approach that:
+
+    - Aggregates final summary steps from each agent run before outputting results
+    - Uses another agent (o3 by default) to make final decisions based on equivalence and source reliability criteria
+    - Provides more robust and accurate final answers
+
+### Running Parallel Thinking Analysis
+
+```bash title="Parallel Thinking Post-Processing"
+uv run utils/util_llm_parallel_thinking.py \
+    --benchmark xbench-ds \
+    --results_dir "logs/xbench-ds/20250922_1430"
+```
+
+The program automatically reads results from each run in the specified directory and performs aggregated analysis. The final output files are generated in the `results_dir`:
+
+- **`llm_parallel_thinking_Nruns.json`** - Detailed analysis results
+- **`llm_parallel_thinking_accuracy_Nruns.txt`** - Final accuracy
+
+Where `N` represents the total number of experimental runs (**minimum of 1**).
+
+---
+
+!!! info "Documentation Info"
+    **Last Updated:** September 2025 · **Doc Contributor:** Team @ MiroMind AI
diff --git a/docs/mkdocs/mkdocs.yml b/docs/mkdocs/mkdocs.yml
@@ -53,6 +53,7 @@ nav:
       - GAIA-Validation-Text-Only: gaia_validation_text_only.md
       - GAIA-Test: gaia_test.md
       - FutureX: futurex.md
+      - xBench-DeepSearch: xbench_ds.md
     - Download Datasets: download_datasets.md
     - Add New Benchmarks: contribute_benchmarks.md
 

diff --git a/utils/progress_check/check_xbench_progress.py b/utils/progress_check/check_xbench_progress.py
@@ -0,0 +1,244 @@
+#!/usr/bin/env python3
+"""
+xbench-DeepSearch Progress Checker
+
+This script analyzes xbench-DeepSearch benchmark results in a log folder to count:
+- Total files processed
+- Files with status "completed"
+- Files with predictions (final_boxed_answer)
+- Files with errors
+
+Usage:
+    python check_xbench_progress.py [LOG_FOLDER_PATH]
+
+If no path is provided, uses the current directory.
+"""
+
+import json
+import sys
+from pathlib import Path
+from typing import Dict, List, Tuple
+
+
+def analyze_xbench_results(log_folder: str) -> Dict[str, int]:
+    """
+    Analyze xbench-DeepSearch benchmark results from JSON log files.
+
+    Args:
+        log_folder: Path to folder containing task_*.json files
+
+    Returns:
+        Dictionary with counts of different categories
+    """
+    log_path = Path(log_folder)
+
+    if not log_path.exists():
+        raise FileNotFoundError(f"Log folder not found: {log_folder}")
+
+    # Find all task JSON files
+    json_files = list(log_path.glob("task_*_attempt_*.json"))
+
+    results = {
+        "total_files": 0,
+        "completed_status": 0,
+        "running_status": 0,
+        "failed_status": 0,
+        "with_predictions": 0,
+        "without_predictions": 0,
+        "with_errors": 0,
+        "parse_errors": 0,
+    }
+
+    completed_files = []
+    running_files = []
+    failed_files = []
+    prediction_files = []
+    error_files = []
+    parse_error_files = []
+
+    print(f"Scanning {len(json_files)} files in {log_folder}...")
+
+    for json_file in json_files:
+        results["total_files"] += 1
+
+        try:
+            with open(json_file, "r", encoding="utf-8") as f:
+                data = json.load(f)
+
+            status = data.get("status", "").lower()
+            final_answer = data.get("final_boxed_answer", "")
+            error_msg = data.get("error", "")
+
+            # Count by status
+            if status == "completed":
+                results["completed_status"] += 1
+                completed_files.append(json_file.name)
+            elif status == "running":
+                results["running_status"] += 1
+                running_files.append(json_file.name)
+            elif status in ["failed", "error"]:
+                results["failed_status"] += 1
+                failed_files.append(json_file.name)
+            else:
+                # Unknown status
+                results["failed_status"] += 1
+                failed_files.append((json_file.name, f"Unknown status: {status}"))
+
+            # Count by prediction availability
+            if final_answer and final_answer.strip():
+                results["with_predictions"] += 1
+                prediction_files.append(
+                    (
+                        json_file.name,
+                        final_answer[:100] + "..."
+                        if len(final_answer) > 100
+                        else final_answer,
+                    )
+                )
+            else:
+                results["without_predictions"] += 1
+
+            # Count by error presence
+            if error_msg and error_msg.strip():
+                results["with_errors"] += 1
+                error_files.append((json_file.name, error_msg))
+
+        except (json.JSONDecodeError, KeyError, FileNotFoundError) as e:
+            results["parse_errors"] += 1
+            parse_error_files.append((json_file.name, str(e)))
+            print(f"Error parsing {json_file.name}: {e}")
+
+    return (
+        results,
+        completed_files,
+        running_files,
+        failed_files,
+        prediction_files,
+        error_files,
+        parse_error_files,
+    )
+
+
+def display_results(
+    results: Dict[str, int],
+    completed_files: List[str],
+    running_files: List[str],
+    failed_files: List[str],
+    prediction_files: List[Tuple[str, str]],
+    error_files: List[Tuple[str, str]],
+    parse_error_files: List[Tuple[str, str]],
+) -> None:
+    """Display the analysis results in a formatted way."""
+
+    print("\n" + "=" * 60)
+    print("xbench-DeepSearch BENCHMARK RESULTS SUMMARY")
+    print("=" * 60)
+
+    total = results["total_files"]
+    completed = results["completed_status"]
+    running = results["running_status"]
+    failed = results["failed_status"]
+    with_predictions = results["with_predictions"]
+    with_errors = results["with_errors"]
+
+    print(f"Total files processed:           {total:3d}")
+    print(
+        f"Files with status 'completed':   {completed:3d} ({completed/total*100:.1f}%)"
+    )
+    print(f"Files with status 'running':     {running:3d} ({running/total*100:.1f}%)")
+    print(f"Files with status 'failed':      {failed:3d} ({failed/total*100:.1f}%)")
+    print(
+        f"Files with predictions:          {with_predictions:3d} ({with_predictions/total*100:.1f}%)"
+    )
+    print(
+        f"Files with errors:               {with_errors:3d} ({with_errors/total*100:.1f}%)"
+    )
+    print(f"Files with parse errors:         {results['parse_errors']:3d}")
+
+    if completed > 0:
+        prediction_rate = with_predictions / completed * 100
+        print(f"\nPrediction rate (predictions/completed): {prediction_rate:.1f}%")
+
+    print("\n" + "-" * 60)
+    print(f"SUMMARY: {completed} tasks completed, {with_predictions} with predictions")
+    print("-" * 60)
+
+    # Show some example files for verification
+    if completed_files:
+        print("\nFirst 5 completed files:")
+        for i, filename in enumerate(completed_files[:5], 1):
+            print(f"  {i}. {filename}")
+        if len(completed_files) > 5:
+            print(f"  ... and {len(completed_files) - 5} more")
+
+    if running_files:
+        print("\nFirst 5 running files:")
+        for i, filename in enumerate(running_files[:5], 1):
+            print(f"  {i}. {filename}")
+        if len(running_files) > 5:
+            print(f"  ... and {len(running_files) - 5} more")
+
+    if prediction_files:
+        print("\nFirst 5 files with predictions:")
+        for i, (filename, prediction) in enumerate(prediction_files[:5], 1):
+            print(f"  {i}. {filename}")
+            print(f"     Prediction: {prediction}")
+        if len(prediction_files) > 5:
+            print(f"  ... and {len(prediction_files) - 5} more")
+
+    if error_files:
+        print("\nFiles with errors:")
+        for filename, error in error_files[:5]:
+            print(f"  - {filename}: {error[:100]}...")
+        if len(error_files) > 5:
+            print(f"  ... and {len(error_files) - 5} more")
+
+    if parse_error_files:
+        print("\nFiles with parse errors:")
+        for filename, error in parse_error_files:
+            print(f"  - {filename}: {error}")
+
+
+def main():
+    """Main function to run the analysis."""
+
+    # Check if folder path was provided as command line argument
+    if len(sys.argv) > 1:
+        log_folder = sys.argv[1]
+        print(f"Using provided folder path: {log_folder}")
+    else:
+        log_folder = "."
+        print(f"No folder path provided, using current directory: {log_folder}")
+
+    try:
+        print(f"Analyzing xbench-DeepSearch benchmark results in: {log_folder}")
+        (
+            results,
+            completed_files,
+            running_files,
+            failed_files,
+            prediction_files,
+            error_files,
+            parse_error_files,
+        ) = analyze_xbench_results(log_folder)
+        display_results(
+            results,
+            completed_files,
+            running_files,
+            failed_files,
+            prediction_files,
+            error_files,
+            parse_error_files,
+        )
+
+    except Exception as e:
+        print(f"Error: {e}")
+        print(f"\nUsage: python {sys.argv[0]} [LOG_FOLDER_PATH]")
+        print(f"Example: python {sys.argv[0]} logs/xbench-ds/claude03_claude_dual/run_1")
+        return 1
+
+    return 0
+
+
+if __name__ == "__main__":
+    exit(main())