Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -208,6 +208,7 @@ marimo/_lsp/
__marimo__/

logs/
tmp/

data/*
!data/README.md
Expand Down
20 changes: 20 additions & 0 deletions config/benchmark/futurex.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# config/benchmark/futurex.yaml
defaults:
- default
- _self_

name: "futurex"

data:
data_dir: "${data_dir}/futurex" # Path to your dataset
metadata_file: "standardized_data.jsonl" # Metadata filename
whitelist: [] # Optional: List of specific task_ids to run

execution:
max_tasks: null # null = no limit, or specify a number
max_concurrent: 5 # Number of parallel tasks
pass_at_k: 1 # Number of attempts per task

# Set to skip evaluation since we don't have ground truth
openai_api_key: "skip_evaluation"

2 changes: 2 additions & 0 deletions docs/mkdocs/docs/download_datasets.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,7 @@ uv run main.py prepare-benchmark get browsecomp-test
uv run main.py prepare-benchmark get browsecomp-zh-test
uv run main.py prepare-benchmark get hle
uv run main.py prepare-benchmark get xbench-ds
uv run main.py prepare-benchmark get futurex
```

### What This Script Does
Expand All @@ -94,6 +95,7 @@ uv run main.py prepare-benchmark get xbench-ds
- `browsecomp-zh-test` - Chinese BrowseComp test set
- `hle` - HLE dataset
- `xbench-ds` - xbench-DeepSearch dataset
- `futurex` - Futurex-Online dataset

### Customizing Dataset Selection

Expand Down
267 changes: 267 additions & 0 deletions docs/mkdocs/docs/futurex.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,267 @@
# Futurex-Online
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mention in the documents that this is a quick start for running futurex benchmark and prepare results, not for fully reproduce the results that is submitted


MiroFlow's evaluation on the Futurex-Online benchmark demonstrates capabilities in future event prediction tasks.

---

## Dataset Overview

!!! info "Futurex-Online Dataset"
The Futurex-Online dataset consists of 61 prediction tasks covering various future events including:

- Political events (referendums, elections)
- Sports outcomes (football matches)
- Legal proceedings
- Economic indicators


!!! abstract "Key Dataset Characteristics"

- **Total Tasks**: 61
- **Task Type**: Future event prediction
- **Answer Format**: Boxed answers (\\boxed{Yes/No} or \\boxed{A/B/C})
- **Ground Truth**: Not available (prediction tasks)
- **Resolution Date**: Around 2025-09-21 (GMT+8)

---

## Quick Start Guide

!!! note "Quick Start Instructions"
This section provides step-by-step instructions to run the Futurex-Online benchmark and prepare submission results. Since this is a prediction dataset without ground truth, we focus on execution traces and response generation. **Note**: This is a quick start guide for running the benchmark, not for reproducing exact submitted results.

### Step 1: Prepare the Futurex-Online Dataset

!!! tip "Dataset Setup"
Use the integrated prepare-benchmark command to download and process the dataset:

```bash title="Download Futurex-Online Dataset"
uv run main.py prepare-benchmark get futurex
```

This will create the standardized dataset at `data/futurex/standardized_data.jsonl`.

### Step 2: Configure API Keys

!!! warning "API Key Configuration"
Set up the required API keys for model access and tool functionality. Update the `.env` file to include the following keys:

```env title=".env Configuration"
# For searching and web scraping
SERPER_API_KEY="xxx"
JINA_API_KEY="xxx"

# For Linux sandbox (code execution environment)
E2B_API_KEY="xxx"

# We use Claude-3.7-Sonnet with OpenRouter backend to initialize the LLM
OPENROUTER_API_KEY="xxx"
OPENROUTER_BASE_URL="https://openrouter.ai/api/v1"

# Used for Claude vision understanding
ANTHROPIC_API_KEY="xxx"

# Used for Gemini vision
GEMINI_API_KEY="xxx"

# Use for llm judge, reasoning, o3 hints, etc.
OPENAI_API_KEY="xxx"
OPENAI_BASE_URL="https://api.openai.com/v1"
```

### Step 3: Run the Evaluation

!!! example "Evaluation Execution"
Execute the following command to run evaluation on the Futurex-Online dataset. This uses the basic `agent_quickstart_1` configuration for quick start purposes.

```bash title="Run Futurex-Online Evaluation"
uv run main.py common-benchmark --config_file_name=agent_quickstart_1 benchmark=futurex output_dir="logs/futurex/$(date +"%Y%m%d_%H%M")"
```

!!! tip "Progress Monitoring and Resume"
To check the progress while running:

```bash title="Check Progress"
uv run utils/progress_check/check_futurex_progress.py $PATH_TO_LOG
```

If you need to resume an interrupted evaluation, specify the same output directory to continue from where you left off.

```bash title="Resume Evaluation, e.g."
uv run main.py common-benchmark --config_file_name=agent_quickstart_1 benchmark=futurex output_dir="logs/futurex/20250918_1010"
```

### Step 4: Extract Results

!!! example "Result Extraction"
After evaluation completion, extract the results using the provided utility:

```bash title="Extract Results"
uv run utils/extract_futurex_results.py logs/futurex/$(date +"%Y%m%d_%H%M")
```

This will generate:

- `futurex_results.json`: Detailed results for each task
- `futurex_summary.json`: Summary statistics
- `futurex_predictions.csv`: Predictions in CSV format

---

## Sample Task Examples

### Political Prediction
```
Task: "Will the 2025 Guinea referendum pass? (resolved around 2025-09-21 (GMT+8))"
Expected Format: \boxed{Yes} or \boxed{No}
```

### Sports Prediction
```
Task: "Brighton vs. Tottenham (resolved around 2025-09-21 (GMT+8))
A. Brighton win on 2025-09-20
B. Brighton vs. Tottenham end in a draw
C. Tottenham win on 2025-09-20"
Expected Format: \boxed{A}, \boxed{B}, or \boxed{C}
```

---

## Multiple Runs and Voting

!!! tip "Improving Prediction Accuracy"
For better prediction accuracy, you can run multiple evaluations and use voting mechanisms to aggregate results. This approach helps reduce randomness and improve the reliability of predictions. **Note**: This is a quick start approach; production submissions may use more sophisticated configurations.

### Step 1: Run Multiple Evaluations

Use the multiple runs script to execute several independent evaluations:

```bash title="Run Multiple Evaluations"
./scripts/run_evaluate_multiple_runs_futurex.sh
```

This script will:

- Run 3 independent evaluations by default (configurable with `NUM_RUNS`)
- Execute all tasks in parallel for efficiency
- Generate separate result files for each run in `run_1/`, `run_2/`, etc.
- Create a consolidated `futurex_submission.jsonl` file with voting results

### Step 2: Customize Multiple Runs

You can customize the evaluation parameters:

```bash title="Custom Multiple Runs"
# Run 5 evaluations with limited tasks for testing
NUM_RUNS=5 MAX_TASKS=10 ./scripts/run_evaluate_multiple_runs_futurex.sh

# Use different agent configuration
AGENT_SET=agent_gaia-validation ./scripts/run_evaluate_multiple_runs_futurex.sh

# Adjust concurrency for resource management
MAX_CONCURRENT=3 ./scripts/run_evaluate_multiple_runs_futurex.sh
```

### Step 3: Voting and Aggregation

After multiple runs, the system automatically:

1. **Extracts predictions** from all runs using `utils/extract_futurex_results.py`
2. **Applies majority voting** to aggregate predictions across runs
3. **Generates submission file** in the format required by FutureX platform
4. **Provides voting statistics** showing prediction distribution across runs

The voting process works as follows:

- **Majority Vote**: Most common prediction across all runs wins
- **Tie-breaking**: If tied, chooses the prediction that appeared earliest across all runs
- **Vote Counts**: Tracks how many runs predicted each option
- **Confidence Indicators**: High agreement indicates more reliable predictions

### Step 4: Analyze Voting Results

Check the generated files for voting analysis:

```bash title="Check Voting Results"
# View submission file with voting results
cat logs/futurex/agent_quickstart_1_*/futurex_submission.jsonl

# Check individual run results
ls logs/futurex/agent_quickstart_1_*/run_*/

# Check progress and voting statistics
uv run python utils/progress_check/check_futurex_progress.py logs/futurex/agent_quickstart_1_*
```

### Manual Voting Aggregation

You can also manually run the voting aggregation:

```bash title="Manual Voting Aggregation"
# Aggregate multiple runs with majority voting
uv run python utils/extract_futurex_results.py logs/futurex/agent_quickstart_1_* --aggregate

# Force single run mode (if needed)
uv run python utils/extract_futurex_results.py logs/futurex/agent_quickstart_1_*/run_1 --single

# Specify custom output file
uv run python utils/extract_futurex_results.py logs/futurex/agent_quickstart_1_* -o my_voted_predictions.jsonl
```

### Voting Output Format

The voting aggregation generates a submission file with the following format:

```json
{"id": "687104310a994c0060ef87a9", "prediction": "No", "vote_counts": {"No": 2}}
{"id": "68a9b46e961bd3003c8f006b", "prediction": "Yes", "vote_counts": {"Yes": 2}}
```

The output includes:

- **`id`**: Task identifier
- **`prediction`**: Final voted prediction (without `\boxed{}` wrapper)
- **`vote_counts`**: Dictionary showing how many runs predicted each option

For example, `"vote_counts": {"No": 2}` means 2 out of 2 runs predicted "No", indicating high confidence.

---

## Evaluation Notes

!!! warning "No Ground Truth Available"
Since Futurex-Online is a prediction dataset, there are no ground truth answers available for evaluation. The focus is on:

- Response generation quality
- Reasoning process documentation
- Prediction confidence and methodology

!!! info "Output Analysis"
The evaluation generates detailed execution traces showing:

- Research process for each prediction
- Information gathering from web sources
- Reasoning chains leading to predictions
- Final boxed answers in required format

### Directory Structure

After running multiple evaluations, you'll find the following structure:

```
logs/futurex/agent_quickstart_1_YYYYMMDD_HHMM/
├── futurex_submission.jsonl # Final voted predictions
├── run_1/ # First run results
│ ├── benchmark_results.jsonl # Individual task results
│ ├── benchmark_results_pass_at_1_accuracy.txt
│ └── task_*_attempt_1.json # Detailed execution traces
├── run_2/ # Second run results
│ └── ... (same structure as run_1)
├── run_1_output.log # Run 1 execution log
└── run_2_output.log # Run 2 execution log
```

---

!!! info "Documentation Info"
**Last Updated:** September 2025 · **Doc Contributor:** Team @ MiroMind AI
1 change: 1 addition & 0 deletions docs/mkdocs/mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@ nav:
- Benchmarks:
- GAIA-Validation: gaia_validation.md
- GAIA-Test: gaia_test.md
- FutureX: futurex.md
- Add New Benchmarks: contribute_benchmarks.md

- Tools:
Expand Down
Loading