Skip to content

feat(benchmark): add support for evaluation on futurex#40

Merged
ntudy merged 4 commits intoMiroMindAI:miroflow-v0.3from
JubSteven:dev
Sep 18, 2025
Merged

feat(benchmark): add support for evaluation on futurex#40
ntudy merged 4 commits intoMiroMindAI:miroflow-v0.3from
JubSteven:dev

Conversation

@JubSteven
Copy link
Contributor

Describe this PR

Overview

Integrates Futurex-Online prediction dataset into MiroFlow's benchmark system with majority voting for improved prediction accuracy (adapted from MiroThinker).

Key Changes

🆕 New Files

  • utils/prepare_benchmark/gen_futurex.py - Dataset generator
  • config/benchmark/futurex.yaml - Benchmark configuration
  • scripts/run_evaluate_multiple_runs_futurex.sh - Multi-run evaluation script
  • docs/mkdocs/docs/futurex.md - Complete documentation
  • docs/mkdocs/mkdocs.yml - Add documentation link for futurex
  • utils/extract_futurex_results.py - Extract results from logging and implement majority voting from MiroThinker for multiple runs.

🔧 Modified Files

  • utils/prepare_benchmark/main.py - Added Futurex support

Features

  • 61 prediction tasks (political events, sports, legal proceedings)
  • Majority voting across multiple runs with tie-breaking
  • Standardized format using MiroFlow's Task class
  • No ground truth (prediction-based evaluation)

Usage

# Download dataset
uv run main.py prepare-benchmark get futurex

# Single run
uv run main.py common-benchmark --config_file_name=agent_quickstart_1 benchmark=futurex output_dir="logs/futurex"

# Multiple runs with voting
./scripts/run_evaluate_multiple_runs_futurex.sh

Checklist for PR

Must Do

  • Write a good PR title and description, i.e. feat(agent): add pdf tool via mcp, perf: make llm client async and fix(utils): load custom config via importlib etc. CI job check-pr-title enforces Angular commit message format to PR title.
  • Run make precommit locally. CI job lint enforce ruff default format/lint rules on all new codes.
  • Run make pytest. Check test summary (located at report.html) and coverage report (located at htmlcov/index.html) on new codes.

Nice To Have

  • (Optional) Write/update tests under /tests for feat and test PR.
  • (Optional) Write/update docs under /docs for docs and ci PR.

# For Linux sandbox (code execution environment)
E2B_API_KEY="xxx"

# We use Claude-3.5-Sonnet with OpenRouter backend to initialize the LLM
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a typo from old days, should be Claude 3.7

@@ -0,0 +1,258 @@
# Futurex-Online
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mention in the documents that this is a quick start for running futurex benchmark and prepare results, not for fully reproduce the results that is submitted

After evaluation completion, extract the results using the provided utility:

```bash title="Extract Results"
uv run utils/extract_futurex_results.py --log_dir logs/futurex/$(date +"%Y%m%d_%H%M")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be the runned dir?

After evaluation completion, extract the results using the provided utility:

```bash title="Extract Results"
uv run utils/extract_futurex_results.py --log_dir logs/futurex/$(date +"%Y%m%d_%H%M")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error: unrecognized arguments: --log_dir

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error: unrecognized arguments: --log_dir

My bad, should be the following format

uv run utils/extract_futurex_results.py logs/futurex/$(date +"%Y%m%d_%H%M")

@JubSteven
Copy link
Contributor Author

I have pushed a commit that should resolve the issues regarding arguments and documentation.

@ntudy ntudy merged commit d9a29ba into MiroMindAI:miroflow-v0.3 Sep 18, 2025
1 check passed
Zhudongsheng75 pushed a commit to open-compass/MiroFlow that referenced this pull request Dec 27, 2025
* upd: add futurex evaluation support.

* upd: support multiple eval for futurex and add relavent doc.

* upd: fix bugs with doc for futurex.

* debug: fix wrong calling path.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants