feat(benchmark): add support for evaluation on futurex by JubSteven · Pull Request #40 · MiroMindAI/MiroFlow

JubSteven · 2025-09-18T05:21:59Z

Describe this PR

Overview

Integrates Futurex-Online prediction dataset into MiroFlow's benchmark system with majority voting for improved prediction accuracy (adapted from MiroThinker).

Key Changes

🆕 New Files

utils/prepare_benchmark/gen_futurex.py - Dataset generator
config/benchmark/futurex.yaml - Benchmark configuration
scripts/run_evaluate_multiple_runs_futurex.sh - Multi-run evaluation script
docs/mkdocs/docs/futurex.md - Complete documentation
docs/mkdocs/mkdocs.yml - Add documentation link for futurex
utils/extract_futurex_results.py - Extract results from logging and implement majority voting from MiroThinker for multiple runs.

🔧 Modified Files

utils/prepare_benchmark/main.py - Added Futurex support

Features

61 prediction tasks (political events, sports, legal proceedings)
Majority voting across multiple runs with tie-breaking
Standardized format using MiroFlow's Task class
No ground truth (prediction-based evaluation)

Usage

# Download dataset
uv run main.py prepare-benchmark get futurex

# Single run
uv run main.py common-benchmark --config_file_name=agent_quickstart_1 benchmark=futurex output_dir="logs/futurex"

# Multiple runs with voting
./scripts/run_evaluate_multiple_runs_futurex.sh

Checklist for PR

Must Do

Write a good PR title and description, i.e. feat(agent): add pdf tool via mcp, perf: make llm client async and fix(utils): load custom config via importlib etc. CI job check-pr-title enforces Angular commit message format to PR title.
Run make precommit locally. CI job lint enforce ruff default format/lint rules on all new codes.
Run make pytest. Check test summary (located at report.html) and coverage report (located at htmlcov/index.html) on new codes.

Nice To Have

(Optional) Write/update tests under /tests for feat and test PR.
(Optional) Write/update docs under /docs for docs and ci PR.

BinWang28 · 2025-09-18T05:27:45Z

docs/mkdocs/docs/futurex.md

+# For Linux sandbox (code execution environment)
+E2B_API_KEY="xxx"
+
+# We use Claude-3.5-Sonnet with OpenRouter backend to initialize the LLM


This is a typo from old days, should be Claude 3.7

BinWang28 · 2025-09-18T05:31:22Z

docs/mkdocs/docs/futurex.md

@@ -0,0 +1,258 @@
+# Futurex-Online


mention in the documents that this is a quick start for running futurex benchmark and prepare results, not for fully reproduce the results that is submitted

ntudy · 2025-09-18T05:53:52Z

docs/mkdocs/docs/futurex.md

+    After evaluation completion, extract the results using the provided utility:
+
+```bash title="Extract Results"
+uv run utils/extract_futurex_results.py --log_dir logs/futurex/$(date +"%Y%m%d_%H%M")


should be the runned dir?

ntudy · 2025-09-18T06:26:38Z

docs/mkdocs/docs/futurex.md

+    After evaluation completion, extract the results using the provided utility:
+
+```bash title="Extract Results"
+uv run utils/extract_futurex_results.py --log_dir logs/futurex/$(date +"%Y%m%d_%H%M")


error: unrecognized arguments: --log_dir

error: unrecognized arguments: --log_dir

My bad, should be the following format

uv run utils/extract_futurex_results.py logs/futurex/$(date +"%Y%m%d_%H%M")

JubSteven · 2025-09-18T07:01:20Z

I have pushed a commit that should resolve the issues regarding arguments and documentation.

* upd: add futurex evaluation support. * upd: support multiple eval for futurex and add relavent doc. * upd: fix bugs with doc for futurex. * debug: fix wrong calling path.

JubSteven added 2 commits September 18, 2025 10:35

upd: add futurex evaluation support.

56b235d

upd: support multiple eval for futurex and add relavent doc.

287a7bc

BinWang28 reviewed Sep 18, 2025

View reviewed changes

ntudy reviewed Sep 18, 2025

View reviewed changes

upd: fix bugs with doc for futurex.

bf43b37

debug: fix wrong calling path.

d1e1637

ntudy approved these changes Sep 18, 2025

View reviewed changes

ntudy merged commit d9a29ba into MiroMindAI:miroflow-v0.3 Sep 18, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(benchmark): add support for evaluation on futurex#40

feat(benchmark): add support for evaluation on futurex#40
ntudy merged 4 commits intoMiroMindAI:miroflow-v0.3from
JubSteven:dev

JubSteven commented Sep 18, 2025

Uh oh!

BinWang28 Sep 18, 2025

Uh oh!

BinWang28 Sep 18, 2025

Uh oh!

ntudy Sep 18, 2025

Uh oh!

ntudy Sep 18, 2025

Uh oh!

JubSteven Sep 18, 2025

Uh oh!

JubSteven commented Sep 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

JubSteven commented Sep 18, 2025

Describe this PR

Overview

Key Changes

🆕 New Files

🔧 Modified Files

Features

Usage

Checklist for PR

Must Do

Nice To Have

Uh oh!

BinWang28 Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

BinWang28 Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

ntudy Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

ntudy Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

JubSteven Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

JubSteven commented Sep 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants