《M^3-Bench: Multi-Modal, Multi-Hop, Multi-Threaded Tool-Using MLLM Agent Benchmark》

A lightweight benchmarking and analysis suite around the Model Context Protocol (MCP). It orchestrates multiple MCP servers, drives different LLMs to complete tasks, produces reproducible results, and offers step-wise evaluations and visualizations. 🎯

Key features ✨:
- Unified multi‑provider LLM driver (see models/api_clients.py)
- MCP server orchestration and tool selection (mcp_host.py + mcp_servers.json)
- End‑to‑end benchmark scripts with reproducible outputs (scripts/*.sh → results/, save/)
- Three evaluation layers: step‑level, call‑level, and final task completion, with plots

Changelog 📝

2025-11-20: Initial public release of M3‑Bench.
2026-1-26: Added optional support for the total number of tools, and added three new CV-related test metrics.

Environment & Installation 🛠️

Python 3.11 (recommended)
Conda/conda for env management
Optional: CUDA, local/hosted LLMs, and API keys (OpenAI, Anthropic, Google/Gemini, xAI, DeepSeek, Zhipu, etc.)

# Create environment (example)
conda create -n mcp_app python=3.11 -y
conda activate mcp_app

# Install deps (adjust per repo files)
pip install -r requirements_pip.txt
conda install -r requirements_conda.txt  # if provided

#
(cd servers/tmdb-mcp-server && npm install && npm run build)
(cd servers/DINO-X-MCP && npm install && npm run build)
(cd servers/mcp-server-nationalparks && npm install && npm run build)
(cd servers/metmuseum-mcp && npm install && npm run build)
(cd servers/okx-mcp && npm install && npm run build)
(cd servers/hugeicons && npm install && npm run build)
(cd servers/math-mcp && npm install && npm run build)
(cd servers/healthcare-mcp-public && npm install)
(cd servers/nasa-mcp && pip install -e .)

Configuration ⚙️

MCP servers: edit mcp_servers.json at repo root (enable/disable servers, args, env vars).
Model/API keys: create a .env at repo root and fill keys such as:
- OPENAI_API_KEY, ANTHROPIC_API_KEY, GOOGLE_API_KEY, XAI_API_KEY, DEEPSEEK_API_KEY, ZHIPU_API_KEY, ...
Quick setup for .env:

cp .env_example .env

Data paths: default GT/PRED paths in scripts can be adjusted (see scripts/evaluate_*.sh). All scripts now use repo‑relative paths by default.

Directory Overview 📁

scripts/
- benchmark_fuzzy.sh: run the benchmark to produce predictions (results/<model>_test_mcp_fuzzy.json).
- evaluate_step.sh: step‑level evaluation and visualization (calls evaluate_trajectories.py and tools/fig_step_eval_result.py).
- evaluate_call.sh: call‑level classification (outputs callanalysis.json, and composes a PDF via tools/plot_call_pies.py).
- evaluate_final_answer.sh: final task completion evaluation (outputs results/<model>/taskcompletion.json).
models/: unified drivers for OpenAI/Anthropic/Gemini/xAI/Deepseek/Zhipu/etc.
servers/: sample MCP servers (weather, wiki, openlibrary, barcode, paper search, ...).
tools/: utilities for result aggregation and plotting.
app_mm.py: minimal FastAPI multimodal demo (image upload + MCP toolchain).
results/, save/: outputs for evaluations and figures.

MCP Serves

MCP tools across servers 🧰:

Test MCP Serves by

python tools/test_mcp_servers.py

Quick Start 🚀

Run the benchmark (generate predictions) 🚀

bash scripts/benchmark_fuzzy.sh
# Output: results/<model>_test_mcp_fuzzy.json

Step‑level evaluation (process quality) 📈

bash scripts/evaluate_step.sh
# Output: results/<model>/ and figures (tools/fig_step_eval_result.py writes PDF to save/)

Example step‑level metrics across models:

Call‑level evaluation (MCP call classification) 📊

bash scripts/evaluate_call.sh
# Output: results/<model>/callanalysis.json and a combined pies PDF under save/

Final task completion evaluation ✅

bash scripts/evaluate_final_answer.sh
# Output: results/<model>/taskcompletion.json

ℹ️ Note: Scripts read API keys from .env and allow changing model lists and data paths inside.

Interactive Demo (optional) 💬

Multimodal chat with MCP tools and image uploads.

python app_mm.py --MODEL_PATH <your_model_or_api_name> \
  --max_step 4 --max_concurrent 4 --TOP_TOOLS 6 --max_new_tokens 20480

Then open the reported URL. Uploaded images are injected as data URLs for the model and MCP tools to consume.

FAQ ❓

Auth/key errors: ensure .env contains the right keys matching the selected driver.
Missing outputs: check results/ existence, correct PRED_PATH/GT_PATH, and that the model list includes your model.
MCP tools unavailable: ensure the server is enabled in mcp_servers.json or run the server locally to debug.

License 📄

Released under the MIT License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
models		models
scripts		scripts
servers		servers
tools		tools
.env_example		.env_example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app_mm.html		app_mm.html
app_mm.py		app_mm.py
benchmark_pipeline.py		benchmark_pipeline.py
evaluate_calls.py		evaluate_calls.py
evaluate_cv_issues.py		evaluate_cv_issues.py
evaluate_final.py		evaluate_final.py
evaluate_trajectories.py		evaluate_trajectories.py
m3_logo.jpg		m3_logo.jpg
mcp_client.py		mcp_client.py
mcp_host.py		mcp_host.py
mcp_servers.json		mcp_servers.json
mcp_tools_per_server.png		mcp_tools_per_server.png
metrics_mllm_step_eval.png		metrics_mllm_step_eval.png
requirement_conda.txt		requirement_conda.txt
requirements_pip.txt		requirements_pip.txt
round_runner.py		round_runner.py
tool_lines.txt		tool_lines.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

《M^3-Bench: Multi-Modal, Multi-Hop, Multi-Threaded Tool-Using MLLM Agent Benchmark》

Changelog 📝

Environment & Installation 🛠️

Configuration ⚙️

Directory Overview 📁

MCP Serves

Quick Start 🚀

Interactive Demo (optional) 💬

FAQ ❓

License 📄

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

《M^3-Bench: Multi-Modal, Multi-Hop, Multi-Threaded Tool-Using MLLM Agent Benchmark》

Changelog 📝

Environment & Installation 🛠️

Configuration ⚙️

Directory Overview 📁

MCP Serves

Quick Start 🚀

Interactive Demo (optional) 💬

FAQ ❓

License 📄

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages