A lightweight benchmarking and analysis suite around the Model Context Protocol (MCP). It orchestrates multiple MCP servers, drives different LLMs to complete tasks, produces reproducible results, and offers step-wise evaluations and visualizations. 🎯
- Key features ✨:
- Unified multi‑provider LLM driver (see
models/api_clients.py) - MCP server orchestration and tool selection (
mcp_host.py+mcp_servers.json) - End‑to‑end benchmark scripts with reproducible outputs (
scripts/*.sh→results/,save/) - Three evaluation layers: step‑level, call‑level, and final task completion, with plots
- Unified multi‑provider LLM driver (see
- 2025-11-20: Initial public release of M3‑Bench.
- 2026-1-26: Added optional support for the total number of tools, and added three new CV-related test metrics.
- Python 3.11 (recommended)
- Conda/conda for env management
- Optional: CUDA, local/hosted LLMs, and API keys (OpenAI, Anthropic, Google/Gemini, xAI, DeepSeek, Zhipu, etc.)
# Create environment (example)
conda create -n mcp_app python=3.11 -y
conda activate mcp_app
# Install deps (adjust per repo files)
pip install -r requirements_pip.txt
conda install -r requirements_conda.txt # if provided
#
(cd servers/tmdb-mcp-server && npm install && npm run build)
(cd servers/DINO-X-MCP && npm install && npm run build)
(cd servers/mcp-server-nationalparks && npm install && npm run build)
(cd servers/metmuseum-mcp && npm install && npm run build)
(cd servers/okx-mcp && npm install && npm run build)
(cd servers/hugeicons && npm install && npm run build)
(cd servers/math-mcp && npm install && npm run build)
(cd servers/healthcare-mcp-public && npm install)
(cd servers/nasa-mcp && pip install -e .)- MCP servers: edit
mcp_servers.jsonat repo root (enable/disable servers, args, env vars). - Model/API keys: create a
.envat repo root and fill keys such as:OPENAI_API_KEY,ANTHROPIC_API_KEY,GOOGLE_API_KEY,XAI_API_KEY,DEEPSEEK_API_KEY,ZHIPU_API_KEY, ...
- Quick setup for
.env:
cp .env_example .env- Data paths: default GT/PRED paths in scripts can be adjusted (see
scripts/evaluate_*.sh). All scripts now use repo‑relative paths by default.
scripts/benchmark_fuzzy.sh: run the benchmark to produce predictions (results/<model>_test_mcp_fuzzy.json).evaluate_step.sh: step‑level evaluation and visualization (callsevaluate_trajectories.pyandtools/fig_step_eval_result.py).evaluate_call.sh: call‑level classification (outputscallanalysis.json, and composes a PDF viatools/plot_call_pies.py).evaluate_final_answer.sh: final task completion evaluation (outputsresults/<model>/taskcompletion.json).
models/: unified drivers for OpenAI/Anthropic/Gemini/xAI/Deepseek/Zhipu/etc.servers/: sample MCP servers (weather, wiki, openlibrary, barcode, paper search, ...).tools/: utilities for result aggregation and plotting.app_mm.py: minimal FastAPI multimodal demo (image upload + MCP toolchain).results/,save/: outputs for evaluations and figures.
MCP tools across servers 🧰:
Test MCP Serves by
python tools/test_mcp_servers.py- Run the benchmark (generate predictions) 🚀
bash scripts/benchmark_fuzzy.sh
# Output: results/<model>_test_mcp_fuzzy.json- Step‑level evaluation (process quality) 📈
bash scripts/evaluate_step.sh
# Output: results/<model>/ and figures (tools/fig_step_eval_result.py writes PDF to save/)Example step‑level metrics across models:
- Call‑level evaluation (MCP call classification) 📊
bash scripts/evaluate_call.sh
# Output: results/<model>/callanalysis.json and a combined pies PDF under save/- Final task completion evaluation ✅
bash scripts/evaluate_final_answer.sh
# Output: results/<model>/taskcompletion.jsonℹ️ Note: Scripts read API keys from
.envand allow changing model lists and data paths inside.
Multimodal chat with MCP tools and image uploads.
python app_mm.py --MODEL_PATH <your_model_or_api_name> \
--max_step 4 --max_concurrent 4 --TOP_TOOLS 6 --max_new_tokens 20480Then open the reported URL. Uploaded images are injected as data URLs for the model and MCP tools to consume.
- Auth/key errors: ensure
.envcontains the right keys matching the selected driver. - Missing outputs: check
results/existence, correctPRED_PATH/GT_PATH, and that the model list includes your model. - MCP tools unavailable: ensure the server is enabled in
mcp_servers.jsonor run the server locally to debug.
Released under the MIT License. See LICENSE for details.


