This project supports two tool configuration modes for benchmark testing:
-
Default Setting with Open-Source Tools — Uses open-source tools as much as possible. Config file:
evaluation_os.yaml -
Advanced Setting with Commercial Tools — Uses commercial tools with advanced features. Config file:
evaluation.yaml
| Tool Set | Default Setting with Open-Source Tools |
Advanced Setting with Commercial Tools |
|---|---|---|
| Google Search | Serper | Serper |
| Linux Sandbox | E2B | E2B |
| Audio Transcription | Whisper-Large-v3-Turbo | GPT-4o mini Transcribe |
| Visual Question Answering | Qwen2.5-VL-72B-Instruct | Claude Sonnet 3.7 |
| Reasoning | Qwen3-235B-A22B-Thinking-2507 | Claude Sonnet 3.7 |
Configure the following variables in your apps/miroflow-agent/.env file according to the mode you choose:
# API for Google Search (recommended)
SERPER_API_KEY=your_serper_key
SERPER_BASE_URL="https://google.serper.dev"
# API for Web Scraping (recommended)
JINA_API_KEY=your_jina_key
JINA_BASE_URL="https://r.jina.ai"
# API for Linux Sandbox (recommended)
E2B_API_KEY=your_e2b_key
# API for LLM-as-Judge (for benchmark testing, optional)
OPENAI_API_KEY=your_openai_key
# API for Open-Source Audio Transcription Tool (for benchmark testing, optional)
WHISPER_MODEL_NAME="openai/whisper-large-v3-turbo"
WHISPER_API_KEY=your_whisper_key
WHISPER_BASE_URL="https://your_whisper_base_url/v1"
# API for Open-Source VQA Tool (for benchmark testing, optional)
VISION_MODEL_NAME="Qwen/Qwen2.5-VL-72B-Instruct"
VISION_API_KEY=your_vision_key
VISION_BASE_URL="https://your_vision_base_url/v1/chat/completions"
# API for Open-Source Reasoning Tool (for benchmark testing, optional)
REASONING_MODEL_NAME="Qwen/Qwen3-235B-A22B-Thinking-2507"
REASONING_API_KEY=your_reasoning_key
REASONING_BASE_URL="https://your_reasoning_base_url/v1/chat/completions"
# API for Claude Sonnet 3.7 as Commercial Tools (optional)
ANTHROPIC_API_KEY=your_anthropic_key
# API for Sougou Search (optional)
TENCENTCLOUD_SECRET_ID=your_tencent_cloud_secret_id
TENCENTCLOUD_SECRET_KEY=your_tencent_cloud_secret_keyTool Name: visual_question_answering
Description: An open-source vision-language model service that answers questions about images. Supports local image files and URLs. Automatically encodes local images to Base64 for API requests. Compatible with JPEG, PNG, GIF formats.
- Open-Source Mode: Qwen2.5-VL-72B-Instruct
- Commercial Mode: Claude Sonnet 3.7
Local Deployment (Open-Source Mode):
python3 -m sglang.launch_server \
--model-path /path/to/Qwen2.5-VL-72B-Instruct \
--tp 8 --host 0.0.0.0 --port 1234 \
--trust-remote-code --enable-metrics \
--log-level debug --log-level-http debug \
--log-requests --log-requests-level 2 --show-time-costTool Name: reasoning
Description: A reasoning service for solving complex analytical problems, such as advanced mathematics, puzzles, and riddles.
- Open-Source Mode: Qwen3-235B-A22B-Thinking-2507
- Commercial Mode: Claude Sonnet 3.7
Local Deployment (Open-Source Mode):
python3 -m sglang.launch_server \
--model-path /path/to/Qwen3-235B-A22B-Thinking-2507 \
--tp 8 --host 0.0.0.0 --port 1234 \
--trust-remote-code --enable-metrics \
--log-level debug --log-level-http debug \
--log-requests --log-requests-level 2 \
--show-time-cost --context-length 131072Tool Name: audio_transcription
Description: A transcription service converts audio files to text. Supports MP3, WAV, M4A, AAC, OGG, FLAC, and WMA formats. Can process both local and remote audio. Includes format detection, temporary file handling, and robust error handling.
- Open-Source Mode: Whisper-Large-v3-Turbo
- Commercial Mode: GPT-4o mini Transcribe
Local Deployment (Open-Source Mode):
pip install vllm==0.10.0
pip install vllm[audio]
vllm serve /path/to/whisper \
--served-model-name whisper-large-v3-turbo \
--task transcription