The CLI component provides tools for running all benchmarks in both local and Docker-based environments. It handles benchmark orchestration, environment setup, and result collection.
The CLI enables you to:
- Run individual benchmarks or execute all benchmarks sequentially
- Deploy benchmarks in isolated Docker containers or local environments
- Automatically collect and aggregate results from multiple benchmark runs
Run the installation script to create a Python virtual environment and install dependencies:
cd cli
./install.shThis will:
- Create a Python 3.12 virtual environment in
.venv/ - Install required packages from
requirements.txt - Prepare the CLI environment for execution
The CLI requires the following packages (automatically installed via install.sh):
swe-rex==1.3.0- Runtime execution frameworkrequests==2.32.4- HTTP libraryazure-identity==1.23.0- Azure authenticationlitellm==1.77.5- Unified LLM interface
Execute all benchmarks sequentially with the local runner:
./run_all_local.sh <model_name>This script will:
- Validate that a model name argument is provided.
- Iterate through each benchmark directory, running
install.shandrun.sh <model_name>when they exist. - Copy the benchmark outputs back into
cli/for easy inspection.
Make sure the virtual environment remains active when invoking the shell wrapper. For benchmarks that require containerized execution:
cd cli
./run_docker.shThis script manages Docker-based benchmark execution with proper cleanup and isolation.
With the virtual environment active, use the Docker runner to execute a specific benchmark:
cd cli
# Activate the CLI virtual environment before running any commands:
source .venv/bin/activate
python docker_run.py --benchmark_name <benchmark_name> --model_name <model>Arguments:
--benchmark_name(required): Name of the benchmark to run (e.g.,cache_bench,course_exam_bench)--model_name(optional): LLM model to use (default:gpt-4o)
Example:
python docker_run.py --benchmark_name cache_bench --model_name gpt-4oEach benchmark includes an env.toml configuration file that specifies:
- Hardware requirements: GPU usage, memory limits
- Docker settings: Container image, entrypoint script
- Environment variables: API keys, endpoints
Example env.toml:
[llm]
OPENAI_API_KEY = "sk-XXXX"
AZURE_API_KEY = "XXX"
AZURE_API_BASE = "XXX"
AZURE_API_VERSION = "2024-05-01-preview"
ANTHROPIC_API_KEY = "sk-ant-XXXX"
[hardware]
use_gpu = false
[env-docker]
image = "default" # or specify custom image
entrypoint = "run.sh"Supported Docker Images:
default- Maps toxuafeng/swe-go-python:latest- Custom images can be specified directly
Benchmark results are saved in timestamped directories:
cli/outputs/{benchmark_name}__{model_name}__{agent}_{timestamp}/
avg_score.json # Aggregated metrics
result.jsonl # Detailed results
avg_score.json: Aggregate performance metrics (accuracy, scores, etc.)result.jsonl: Line-delimited JSON with detailed evaluation results for each test caselogs/: Execution logs with timestamps
Generate a visual leaderboard for any results directory with dashboard.py:
cd cli
python3 dashboard.py --results_dir outputs --output dashboard.htmlThe script produces System Intelligence Leaderboard, a dark-mode HTML report. See example and screenshot below.
If Plotly is unavailable, the core leaderboard still renders while charts are omitted. Open the generated HTML file in a browser to explore the latest runs.
The CLI uses a centralized logging system configured in logger.py:
- Console output: INFO level and above
- File logging: DEBUG level and above
- Log files: Stored in
cli/logs/vansys-cli_{date}.log
Log format:
YYYY-MM-DD HH:MM:SS | LEVEL | logger-name | message
GPU support is currently not implemented. If a benchmark requires GPU and use_gpu = true in env.toml, the CLI will exit with an error message.
If Docker containers fail to start:
- Ensure Docker is installed and running
- Verify you have permission to run Docker commands (add user to
dockergroup) - Check that the specified Docker image is available or can be pulled
If the virtual environment fails to activate:
cd cli
rm -rf .venv
./install.sh- Main README - Project overview and benchmark descriptions
- SDK Documentation - Details on evaluators and LLM interfaces
- Benchmark Examples - Template for creating new benchmarks
