A comprehensive evaluation platform for comparing LLM models across 90+ benchmarks. Built with Next.js, FastAPI, and Inspect AI.
- Features
- Prerequisites
- Quick Start (Docker)
- Local Development Setup
- Configuration
- Usage
- API Reference
- Project Structure
- Troubleshooting
- 90+ Benchmarks: MMLU, HumanEval, GPQA, MATH, ARC, HellaSwag, and more
- Real-time Streaming: Live progress updates during evaluations
- Model Comparison: Compare multiple models head-to-head
- Provider Evaluation: Test different inference providers for the same model
- Reasoning Model Support: Special handling for o1, o3, o4-mini, and GPT-4.x/5.x models
- Export Results: Download results as CSV, JSON, Markdown, or TXT
- Cost Estimation: Track estimated costs per evaluation
- Node.js 18+ for the dashboard
- Python 3.11+ for the benchmark service (3.12 recommended)
- Docker and Docker Compose (recommended)
- OpenRouter API Key from openrouter.ai/keys
The fastest way to get running:
# 1. Clone the repository
git clone <your-repo-url>
cd eval-dashboard
# 2. Set up environment variables
cp .env.example .env
# 3. Add your OpenRouter API key to .env
# Edit .env and set: OPENROUTER_API_KEY=sk-or-v1-your-key-here
# 4. Start all services
docker compose up
# 5. Open the dashboard at http://localhost:3000The Docker setup includes:
- Dashboard at http://localhost:3000
- Benchmark API at http://localhost:8000
- PostgreSQL database for result storage
For development with hot-reload, follow these steps:
git clone <your-repo-url>
cd eval-dashboardcd benchmark-service
# Create and activate a virtual environment
python3 -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Install inspect-evals (contains 90+ pre-built benchmarks)
pip install inspect-evals
# Copy environment template
cp .env.example .envEdit benchmark-service/.env and add your API key:
# Required
OPENROUTER_API_KEY=sk-or-v1-your-key-here
# Optional: For direct provider access
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GOOGLE_API_KEY=...With the virtual environment activated:
cd benchmark-service
source .venv/bin/activate # If not already activated
# Option A: Use the start script
./start.sh
# Option B: Start manually
uvicorn main:app --host 0.0.0.0 --port 8000 --reloadVerify it's running:
curl http://localhost:8000/health
# Should return: {"status":"healthy",...}Open a new terminal:
cd dashboard
# Install Node.js dependencies
npm install
# Copy environment template
cp .env.example .envEdit dashboard/.env:
# Points to your local benchmark service
NEXT_PUBLIC_EVAL_SERVICE_URL=http://localhost:8000
# Required for OpenRouter API calls
OPENROUTER_API_KEY=sk-or-v1-your-key-herecd dashboard
npm run devOpen http://localhost:3000 in your browser.
If you prefer Docker for the backend but want hot-reload on the frontend:
# Terminal 1: Start backend services with Docker
docker compose up benchmark-service postgres
# Terminal 2: Run frontend locally
cd dashboard
npm run devCreate a .env file in the project root (for Docker) or in each service directory (for local development):
| Variable | Required | Default | Description |
|---|---|---|---|
OPENROUTER_API_KEY |
Yes | - | Your OpenRouter API key |
NEXT_PUBLIC_EVAL_SERVICE_URL |
No | http://localhost:8000 |
Backend API URL |
POSTGRES_USER |
No | eval |
Database username |
POSTGRES_PASSWORD |
No | evalpass |
Database password |
POSTGRES_DB |
No | eval_db |
Database name |
CORS_ORIGINS |
No | http://localhost:3000 |
Allowed CORS origins |
For direct API access (bypassing OpenRouter):
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GOOGLE_API_KEY=...
TOGETHER_API_KEY=...
FIREWORKS_API_KEY=...- Navigate to Models or Matrix View in the sidebar
- Select a model (e.g.,
gpt-4o,claude-3-5-sonnet) - Choose a benchmark (e.g.,
MMLU,HumanEval) - Set the sample size (number of questions to run)
- Click Start Eval
Compare multiple models across multiple benchmarks:
- Go to Matrix View
- Select multiple models
- Select one or more benchmarks
- Click Run Matrix Evaluation
- View results in the comparison grid
After an evaluation completes, use the export buttons to download:
- CSV: Spreadsheet format
- JSON: Raw data
- Markdown: Documentation format
- TXT: Plain text report
| Endpoint | Method | Description |
|---|---|---|
/health |
GET | Health check |
/benchmarks |
GET | List all 90+ benchmarks |
/providers |
GET | List available providers |
/run |
POST | Run evaluation (blocking) |
/run/stream |
POST | Run evaluation with SSE streaming |
| Endpoint | Method | Description |
|---|---|---|
/api/models |
GET | Fetch models from OpenRouter |
/api/providers |
GET | Get providers for a model |
/api/benchmarks |
GET | List configured benchmarks |
/api/eval/stream |
GET | Stream evaluation results |
curl -X POST http://localhost:8000/run/stream \
-H "Content-Type: application/json" \
-d '{
"model": "openai/gpt-4o",
"benchmark": "mmlu",
"limit": 10,
"provider": "openrouter"
}'eval-dashboard/
├── benchmark-service/ # Python FastAPI backend
│ ├── main.py # FastAPI app entry point
│ ├── inspect_runner.py # Inspect AI evaluation runner
│ ├── requirements.txt # Python dependencies
│ └── Dockerfile
├── dashboard/ # Next.js frontend
│ ├── src/
│ │ ├── app/ # Next.js app router pages
│ │ ├── components/ # React components
│ │ ├── hooks/ # Custom React hooks
│ │ └── lib/ # Utilities and configs
│ └── package.json
├── provider-eval/ # CLI alternative (incomplete)
├── database/ # SQL migrations
├── docker-compose.yml # Full stack deployment
└── .env.example # Environment template
The provider-eval/ directory contains an incomplete TypeScript CLI tool intended as an alternative to the web dashboard. It provides a lightweight way to run evaluations from the command line without the full stack.
Status: Incomplete
The CLI currently supports basic MMLU evaluations but lacks:
- Full benchmark support (only MMLU implemented)
- Streaming output
- Result persistence
- Multi-provider comparison
If you're interested in contributing, explore the source in provider-eval/src/.
# Check if port 8000 is in use
lsof -i :8000
# Try a different port
uvicorn main:app --port 8001Make sure OPENROUTER_API_KEY is set in your .env file and the file is in the correct directory:
- For Docker:
.envin the project root - For local development:
.envin bothbenchmark-service/anddashboard/
Reasoning models (o1, o3, o4-mini, GPT-4.x/5.x) take longer. The timeout automatically extends for these models, but you can also:
- Reduce the sample size
- Use a faster model for testing
# Check logs
docker compose logs
# Rebuild containers
docker compose build --no-cache
# Clean up and restart
docker compose down -v
docker compose up- Verify the backend is running:
curl http://localhost:8000/health - Check
NEXT_PUBLIC_EVAL_SERVICE_URLindashboard/.env - If using Docker, the dashboard connects to
http://benchmark-service:8000internally
This platform is under active development.
Tested:
- Basic evaluation flow with small sample sizes (10-50 questions)
- Streaming results display
- Model and benchmark selection UI
- Export functionality (CSV, JSON, Markdown, TXT)
- Provider switching
- Reasoning models (o1, o3, GPT-5.x) with answer parsing via OpenRouter
Not yet fully tested:
- Full evaluations (100+ questions) across all 90+ benchmarks
- All model/provider combinations via OpenRouter
- Direct provider integrations (Anthropic, Google, Together, Fireworks)
- Matrix evaluation with multiple models simultaneously
- Database persistence of results across sessions
- Production deployment at scale
Known Limitations:
- Latency measurements may show 0ms for providers that don't report timing
- Some benchmarks require specific model capabilities (code execution, vision)
Most benchmarks (75+) work out of the box. Some specialized benchmarks require additional setup:
| Benchmark | Requirements | Install Command |
|---|---|---|
| SWE-bench | Docker, Python 3.11+, 100GB+ disk | pip install inspect-evals[swe_bench] |
| MLE-bench | Python 3.11+ | pip install inspect-evals[mle_bench] |
| GAIA | Playwright + browsers | pip install inspect-evals[gaia] && playwright install |
| Cybench, CTF | Docker, 32-65GB+ disk | Docker must be running |
| OSWorld | Docker | Docker must be running |
| SciKnowEval | Python 3.10-3.12 | pip install gensim |
The API displays helpful error messages with install instructions if you run a benchmark with missing dependencies.
- OpenAI via OpenRouter (o1, o3, GPT-5.x): Automatically sets
reasoning_effort=nonefor parseable responses - OpenAI Direct API: Full reasoning support with configurable
reasoning_effort - Anthropic (Claude 3.7+, Claude 4): Extended thinking with accessible summaries
- DeepSeek (R1 series): Reasoning content in
reasoning_contentfield - Google (Gemini 2.5+, Gemini 3): Thinking models with configurable
thinkingLevel - Microsoft (Phi-4-reasoning): Reasoning models supported
Contributions and testing feedback are welcome.
- Frontend: Next.js 15, React 19, TailwindCSS, Recharts
- Backend: FastAPI, Inspect AI, Python 3.12+
- Database: PostgreSQL 16
- Deployment: Docker Compose
MIT