A local-first benchmarking arena for evaluating and comparing Large Language Models (LLMs) with both manual scoring and optional automated evaluation.
Highlights
- SQLite-backed, single-binary Go server with SSR templates + WebSockets (
:8080) - Prompt suites, profiles, models, results grid, and analytics
- Optional Python FastAPI "judge service" for automated evaluation (
:8001) - Encrypted API key storage (AES-256-GCM) via
ENCRYPTION_KEY
UI Stack
- Tailwind CSS v4.1.18 + DaisyUI v5.0.0 (0% custom CSS)
- Built-in DaisyUI components and themes (coffee)
- Industry-standard utility-first styling approach
- Zero maintenance custom CSS codebase
- 1. Quick Start
- 2. UI Design
- 3. Features
- 4. Architecture
- 5. Tech Stack
- 6. Installation
- 7. Usage Tutorial
- 8. Development
- 9. Testing
- 10. Troubleshooting
- 11. API Reference
- 12. Project Structure
- 13. Environment Variables
- 14. Documentation Guidelines
- 15. License
- 16. Contact
git clone https://github.com/lavantien/llm-tournament.git
cd llm-tournament
make runOpen http://localhost:8080 (data is stored in data/tournament.db by default).
No make? Run directly:
CGO_ENABLED=1 go run .PowerShell:
$env:CGO_ENABLED=1; go run .Updated: Tailwind v4 + DaisyUI v5 (Zero Custom CSS)
The UI has been migrated to use 100% pure Tailwind v4 + DaisyUI v5 components. See DESIGN_CONCEPT.md for complete design specifications and DESIGN_ROLLOUT.md for detailed migration plan.
Key Design Decisions:
- Zero custom CSS - all styling uses Tailwind utilities or DaisyUI semantic components
- Built-in DaisyUI
cyberpunktheme provides dark backgrounds with neon accents - Tailwind v4 built-in animations (
animate-spin,animate-ping,animate-pulse) replace custom keyframes - Dynamic score theming uses Tailwind arbitrary values (
bg-[#color]) instead of CSS variables - Glass panels use DaisyUI
.cardcomponents without custom glow effects - Industry-standard approach using well-maintained tools (Tailwind + DaisyUI)
Trade-offs:
- No glass glow overlay effects (cleaner appearance)
- No grid overlay texture (cleaner background)
- Simpler animations (less dramatic, more performant)
- Standard DaisyUI components instead of custom semantic classes
- Multi-judge consensus scoring using Claude Opus 4.5, GPT-5.2, and Gemini 3 Pro with extended thinking
- Dual evaluation modes: objective (semantic matching) and creative (quality assessment)
- Async job queue with 3 concurrent workers and job persistence
- Real-time progress tracking and cost management (provider pricing varies)
- AES-256-GCM encrypted API key storage
- Complete audit trail with judge reasoning and confidence scores
- Real-time scoring on 0-100 scale (increments: 0, 20, 40, 60, 80, 100)
- Automatic model ranking with live leaderboard updates
- WebSocket-based instant updates across all clients
- State backup and rollback support
- Drag-and-drop prompt reordering and bulk operations
- Independent prompt suites with isolated profiles, prompts, and results
- JSON import/export for suites and evaluation results
- Duplicate cleanup and SQLite migration support
- One-click suite switching
- 12-tier classification system: Transcendental (>=3780) to Primordial (<300)
- Interactive visualizations using Chart.js
- Score distributions and tier-based model grouping
- Performance comparisons across models and prompt types
- Markdown editor with live preview
- Advanced search and filtering
- Copy-to-clipboard functionality
- Connection status monitoring with automatic reconnection
Go Server (:8080) Python Service (:8001)
├── HTTP Handlers ├── AI Judge Service
├── WebSocket Hub ──HTTP──→ ├── 3 LLM Judges
├── Job Queue └── Consensus Scoring
└── SQLite DB
High-Level System Context
graph LR
subgraph Client Side
Browser["User Browser\n(HTML/JS/Templates)"]
end
subgraph Server Side
subgraph Go Monolith
HTTP_WS["Go HTTP & WebSocket Server"]
end
SQLite[("SQLite Database\n(Single Source of Truth)")]
PythonService["Python FastAPI\n(Judge Service)"]
end
Browser -- "HTTP Requests / WebSocket" --> HTTP_WS
HTTP_WS -- "Reads/Writes Data" --> SQLite
HTTP_WS -- "HTTP Requests (Scoring)" --> PythonService
PythonService -- "Returns Scores" --> HTTP_WS
Layered Architecture Flow
graph TD
subgraph "Go Monolith Layers"
Surface["1. Surface Layer\n(Templates: *.html, *.js)"]
Handlers["2. HTTP Handlers\n(handlers/*.go)"]
Middleware["3. Middleware Layer\n(DB, State, Auth, Encryption)\n(middleware/*.go)"]
Evaluator["4. Evaluator Layer\n(Async Jobs, Python Client)\n(evaluator/*.go)"]
end
DB[("SQLite Database")]
ExternalJudge["Python FastAPI Judge Service\n(python_service/)"]
%% Main Flow based on text description
Surface --> Handlers
Handlers --> Middleware
Middleware --> Evaluator
%% Data Access
Middleware <-->|"Read/Write Schema"| DB
Evaluator -..->|"Updates Job Status/Results"| Middleware
%% External Call
Evaluator -- "HTTP Calls for Scoring" --> ExternalJudge
Sequence: Manual Evaluation Flow
sequenceDiagram
participant User
participant Browser
participant GoServer
participant SQLite
participant WebSocket
User->>Browser: Clicks score button
Browser->>GoServer: POST /results/update
GoServer->>SQLite: UPDATE scores SET score = ?
SQLite-->>GoServer: Success
GoServer->>WebSocket: Broadcast score_update
WebSocket-->>Browser: Real-time update
Browser->>User: Live leaderboard refresh
Sequence: Automated Evaluation Flow
sequenceDiagram
participant User
participant GoServer
participant JobQueue
participant PythonService
participant AIJudges
participant SQLite
User->>GoServer: POST /evaluate/all
GoServer->>JobQueue: Create jobs for all model×prompt pairs
JobQueue->>JobQueue: Dispatch to workers (3 concurrent)
loop For each job
JobQueue->>PythonService: POST /evaluate/objective
PythonService->>AIJudges: Call Claude, GPT, Gemini
AIJudges-->>PythonService: Individual scores
PythonService->>PythonService: Consensus algorithm
PythonService-->>JobQueue: Final score + reasoning
end
JobQueue->>SQLite: Store results
JobQueue->>User: WebSocket progress updates
Sequence: Prompt Management Flow
sequenceDiagram
participant User
participant Browser
participant GoServer
participant SQLite
User->>Browser: Drag prompt to reorder
Browser->>GoServer: WS message: reorder_prompts
GoServer->>SQLite: UPDATE prompts SET order = ?
SQLite-->>GoServer: Success
GoServer->>Browser: WS broadcast: update_prompts_order
Browser->>User: Reorder animation completes
Request Flow: User -> Handlers -> Middleware -> SQLite -> WebSocket Broadcast Evaluation Flow: Job Queue -> Python Service -> AI Judges -> Consensus -> Score Update
- This is a Go monolith (HTTP + WebSocket) with SQLite as a single source of truth, plus an optional Python FastAPI "judge service" for automated scoring.
- The repo is organized by "layer": surface (templates) -> HTTP handlers -> middleware (DB/state/render/ws/encryption) -> evaluator (async jobs + Python client) ->
python_service(judge logic). - The fastest "index" is to URL handler map in
main.go:60, and the DB schema is centralized inmiddleware/database.go:58. - UI Migration: All styling now uses Tailwind v4 + DaisyUI v5 components with zero custom CSS. See DESIGN_ROLLOUT.md for complete migration details.
- HTTP routes / feature entrypoint:
main.go:60(every user-visible feature starts as a path here). - HTML/JS for a page:
templates/*.htmlandtemplates/*.js(e.g.templates/results.html,templates/prompt_list.html). - DB tables & relationships:
middleware/database.go:58(schema includessuites,profiles,prompts,models,scores,settings,evaluation_jobs,evaluation_history, etc.). - Per-feature server logic:
handlers/*.go(files are feature-named: prompts/models/profiles/results/stats/settings/suites/evaluation). - WebSocket messages:
middleware/socket.go:33(server-side/ws, broadcasting and client tracking). - Automated evaluation pipeline:
handlers/evaluation.go:25evaluator/job_queue.go:11(workers/jobs)evaluator/litellm_client.go:12(HTTP to Python)python_service/main.py:87(FastAPI endpoints). - UI Components: Tailwind v4 + DaisyUI v5. See DESIGN_CONCEPT.md for complete component mapping.
- Test-as-documentation:
handlers/*_test.go,middleware/*_test.go,evaluator/*_test.go,integration/prompts_integration_test.go.
- Prompt suites:
main.go:76handlers/suites.go(+ UI intemplates/*prompt_suite*.html) - Prompts CRUD/order:
main.go:62/main.go:66/main.go:73handlers/prompt.go:1(+ reorder over WS inmiddleware/socket.go:71) - Models CRUD:
main.go:63handlers/models.go - Manual scoring/results UI:
main.go:80/main.go:81handlers/results.go(+templates/results.html) - Stats/analytics:
main.go:93handlers/stats.go(+templates/stats.html) - Settings + encrypted keys:
main.go:95handlers/settings.go(+ crypto inmiddleware/encryption.go:13) - Automated evaluation:
main.go:98handlers/evaluation.go:25(jobs stored inevaluation_jobsinmiddleware/database.go:115)
- Find a feature by URL:
rg -n '"/evaluate/all"|"/results"|"/settings"' main.go - Find which handler renders a template:
rg -n "results\\.html|prompt_list\\.html" handlers - Find everything touching a table:
rg -n "evaluation_jobs|evaluation_history|model_responses" -S . - Find a websocket message type:
rg -n "update_prompts_order|results" middleware/templates -S
Backend: Go 1.24+, Gorilla WebSocket, Blackfriday, Bluemonday, SQLite, AES-256-GCM
AI Service: Python 3.8+, FastAPI, LiteLLM, Anthropic/OpenAI/Google SDKs
Frontend: HTML5, Tailwind CSS v4.1.18, DaisyUI v5.0.0, JavaScript ES6+, Chart.js 4.x, Marked.js
Security: XSS sanitization, CORS protection, input validation, encrypted API keys
- Go 1.24+
- Python 3.8+ (for automated evaluation)
- A C toolchain for CGO/SQLite (e.g., gcc/clang; on Windows install MinGW-w64/MSYS2)
- Git
- Make (optional, for convenience targets)
- Node.js and npm (for UI screenshots only)
# Run from source
make run
# Or without make
CGO_ENABLED=1 go run .Build a binary:
make buildRun it:
- Linux/macOS:
./release/llm-tournament - Windows (PowerShell):
.\release\llm-tournament.exe
One-time migration (only if upgrading old result formats):
CGO_ENABLED=1 go run . --migrate-resultsInstall Python dependencies:
cd python_service
pip install -r requirements.txtGenerate and export ENCRYPTION_KEY (64 hex chars / 32 bytes):
export ENCRYPTION_KEY=$(openssl rand -hex 32)PowerShell:
$env:ENCRYPTION_KEY = (python -c "import secrets; print(secrets.token_hex(32))")Start Python service (terminal 1):
python main.py # Port 8001Start Go server (terminal 2):
cd ..
CGO_ENABLED=1 go run . # Port 8080Configure API keys at http://localhost:8080/settings
Complete setup guide: AUTOMATED_EVALUATION_SETUP.md
The UI now uses Tailwind CSS v4 + DaisyUI v5 with zero custom CSS. See DESIGN_CONCEPT.md and DESIGN_ROLLOUT.md for complete migration details.
Install dependencies:
npm installThis installs:
- Tailwind CSS v4.1.18
- DaisyUI v5.0.0
- PostCSS and build tools
Build CSS:
npm run build:cssThis generates templates/output.css from templates/input.css using PostCSS.
This tutorial will guide you through the essential workflows of LLM Tournament Arena.
After starting the server, open http://localhost:8080. You'll see:
- Top navigation bar - Contains links to Results, Stats, Prompts, Profiles, Evaluate, and Settings
- Suite selector - On the right side of the top bar, with New/Edit/Delete buttons
- Prompts page - Your starting point for managing test prompts (a default suite is created automatically)
Models are added directly from the Results page:
- Click Results in the top navigation bar
- At the top of the page, you'll see a form with "Enter new model name"
- Type the model name (e.g., "claude-3-5-sonnet-20241022", "gpt-4o", etc.)
- Click Add
The model will appear in the results grid. Repeat for each model you want to evaluate.
A Profile is a group of models that you want to evaluate together.
- Click Profiles in the top bar
- Click Add Profile button
- Enter a name for your profile
- Select the models you want to include
- Click Save
- Navigate to Prompts in the top bar
- Click Add Prompt button
- Enter prompt details:
- Title: Short descriptive name
- Category: e.g., "coding", "creative-writing", "reasoning"
- Content: Your test prompt (Markdown supported)
- Expected Answer: Reference answer for manual comparison
- Click Save
The Evaluate page lets you score models one prompt at a time.
How to access: You typically navigate here by clicking a score cell in the Results page (see section 7.6), which automatically takes you to the evaluate page for that model and prompt.
Once on the Evaluate page, you'll see:
- The current model name at the top
- The prompt number (e.g., "Prompt 3 of 10")
- The prompt text and expected solution
- The model's response (which you can save)
- Score buttons (0, 20, 40, 60, 80, 100)
To score:
- Click a score button to select it
- Click ✅ to submit and move to the next prompt
- Use ⬅️➡️ buttons to navigate between prompts without scoring
- Click ❌ to return to the Results page
The Results page shows your scoring grid and lets you edit individual scores.
-
Results Grid overview:
- Rows show each model
- Columns show each prompt
- Cells show scores with color coding (green=high, red=low)
- Total scores and progress bars on the right
-
Edit a score:
- Click any score cell to go to the Evaluate page for that model×prompt combination
- Update your score and click ✅
- Click ❌ to return to Results
-
Stats (top bar link):
- View score distributions
- See model tier rankings
- Compare performance across categories
Automated evaluation uses AI judges (Claude, GPT, Gemini) to score responses via a Python service.
- Go to Settings
- Add your AI provider API keys (Claude, GPT, Gemini)
- Set the Cost Alert Threshold to limit spending
- Enable Auto-evaluate new models if desired
- Set the Python Service URL (default:
http://localhost:8001) - Start the Python judge service (see Installation section)
Automated evaluation is triggered via API endpoints:
# Evaluate all models × all prompts
POST /evaluate/all
# Evaluate one model × all prompts
POST /evaluate/model?id={model_id}
# Evaluate all models × one prompt
POST /evaluate/prompt?id={prompt_id}Use a tool like curl or integrate these endpoints into your workflow. Results automatically populate the Results grid as evaluation progresses.
Auto-evaluate setting: In the Settings page, you can enable "Auto-evaluate new models" to automatically trigger evaluation when a new model is added (requires Python service running).
Suite Management:
- The Suite selector is in the top-right of the top navigation bar
- Use the dropdown to switch between suites
- Click New to create a new suite
- Click Edit to modify the current suite name
- Click Delete to remove the current suite
Import/Export:
- Results page: Contains import/export buttons for evaluation results
- Prompts page: Contains import/export buttons for prompt data
- Export formats use JSON for backup and portability
| Action | Shortcut |
|---|---|
| Navigate between cells (Results grid) | Arrow keys (when cell is focused) |
| Submit score (Evaluate page) | Enter (when score selected) |
Note: Other navigation elements use UI buttons (⬅️➡️ for prompts, ↑↓ for scroll to top/bottom).
- Batch Operations: Use checkboxes to select multiple prompts for bulk actions
- Drag to Reorder: Reorder prompts by dragging them in the list
- Real-time Updates: Open multiple browser tabs - they sync automatically
- State Backup: Save your evaluation state before long sessions
- Suite Isolation: Use separate suites for different evaluation projects
This project is part of a larger development environment. For a complete setup including:
- Shell configuration (zsh/fish/bash with aliases, functions)
- Go development tools (gopls, golangci-lint, delve debugger)
- Python environment (pyenv, poetry, pipx)
- Node.js tools (nvm, npm global packages)
- AI/LLM CLI tools (claude-cli, openai-cli, aider)
- Git workflows (hooks, templates, aliases)
- Editor configs (Neovim/Vim/VSCode settings)
See: lavantien/dotfiles
- Go 1.24+
- Python 3.8+
- Node.js 20+
- CGO-enabled toolchain (gcc/clang/MinGW)
# Clone the repository
git clone https://github.com/lavantien/llm-tournament.git
cd llm-tournament
# Install UI dependencies
npm install
# Build CSS (watch mode for development)
npm run build:css:watch
# Run the Go server
CGO_ENABLED=1 go run .The server will start on http://localhost:8080.
# Full test suite with TDD guard
make test
# Run specific test package
CGO_ENABLED=1 go test ./handlers -v -race -cover
# Generate coverage report
CGO_ENABLED=1 go test ./... -coverprofile=coverage.out
go tool cover -html=coverage.out# One-time build
npm run build:css
# Watch mode (rebuilds on changes)
npm run build:css:watchnpm install
npm run screenshots:install
npm run screenshotsScreenshots are saved to assets/ui-*.png.
- UI design and migration: DESIGN_CONCEPT.md, DESIGN_ROLLOUT.md
- Automated evaluation setup: AUTOMATED_EVALUATION_SETUP.md
- Changelog / release notes: CHANGELOG.md, RELEASE_NOTES_v3.4.md
# Run all tests with TDD-guard, race detection, and coverage
# (requires `tdd-guard-go` on your PATH)
make test
# Run tests with verbose output (bypasses TDD-guard)
make test-verbose
# Manual test run
CGO_ENABLED=1 go test ./... -v -race -cover
# Test Python service health
curl http://localhost:8001/healthImportant: Tailwind/DaisyUI classes are strings in your templates—no runtime JS needed for unit tests. You verify classes are present in rendered HTML without applying actual CSS.
Go SSR Testing: Full SSR flow verification using cmp package (now part of Go standard lib). Test handlers execute templates with data and verify DaisyUI classes are present in output.
httptest Integration: HTTP handler testing with rendered HTML output. Verify template rendering with real data structures.
Visual Regression: Use existing screenshot system to compare before/after UI states.
Package-level statement coverage from CGO_ENABLED=1 go test ./... -coverprofile coverage.out:
| Package | Coverage |
|---|---|
| llm-tournament | 100.0% |
| llm-tournament/evaluator | 100.0% |
| llm-tournament/handlers | 99.1% |
| llm-tournament/integration | - |
| llm-tournament/middleware | 100.0% |
| llm-tournament/templates | 100.0% |
| llm-tournament/testutil | 99.6% |
| llm-tournament/tools/screenshots/cmd/demo-server | 100.0% |
| Total | 99.5% |
CGO_ENABLED=1set but build fails: install a working C compiler toolchain (CGO required it for SQLite).ENCRYPTION_KEYenvironment variable not set: setENCRYPTION_KEYbefore using encrypted API keys / automated evaluation.- Automated evaluation stuck/unavailable: confirm that Python service is running and healthy (
GET /healthon:8001). - Port already in use: stop conflicting process or run on different ports (Python:
PORT; Go server currently listens on:8080inmain.go). - DB issues: default DB is
data/tournament.db; you can point to another file with--db <path>. - DaisyUI classes not rendering: Verify
tailwind.config.jsincludes DaisyUI plugin andnpm run build:csshas been run.
- POST /evaluate/all - Evaluate all models × all prompts
- POST /evaluate/model?id={id} - Evaluate one model × all prompts
- POST /evaluate/prompt?id={id} - Evaluate all models × one prompt
- GET /evaluation/progress?id={job_id} - Get job status
- POST /evaluation/cancel?id={job_id} - Cancel running job
- GET /settings - Settings page
- POST /settings/update - Update settings
- POST /settings/test_key - Test API key validity
- GET /prompts - Prompts list (default route)
- GET /results - Results and scoring
- GET /profiles - Profile management
- WS /ws - WebSocket connection
llm-tournament/
├── main.go # Entry point, routing, server setup
├── handlers/ # HTTP handlers (models, prompts, results, stats, evaluation, settings)
├── middleware/ # Business logic (database, WebSocket, encryption, state)
├── evaluator/ # Async job queue, LLM client, consensus algorithm
├── python_service/ # FastAPI AI judge service (3 LLM judges)
├── templates/ # HTML, CSS, JavaScript
├── assets/ # UI screenshots and static images
├── data/ # SQLite database
├── tailwind.config.js # Tailwind v4 + DaisyUI v5 configuration
└── postcss.config.js # PostCSS configuration
UI-Specific:
templates/input.css- Tailwind + DaisyUI imports only (zero custom CSS)templates/output.css- Generated CSS file (PostCSS output)templates/*.html- All HTML templates using DaisyUI components
- CGO_ENABLED=1 (required for SQLite)
- ENCRYPTION_KEY (64-char hex / 32 bytes; required for encrypted API key storage and automated evaluation)
Python judge service (optional):
- HOST (default
0.0.0.0) - PORT (default
8001)
Generate encryption key:
openssl rand -hex 32python -c "import secrets; print(secrets.token_hex(32))"
See AUTOMATED_EVALUATION_SETUP.md for detailed configuration.
When editing documentation files, be aware that several files are automatically validated by tests and CI scripts. See DOCUMENTATION_ENFORCEMENT.md for:
- List of enforced documentation files (README.md, DESIGN_CONCEPT.md, design_preview.html)
- Required sections and formats for each file
- How to update documentation without breaking automation
- Troubleshooting common mistakes
Quick reference:
README.md- Coverage table enforced byscripts/update_coverage_table.pyDESIGN_CONCEPT.md- Section headers enforced bydesign_preview_test.godesign_preview.html- Required elements enforced bydesign_preview_test.go
Pre-commit hook (recommended):
# Install automatic documentation verification before commits
cp scripts/pre-commit .git/hooks/pre-commit && chmod +x .git/hooks/pre-commitThis will automatically run make verify-docs when you commit documentation changes.
To verify documentation changes:
# Run specific enforcement test
CGO_ENABLED=1 go test -run TestDesignConceptAndPreview_ExistAndStructured -v
# Update coverage table after editing README
make update-coverage-table
# Run full test suite
make testMIT License - See LICENSE for details






