LLM Tournament Arena

A local-first benchmarking arena for evaluating and comparing Large Language Models (LLMs) with both manual scoring and optional automated evaluation.

Highlights

SQLite-backed, single-binary Go server with SSR templates + WebSockets (:8080)
Prompt suites, profiles, models, results grid, and analytics
Optional Python FastAPI "judge service" for automated evaluation (:8001)
Encrypted API key storage (AES-256-GCM) via ENCRYPTION_KEY

UI Stack

Tailwind CSS v4.1.18 + DaisyUI v5.0.0 (0% custom CSS)
Built-in DaisyUI components and themes (coffee)
Industry-standard utility-first styling approach
Zero maintenance custom CSS codebase

1. Quick Start

git clone https://github.com/lavantien/llm-tournament.git
cd llm-tournament
make run

Open http://localhost:8080 (data is stored in data/tournament.db by default).

No make? Run directly:

CGO_ENABLED=1 go run .

PowerShell:

$env:CGO_ENABLED=1; go run .

↑ Back to top

2. UI Design

Updated: Tailwind v4 + DaisyUI v5 (Zero Custom CSS)

The UI has been migrated to use 100% pure Tailwind v4 + DaisyUI v5 components. See DESIGN_CONCEPT.md for complete design specifications and DESIGN_ROLLOUT.md for detailed migration plan.

Key Design Decisions:

Zero custom CSS - all styling uses Tailwind utilities or DaisyUI semantic components
Built-in DaisyUI cyberpunk theme provides dark backgrounds with neon accents
Tailwind v4 built-in animations (animate-spin, animate-ping, animate-pulse) replace custom keyframes
Dynamic score theming uses Tailwind arbitrary values (bg-[#color]) instead of CSS variables
Glass panels use DaisyUI .card components without custom glow effects
Industry-standard approach using well-maintained tools (Tailwind + DaisyUI)

Trade-offs:

No glass glow overlay effects (cleaner appearance)
No grid overlay texture (cleaner background)
Simpler animations (less dramatic, more performant)
Standard DaisyUI components instead of custom semantic classes

↑ Back to top

3. Features

3.1 Automated Evaluation

Multi-judge consensus scoring using Claude Opus 4.5, GPT-5.2, and Gemini 3 Pro with extended thinking
Dual evaluation modes: objective (semantic matching) and creative (quality assessment)
Async job queue with 3 concurrent workers and job persistence
Real-time progress tracking and cost management (provider pricing varies)
AES-256-GCM encrypted API key storage
Complete audit trail with judge reasoning and confidence scores

3.2 Manual Evaluation

Real-time scoring on 0-100 scale (increments: 0, 20, 40, 60, 80, 100)
Automatic model ranking with live leaderboard updates
WebSocket-based instant updates across all clients
State backup and rollback support
Drag-and-drop prompt reordering and bulk operations

3.3 Suite Management

Independent prompt suites with isolated profiles, prompts, and results
JSON import/export for suites and evaluation results
Duplicate cleanup and SQLite migration support
One-click suite switching

3.4 Analytics

12-tier classification system: Transcendental (>=3780) to Primordial (<300)
Interactive visualizations using Chart.js
Score distributions and tier-based model grouping
Performance comparisons across models and prompt types

3.5 Interface

Markdown editor with live preview
Advanced search and filtering
Copy-to-clipboard functionality
Connection status monitoring with automatic reconnection

↑ Back to top

4. Architecture

Go Server (:8080)              Python Service (:8001)
├── HTTP Handlers              ├── AI Judge Service
├── WebSocket Hub    ──HTTP──→ ├── 3 LLM Judges
├── Job Queue                  └── Consensus Scoring
└── SQLite DB

High-Level System Context

graph LR
    subgraph Client Side
        Browser["User Browser\n(HTML/JS/Templates)"]
    end

    subgraph Server Side
        subgraph Go Monolith
            HTTP_WS["Go HTTP & WebSocket Server"]
        end

        SQLite[("SQLite Database\n(Single Source of Truth)")]
        PythonService["Python FastAPI\n(Judge Service)"]
    end

    Browser -- "HTTP Requests / WebSocket" --> HTTP_WS
    HTTP_WS -- "Reads/Writes Data" --> SQLite
    HTTP_WS -- "HTTP Requests (Scoring)" --> PythonService
    PythonService -- "Returns Scores" --> HTTP_WS

Layered Architecture Flow

graph TD
    subgraph "Go Monolith Layers"
        Surface["1. Surface Layer\n(Templates: *.html, *.js)"]
        Handlers["2. HTTP Handlers\n(handlers/*.go)"]
        Middleware["3. Middleware Layer\n(DB, State, Auth, Encryption)\n(middleware/*.go)"]
        Evaluator["4. Evaluator Layer\n(Async Jobs, Python Client)\n(evaluator/*.go)"]
    end

    DB[("SQLite Database")]
    ExternalJudge["Python FastAPI Judge Service\n(python_service/)"]

    %% Main Flow based on text description
    Surface --> Handlers
    Handlers --> Middleware
    Middleware --> Evaluator

    %% Data Access
    Middleware <-->|"Read/Write Schema"| DB
    Evaluator -..->|"Updates Job Status/Results"| Middleware

    %% External Call
    Evaluator -- "HTTP Calls for Scoring" --> ExternalJudge

Sequence: Manual Evaluation Flow

sequenceDiagram
    participant User
    participant Browser
    participant GoServer
    participant SQLite
    participant WebSocket

    User->>Browser: Clicks score button
    Browser->>GoServer: POST /results/update
    GoServer->>SQLite: UPDATE scores SET score = ?
    SQLite-->>GoServer: Success
    GoServer->>WebSocket: Broadcast score_update
    WebSocket-->>Browser: Real-time update
    Browser->>User: Live leaderboard refresh

Sequence: Automated Evaluation Flow

sequenceDiagram
    participant User
    participant GoServer
    participant JobQueue
    participant PythonService
    participant AIJudges
    participant SQLite

    User->>GoServer: POST /evaluate/all
    GoServer->>JobQueue: Create jobs for all model×prompt pairs
    JobQueue->>JobQueue: Dispatch to workers (3 concurrent)

    loop For each job
        JobQueue->>PythonService: POST /evaluate/objective
        PythonService->>AIJudges: Call Claude, GPT, Gemini
        AIJudges-->>PythonService: Individual scores
        PythonService->>PythonService: Consensus algorithm
        PythonService-->>JobQueue: Final score + reasoning
    end

    JobQueue->>SQLite: Store results
    JobQueue->>User: WebSocket progress updates

Sequence: Prompt Management Flow

sequenceDiagram
    participant User
    participant Browser
    participant GoServer
    participant SQLite

    User->>Browser: Drag prompt to reorder
    Browser->>GoServer: WS message: reorder_prompts
    GoServer->>SQLite: UPDATE prompts SET order = ?
    SQLite-->>GoServer: Success
    GoServer->>Browser: WS broadcast: update_prompts_order
    Browser->>User: Reorder animation completes

Request Flow: User -> Handlers -> Middleware -> SQLite -> WebSocket Broadcast Evaluation Flow: Job Queue -> Python Service -> AI Judges -> Consensus -> Score Update

4.1 Bird's-Eye View

This is a Go monolith (HTTP + WebSocket) with SQLite as a single source of truth, plus an optional Python FastAPI "judge service" for automated scoring.
The repo is organized by "layer": surface (templates) -> HTTP handlers -> middleware (DB/state/render/ws/encryption) -> evaluator (async jobs + Python client) -> python_service (judge logic).
The fastest "index" is to URL handler map in main.go:60, and the DB schema is centralized in middleware/database.go:58.
UI Migration: All styling now uses Tailwind v4 + DaisyUI v5 components with zero custom CSS. See DESIGN_ROLLOUT.md for complete migration details.

4.2 Where To Look In 5 Seconds

HTTP routes / feature entrypoint: main.go:60 (every user-visible feature starts as a path here).
HTML/JS for a page: templates/*.html and templates/*.js (e.g. templates/results.html, templates/prompt_list.html).
DB tables & relationships: middleware/database.go:58 (schema includes suites, profiles, prompts, models, scores, settings, evaluation_jobs, evaluation_history, etc.).
Per-feature server logic: handlers/*.go (files are feature-named: prompts/models/profiles/results/stats/settings/suites/evaluation).
WebSocket messages: middleware/socket.go:33 (server-side /ws, broadcasting and client tracking).
Automated evaluation pipeline: handlers/evaluation.go:25 evaluator/job_queue.go:11 (workers/jobs) evaluator/litellm_client.go:12 (HTTP to Python) python_service/main.py:87 (FastAPI endpoints).
UI Components: Tailwind v4 + DaisyUI v5. See DESIGN_CONCEPT.md for complete component mapping.
Test-as-documentation: handlers/*_test.go, middleware/*_test.go, evaluator/*_test.go, integration/prompts_integration_test.go.

4.3 Common Feature Map

Prompt suites: main.go:76 handlers/suites.go (+ UI in templates/*prompt_suite*.html)
Prompts CRUD/order: main.go:62/main.go:66/main.go:73 handlers/prompt.go:1 (+ reorder over WS in middleware/socket.go:71)
Models CRUD: main.go:63 handlers/models.go
Manual scoring/results UI: main.go:80/main.go:81 handlers/results.go (+ templates/results.html)
Stats/analytics: main.go:93 handlers/stats.go (+ templates/stats.html)
Settings + encrypted keys: main.go:95 handlers/settings.go (+ crypto in middleware/encryption.go:13)
Automated evaluation: main.go:98 handlers/evaluation.go:25 (jobs stored in evaluation_jobs in middleware/database.go:115)

4.4 Search Cheats (copy/paste)

Find a feature by URL: rg -n '"/evaluate/all"|"/results"|"/settings"' main.go
Find which handler renders a template: rg -n "results\\.html|prompt_list\\.html" handlers
Find everything touching a table: rg -n "evaluation_jobs|evaluation_history|model_responses" -S .
Find a websocket message type: rg -n "update_prompts_order|results" middleware/templates -S

↑ Back to top

5. Tech Stack

Backend: Go 1.24+, Gorilla WebSocket, Blackfriday, Bluemonday, SQLite, AES-256-GCM

AI Service: Python 3.8+, FastAPI, LiteLLM, Anthropic/OpenAI/Google SDKs

Frontend: HTML5, Tailwind CSS v4.1.18, DaisyUI v5.0.0, JavaScript ES6+, Chart.js 4.x, Marked.js

Security: XSS sanitization, CORS protection, input validation, encrypted API keys

↑ Back to top

6. Installation

6.1 Prerequisites

Go 1.24+
Python 3.8+ (for automated evaluation)
A C toolchain for CGO/SQLite (e.g., gcc/clang; on Windows install MinGW-w64/MSYS2)
Git
Make (optional, for convenience targets)
Node.js and npm (for UI screenshots only)

6.2 Manual Evaluation (Go-only)

# Run from source
make run

# Or without make
CGO_ENABLED=1 go run .

Build a binary:

make build

Run it:

Linux/macOS: ./release/llm-tournament
Windows (PowerShell): .\release\llm-tournament.exe

One-time migration (only if upgrading old result formats):

CGO_ENABLED=1 go run . --migrate-results

6.3 Automated Evaluation (Go + Python)

Install Python dependencies:

cd python_service
pip install -r requirements.txt

Generate and export ENCRYPTION_KEY (64 hex chars / 32 bytes):

export ENCRYPTION_KEY=$(openssl rand -hex 32)

PowerShell:

$env:ENCRYPTION_KEY = (python -c "import secrets; print(secrets.token_hex(32))")

Start Python service (terminal 1):

python main.py  # Port 8001

Start Go server (terminal 2):

cd ..
CGO_ENABLED=1 go run .  # Port 8080

Configure API keys at http://localhost:8080/settings

Complete setup guide: AUTOMATED_EVALUATION_SETUP.md

6.4 UI Installation (DaisyUI + Tailwind v4)

The UI now uses Tailwind CSS v4 + DaisyUI v5 with zero custom CSS. See DESIGN_CONCEPT.md and DESIGN_ROLLOUT.md for complete migration details.

Install dependencies:

npm install

This installs:

Tailwind CSS v4.1.18
DaisyUI v5.0.0
PostCSS and build tools

Build CSS:

npm run build:css

This generates templates/output.css from templates/input.css using PostCSS.

↑ Back to top

7. Usage Tutorial

This tutorial will guide you through the essential workflows of LLM Tournament Arena.

7.1 Your First Run

After starting the server, open http://localhost:8080. You'll see:

Top navigation bar - Contains links to Results, Stats, Prompts, Profiles, Evaluate, and Settings
Suite selector - On the right side of the top bar, with New/Edit/Delete buttons
Prompts page - Your starting point for managing test prompts (a default suite is created automatically)

↑ Back to top

7.2 Task: Add Your First Model

Models are added directly from the Results page:

Click Results in the top navigation bar
At the top of the page, you'll see a form with "Enter new model name"
Type the model name (e.g., "claude-3-5-sonnet-20241022", "gpt-4o", etc.)
Click Add

The model will appear in the results grid. Repeat for each model you want to evaluate.

7.3 Task: Create a Profile

A Profile is a group of models that you want to evaluate together.

Click Profiles in the top bar
Click Add Profile button
Enter a name for your profile
Select the models you want to include
Click Save

7.4 Task: Create a Test Prompt

Navigate to Prompts in the top bar
Click Add Prompt button
Enter prompt details:
- Title: Short descriptive name
- Category: e.g., "coding", "creative-writing", "reasoning"
- Content: Your test prompt (Markdown supported)
- Expected Answer: Reference answer for manual comparison
Click Save

7.5 Task: Run Manual Evaluation

The Evaluate page lets you score models one prompt at a time.

How to access: You typically navigate here by clicking a score cell in the Results page (see section 7.6), which automatically takes you to the evaluate page for that model and prompt.

Once on the Evaluate page, you'll see:

The current model name at the top
The prompt number (e.g., "Prompt 3 of 10")
The prompt text and expected solution
The model's response (which you can save)
Score buttons (0, 20, 40, 60, 80, 100)

To score:

Click a score button to select it
Click ✅ to submit and move to the next prompt
Use ⬅️➡️ buttons to navigate between prompts without scoring
Click ❌ to return to the Results page

7.6 Task: View and Edit Results

The Results page shows your scoring grid and lets you edit individual scores.

Results Grid overview:
- Rows show each model
- Columns show each prompt
- Cells show scores with color coding (green=high, red=low)
- Total scores and progress bars on the right
Edit a score:
- Click any score cell to go to the Evaluate page for that model×prompt combination
- Update your score and click ✅
- Click ❌ to return to Results
Stats (top bar link):
- View score distributions
- See model tier rankings
- Compare performance across categories

7.7 Task: Configure Automated Evaluation

Automated evaluation uses AI judges (Claude, GPT, Gemini) to score responses via a Python service.

Go to Settings
Add your AI provider API keys (Claude, GPT, Gemini)
Set the Cost Alert Threshold to limit spending
Enable Auto-evaluate new models if desired
Set the Python Service URL (default: http://localhost:8001)
Start the Python judge service (see Installation section)

7.8 Task: Run Automated Evaluation

Automated evaluation is triggered via API endpoints:

# Evaluate all models × all prompts
POST /evaluate/all

# Evaluate one model × all prompts
POST /evaluate/model?id={model_id}

# Evaluate all models × one prompt
POST /evaluate/prompt?id={prompt_id}

Use a tool like curl or integrate these endpoints into your workflow. Results automatically populate the Results grid as evaluation progresses.

Auto-evaluate setting: In the Settings page, you can enable "Auto-evaluate new models" to automatically trigger evaluation when a new model is added (requires Python service running).

7.9 Task: Import/Export and Suite Management

Suite Management:

The Suite selector is in the top-right of the top navigation bar
Use the dropdown to switch between suites
Click New to create a new suite
Click Edit to modify the current suite name
Click Delete to remove the current suite

Import/Export:

Results page: Contains import/export buttons for evaluation results
Prompts page: Contains import/export buttons for prompt data
Export formats use JSON for backup and portability

7.10 Keyboard Shortcuts

Action	Shortcut
Navigate between cells (Results grid)	Arrow keys (when cell is focused)
Submit score (Evaluate page)	Enter (when score selected)

Note: Other navigation elements use UI buttons (⬅️➡️ for prompts, ↑↓ for scroll to top/bottom).

7.11 Tips for Efficient Usage

Batch Operations: Use checkboxes to select multiple prompts for bulk actions
Drag to Reorder: Reorder prompts by dragging them in the list
Real-time Updates: Open multiple browser tabs - they sync automatically
State Backup: Save your evaluation state before long sessions
Suite Isolation: Use separate suites for different evaluation projects

↑ Back to top

8. Development

This project is part of a larger development environment. For a complete setup including:

Shell configuration (zsh/fish/bash with aliases, functions)
Go development tools (gopls, golangci-lint, delve debugger)
Python environment (pyenv, poetry, pipx)
Node.js tools (nvm, npm global packages)
AI/LLM CLI tools (claude-cli, openai-cli, aider)
Git workflows (hooks, templates, aliases)
Editor configs (Neovim/Vim/VSCode settings)

See: lavantien/dotfiles

8.1 Local Development Setup

Prerequisites

Go 1.24+
Python 3.8+
Node.js 20+
CGO-enabled toolchain (gcc/clang/MinGW)

Running the Development Server

# Clone the repository
git clone https://github.com/lavantien/llm-tournament.git
cd llm-tournament

# Install UI dependencies
npm install

# Build CSS (watch mode for development)
npm run build:css:watch

# Run the Go server
CGO_ENABLED=1 go run .

The server will start on http://localhost:8080.

8.2 Running Tests

# Full test suite with TDD guard
make test

# Run specific test package
CGO_ENABLED=1 go test ./handlers -v -race -cover

# Generate coverage report
CGO_ENABLED=1 go test ./... -coverprofile=coverage.out
go tool cover -html=coverage.out

8.3 Building CSS

# One-time build
npm run build:css

# Watch mode (rebuilds on changes)
npm run build:css:watch

8.4 Generating Screenshots

npm install
npm run screenshots:install
npm run screenshots

Screenshots are saved to assets/ui-*.png.

8.5 Additional Documentation

UI design and migration: DESIGN_CONCEPT.md, DESIGN_ROLLOUT.md
Automated evaluation setup: AUTOMATED_EVALUATION_SETUP.md
Changelog / release notes: CHANGELOG.md, RELEASE_NOTES_v3.4.md

↑ Back to top

9. Testing

# Run all tests with TDD-guard, race detection, and coverage
# (requires `tdd-guard-go` on your PATH)
make test

# Run tests with verbose output (bypasses TDD-guard)
make test-verbose

# Manual test run
CGO_ENABLED=1 go test ./... -v -race -cover

# Test Python service health
curl http://localhost:8001/health

9.1 Testing Methodology (UI Components)

Important: Tailwind/DaisyUI classes are strings in your templates—no runtime JS needed for unit tests. You verify classes are present in rendered HTML without applying actual CSS.

Go SSR Testing: Full SSR flow verification using cmp package (now part of Go standard lib). Test handlers execute templates with data and verify DaisyUI classes are present in output.

httptest Integration: HTTP handler testing with rendered HTML output. Verify template rendering with real data structures.

Visual Regression: Use existing screenshot system to compare before/after UI states.

9.2 Coverage

Package-level statement coverage from CGO_ENABLED=1 go test ./... -coverprofile coverage.out:

Package	Coverage
llm-tournament	100.0%
llm-tournament/evaluator	100.0%
llm-tournament/handlers	99.1%
llm-tournament/integration	-
llm-tournament/middleware	100.0%
llm-tournament/templates	100.0%
llm-tournament/testutil	99.6%
llm-tournament/tools/screenshots/cmd/demo-server	100.0%
Total	99.5%

↑ Back to top

10. Troubleshooting

CGO_ENABLED=1 set but build fails: install a working C compiler toolchain (CGO required it for SQLite).
ENCRYPTION_KEY environment variable not set: set ENCRYPTION_KEY before using encrypted API keys / automated evaluation.
Automated evaluation stuck/unavailable: confirm that Python service is running and healthy (GET /health on :8001).
Port already in use: stop conflicting process or run on different ports (Python: PORT; Go server currently listens on :8080 in main.go).
DB issues: default DB is data/tournament.db; you can point to another file with --db <path>.
DaisyUI classes not rendering: Verify tailwind.config.js includes DaisyUI plugin and npm run build:css has been run.

↑ Back to top

11. API Reference

11.1 Evaluation Endpoints

POST /evaluate/all - Evaluate all models × all prompts
POST /evaluate/model?id={id} - Evaluate one model × all prompts
POST /evaluate/prompt?id={id} - Evaluate all models × one prompt
GET /evaluation/progress?id={job_id} - Get job status
POST /evaluation/cancel?id={job_id} - Cancel running job

11.2 Settings Endpoints

GET /settings - Settings page
POST /settings/update - Update settings
POST /settings/test_key - Test API key validity

11.3 Core Endpoints

GET /prompts - Prompts list (default route)
GET /results - Results and scoring
GET /profiles - Profile management
WS /ws - WebSocket connection

↑ Back to top

12. Project Structure

llm-tournament/
├── main.go              # Entry point, routing, server setup
├── handlers/            # HTTP handlers (models, prompts, results, stats, evaluation, settings)
├── middleware/          # Business logic (database, WebSocket, encryption, state)
├── evaluator/           # Async job queue, LLM client, consensus algorithm
├── python_service/      # FastAPI AI judge service (3 LLM judges)
├── templates/           # HTML, CSS, JavaScript
├── assets/              # UI screenshots and static images
├── data/                # SQLite database
├── tailwind.config.js    # Tailwind v4 + DaisyUI v5 configuration
└── postcss.config.js    # PostCSS configuration

UI-Specific:

templates/input.css - Tailwind + DaisyUI imports only (zero custom CSS)
templates/output.css - Generated CSS file (PostCSS output)
templates/*.html - All HTML templates using DaisyUI components

↑ Back to top

13. Environment Variables

CGO_ENABLED=1 (required for SQLite)
ENCRYPTION_KEY (64-char hex / 32 bytes; required for encrypted API key storage and automated evaluation)

Python judge service (optional):

HOST (default 0.0.0.0)
PORT (default 8001)

Generate encryption key:

openssl rand -hex 32
python -c "import secrets; print(secrets.token_hex(32))"

See AUTOMATED_EVALUATION_SETUP.md for detailed configuration.

↑ Back to top

14. Documentation Guidelines

When editing documentation files, be aware that several files are automatically validated by tests and CI scripts. See DOCUMENTATION_ENFORCEMENT.md for:

List of enforced documentation files (README.md, DESIGN_CONCEPT.md, design_preview.html)
Required sections and formats for each file
How to update documentation without breaking automation
Troubleshooting common mistakes

Quick reference:

README.md - Coverage table enforced by scripts/update_coverage_table.py
DESIGN_CONCEPT.md - Section headers enforced by design_preview_test.go
design_preview.html - Required elements enforced by design_preview_test.go

Pre-commit hook (recommended):

# Install automatic documentation verification before commits
cp scripts/pre-commit .git/hooks/pre-commit && chmod +x .git/hooks/pre-commit

This will automatically run make verify-docs when you commit documentation changes.

To verify documentation changes:

# Run specific enforcement test
CGO_ENABLED=1 go test -run TestDesignConceptAndPreview_ExistAndStructured -v

# Update coverage table after editing README
make update-coverage-table

# Run full test suite
make test

↑ Back to top

15. License

MIT License - See LICENSE for details

↑ Back to top

16. Contact

cariyaputta@gmail.com

↑ Back to top

Name		Name	Last commit message	Last commit date
Latest commit History 987 Commits
.github/workflows		.github/workflows
.vscode		.vscode
assets		assets
data		data
evaluator		evaluator
handlers		handlers
integration		integration
middleware		middleware
python_service		python_service
scripts		scripts
templates		templates
testutil		testutil
tools		tools
.gitignore		.gitignore
AGENTS.md		AGENTS.md
AUTOMATED_EVALUATION_SETUP.md		AUTOMATED_EVALUATION_SETUP.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
DESIGN_CONCEPT.md		DESIGN_CONCEPT.md
DESIGN_ROLLOUT.md		DESIGN_ROLLOUT.md
DOCUMENTATION_ENFORCEMENT.md		DOCUMENTATION_ENFORCEMENT.md
GEMINI.md		GEMINI.md
LICENSE		LICENSE
README.md		README.md
RULES.md		RULES.md
app.go		app.go
app_test.go		app_test.go
arena_css_features_test.go		arena_css_features_test.go
arena_nav_test.go		arena_nav_test.go
arena_theme_test.go		arena_theme_test.go
automated_evaluation_setup_doc_test.go		automated_evaluation_setup_doc_test.go
check.js		check.js
check_spacers.js		check_spacers.js
coverage-badge.svg		coverage-badge.svg
coverage.html		coverage.html
dev.ps1		dev.ps1
dev.sh		dev.sh
go.mod		go.mod
go.sum		go.sum
main.go		main.go
main_run_test.go		main_run_test.go
main_test.go		main_test.go
makefile		makefile
makefile_regression_test.go		makefile_regression_test.go
package-lock.json		package-lock.json
package.json		package.json
postcss.config.js		postcss.config.js
readme_quickstart_test.go		readme_quickstart_test.go
readme_ui_screenshots_test.go		readme_ui_screenshots_test.go
run.go		run.go
system_prompt_general.xml		system_prompt_general.xml
system_prompt_programming.xml		system_prompt_programming.xml
system_prompt_translation.xml		system_prompt_translation.xml
tailwind.config.js		tailwind.config.js
test-output.txt		test-output.txt

Folders and files

Latest commit

History

Repository files navigation

LLM Tournament Arena

Table of Contents

1. Quick Start

2. UI Design

3. Features

3.1 Automated Evaluation

3.2 Manual Evaluation

3.3 Suite Management

3.4 Analytics

3.5 Interface

4. Architecture

4.1 Bird's-Eye View

4.2 Where To Look In 5 Seconds

4.3 Common Feature Map

4.4 Search Cheats (copy/paste)

5. Tech Stack

6. Installation

6.1 Prerequisites

6.2 Manual Evaluation (Go-only)

6.3 Automated Evaluation (Go + Python)

6.4 UI Installation (DaisyUI + Tailwind v4)

7. Usage Tutorial

7.1 Your First Run

7.2 Task: Add Your First Model

7.3 Task: Create a Profile

7.4 Task: Create a Test Prompt

7.5 Task: Run Manual Evaluation

7.6 Task: View and Edit Results

7.7 Task: Configure Automated Evaluation

7.8 Task: Run Automated Evaluation

7.9 Task: Import/Export and Suite Management

7.10 Keyboard Shortcuts

7.11 Tips for Efficient Usage

8. Development

8.1 Local Development Setup

Prerequisites

Running the Development Server

8.2 Running Tests

8.3 Building CSS

8.4 Generating Screenshots

8.5 Additional Documentation

9. Testing

9.1 Testing Methodology (UI Components)

9.2 Coverage

10. Troubleshooting

11. API Reference

11.1 Evaluation Endpoints

11.2 Settings Endpoints

11.3 Core Endpoints

12. Project Structure

13. Environment Variables

14. Documentation Guidelines

15. License

16. Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 18

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages