AI Evals Orchestration Platform

Run identical prompts across multiple LLMs, compare quality, cost, and latency, then make an explicit human decision before shipping.

What It Does

Transcribe — upload audio (OpenAI Whisper) or paste text directly.
Run — send the same prompt to up to 3 models (OpenAI, Anthropic, Gemini) in parallel.
Evaluate — score each output with deterministic heuristics (edit quality, structural clarity, publish readiness).
Decide — review results side-by-side and submit a Ship / Iterate / Rollback decision. Nothing auto-ships.

Screenshots

Local Setup

Prerequisites

Python 3.10+
Node.js 16+
A Supabase project (URL + Service Role Key)

1. Clone and Configure Environment

cp .env.example .env
# Fill in OPENAI_API_KEY, SUPABASE_URL, SUPABASE_SERVICE_ROLE_KEY

2. Database

Run schema.sql in the Supabase SQL Editor to create the experiments, model_runs, and eval_metrics tables.

3. Backend

pip install -r requirements.txt
python -m uvicorn src.api:app --reload

API runs at http://localhost:8000.

4. Frontend

cd ui
npm install
npm run dev

UI runs at http://localhost:5173.

Environment Variables

Variable	Required	Purpose
`OPENAI_API_KEY`	Yes	Server-side Whisper transcription
`SUPABASE_URL`	Yes	Supabase project URL
`SUPABASE_SERVICE_ROLE_KEY`	Yes	Supabase backend access

Note

Generation API Keys API keys for model generation (OpenAI, Anthropic, Gemini) are passed securely per-session via the UI. They are never stored in the database or backend environment variables.

Running Tests

pytest tests/ -v

Tests use mocked AI and database clients — no API keys or Supabase connection required.

Project Structure

src/
  api.py            # FastAPI routes
  orchestrator.py   # Experiment lifecycle engine
  services.py       # AI provider service + Supabase client
  models.py         # Pydantic models and enums
  main.py           # CLI demo runner
  providers/
    base.py         # Unified provider interface
    registry.py     # Provider routing + model defaults
    openai_provider.py
    anthropic_provider.py
    gemini_provider.py
tests/
  test_orchestrator.py  # Unit tests (mocked dependencies)
schema.sql              # Supabase table definitions
.env.example            # Environment variable template

Architecture

Unified Provider Layer — Model execution is isolated by provider (OpenAI, Anthropic, Gemini) behind a standard interface and resolved via a central registry, preventing spaghetti routing logic and making adding new providers trivial.
Parallel model invocation — same prompt, same transcript, different providers
Failure-tolerant orchestration — individual model failures are logged and persisted without crashing the experiment
Cost + latency tracked per call
Deterministic heuristic scoring — rule-based (length, structure, formatting); not model-graded
Human-in-the-loop — mandatory Ship/Iterate/Rollback decision before completion

Limitations

Heuristic evals: Scores (edit_quality, structural_clarity, publish_ready) are rule-based heuristics, not LLM-graded.
Local only: Orchestration runs via FastAPI BackgroundTasks, not a distributed queue.
Three providers: OpenAI, Anthropic, and Gemini. Additional providers require a new adapter in src/providers/.
Security: User API keys are passed from the UI per-session and not persisted, but production deployments should use a secret vault.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github/workflows		.github/workflows
src		src
tests		tests
ui		ui
.env.example		.env.example
.gitignore		.gitignore
APP_OVERVIEW.md		APP_OVERVIEW.md
README.md		README.md
dashboard.png		dashboard.png
exphistory.png		exphistory.png
requirements.txt		requirements.txt
results2.png		results2.png
schema.sql		schema.sql
verify_supabase.py		verify_supabase.py
walkthrough.md		walkthrough.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Evals Orchestration Platform

What It Does

Screenshots

Local Setup

Prerequisites

1. Clone and Configure Environment

2. Database

3. Backend

4. Frontend

Environment Variables

Running Tests

Project Structure

Architecture

Limitations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AI Evals Orchestration Platform

What It Does

Screenshots

Local Setup

Prerequisites

1. Clone and Configure Environment

2. Database

3. Backend

4. Frontend

Environment Variables

Running Tests

Project Structure

Architecture

Limitations

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages