Run identical prompts across multiple LLMs, compare quality, cost, and latency, then make an explicit human decision before shipping.
- Transcribe — upload audio (OpenAI Whisper) or paste text directly.
- Run — send the same prompt to up to 3 models (OpenAI, Anthropic, Gemini) in parallel.
- Evaluate — score each output with deterministic heuristics (edit quality, structural clarity, publish readiness).
- Decide — review results side-by-side and submit a Ship / Iterate / Rollback decision. Nothing auto-ships.
- Python 3.10+
- Node.js 16+
- A Supabase project (URL + Service Role Key)
cp .env.example .env
# Fill in OPENAI_API_KEY, SUPABASE_URL, SUPABASE_SERVICE_ROLE_KEYRun schema.sql in the Supabase SQL Editor to create the experiments, model_runs, and eval_metrics tables.
pip install -r requirements.txt
python -m uvicorn src.api:app --reloadAPI runs at http://localhost:8000.
cd ui
npm install
npm run devUI runs at http://localhost:5173.
| Variable | Required | Purpose |
|---|---|---|
OPENAI_API_KEY |
Yes | Server-side Whisper transcription |
SUPABASE_URL |
Yes | Supabase project URL |
SUPABASE_SERVICE_ROLE_KEY |
Yes | Supabase backend access |
Note
Generation API Keys
API keys for model generation (OpenAI, Anthropic, Gemini) are passed securely per-session via the UI. They are never stored in the database or backend environment variables.
pytest tests/ -vTests use mocked AI and database clients — no API keys or Supabase connection required.
src/
api.py # FastAPI routes
orchestrator.py # Experiment lifecycle engine
services.py # AI provider service + Supabase client
models.py # Pydantic models and enums
main.py # CLI demo runner
providers/
base.py # Unified provider interface
registry.py # Provider routing + model defaults
openai_provider.py
anthropic_provider.py
gemini_provider.py
tests/
test_orchestrator.py # Unit tests (mocked dependencies)
schema.sql # Supabase table definitions
.env.example # Environment variable template
- Unified Provider Layer — Model execution is isolated by provider (
OpenAI,Anthropic,Gemini) behind a standard interface and resolved via a central registry, preventing spaghetti routing logic and making adding new providers trivial. - Parallel model invocation — same prompt, same transcript, different providers
- Failure-tolerant orchestration — individual model failures are logged and persisted without crashing the experiment
- Cost + latency tracked per call
- Deterministic heuristic scoring — rule-based (length, structure, formatting); not model-graded
- Human-in-the-loop — mandatory Ship/Iterate/Rollback decision before completion
- Heuristic evals: Scores (
edit_quality,structural_clarity,publish_ready) are rule-based heuristics, not LLM-graded. - Local only: Orchestration runs via FastAPI
BackgroundTasks, not a distributed queue. - Three providers: OpenAI, Anthropic, and Gemini. Additional providers require a new adapter in
src/providers/. - Security: User API keys are passed from the UI per-session and not persisted, but production deployments should use a secret vault.


