Skip to content

SuperfiedStudd/ai-evals-orchestration

Repository files navigation

AI Evals Orchestration Platform

Run identical prompts across multiple LLMs, compare quality, cost, and latency, then make an explicit human decision before shipping.

What It Does

  1. Transcribe — upload audio (OpenAI Whisper) or paste text directly.
  2. Run — send the same prompt to up to 3 models (OpenAI, Anthropic, Gemini) in parallel.
  3. Evaluate — score each output with deterministic heuristics (edit quality, structural clarity, publish readiness).
  4. Decide — review results side-by-side and submit a Ship / Iterate / Rollback decision. Nothing auto-ships.

Screenshots


Local Setup

Prerequisites

  • Python 3.10+
  • Node.js 16+
  • A Supabase project (URL + Service Role Key)

1. Clone and Configure Environment

cp .env.example .env
# Fill in OPENAI_API_KEY, SUPABASE_URL, SUPABASE_SERVICE_ROLE_KEY

2. Database

Run schema.sql in the Supabase SQL Editor to create the experiments, model_runs, and eval_metrics tables.

3. Backend

pip install -r requirements.txt
python -m uvicorn src.api:app --reload

API runs at http://localhost:8000.

4. Frontend

cd ui
npm install
npm run dev

UI runs at http://localhost:5173.


Environment Variables

Variable Required Purpose
OPENAI_API_KEY Yes Server-side Whisper transcription
SUPABASE_URL Yes Supabase project URL
SUPABASE_SERVICE_ROLE_KEY Yes Supabase backend access

Note

Generation API Keys API keys for model generation (OpenAI, Anthropic, Gemini) are passed securely per-session via the UI. They are never stored in the database or backend environment variables.


Running Tests

pytest tests/ -v

Tests use mocked AI and database clients — no API keys or Supabase connection required.


Project Structure

src/
  api.py            # FastAPI routes
  orchestrator.py   # Experiment lifecycle engine
  services.py       # AI provider service + Supabase client
  models.py         # Pydantic models and enums
  main.py           # CLI demo runner
  providers/
    base.py         # Unified provider interface
    registry.py     # Provider routing + model defaults
    openai_provider.py
    anthropic_provider.py
    gemini_provider.py
tests/
  test_orchestrator.py  # Unit tests (mocked dependencies)
schema.sql              # Supabase table definitions
.env.example            # Environment variable template

Architecture

  • Unified Provider Layer — Model execution is isolated by provider (OpenAI, Anthropic, Gemini) behind a standard interface and resolved via a central registry, preventing spaghetti routing logic and making adding new providers trivial.
  • Parallel model invocation — same prompt, same transcript, different providers
  • Failure-tolerant orchestration — individual model failures are logged and persisted without crashing the experiment
  • Cost + latency tracked per call
  • Deterministic heuristic scoring — rule-based (length, structure, formatting); not model-graded
  • Human-in-the-loop — mandatory Ship/Iterate/Rollback decision before completion

Limitations

  • Heuristic evals: Scores (edit_quality, structural_clarity, publish_ready) are rule-based heuristics, not LLM-graded.
  • Local only: Orchestration runs via FastAPI BackgroundTasks, not a distributed queue.
  • Three providers: OpenAI, Anthropic, and Gemini. Additional providers require a new adapter in src/providers/.
  • Security: User API keys are passed from the UI per-session and not persisted, but production deployments should use a secret vault.

About

End-to-end AI evals orchestration platform for comparing LLM outputs across providers with transcription, structured logging, human review, and Supabase-backed decision tracking.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors