Inspect AI Prompt Optimizer

Warning

This is an experimental personal project, not affiliated with Inspect AI. Shared for inspiration only.

Automatically optimize system prompts for Inspect AI evaluations through iterative refinement.

This tool uses AI to iteratively improve prompts by analyzing failures, extracting insights, and generating better prompts.

Purpose

There are two main purposes for this project:

Elicitate model capabilities via automatic prompt optimization for existing evals/benchmarks to better estimate current state-of-the-art capabilites
Solve any task that can be defined with inspect AI

How it works

Evaluate → Run Inspect AI evaluation
Analyze → Extract structured failure patterns
Optimize → Generate improved prompt
Iterate → Repeat until target reached

Features

Works out-of-the-box for any eval defined in https://github.com/UKGovernmentBEIS/inspect_evals
Provides several interfaces: gradio webapp, CLI, and MCP server

Quick Start

One-Liner

from inspect_optimize import solve

solution = solve("math_word_problems")
print(solution)

With Options

from inspect_optimize import solve

result = solve("hellaswag", # Select eval task
               iterations=3, # Do 3 prompt optimization iterations
               target_score=0.9, # Stop early when target score reached
               limit=10) # Limit evaluations to 10 samples 
print(result)

Streaming API (for UIs)

from inspect_optimize import solve_stream
from inspect_optimize.types import EvalResult, Insights, PromptUpdate

for item in solve_stream("arc_easy", iterations=3, limit=2):
    if isinstance(item, EvalResult):
        print(f"📊 Score: {item.score:.1%}")
    elif isinstance(item, Insights):
        print(f"💡 Found {len(item.failure_modes)} failure modes")
    elif isinstance(item, PromptUpdate):
        print(f"✨ New prompt generated")

Installation & Setup

📦 Installation Options

Install from GitHub

pip install git+https://github.com/pwenker/inspect-optimize

Using `uv` with `pyproject.toml`

[project]
dependencies = [
    "inspect-optimize @ git+https://github.com/pwenker/inspect-optimize.git"
]

Environment Setup

Create a .env file in your project root:

echo "ANTHROPIC_API_KEY=your-api-key-here" > .env

Interfaces

⌨️ CLI

inspect-optimize solve hellaswag              # Run optimization
inspect-optimize solve hellaswag --iterations 10 --limit 50
inspect-optimize list-evals                   # List available evals
inspect-optimize show-eval humaneval          # Show eval details

See CLI documentation for all commands and options.

🎨 Gradio Web UI

inspect-optimize ui hellaswag                 # Official Inspect AI eval
inspect-optimize ui ./my_custom_eval.py       # Custom evaluation file
inspect-optimize ui hellaswag --port 8080     # Custom port
inspect-optimize ui hellaswag --share         # Create public URL

Interactive web interface with real-time progress updates.

Options:

--task TASK (required): Evaluation name or file path
--port PORT: Server port (default: 7860)
--share: Enable Gradio sharing for public URL

🤖 MCP Server (for AI Agents like Claude Code)

This project includes an .mcp.json file that configures the MCP server for Claude Code. After cloning the repo, the MCP tools are available automatically.

Available Tools:

list_evals - List all available evaluation tasks
get_eval_info - Get details about a specific evaluation
run_evaluation - Run evaluation and get failure analysis

Manual Usage:

fastmcp dev src/inspect_optimize/mcp_server.py  # Development with inspector UI
fastmcp run src/inspect_optimize/mcp_server.py  # Production (stdio)

How It Works

Key Features

Intelligent Failure Analysis

Automatically identifies failure patterns
Groups similar failures together
Extracts root causes

Streaming Progress

Real-time updates for UIs
See evaluation, analysis, and optimization as they happen
Build progress bars, dashboards, chat interfaces

Human-in-the-Loop (HITL)

Pause after AI generates each prompt
Review AI's analysis and proposal
Provide expert feedback
AI synthesizes AI insights + human expertise

result = solve("hellaswag", iterations=5, hitl=True)
# After each AI proposal, you'll be prompted:
# > Feedback: [your expert input or press Enter to accept]

Flexible Feedback Levels Control how much information is shown to the analyzer:

BLIND - Question + Model's Answer + Verdict (analyzer doesn't see correct answer)
FULL - Question + Answer + Target + Verdict + Scorer Explanations (complete context, default)

result = solve("hellaswag", feedback_detail=FeedbackDetail.FULL)

Examples

Basic Usage

from inspect_optimize import solve

# Simple
result = solve("hellaswag", iterations=5)
print(result)

# With options
result = solve(
    task="hellaswag",
    iterations=10,
    target_score=0.9,
    verbose=True,
    model="anthropic/claude-opus-4.5",  # Inspect AI parameter
    limit=50,                          # Inspect AI parameter
)

Streaming for UIs

from inspect_optimize import solve_stream
from inspect_optimize.types import EvalResult, Insights, PromptUpdate

def solve_with_ui(task: str):
    for item in solve_stream(task, iterations=5):
        if isinstance(item, EvalResult):
            # Update progress bar
            update_progress(f"Evaluated: {item.score:.1%}")
        elif isinstance(item, Insights):
            # Show analysis
            show_analysis(item.summary)
        elif isinstance(item, PromptUpdate):
            # Show new prompt
            show_prompt(item.prompt)

solve_with_ui("hellaswag")

Human-in-the-Loop

from inspect_optimize import solve

# CLI mode (prompts in terminal)
result = solve("hellaswag", iterations=5, hitl=True)
# You'll be prompted after each AI proposal

Manual Loop (Advanced)

from inspect_optimize import evaluate, analyze, optimize_prompt
from inspect_optimize.types import EvalArgs, FeedbackDetail

prompt = ""

for i in range(5):
    # 1. Evaluate
    eval_args = EvalArgs(
        task="hellaswag",
        iteration=i,
        eval_kwargs={"model": "anthropic/claude-opus-4", "limit": 100}
    )
    state = evaluate(prompt, eval_args)
    print(f"Iteration {i}: Score = {state.score:.1%}")

    # 2. Analyze
    insights = analyze(
        state,
        analyzer_model="anthropic/claude-haiku-4-5",
        feedback_detail=FeedbackDetail.FULL
    )
    print(f"Failures: {len(insights.failure_modes)}")

    # 3. Optimize
    update = optimize_prompt(
        insights,
        prompt,
        optimizer_model="anthropic/claude-sonnet-4-5"
    )
    print(f"Changes: {update.key_changes}")

    # 4. Use new prompt
    prompt = update.prompt

    # Custom stopping logic
    if state.score >= 0.95:
        break

Configuration

Feedback Detail Levels

Control what information is shown to the analyzer:

from inspect_optimize.types import FeedbackDetail

# Minimal (faster, cheaper - analyzer doesn't see correct answers)
result = solve("hellaswag", feedback_detail=FeedbackDetail.BLIND)

# Maximum (slower, more expensive, best insights - complete context)
result = solve("hellaswag", feedback_detail=FeedbackDetail.FULL)

Model Selection

result = solve(
    "hellaswag",
    agent_model="anthropic/claude-sonnet-4-5",    # For optimization
    analyzer_model="anthropic/claude-haiku-4-5",  # For analysis (cheaper)
    model="anthropic/claude-opus-4",              # For evaluation (best)
)

Inspect AI Parameters

All Inspect AI eval_async parameters are supported:

result = solve(
    "hellaswag",
    model="anthropic/claude-opus-4",
    limit=100,
    temperature=0.7,
    sandbox="docker",
    log_dir="logs/",
    epochs=2,
)

Architecture

See CLAUDE.md for detailed architecture documentation.

High-Level Overview:

solve() → solve_stream() → [evaluate → analyze → optimize] × iterations

Each stage is a pure function:

evaluate(prompt, args) → EvalResult (run Inspect AI eval)
analyze(state, ...) → Insights (extract patterns)
optimize_prompt(insights, prompt) → PromptUpdate (improve prompt)

Limitations

Single task evaluation only (multiple tasks not yet supported)

Development

# Setup
git clone https://github.com/pwenker/inspect-optimize.git
cd inspect-optimize
uv sync

# Run tests
uv run pytest

# Run specific test
uv run pytest tests/test_high_level_api.py -v

# Type checking
uv run mypy src/

# Lint
uv run ruff check src/

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
docs		docs
examples		examples
scripts		scripts
src/inspect_optimize		src/inspect_optimize
tests		tests
.env-example		.env-example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Inspect AI Prompt Optimizer

Purpose

How it works

Features

Quick Start

One-Liner

With Options

Streaming API (for UIs)

Installation & Setup

Install from GitHub

Using `uv` with `pyproject.toml`

Environment Setup

Interfaces

How It Works

Key Features

Examples

Configuration

Architecture

Limitations

Development

About

Uh oh!

Languages

License

pwenker/inspect-optimize

Folders and files

Latest commit

History

Repository files navigation

Inspect AI Prompt Optimizer

Purpose

How it works

Features

Quick Start

One-Liner

With Options

Streaming API (for UIs)

Installation & Setup

Install from GitHub

Using uv with pyproject.toml

Environment Setup

Interfaces

How It Works

Key Features

Examples

Configuration

Architecture

Limitations

Development

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages

Using `uv` with `pyproject.toml`