Skip to content

Automated prompt optimization for Inspect AI via structured failure analysis

License

Notifications You must be signed in to change notification settings

pwenker/inspect-optimize

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Inspect AI Prompt Optimizer

Warning

This is an experimental personal project, not affiliated with Inspect AI. Shared for inspiration only.

Python Tests License

Automatically optimize system prompts for Inspect AI evaluations through iterative refinement.

This tool uses AI to iteratively improve prompts by analyzing failures, extracting insights, and generating better prompts.

Purpose

There are two main purposes for this project:

  1. Elicitate model capabilities via automatic prompt optimization for existing evals/benchmarks to better estimate current state-of-the-art capabilites
  2. Solve any task that can be defined with inspect AI

How it works

  1. Evaluate β†’ Run Inspect AI evaluation
  2. Analyze β†’ Extract structured failure patterns
  3. Optimize β†’ Generate improved prompt
  4. Iterate β†’ Repeat until target reached

Features

Quick Start

One-Liner

from inspect_optimize import solve

solution = solve("math_word_problems")
print(solution)

With Options

from inspect_optimize import solve

result = solve("hellaswag", # Select eval task
               iterations=3, # Do 3 prompt optimization iterations
               target_score=0.9, # Stop early when target score reached
               limit=10) # Limit evaluations to 10 samples 
print(result)

Streaming API (for UIs)

from inspect_optimize import solve_stream
from inspect_optimize.types import EvalResult, Insights, PromptUpdate

for item in solve_stream("arc_easy", iterations=3, limit=2):
    if isinstance(item, EvalResult):
        print(f"πŸ“Š Score: {item.score:.1%}")
    elif isinstance(item, Insights):
        print(f"πŸ’‘ Found {len(item.failure_modes)} failure modes")
    elif isinstance(item, PromptUpdate):
        print(f"✨ New prompt generated")

Installation & Setup

πŸ“¦ Installation Options

Install from GitHub

pip install git+https://github.com/pwenker/inspect-optimize

Using uv with pyproject.toml

[project]
dependencies = [
    "inspect-optimize @ git+https://github.com/pwenker/inspect-optimize.git"
]

Environment Setup

Create a .env file in your project root:

echo "ANTHROPIC_API_KEY=your-api-key-here" > .env

Interfaces

⌨️ CLI
inspect-optimize solve hellaswag              # Run optimization
inspect-optimize solve hellaswag --iterations 10 --limit 50
inspect-optimize list-evals                   # List available evals
inspect-optimize show-eval humaneval          # Show eval details

See CLI documentation for all commands and options.

🎨 Gradio Web UI
inspect-optimize ui hellaswag                 # Official Inspect AI eval
inspect-optimize ui ./my_custom_eval.py       # Custom evaluation file
inspect-optimize ui hellaswag --port 8080     # Custom port
inspect-optimize ui hellaswag --share         # Create public URL

Interactive web interface with real-time progress updates.

Options:

  • --task TASK (required): Evaluation name or file path
  • --port PORT: Server port (default: 7860)
  • --share: Enable Gradio sharing for public URL
πŸ€– MCP Server (for AI Agents like Claude Code)

This project includes an .mcp.json file that configures the MCP server for Claude Code. After cloning the repo, the MCP tools are available automatically.

Available Tools:

  • list_evals - List all available evaluation tasks
  • get_eval_info - Get details about a specific evaluation
  • run_evaluation - Run evaluation and get failure analysis

Manual Usage:

fastmcp dev src/inspect_optimize/mcp_server.py  # Development with inspector UI
fastmcp run src/inspect_optimize/mcp_server.py  # Production (stdio)

How It Works

Key Features

Intelligent Failure Analysis

  • Automatically identifies failure patterns
  • Groups similar failures together
  • Extracts root causes

Streaming Progress

  • Real-time updates for UIs
  • See evaluation, analysis, and optimization as they happen
  • Build progress bars, dashboards, chat interfaces

Human-in-the-Loop (HITL)

  • Pause after AI generates each prompt
  • Review AI's analysis and proposal
  • Provide expert feedback
  • AI synthesizes AI insights + human expertise
result = solve("hellaswag", iterations=5, hitl=True)
# After each AI proposal, you'll be prompted:
# > Feedback: [your expert input or press Enter to accept]

Flexible Feedback Levels Control how much information is shown to the analyzer:

  • BLIND - Question + Model's Answer + Verdict (analyzer doesn't see correct answer)
  • FULL - Question + Answer + Target + Verdict + Scorer Explanations (complete context, default)
result = solve("hellaswag", feedback_detail=FeedbackDetail.FULL)

Examples

Basic Usage
from inspect_optimize import solve

# Simple
result = solve("hellaswag", iterations=5)
print(result)

# With options
result = solve(
    task="hellaswag",
    iterations=10,
    target_score=0.9,
    verbose=True,
    model="anthropic/claude-opus-4.5",  # Inspect AI parameter
    limit=50,                          # Inspect AI parameter
)
Streaming for UIs
from inspect_optimize import solve_stream
from inspect_optimize.types import EvalResult, Insights, PromptUpdate

def solve_with_ui(task: str):
    for item in solve_stream(task, iterations=5):
        if isinstance(item, EvalResult):
            # Update progress bar
            update_progress(f"Evaluated: {item.score:.1%}")
        elif isinstance(item, Insights):
            # Show analysis
            show_analysis(item.summary)
        elif isinstance(item, PromptUpdate):
            # Show new prompt
            show_prompt(item.prompt)

solve_with_ui("hellaswag")
Human-in-the-Loop
from inspect_optimize import solve

# CLI mode (prompts in terminal)
result = solve("hellaswag", iterations=5, hitl=True)
# You'll be prompted after each AI proposal
Manual Loop (Advanced)
from inspect_optimize import evaluate, analyze, optimize_prompt
from inspect_optimize.types import EvalArgs, FeedbackDetail

prompt = ""

for i in range(5):
    # 1. Evaluate
    eval_args = EvalArgs(
        task="hellaswag",
        iteration=i,
        eval_kwargs={"model": "anthropic/claude-opus-4", "limit": 100}
    )
    state = evaluate(prompt, eval_args)
    print(f"Iteration {i}: Score = {state.score:.1%}")

    # 2. Analyze
    insights = analyze(
        state,
        analyzer_model="anthropic/claude-haiku-4-5",
        feedback_detail=FeedbackDetail.FULL
    )
    print(f"Failures: {len(insights.failure_modes)}")

    # 3. Optimize
    update = optimize_prompt(
        insights,
        prompt,
        optimizer_model="anthropic/claude-sonnet-4-5"
    )
    print(f"Changes: {update.key_changes}")

    # 4. Use new prompt
    prompt = update.prompt

    # Custom stopping logic
    if state.score >= 0.95:
        break

Configuration

Feedback Detail Levels

Control what information is shown to the analyzer:

from inspect_optimize.types import FeedbackDetail

# Minimal (faster, cheaper - analyzer doesn't see correct answers)
result = solve("hellaswag", feedback_detail=FeedbackDetail.BLIND)

# Maximum (slower, more expensive, best insights - complete context)
result = solve("hellaswag", feedback_detail=FeedbackDetail.FULL)
Model Selection
result = solve(
    "hellaswag",
    agent_model="anthropic/claude-sonnet-4-5",    # For optimization
    analyzer_model="anthropic/claude-haiku-4-5",  # For analysis (cheaper)
    model="anthropic/claude-opus-4",              # For evaluation (best)
)
Inspect AI Parameters

All Inspect AI eval_async parameters are supported:

result = solve(
    "hellaswag",
    model="anthropic/claude-opus-4",
    limit=100,
    temperature=0.7,
    sandbox="docker",
    log_dir="logs/",
    epochs=2,
)

Architecture

See CLAUDE.md for detailed architecture documentation.

High-Level Overview:

solve() β†’ solve_stream() β†’ [evaluate β†’ analyze β†’ optimize] Γ— iterations

Each stage is a pure function:

  • evaluate(prompt, args) β†’ EvalResult (run Inspect AI eval)
  • analyze(state, ...) β†’ Insights (extract patterns)
  • optimize_prompt(insights, prompt) β†’ PromptUpdate (improve prompt)

Limitations

  • Single task evaluation only (multiple tasks not yet supported)

Development

# Setup
git clone https://github.com/pwenker/inspect-optimize.git
cd inspect-optimize
uv sync

# Run tests
uv run pytest

# Run specific test
uv run pytest tests/test_high_level_api.py -v

# Type checking
uv run mypy src/

# Lint
uv run ruff check src/

About

Automated prompt optimization for Inspect AI via structured failure analysis

Topics

Resources

License

Stars

Watchers

Forks