Warning
This is an experimental personal project, not affiliated with Inspect AI. Shared for inspiration only.
Automatically optimize system prompts for Inspect AI evaluations through iterative refinement.
This tool uses AI to iteratively improve prompts by analyzing failures, extracting insights, and generating better prompts.
There are two main purposes for this project:
- Elicitate model capabilities via automatic prompt optimization for existing evals/benchmarks to better estimate current state-of-the-art capabilites
- Solve any task that can be defined with inspect AI
- Evaluate β Run Inspect AI evaluation
- Analyze β Extract structured failure patterns
- Optimize β Generate improved prompt
- Iterate β Repeat until target reached
- Works out-of-the-box for any eval defined in https://github.com/UKGovernmentBEIS/inspect_evals
- Provides several interfaces: gradio webapp, CLI, and MCP server
from inspect_optimize import solve
solution = solve("math_word_problems")
print(solution)from inspect_optimize import solve
result = solve("hellaswag", # Select eval task
iterations=3, # Do 3 prompt optimization iterations
target_score=0.9, # Stop early when target score reached
limit=10) # Limit evaluations to 10 samples
print(result)from inspect_optimize import solve_stream
from inspect_optimize.types import EvalResult, Insights, PromptUpdate
for item in solve_stream("arc_easy", iterations=3, limit=2):
if isinstance(item, EvalResult):
print(f"π Score: {item.score:.1%}")
elif isinstance(item, Insights):
print(f"π‘ Found {len(item.failure_modes)} failure modes")
elif isinstance(item, PromptUpdate):
print(f"β¨ New prompt generated")π¦ Installation Options
pip install git+https://github.com/pwenker/inspect-optimize[project]
dependencies = [
"inspect-optimize @ git+https://github.com/pwenker/inspect-optimize.git"
]Create a .env file in your project root:
echo "ANTHROPIC_API_KEY=your-api-key-here" > .envβ¨οΈ CLI
inspect-optimize solve hellaswag # Run optimization
inspect-optimize solve hellaswag --iterations 10 --limit 50
inspect-optimize list-evals # List available evals
inspect-optimize show-eval humaneval # Show eval detailsSee CLI documentation for all commands and options.
π¨ Gradio Web UI
inspect-optimize ui hellaswag # Official Inspect AI eval
inspect-optimize ui ./my_custom_eval.py # Custom evaluation file
inspect-optimize ui hellaswag --port 8080 # Custom port
inspect-optimize ui hellaswag --share # Create public URLInteractive web interface with real-time progress updates.
Options:
--task TASK(required): Evaluation name or file path--port PORT: Server port (default: 7860)--share: Enable Gradio sharing for public URL
π€ MCP Server (for AI Agents like Claude Code)
This project includes an .mcp.json file that configures the MCP server for Claude Code. After cloning the repo, the MCP tools are available automatically.
Available Tools:
list_evals- List all available evaluation tasksget_eval_info- Get details about a specific evaluationrun_evaluation- Run evaluation and get failure analysis
Manual Usage:
fastmcp dev src/inspect_optimize/mcp_server.py # Development with inspector UI
fastmcp run src/inspect_optimize/mcp_server.py # Production (stdio)Intelligent Failure Analysis
- Automatically identifies failure patterns
- Groups similar failures together
- Extracts root causes
Streaming Progress
- Real-time updates for UIs
- See evaluation, analysis, and optimization as they happen
- Build progress bars, dashboards, chat interfaces
Human-in-the-Loop (HITL)
- Pause after AI generates each prompt
- Review AI's analysis and proposal
- Provide expert feedback
- AI synthesizes AI insights + human expertise
result = solve("hellaswag", iterations=5, hitl=True)
# After each AI proposal, you'll be prompted:
# > Feedback: [your expert input or press Enter to accept]Flexible Feedback Levels Control how much information is shown to the analyzer:
BLIND- Question + Model's Answer + Verdict (analyzer doesn't see correct answer)FULL- Question + Answer + Target + Verdict + Scorer Explanations (complete context, default)
result = solve("hellaswag", feedback_detail=FeedbackDetail.FULL)Basic Usage
from inspect_optimize import solve
# Simple
result = solve("hellaswag", iterations=5)
print(result)
# With options
result = solve(
task="hellaswag",
iterations=10,
target_score=0.9,
verbose=True,
model="anthropic/claude-opus-4.5", # Inspect AI parameter
limit=50, # Inspect AI parameter
)Streaming for UIs
from inspect_optimize import solve_stream
from inspect_optimize.types import EvalResult, Insights, PromptUpdate
def solve_with_ui(task: str):
for item in solve_stream(task, iterations=5):
if isinstance(item, EvalResult):
# Update progress bar
update_progress(f"Evaluated: {item.score:.1%}")
elif isinstance(item, Insights):
# Show analysis
show_analysis(item.summary)
elif isinstance(item, PromptUpdate):
# Show new prompt
show_prompt(item.prompt)
solve_with_ui("hellaswag")Human-in-the-Loop
from inspect_optimize import solve
# CLI mode (prompts in terminal)
result = solve("hellaswag", iterations=5, hitl=True)
# You'll be prompted after each AI proposalManual Loop (Advanced)
from inspect_optimize import evaluate, analyze, optimize_prompt
from inspect_optimize.types import EvalArgs, FeedbackDetail
prompt = ""
for i in range(5):
# 1. Evaluate
eval_args = EvalArgs(
task="hellaswag",
iteration=i,
eval_kwargs={"model": "anthropic/claude-opus-4", "limit": 100}
)
state = evaluate(prompt, eval_args)
print(f"Iteration {i}: Score = {state.score:.1%}")
# 2. Analyze
insights = analyze(
state,
analyzer_model="anthropic/claude-haiku-4-5",
feedback_detail=FeedbackDetail.FULL
)
print(f"Failures: {len(insights.failure_modes)}")
# 3. Optimize
update = optimize_prompt(
insights,
prompt,
optimizer_model="anthropic/claude-sonnet-4-5"
)
print(f"Changes: {update.key_changes}")
# 4. Use new prompt
prompt = update.prompt
# Custom stopping logic
if state.score >= 0.95:
breakFeedback Detail Levels
Control what information is shown to the analyzer:
from inspect_optimize.types import FeedbackDetail
# Minimal (faster, cheaper - analyzer doesn't see correct answers)
result = solve("hellaswag", feedback_detail=FeedbackDetail.BLIND)
# Maximum (slower, more expensive, best insights - complete context)
result = solve("hellaswag", feedback_detail=FeedbackDetail.FULL)Model Selection
result = solve(
"hellaswag",
agent_model="anthropic/claude-sonnet-4-5", # For optimization
analyzer_model="anthropic/claude-haiku-4-5", # For analysis (cheaper)
model="anthropic/claude-opus-4", # For evaluation (best)
)Inspect AI Parameters
All Inspect AI eval_async parameters are supported:
result = solve(
"hellaswag",
model="anthropic/claude-opus-4",
limit=100,
temperature=0.7,
sandbox="docker",
log_dir="logs/",
epochs=2,
)See CLAUDE.md for detailed architecture documentation.
High-Level Overview:
solve() β solve_stream() β [evaluate β analyze β optimize] Γ iterations
Each stage is a pure function:
evaluate(prompt, args)βEvalResult(run Inspect AI eval)analyze(state, ...)βInsights(extract patterns)optimize_prompt(insights, prompt)βPromptUpdate(improve prompt)
- Single task evaluation only (multiple tasks not yet supported)
# Setup
git clone https://github.com/pwenker/inspect-optimize.git
cd inspect-optimize
uv sync
# Run tests
uv run pytest
# Run specific test
uv run pytest tests/test_high_level_api.py -v
# Type checking
uv run mypy src/
# Lint
uv run ruff check src/