Skip to content

bluearchio/cloud-benchmark

Repository files navigation

Cloud Benchmark

An SWE-bench-style evaluation framework for AI agents that manage AWS infrastructure.

Overview

Cloud Benchmark measures how well AI agents diagnose and fix real AWS infrastructure problems. It deploys Terraform-managed infrastructure, injects faults (misconfigured security groups, overly-permissive IAM policies, missing dead-letter queues, recursive Lambda invocations), then hands the broken environment to an agent with a problem statement and scoped credentials. The framework automatically verifies whether the agent fixed the issue without causing regressions, and scores each attempt on resolution, safety, cost, and speed.

If you build or evaluate AI agents for cloud operations, Cloud Benchmark gives you reproducible, automated benchmarks against real AWS services -- or free LocalStack-simulated equivalents.

Quick Start

Get a benchmark running in under five minutes:

# Clone and install
git clone https://github.com/bluearch/cloud-benchmark.git
cd cloud-benchmark
uv venv --python 3.13 .venv && source .venv/bin/activate
uv pip install -e ".[dev,agents]"

# Start LocalStack
docker run --name localstack -d -p 4566:4566 \
  -v /var/run/docker.sock:/var/run/docker.sock \
  -e SERVICES=ec2,s3,iam,sqs,lambda,sts,logs,cloudwatch \
  localstack/localstack

# Check your environment
cb doctor

# Set your API key
export ANTHROPIC_API_KEY=sk-ant-...

# Run your first benchmark
cb run --sandbox localstack --instance-ids iam-overperm-001 --agent agents/claude-haiku.yaml

# View results
cb results list

Prerequisites

  • Python 3.11+ (3.13 recommended)
  • Docker (for LocalStack)
  • uv package manager (curl -LsSf https://astral.sh/uv/install.sh | sh)
  • Terraform or tflocal (tflocal recommended for LocalStack; installed via pip install terraform-local)
  • API key for at least one model provider (Anthropic, OpenAI, or a local Ollama server)

Agent Backends

Cloud Benchmark ships with pre-configured agent profiles. Each config file points to a script that speaks the subprocess protocol (JSON on stdin/stdout).

Backend Config File Requires
Claude Haiku 4.5 agents/claude-haiku.yaml ANTHROPIC_API_KEY
Claude Sonnet 4.5 agents/claude-sonnet.yaml ANTHROPIC_API_KEY
Claude Opus agents/claude-opus.yaml ANTHROPIC_API_KEY
Claude Sonnet 4.5 (Agent SDK) agents/claude-sdk-sonnet.yaml ANTHROPIC_API_KEY
Claude Opus (Agent SDK) agents/claude-sdk-opus.yaml ANTHROPIC_API_KEY
GPT-4o agents/openai-gpt4o.yaml OPENAI_API_KEY
Llama 3 (Ollama) agents/ollama-llama3.yaml Local Ollama server running on port 11434

Adding your own agent

Create a YAML config file that specifies the subprocess adapter and a command. The command must be a script or binary that reads a JSON task from stdin and writes a JSON result to stdout. Example:

adapter: subprocess
agent_id: my-custom-agent
description: "My custom agent"
command:
  - python3
  - my_agent.py
requires_env:
  - MY_API_KEY
timeout_buffer_seconds: 60

The input JSON contains the problem statement, Terraform outputs (resource IDs, ARNs), allowed services, and AWS credentials. The output JSON contains the actions the agent took. See agents/claude_agent.py for a reference implementation.

Available Instances

Each instance defines a self-contained infrastructure scenario with Terraform configs, a fault injection, verification checks, and regression checks.

Instance ID Category Difficulty Description
sg-connectivity-001 diagnosis easy Security group rule removed; EC2 app cannot reach database on port 5432
s3-policy-001 security-hardening easy S3 bucket policy grants public read access; public access block disabled
iam-overperm-001 security-hardening easy IAM role has AdministratorAccess attached; must restore least-privilege
sqs-dlq-001 provisioning medium SQS queue lacks a dead-letter queue; must create DLQ and configure redrive policy
lambda-cascade-001 incident-response hard Lambda function in recursive invocation loop via SQS; must stop the cascade

All instances are LocalStack-compatible and can run without an AWS account.

CLI Reference

The cb command is the entry point for all operations.

Environment check

cb doctor                     Check Python, Docker, LocalStack, SDKs, and API keys

Instance management

cb instance list              List all benchmark instances (filterable by --category, --difficulty, --localstack)
cb instance validate PATH     Validate an instance definition (schema, Terraform dirs, checks)
cb instance inspect ID        Show full details for an instance

Running evaluations

cb run                        Execute benchmark evaluations
  --sandbox localstack|aws    Target environment (default: localstack)
  --instance-ids ID[,ID,...]  Comma-separated instance IDs to evaluate
  --agent PATH                Path to agent config YAML
  --max-workers N             Parallel instance limit (default: 1)
  --run-id ID                 Custom run identifier
  --output-dir DIR            Output directory (default: results/)
  --dry-run                   Validate config without deploying
  -v, --verbose               Verbose logging

Results and reporting

cb results list               List all recorded evaluation runs
cb results show RUN_ID        Show detailed results for a run
cb results compare A B        Side-by-side comparison of two runs
cb results rank ID [ID ...]   Bradley-Terry tournament ranking across runs (requires 2+)
cb results visualize ID [...]  Generate self-contained HTML dashboard (--open to launch browser)

How It Works

Each benchmark evaluation follows a seven-stage pipeline:

PROVISION --> INJECT --> PRE_VERIFY --> HANDOFF --> POST_VERIFY --> SCORE --> TEARDOWN
  1. PROVISION -- Deploy baseline infrastructure with terraform apply (or tflocal apply for LocalStack).
  2. INJECT -- Apply fault injection Terraform to break the infrastructure in a specific way.
  3. PRE_VERIFY -- Confirm that verification checks fail and regression checks pass. This validates the fault was injected correctly before involving the agent.
  4. HANDOFF -- Invoke the agent subprocess with the problem statement, Terraform outputs, scoped AWS credentials, and constraints (time limit, allowed services, denied actions).
  5. POST_VERIFY -- Run the same verification checks (should now pass) and regression checks (should still pass) to determine whether the agent fixed the problem without side effects.
  6. SCORE -- Compute scoring dimensions: resolution (did checks pass?), regression safety (did anything break?), cost (API and infrastructure spend), and time.
  7. TEARDOWN -- Destroy all infrastructure with terraform destroy.

The pipeline uses a double-entry control log for crash recovery. If a run is interrupted, infrastructure teardown is guaranteed on the next invocation.

Development

Run tests

pytest tests/

All tests are mocked and do not require LocalStack or AWS credentials.

Lint and type check

ruff check src/
mypy src/

Project structure

src/cloud_benchmark/
  cli/           CLI commands (Click)
  schema/        Pydantic v2 models for instances, configs, results
  harness/       Pipeline, Terraform executor, runner
  agents/        Agent adapter interface and subprocess adapter
  scoring/       Scoring dimensions, aggregation, Bradley-Terry ranking
  storage/       Local JSONL result storage
  visualization/ HTML report generation
instances/       Benchmark instance definitions (YAML + Terraform)
agents/          Pre-configured agent YAML configs and agent scripts
tests/           Unit tests

Troubleshooting

LocalStack not running

docker run --name localstack -d -p 4566:4566 \
  -v /var/run/docker.sock:/var/run/docker.sock \
  -e SERVICES=ec2,s3,iam,sqs,lambda,sts,logs,cloudwatch \
  localstack/localstack

Run cb doctor to confirm it is reachable.

Docker socket error for Lambda instances

Lambda instances require Docker-in-Docker access. Mount the Docker socket when starting LocalStack: -v /var/run/docker.sock:/var/run/docker.sock

API key not set

export ANTHROPIC_API_KEY=sk-ant-...
# or
export OPENAI_API_KEY=sk-...

Run cb doctor to verify your keys are detected.

Python version too old

Cloud Benchmark requires Python 3.11+. Use uv to create a virtual environment with the correct version:

uv venv --python 3.13 .venv && source .venv/bin/activate

tflocal not found

pip install terraform-local

tflocal is a wrapper around Terraform that transparently redirects AWS provider calls to LocalStack.

License

Apache 2.0. See LICENSE for the full text.

About

SWE-bench-style evaluation framework for AI agents managing AWS infrastructure

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages