An SWE-bench-style evaluation framework for AI agents that manage AWS infrastructure.
Cloud Benchmark measures how well AI agents diagnose and fix real AWS infrastructure problems. It deploys Terraform-managed infrastructure, injects faults (misconfigured security groups, overly-permissive IAM policies, missing dead-letter queues, recursive Lambda invocations), then hands the broken environment to an agent with a problem statement and scoped credentials. The framework automatically verifies whether the agent fixed the issue without causing regressions, and scores each attempt on resolution, safety, cost, and speed.
If you build or evaluate AI agents for cloud operations, Cloud Benchmark gives you reproducible, automated benchmarks against real AWS services -- or free LocalStack-simulated equivalents.
Get a benchmark running in under five minutes:
# Clone and install
git clone https://github.com/bluearch/cloud-benchmark.git
cd cloud-benchmark
uv venv --python 3.13 .venv && source .venv/bin/activate
uv pip install -e ".[dev,agents]"
# Start LocalStack
docker run --name localstack -d -p 4566:4566 \
-v /var/run/docker.sock:/var/run/docker.sock \
-e SERVICES=ec2,s3,iam,sqs,lambda,sts,logs,cloudwatch \
localstack/localstack
# Check your environment
cb doctor
# Set your API key
export ANTHROPIC_API_KEY=sk-ant-...
# Run your first benchmark
cb run --sandbox localstack --instance-ids iam-overperm-001 --agent agents/claude-haiku.yaml
# View results
cb results list- Python 3.11+ (3.13 recommended)
- Docker (for LocalStack)
- uv package manager (
curl -LsSf https://astral.sh/uv/install.sh | sh) - Terraform or tflocal (tflocal recommended for LocalStack; installed via
pip install terraform-local) - API key for at least one model provider (Anthropic, OpenAI, or a local Ollama server)
Cloud Benchmark ships with pre-configured agent profiles. Each config file points to a script that speaks the subprocess protocol (JSON on stdin/stdout).
| Backend | Config File | Requires |
|---|---|---|
| Claude Haiku 4.5 | agents/claude-haiku.yaml |
ANTHROPIC_API_KEY |
| Claude Sonnet 4.5 | agents/claude-sonnet.yaml |
ANTHROPIC_API_KEY |
| Claude Opus | agents/claude-opus.yaml |
ANTHROPIC_API_KEY |
| Claude Sonnet 4.5 (Agent SDK) | agents/claude-sdk-sonnet.yaml |
ANTHROPIC_API_KEY |
| Claude Opus (Agent SDK) | agents/claude-sdk-opus.yaml |
ANTHROPIC_API_KEY |
| GPT-4o | agents/openai-gpt4o.yaml |
OPENAI_API_KEY |
| Llama 3 (Ollama) | agents/ollama-llama3.yaml |
Local Ollama server running on port 11434 |
Create a YAML config file that specifies the subprocess adapter and a command. The command must be a script or binary that reads a JSON task from stdin and writes a JSON result to stdout. Example:
adapter: subprocess
agent_id: my-custom-agent
description: "My custom agent"
command:
- python3
- my_agent.py
requires_env:
- MY_API_KEY
timeout_buffer_seconds: 60The input JSON contains the problem statement, Terraform outputs (resource IDs, ARNs), allowed services, and AWS credentials. The output JSON contains the actions the agent took. See agents/claude_agent.py for a reference implementation.
Each instance defines a self-contained infrastructure scenario with Terraform configs, a fault injection, verification checks, and regression checks.
| Instance ID | Category | Difficulty | Description |
|---|---|---|---|
sg-connectivity-001 |
diagnosis | easy | Security group rule removed; EC2 app cannot reach database on port 5432 |
s3-policy-001 |
security-hardening | easy | S3 bucket policy grants public read access; public access block disabled |
iam-overperm-001 |
security-hardening | easy | IAM role has AdministratorAccess attached; must restore least-privilege |
sqs-dlq-001 |
provisioning | medium | SQS queue lacks a dead-letter queue; must create DLQ and configure redrive policy |
lambda-cascade-001 |
incident-response | hard | Lambda function in recursive invocation loop via SQS; must stop the cascade |
All instances are LocalStack-compatible and can run without an AWS account.
The cb command is the entry point for all operations.
cb doctor Check Python, Docker, LocalStack, SDKs, and API keys
cb instance list List all benchmark instances (filterable by --category, --difficulty, --localstack)
cb instance validate PATH Validate an instance definition (schema, Terraform dirs, checks)
cb instance inspect ID Show full details for an instance
cb run Execute benchmark evaluations
--sandbox localstack|aws Target environment (default: localstack)
--instance-ids ID[,ID,...] Comma-separated instance IDs to evaluate
--agent PATH Path to agent config YAML
--max-workers N Parallel instance limit (default: 1)
--run-id ID Custom run identifier
--output-dir DIR Output directory (default: results/)
--dry-run Validate config without deploying
-v, --verbose Verbose logging
cb results list List all recorded evaluation runs
cb results show RUN_ID Show detailed results for a run
cb results compare A B Side-by-side comparison of two runs
cb results rank ID [ID ...] Bradley-Terry tournament ranking across runs (requires 2+)
cb results visualize ID [...] Generate self-contained HTML dashboard (--open to launch browser)
Each benchmark evaluation follows a seven-stage pipeline:
PROVISION --> INJECT --> PRE_VERIFY --> HANDOFF --> POST_VERIFY --> SCORE --> TEARDOWN
- PROVISION -- Deploy baseline infrastructure with
terraform apply(ortflocal applyfor LocalStack). - INJECT -- Apply fault injection Terraform to break the infrastructure in a specific way.
- PRE_VERIFY -- Confirm that verification checks fail and regression checks pass. This validates the fault was injected correctly before involving the agent.
- HANDOFF -- Invoke the agent subprocess with the problem statement, Terraform outputs, scoped AWS credentials, and constraints (time limit, allowed services, denied actions).
- POST_VERIFY -- Run the same verification checks (should now pass) and regression checks (should still pass) to determine whether the agent fixed the problem without side effects.
- SCORE -- Compute scoring dimensions: resolution (did checks pass?), regression safety (did anything break?), cost (API and infrastructure spend), and time.
- TEARDOWN -- Destroy all infrastructure with
terraform destroy.
The pipeline uses a double-entry control log for crash recovery. If a run is interrupted, infrastructure teardown is guaranteed on the next invocation.
pytest tests/All tests are mocked and do not require LocalStack or AWS credentials.
ruff check src/
mypy src/src/cloud_benchmark/
cli/ CLI commands (Click)
schema/ Pydantic v2 models for instances, configs, results
harness/ Pipeline, Terraform executor, runner
agents/ Agent adapter interface and subprocess adapter
scoring/ Scoring dimensions, aggregation, Bradley-Terry ranking
storage/ Local JSONL result storage
visualization/ HTML report generation
instances/ Benchmark instance definitions (YAML + Terraform)
agents/ Pre-configured agent YAML configs and agent scripts
tests/ Unit tests
LocalStack not running
docker run --name localstack -d -p 4566:4566 \
-v /var/run/docker.sock:/var/run/docker.sock \
-e SERVICES=ec2,s3,iam,sqs,lambda,sts,logs,cloudwatch \
localstack/localstackRun cb doctor to confirm it is reachable.
Docker socket error for Lambda instances
Lambda instances require Docker-in-Docker access. Mount the Docker socket when starting LocalStack:
-v /var/run/docker.sock:/var/run/docker.sock
API key not set
export ANTHROPIC_API_KEY=sk-ant-...
# or
export OPENAI_API_KEY=sk-...Run cb doctor to verify your keys are detected.
Python version too old
Cloud Benchmark requires Python 3.11+. Use uv to create a virtual environment with the correct version:
uv venv --python 3.13 .venv && source .venv/bin/activatetflocal not found
pip install terraform-localtflocal is a wrapper around Terraform that transparently redirects AWS provider calls to LocalStack.
Apache 2.0. See LICENSE for the full text.