Cloud Benchmark

An SWE-bench-style evaluation framework for AI agents that manage AWS infrastructure.

Overview

Cloud Benchmark measures how well AI agents diagnose and fix real AWS infrastructure problems. It deploys Terraform-managed infrastructure, injects faults (misconfigured security groups, overly-permissive IAM policies, missing dead-letter queues, recursive Lambda invocations), then hands the broken environment to an agent with a problem statement and scoped credentials. The framework automatically verifies whether the agent fixed the issue without causing regressions, and scores each attempt on resolution, safety, cost, and speed.

If you build or evaluate AI agents for cloud operations, Cloud Benchmark gives you reproducible, automated benchmarks against real AWS services -- or free LocalStack-simulated equivalents.

Quick Start

Get a benchmark running in under five minutes:

# Clone and install
git clone https://github.com/bluearch/cloud-benchmark.git
cd cloud-benchmark
uv venv --python 3.13 .venv && source .venv/bin/activate
uv pip install -e ".[dev,agents]"

# Start LocalStack
docker run --name localstack -d -p 4566:4566 \
  -v /var/run/docker.sock:/var/run/docker.sock \
  -e SERVICES=ec2,s3,iam,sqs,lambda,sts,logs,cloudwatch \
  localstack/localstack

# Check your environment
cb doctor

# Set your API key
export ANTHROPIC_API_KEY=sk-ant-...

# Run your first benchmark
cb run --sandbox localstack --instance-ids iam-overperm-001 --agent agents/claude-haiku.yaml

# View results
cb results list

Prerequisites

Python 3.11+ (3.13 recommended)
Docker (for LocalStack)
uv package manager (curl -LsSf https://astral.sh/uv/install.sh | sh)
Terraform or tflocal (tflocal recommended for LocalStack; installed via pip install terraform-local)
API key for at least one model provider (Anthropic, OpenAI, or a local Ollama server)

Agent Backends

Cloud Benchmark ships with pre-configured agent profiles. Each config file points to a script that speaks the subprocess protocol (JSON on stdin/stdout).

Backend	Config File	Requires
Claude Haiku 4.5	`agents/claude-haiku.yaml`	`ANTHROPIC_API_KEY`
Claude Sonnet 4.5	`agents/claude-sonnet.yaml`	`ANTHROPIC_API_KEY`
Claude Opus	`agents/claude-opus.yaml`	`ANTHROPIC_API_KEY`
Claude Sonnet 4.5 (Agent SDK)	`agents/claude-sdk-sonnet.yaml`	`ANTHROPIC_API_KEY`
Claude Opus (Agent SDK)	`agents/claude-sdk-opus.yaml`	`ANTHROPIC_API_KEY`
GPT-4o	`agents/openai-gpt4o.yaml`	`OPENAI_API_KEY`
Llama 3 (Ollama)	`agents/ollama-llama3.yaml`	Local Ollama server running on port 11434

Adding your own agent

Create a YAML config file that specifies the subprocess adapter and a command. The command must be a script or binary that reads a JSON task from stdin and writes a JSON result to stdout. Example:

adapter: subprocess
agent_id: my-custom-agent
description: "My custom agent"
command:
  - python3
  - my_agent.py
requires_env:
  - MY_API_KEY
timeout_buffer_seconds: 60

The input JSON contains the problem statement, Terraform outputs (resource IDs, ARNs), allowed services, and AWS credentials. The output JSON contains the actions the agent took. See agents/claude_agent.py for a reference implementation.

Available Instances

Each instance defines a self-contained infrastructure scenario with Terraform configs, a fault injection, verification checks, and regression checks.

Instance ID	Category	Difficulty	Description
`sg-connectivity-001`	diagnosis	easy	Security group rule removed; EC2 app cannot reach database on port 5432
`s3-policy-001`	security-hardening	easy	S3 bucket policy grants public read access; public access block disabled
`iam-overperm-001`	security-hardening	easy	IAM role has AdministratorAccess attached; must restore least-privilege
`sqs-dlq-001`	provisioning	medium	SQS queue lacks a dead-letter queue; must create DLQ and configure redrive policy
`lambda-cascade-001`	incident-response	hard	Lambda function in recursive invocation loop via SQS; must stop the cascade

All instances are LocalStack-compatible and can run without an AWS account.

CLI Reference

The cb command is the entry point for all operations.

Environment check

cb doctor                     Check Python, Docker, LocalStack, SDKs, and API keys

Instance management

cb instance list              List all benchmark instances (filterable by --category, --difficulty, --localstack)
cb instance validate PATH     Validate an instance definition (schema, Terraform dirs, checks)
cb instance inspect ID        Show full details for an instance

Running evaluations

cb run                        Execute benchmark evaluations
  --sandbox localstack|aws    Target environment (default: localstack)
  --instance-ids ID[,ID,...]  Comma-separated instance IDs to evaluate
  --agent PATH                Path to agent config YAML
  --max-workers N             Parallel instance limit (default: 1)
  --run-id ID                 Custom run identifier
  --output-dir DIR            Output directory (default: results/)
  --dry-run                   Validate config without deploying
  -v, --verbose               Verbose logging

Results and reporting

cb results list               List all recorded evaluation runs
cb results show RUN_ID        Show detailed results for a run
cb results compare A B        Side-by-side comparison of two runs
cb results rank ID [ID ...]   Bradley-Terry tournament ranking across runs (requires 2+)
cb results visualize ID [...]  Generate self-contained HTML dashboard (--open to launch browser)

How It Works

Each benchmark evaluation follows a seven-stage pipeline:

PROVISION --> INJECT --> PRE_VERIFY --> HANDOFF --> POST_VERIFY --> SCORE --> TEARDOWN

PROVISION -- Deploy baseline infrastructure with terraform apply (or tflocal apply for LocalStack).
INJECT -- Apply fault injection Terraform to break the infrastructure in a specific way.
PRE_VERIFY -- Confirm that verification checks fail and regression checks pass. This validates the fault was injected correctly before involving the agent.
HANDOFF -- Invoke the agent subprocess with the problem statement, Terraform outputs, scoped AWS credentials, and constraints (time limit, allowed services, denied actions).
POST_VERIFY -- Run the same verification checks (should now pass) and regression checks (should still pass) to determine whether the agent fixed the problem without side effects.
SCORE -- Compute scoring dimensions: resolution (did checks pass?), regression safety (did anything break?), cost (API and infrastructure spend), and time.
TEARDOWN -- Destroy all infrastructure with terraform destroy.

The pipeline uses a double-entry control log for crash recovery. If a run is interrupted, infrastructure teardown is guaranteed on the next invocation.

Development

Run tests

pytest tests/

All tests are mocked and do not require LocalStack or AWS credentials.

Lint and type check

ruff check src/
mypy src/

Project structure

src/cloud_benchmark/
  cli/           CLI commands (Click)
  schema/        Pydantic v2 models for instances, configs, results
  harness/       Pipeline, Terraform executor, runner
  agents/        Agent adapter interface and subprocess adapter
  scoring/       Scoring dimensions, aggregation, Bradley-Terry ranking
  storage/       Local JSONL result storage
  visualization/ HTML report generation
instances/       Benchmark instance definitions (YAML + Terraform)
agents/          Pre-configured agent YAML configs and agent scripts
tests/           Unit tests

Troubleshooting

LocalStack not running

docker run --name localstack -d -p 4566:4566 \
  -v /var/run/docker.sock:/var/run/docker.sock \
  -e SERVICES=ec2,s3,iam,sqs,lambda,sts,logs,cloudwatch \
  localstack/localstack

Run cb doctor to confirm it is reachable.

Docker socket error for Lambda instances

Lambda instances require Docker-in-Docker access. Mount the Docker socket when starting LocalStack: -v /var/run/docker.sock:/var/run/docker.sock

API key not set

export ANTHROPIC_API_KEY=sk-ant-...
# or
export OPENAI_API_KEY=sk-...

Run cb doctor to verify your keys are detected.

Python version too old

Cloud Benchmark requires Python 3.11+. Use uv to create a virtual environment with the correct version:

uv venv --python 3.13 .venv && source .venv/bin/activate

tflocal not found

pip install terraform-local

tflocal is a wrapper around Terraform that transparently redirects AWS provider calls to LocalStack.

License

Apache 2.0. See LICENSE for the full text.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
agents		agents
docker		docker
docs		docs
examples		examples
instances		instances
scripts		scripts
src/cloud_benchmark		src/cloud_benchmark
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cloud Benchmark

Overview

Quick Start

Prerequisites

Agent Backends

Adding your own agent

Available Instances

CLI Reference

Environment check

Instance management

Running evaluations

Results and reporting

How It Works

Development

Run tests

Lint and type check

Project structure

Troubleshooting

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Cloud Benchmark

Overview

Quick Start

Prerequisites

Agent Backends

Adding your own agent

Available Instances

CLI Reference

Environment check

Instance management

Running evaluations

Results and reporting

How It Works

Development

Run tests

Lint and type check

Project structure

Troubleshooting

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages