Benchmark that measures how well LLMs generate Terraform. An LLM gets a problem statement, produces .tf files, then the pipeline runs init → validate → plan → apply and scores each stage (weighted: init 10%, validate 20%, plan 40%, apply 20%, validation script 10%).
# Run benchmark (Docker + LocalStack by default)
uv run python -m terraform_llm.cli benchmark dataset/ -o output --model anthropic/claude-sonnet-4-5-20250929
# Run single instance
uv run python -m terraform_llm.cli benchmark dataset/ -o output --instance-id terraform-aws-s3-001
# Run without Docker (needs terraform on host)
uv run python -m terraform_llm.cli benchmark dataset/ -o output --no-docker
# Run plan-only (default runs full apply)
uv run python -m terraform_llm.cli benchmark dataset/ -o output --no-run-apply
# Skip LLM generation, reuse existing .tf files
uv run python -m terraform_llm.cli benchmark dataset/ -o output --skip-generation
# Generate terraform from a prompt
uv run python -m terraform_llm.cli generate "Create an S3 bucket" -o outputagent/models.py— Calls LLM via litellm to generate.tffiles from a problem statementagent/environment.py—TerraformEnvironmentruns terraform commands. Accepts an optionaldocker_env(LocalstackDockerEnvironment) to route commands through Docker instead of local subprocessagent/evaluator.py—evaluate_instance()runs the full pipeline (setup script → init → validate → plan → apply → validation → destroy), scores each stage, skips remaining stages on failureagent/agent.py—run_instance()ties generation + evaluation together.run_benchmark()loops over a datasetagent/docker_environment.py— Manages Docker network, LocalStack container, and runs terraform/validation/setup/cleanup scripts in containers
Each dataset lives in dataset/<name>/ with a JSONL file. Each line is one instance:
{
"instance_id": "terraform-aws-s3-001",
"problem_statement": "Create an S3 bucket with versioning",
"difficulty": "easy",
"tags": ["aws", "s3"],
"provider": "aws",
"region": "us-east-1",
"expected_resources": {"aws_s3_bucket": 1},
"validation_script": "dataset/simple_s3/validation.py",
"metadata": {"estimated_cost": "$0.02/month", "deployment_time_seconds": 30},
"gold_solution": {"main.tf": "..."},
"hints": ["Use aws_s3_bucket_versioning as a separate resource"],
"setup_script": null
}Key fields: expected_resources drives plan scoring, validation_script runs after apply, setup_script creates pre-existing infrastructure before terraform runs.
terraform_llm/agent/— Core evaluation pipelineterraform_llm/datasets/— Schema, loader, Dataset classterraform_llm/cli/— CLI commands (benchmark, generate, list, traces)dataset/— Benchmark instances (JSONL + validation scripts + gold solutions)