A benchmark and training pipeline for building small, specialized LLMs that generate production-quality Terraform — matching the performance of models 10-100x their size.
Large frontier models (GPT-4, Claude, etc.) can generate decent Terraform, but they're expensive, slow, and overkill for infrastructure-as-code. The hypothesis: a small, fine-tuned model trained on high-quality Terraform data can match or exceed them on this specific task.
This project provides:
- A benchmark to measure how well any LLM generates Terraform, with graded scoring across the full
init -> validate -> plan -> applypipeline - A dataset framework for curating Terraform generation tasks at varying difficulty levels
- A runner that evaluates any model (via litellm) against the benchmark in a single command
The end goal is to train and release a small model (7B-13B parameters) that scores competitively against frontier models on Terraform generation.
Problem statement ──> LLM ──> .tf files ──> Terraform pipeline ──> Graded score
│
├── terraform init (0.1 weight)
├── terraform validate (0.2 weight)
├── terraform plan (0.4 weight)
├── terraform apply (0.2 weight)
└── validation script (0.1 weight)
Each stage produces a score between 0.0 and 1.0. The plan stage compares actual planned resources against expected resources with partial credit — if the benchmark expects 2 S3 buckets and the model produces 1, that's a 0.5 on that resource type, not a zero.
terraform_llm/
datasets/ # Benchmark data layer
schema.py # BenchmarkInstance, difficulty levels, validation
dataset.py # HuggingFace-style Dataset class (filter, map, split)
loader.py # JSONL loading, streaming, save/export
agent/ # Evaluation runner
models.py # LLM abstraction via litellm (any provider)
environment.py # Terraform execution (local subprocess or Docker)
docker_environment.py # Docker + LocalStack environment management
evaluator.py # Graded scoring pipeline + resource comparison
agent.py # run_instance, run_benchmark
results.py # StageResult, InstanceResult, BenchmarkReport
cli/ # Command-line interface
benchmark.py # benchmark command
generate.py # generate command
pip install -e .
# or
uv syncAll commands are available via uv run python -m terraform_llm.cli (or the installed entry point if configured).
Run the benchmark against a dataset. By default, Terraform runs inside Docker with LocalStack for isolated AWS mocking.
# Basic run with Docker + LocalStack (default)
uv run python -m terraform_llm.cli benchmark dataset/ -o output --model anthropic/claude-sonnet-4-5-20250929
# Run without Docker (requires Terraform CLI on host)
uv run python -m terraform_llm.cli benchmark dataset/ -o output --model anthropic/claude-sonnet-4-5-20250929 --no-docker
# Run a specific instance
uv run python -m terraform_llm.cli benchmark dataset/ -o output --instance-id terraform-aws-s3-001
# Filter by difficulty, provider, or tags
uv run python -m terraform_llm.cli benchmark dataset/ -o output --difficulty easy --provider aws --tag s3 --limit 5
# Plan-only mode (skips apply and validation, faster but less thorough)
uv run python -m terraform_llm.cli benchmark dataset/ -o output --no-run-apply
# Skip code generation — reuse .tf files already in the output directory
uv run python -m terraform_llm.cli benchmark dataset/ -o output --skip-generation
# Custom Docker images
uv run python -m terraform_llm.cli benchmark dataset/ -o output \
--terraform-image hashicorp/terraform:1.6 \
--localstack-image localstack/localstack:3.0Options:
| Flag | Default | Description |
|---|---|---|
DATASET (argument) |
required | Path to JSONL dataset file or directory |
-o, --output-dir |
output |
Output directory |
--model |
anthropic/claude-3-5-sonnet-20241022 |
Model identifier (any litellm-supported model) |
--docker / --no-docker |
--docker |
Use Docker + LocalStack for isolated execution |
--terraform-image |
hashicorp/terraform:latest |
Docker image for Terraform |
--localstack-image |
localstack/localstack:latest |
Docker image for LocalStack |
--run-apply / --no-run-apply |
--run-apply |
Run terraform apply (creates infrastructure) |
--skip-generation |
off | Reuse existing .tf files from output directory |
-i, --instance-id |
none | Run a specific instance by ID |
--difficulty |
none | Filter by difficulty (easy, medium, hard) |
-p, --provider |
none | Filter by cloud provider (aws, azure, gcp) |
--tag |
none | Filter by tags (repeatable) |
--limit |
none | Max number of instances to run |
--temperature |
0.0 |
Model temperature |
-v, --verbose |
off | Verbose output |
Generate Terraform files from a natural language description, optionally validating them.
# Generate and validate
uv run python -m terraform_llm.cli generate "Create an S3 bucket with versioning" -o output
# Generate without validation
uv run python -m terraform_llm.cli generate "Create a VPC with public subnets" -o output --no-validate
# Use a different model
uv run python -m terraform_llm.cli generate "Create a Lambda function" -o output --model openai/gpt-4ouv run python -m terraform_llm.cli list dataset/uv run python -m terraform_llm.cli traces output/from terraform_llm.datasets import load_dataset
from terraform_llm.agent import ModelConfig, EvalConfig, run_benchmark
dataset = load_dataset("data/benchmark.jsonl")
# Evaluate any model — litellm supports OpenAI, Anthropic, Ollama, HuggingFace, etc.
model = ModelConfig(model="anthropic/claude-sonnet-4-5-20250929")
config = EvalConfig(run_apply=True) # full apply mode (deploys to LocalStack)
report = run_benchmark(dataset, model, config)
print(f"Mean score: {report.mean_score:.2f}")
print(f"Stage pass rates: {report.stage_pass_rates()}")config = EvalConfig(
run_apply=True,
use_docker=True,
terraform_image="hashicorp/terraform:latest",
localstack_image="localstack/localstack:latest",
)
report = run_benchmark(dataset, model, config)models = [
ModelConfig(model="anthropic/claude-sonnet-4-5-20250929"),
ModelConfig(model="openai/gpt-4o"),
ModelConfig(model="ollama/llama3:8b"), # local small model
ModelConfig(model="ollama/codellama:13b"), # local code model
]
for model in models:
report = run_benchmark(dataset, model, EvalConfig())
print(f"{model.model}: {report.mean_score:.2f}")Each instance in the JSONL dataset looks like:
{
"instance_id": "terraform-aws-s3-001",
"problem_statement": "Create an S3 bucket with versioning enabled and server-side encryption using AES256",
"difficulty": "easy",
"tags": ["aws", "s3", "storage"],
"provider": "aws",
"region": "us-east-1",
"expected_resources": {"aws_s3_bucket": 1, "aws_s3_bucket_versioning": 1, "aws_s3_bucket_server_side_encryption_configuration": 1},
"validation_script": "scripts/validate_s3.sh",
"metadata": {"estimated_cost": "$0.02/month", "deployment_time_seconds": 30},
"gold_solution": {"main.tf": "resource \"aws_s3_bucket\" ..."},
"hints": ["Use aws_s3_bucket_versioning as a separate resource"]
}| Mode | Flag | What happens |
|---|---|---|
| Full apply (default) | run_apply=True |
Deploys real infrastructure, runs validation, destroys after. |
| Plan only | run_apply=False |
Runs init/validate/plan. No real infrastructure. Free. |
| Docker (default) | --docker |
Runs Terraform in Docker with LocalStack. Isolated, no real AWS costs. |
| Local | --no-docker |
Runs Terraform directly on the host. Requires Terraform CLI installed. |
The benchmark uses graded scoring, not binary pass/fail:
- init (10%): Did providers resolve correctly?
- validate (20%): Is the HCL syntactically and semantically valid?
- plan (40%): Do the planned resources match what's expected? Partial credit for partial matches.
- apply (20%): Did the infrastructure deploy successfully?
- validation script (10%): Does the deployed infrastructure actually work as intended?
A model that gets to plan with correct resources but fails apply scores much higher than one that can't even pass validate.
- Curate initial benchmark dataset (50-100 instances across easy/medium/hard)
- Baseline frontier models (Claude, GPT-4, Gemini)
- Baseline open models (Llama, CodeLlama, DeepSeek, Qwen)
- Fine-tune a small model on gold solutions
- Evaluate fine-tuned model against baselines
- Multi-turn agent mode (iterative fix based on terraform errors)
- Python >= 3.12
- Docker (for default isolated execution with LocalStack)
- Terraform CLI (only needed with
--no-dockermode) - API keys for whichever model provider you want to evaluate (set via environment variables)