terraform-llm

A benchmark and training pipeline for building small, specialized LLMs that generate production-quality Terraform — matching the performance of models 10-100x their size.

Why

Large frontier models (GPT-4, Claude, etc.) can generate decent Terraform, but they're expensive, slow, and overkill for infrastructure-as-code. The hypothesis: a small, fine-tuned model trained on high-quality Terraform data can match or exceed them on this specific task.

This project provides:

A benchmark to measure how well any LLM generates Terraform, with graded scoring across the full init -> validate -> plan -> apply pipeline
A dataset framework for curating Terraform generation tasks at varying difficulty levels
A runner that evaluates any model (via litellm) against the benchmark in a single command

The end goal is to train and release a small model (7B-13B parameters) that scores competitively against frontier models on Terraform generation.

How it works

Problem statement ──> LLM ──> .tf files ──> Terraform pipeline ──> Graded score
                                              │
                                              ├── terraform init      (0.1 weight)
                                              ├── terraform validate  (0.2 weight)
                                              ├── terraform plan      (0.4 weight)
                                              ├── terraform apply     (0.2 weight)
                                              └── validation script   (0.1 weight)

Each stage produces a score between 0.0 and 1.0. The plan stage compares actual planned resources against expected resources with partial credit — if the benchmark expects 2 S3 buckets and the model produces 1, that's a 0.5 on that resource type, not a zero.

Project structure

terraform_llm/
  datasets/                  # Benchmark data layer
    schema.py                # BenchmarkInstance, difficulty levels, validation
    dataset.py               # HuggingFace-style Dataset class (filter, map, split)
    loader.py                # JSONL loading, streaming, save/export
  agent/                     # Evaluation runner
    models.py                # LLM abstraction via litellm (any provider)
    environment.py           # Terraform execution (local subprocess or Docker)
    docker_environment.py    # Docker + LocalStack environment management
    evaluator.py             # Graded scoring pipeline + resource comparison
    agent.py                 # run_instance, run_benchmark
    results.py               # StageResult, InstanceResult, BenchmarkReport
  cli/                       # Command-line interface
    benchmark.py             # benchmark command
    generate.py              # generate command

Quick start

pip install -e .
# or
uv sync

CLI usage

All commands are available via uv run python -m terraform_llm.cli (or the installed entry point if configured).

`benchmark` — Run the evaluation pipeline

Run the benchmark against a dataset. By default, Terraform runs inside Docker with LocalStack for isolated AWS mocking.

# Basic run with Docker + LocalStack (default)
uv run python -m terraform_llm.cli benchmark dataset/ -o output --model anthropic/claude-sonnet-4-5-20250929

# Run without Docker (requires Terraform CLI on host)
uv run python -m terraform_llm.cli benchmark dataset/ -o output --model anthropic/claude-sonnet-4-5-20250929 --no-docker

# Run a specific instance
uv run python -m terraform_llm.cli benchmark dataset/ -o output --instance-id terraform-aws-s3-001

# Filter by difficulty, provider, or tags
uv run python -m terraform_llm.cli benchmark dataset/ -o output --difficulty easy --provider aws --tag s3 --limit 5

# Plan-only mode (skips apply and validation, faster but less thorough)
uv run python -m terraform_llm.cli benchmark dataset/ -o output --no-run-apply

# Skip code generation — reuse .tf files already in the output directory
uv run python -m terraform_llm.cli benchmark dataset/ -o output --skip-generation

# Custom Docker images
uv run python -m terraform_llm.cli benchmark dataset/ -o output \
  --terraform-image hashicorp/terraform:1.6 \
  --localstack-image localstack/localstack:3.0

Options:

Flag	Default	Description
`DATASET` (argument)	required	Path to JSONL dataset file or directory
`-o`, `--output-dir`	`output`	Output directory
`--model`	`anthropic/claude-3-5-sonnet-20241022`	Model identifier (any litellm-supported model)
`--docker` / `--no-docker`	`--docker`	Use Docker + LocalStack for isolated execution
`--terraform-image`	`hashicorp/terraform:latest`	Docker image for Terraform
`--localstack-image`	`localstack/localstack:latest`	Docker image for LocalStack
`--run-apply` / `--no-run-apply`	`--run-apply`	Run terraform apply (creates infrastructure)
`--skip-generation`	off	Reuse existing .tf files from output directory
`-i`, `--instance-id`	none	Run a specific instance by ID
`--difficulty`	none	Filter by difficulty (easy, medium, hard)
`-p`, `--provider`	none	Filter by cloud provider (aws, azure, gcp)
`--tag`	none	Filter by tags (repeatable)
`--limit`	none	Max number of instances to run
`--temperature`	`0.0`	Model temperature
`-v`, `--verbose`	off	Verbose output

`generate` — Generate Terraform code from a prompt

Generate Terraform files from a natural language description, optionally validating them.

# Generate and validate
uv run python -m terraform_llm.cli generate "Create an S3 bucket with versioning" -o output

# Generate without validation
uv run python -m terraform_llm.cli generate "Create a VPC with public subnets" -o output --no-validate

# Use a different model
uv run python -m terraform_llm.cli generate "Create a Lambda function" -o output --model openai/gpt-4o

`list` — List instances in a dataset

uv run python -m terraform_llm.cli list dataset/

`traces` — View execution traces

uv run python -m terraform_llm.cli traces output/

Python API

Run the benchmark

from terraform_llm.datasets import load_dataset
from terraform_llm.agent import ModelConfig, EvalConfig, run_benchmark

dataset = load_dataset("data/benchmark.jsonl")

# Evaluate any model — litellm supports OpenAI, Anthropic, Ollama, HuggingFace, etc.
model = ModelConfig(model="anthropic/claude-sonnet-4-5-20250929")
config = EvalConfig(run_apply=True)  # full apply mode (deploys to LocalStack)

report = run_benchmark(dataset, model, config)
print(f"Mean score: {report.mean_score:.2f}")
print(f"Stage pass rates: {report.stage_pass_rates()}")

Run with Docker + LocalStack

config = EvalConfig(
    run_apply=True,
    use_docker=True,
    terraform_image="hashicorp/terraform:latest",
    localstack_image="localstack/localstack:latest",
)

report = run_benchmark(dataset, model, config)

Compare models

models = [
    ModelConfig(model="anthropic/claude-sonnet-4-5-20250929"),
    ModelConfig(model="openai/gpt-4o"),
    ModelConfig(model="ollama/llama3:8b"),       # local small model
    ModelConfig(model="ollama/codellama:13b"),    # local code model
]

for model in models:
    report = run_benchmark(dataset, model, EvalConfig())
    print(f"{model.model}: {report.mean_score:.2f}")

Benchmark instance format

Each instance in the JSONL dataset looks like:

{
  "instance_id": "terraform-aws-s3-001",
  "problem_statement": "Create an S3 bucket with versioning enabled and server-side encryption using AES256",
  "difficulty": "easy",
  "tags": ["aws", "s3", "storage"],
  "provider": "aws",
  "region": "us-east-1",
  "expected_resources": {"aws_s3_bucket": 1, "aws_s3_bucket_versioning": 1, "aws_s3_bucket_server_side_encryption_configuration": 1},
  "validation_script": "scripts/validate_s3.sh",
  "metadata": {"estimated_cost": "$0.02/month", "deployment_time_seconds": 30},
  "gold_solution": {"main.tf": "resource \"aws_s3_bucket\" ..."},
  "hints": ["Use aws_s3_bucket_versioning as a separate resource"]
}

Evaluation modes

Mode	Flag	What happens
Full apply (default)	`run_apply=True`	Deploys real infrastructure, runs validation, destroys after.
Plan only	`run_apply=False`	Runs init/validate/plan. No real infrastructure. Free.
Docker (default)	`--docker`	Runs Terraform in Docker with LocalStack. Isolated, no real AWS costs.
Local	`--no-docker`	Runs Terraform directly on the host. Requires Terraform CLI installed.

Scoring

The benchmark uses graded scoring, not binary pass/fail:

init (10%): Did providers resolve correctly?
validate (20%): Is the HCL syntactically and semantically valid?
plan (40%): Do the planned resources match what's expected? Partial credit for partial matches.
apply (20%): Did the infrastructure deploy successfully?
validation script (10%): Does the deployed infrastructure actually work as intended?

A model that gets to plan with correct resources but fails apply scores much higher than one that can't even pass validate.

Roadmap

Curate initial benchmark dataset (50-100 instances across easy/medium/hard)
Baseline frontier models (Claude, GPT-4, Gemini)
Baseline open models (Llama, CodeLlama, DeepSeek, Qwen)
Fine-tune a small model on gold solutions
Evaluate fine-tuned model against baselines
Multi-turn agent mode (iterative fix based on terraform errors)

Requirements

Python >= 3.12
Docker (for default isolated execution with LocalStack)
Terraform CLI (only needed with --no-docker mode)
API keys for whichever model provider you want to evaluate (set via environment variables)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

terraform-llm

Why

How it works

Project structure

Quick start

CLI usage

`benchmark` — Run the evaluation pipeline

`generate` — Generate Terraform code from a prompt

`list` — List instances in a dataset

`traces` — View execution traces

Python API

Run the benchmark

Run with Docker + LocalStack

Compare models

Benchmark instance format

Evaluation modes

Scoring

Roadmap

Requirements

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

terraform-llm

Why

How it works

Project structure

Quick start

CLI usage

benchmark — Run the evaluation pipeline

generate — Generate Terraform code from a prompt

list — List instances in a dataset

traces — View execution traces

Python API

Run the benchmark

Run with Docker + LocalStack

Compare models

Benchmark instance format

Evaluation modes

Scoring

Roadmap

Requirements

`benchmark` — Run the evaluation pipeline

`generate` — Generate Terraform code from a prompt

`list` — List instances in a dataset

`traces` — View execution traces