terraform-llm

A benchmark and training pipeline for building small, specialized LLMs that generate production-quality Terraform — matching the performance of models 10-100x their size.

Why

Large frontier models (GPT-4, Claude, etc.) can generate decent Terraform, but they're expensive, slow, and overkill for infrastructure-as-code. The hypothesis: a small, fine-tuned model trained on high-quality Terraform data can match or exceed them on this specific task.

This project provides:

A benchmark to measure how well any LLM generates Terraform, with graded scoring across the full init -> validate -> plan -> apply pipeline
A dataset framework for curating Terraform generation tasks at varying difficulty levels
A runner that evaluates any model (via litellm) against the benchmark in a single command

The end goal is to train and release a small model (7B-13B parameters) that scores competitively against frontier models on Terraform generation.

How it works

Problem statement ──> LLM ──> .tf files ──> Terraform pipeline ──> Graded score
                                              │
                                              ├── terraform init      (0.1 weight)
                                              ├── terraform validate  (0.2 weight)
                                              ├── terraform plan      (0.4 weight)
                                              ├── terraform apply     (0.2 weight)
                                              └── validation script   (0.1 weight)

Each stage produces a score between 0.0 and 1.0. The plan stage compares actual planned resources against expected resources with partial credit — if the benchmark expects 2 S3 buckets and the model produces 1, that's a 0.5 on that resource type, not a zero.

Project structure

terraform_llm/
  datasets/                  # Benchmark data layer
    schema.py                # BenchmarkInstance, difficulty levels, validation
    dataset.py               # HuggingFace-style Dataset class (filter, map, split)
    loader.py                # JSONL loading, streaming, save/export
  agent/                     # Evaluation runner
    models.py                # LLM abstraction via litellm (any provider)
    environment.py           # Terraform execution (local subprocess or Docker)
    docker_environment.py    # Docker + LocalStack environment management
    evaluator.py             # Graded scoring pipeline + resource comparison
    agent.py                 # run_instance, run_benchmark
    results.py               # StageResult, InstanceResult, BenchmarkReport
  cli/                       # Command-line interface
    benchmark.py             # benchmark command
    generate.py              # generate command

Quick start

pip install -e .
# or
uv sync

CLI usage

All commands are available via uv run python -m terraform_llm.cli (or the installed entry point if configured).

`benchmark` — Run the evaluation pipeline

Run the benchmark against a dataset. By default, Terraform runs inside Docker with LocalStack for isolated AWS mocking.

# Basic run with Docker + LocalStack (default)
uv run python -m terraform_llm.cli benchmark dataset/ -o output --model anthropic/claude-sonnet-4-5-20250929

# Run without Docker (requires Terraform CLI on host)
uv run python -m terraform_llm.cli benchmark dataset/ -o output --model anthropic/claude-sonnet-4-5-20250929 --no-docker

# Run a specific instance
uv run python -m terraform_llm.cli benchmark dataset/ -o output --instance-id terraform-aws-s3-001

# Filter by difficulty, provider, or tags
uv run python -m terraform_llm.cli benchmark dataset/ -o output --difficulty easy --provider aws --tag s3 --limit 5

# Plan-only mode (skips apply and validation, faster but less thorough)
uv run python -m terraform_llm.cli benchmark dataset/ -o output --no-run-apply

# Skip code generation — reuse .tf files already in the output directory
uv run python -m terraform_llm.cli benchmark dataset/ -o output --skip-generation

# Custom Docker images
uv run python -m terraform_llm.cli benchmark dataset/ -o output \
  --terraform-image hashicorp/terraform:1.6 \
  --localstack-image localstack/localstack:3.0

Options:

Flag	Default	Description
`DATASET` (argument)	required	Path to JSONL dataset file or directory
`-o`, `--output-dir`	`output`	Output directory
`--model`	`anthropic/claude-3-5-sonnet-20241022`	Model identifier (any litellm-supported model)
`--docker` / `--no-docker`	`--docker`	Use Docker + LocalStack for isolated execution
`--terraform-image`	`hashicorp/terraform:latest`	Docker image for Terraform
`--localstack-image`	`localstack/localstack:latest`	Docker image for LocalStack
`--run-apply` / `--no-run-apply`	`--run-apply`	Run terraform apply (creates infrastructure)
`--skip-generation`	off	Reuse existing .tf files from output directory
`-i`, `--instance-id`	none	Run a specific instance by ID
`--difficulty`	none	Filter by difficulty (easy, medium, hard)
`-p`, `--provider`	none	Filter by cloud provider (aws, azure, gcp)
`--tag`	none	Filter by tags (repeatable)
`--limit`	none	Max number of instances to run
`--temperature`	`0.0`	Model temperature
`-v`, `--verbose`	off	Verbose output

`generate` — Generate Terraform code from a prompt

Generate Terraform files from a natural language description, optionally validating them.

# Generate and validate
uv run python -m terraform_llm.cli generate "Create an S3 bucket with versioning" -o output

# Generate without validation
uv run python -m terraform_llm.cli generate "Create a VPC with public subnets" -o output --no-validate

# Use a different model
uv run python -m terraform_llm.cli generate "Create a Lambda function" -o output --model openai/gpt-4o

`list` — List instances in a dataset

uv run python -m terraform_llm.cli list dataset/

`traces` — View execution traces

uv run python -m terraform_llm.cli traces output/

Python API

Run the benchmark

from terraform_llm.datasets import load_dataset
from terraform_llm.agent import ModelConfig, EvalConfig, run_benchmark

dataset = load_dataset("data/benchmark.jsonl")

# Evaluate any model — litellm supports OpenAI, Anthropic, Ollama, HuggingFace, etc.
model = ModelConfig(model="anthropic/claude-sonnet-4-5-20250929")
config = EvalConfig(run_apply=True)  # full apply mode (deploys to LocalStack)

report = run_benchmark(dataset, model, config)
print(f"Mean score: {report.mean_score:.2f}")
print(f"Stage pass rates: {report.stage_pass_rates()}")

Run with Docker + LocalStack

config = EvalConfig(
    run_apply=True,
    use_docker=True,
    terraform_image="hashicorp/terraform:latest",
    localstack_image="localstack/localstack:latest",
)

report = run_benchmark(dataset, model, config)

Compare models

models = [
    ModelConfig(model="anthropic/claude-sonnet-4-5-20250929"),
    ModelConfig(model="openai/gpt-4o"),
    ModelConfig(model="ollama/llama3:8b"),       # local small model
    ModelConfig(model="ollama/codellama:13b"),    # local code model
]

for model in models:
    report = run_benchmark(dataset, model, EvalConfig())
    print(f"{model.model}: {report.mean_score:.2f}")

Benchmark instance format

Each instance in the JSONL dataset looks like:

{
  "instance_id": "terraform-aws-s3-001",
  "problem_statement": "Create an S3 bucket with versioning enabled and server-side encryption using AES256",
  "difficulty": "easy",
  "tags": ["aws", "s3", "storage"],
  "provider": "aws",
  "region": "us-east-1",
  "expected_resources": {"aws_s3_bucket": 1, "aws_s3_bucket_versioning": 1, "aws_s3_bucket_server_side_encryption_configuration": 1},
  "validation_script": "scripts/validate_s3.sh",
  "metadata": {"estimated_cost": "$0.02/month", "deployment_time_seconds": 30},
  "gold_solution": {"main.tf": "resource \"aws_s3_bucket\" ..."},
  "hints": ["Use aws_s3_bucket_versioning as a separate resource"]
}

Evaluation modes

Mode	Flag	What happens
Full apply (default)	`run_apply=True`	Deploys real infrastructure, runs validation, destroys after.
Plan only	`run_apply=False`	Runs init/validate/plan. No real infrastructure. Free.
Docker (default)	`--docker`	Runs Terraform in Docker with LocalStack. Isolated, no real AWS costs.
Local	`--no-docker`	Runs Terraform directly on the host. Requires Terraform CLI installed.

Scoring

The benchmark uses graded scoring, not binary pass/fail:

init (10%): Did providers resolve correctly?
validate (20%): Is the HCL syntactically and semantically valid?
plan (40%): Do the planned resources match what's expected? Partial credit for partial matches.
apply (20%): Did the infrastructure deploy successfully?
validation script (10%): Does the deployed infrastructure actually work as intended?

A model that gets to plan with correct resources but fails apply scores much higher than one that can't even pass validate.

Roadmap

Curate initial benchmark dataset (50-100 instances across easy/medium/hard)
Baseline frontier models (Claude, GPT-4, Gemini)
Baseline open models (Llama, CodeLlama, DeepSeek, Qwen)
Fine-tune a small model on gold solutions
Evaluate fine-tuned model against baselines
Multi-turn agent mode (iterative fix based on terraform errors)

Requirements

Python >= 3.12
Docker (for default isolated execution with LocalStack)
Terraform CLI (only needed with --no-docker mode)
API keys for whichever model provider you want to evaluate (set via environment variables)

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
.claude		.claude
ai		ai
dashboard		dashboard
dataset		dataset
leaderboards		leaderboards
output		output
scripts		scripts
terraform_llm		terraform_llm
.gitattributes		.gitattributes
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

terraform-llm

Why

How it works

Project structure

Quick start

CLI usage

`benchmark` — Run the evaluation pipeline

`generate` — Generate Terraform code from a prompt

`list` — List instances in a dataset

`traces` — View execution traces

Python API

Run the benchmark

Run with Docker + LocalStack

Compare models

Benchmark instance format

Evaluation modes

Scoring

Roadmap

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

terraform-llm

Why

How it works

Project structure

Quick start

CLI usage

benchmark — Run the evaluation pipeline

generate — Generate Terraform code from a prompt

list — List instances in a dataset

traces — View execution traces

Python API

Run the benchmark

Run with Docker + LocalStack

Compare models

Benchmark instance format

Evaluation modes

Scoring

Roadmap

Requirements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`benchmark` — Run the evaluation pipeline

`generate` — Generate Terraform code from a prompt

`list` — List instances in a dataset

`traces` — View execution traces

Packages