Skip to content

Latest commit

 

History

History
241 lines (181 loc) · 9.73 KB

File metadata and controls

241 lines (181 loc) · 9.73 KB

terraform-llm

A benchmark and training pipeline for building small, specialized LLMs that generate production-quality Terraform — matching the performance of models 10-100x their size.

Why

Large frontier models (GPT-4, Claude, etc.) can generate decent Terraform, but they're expensive, slow, and overkill for infrastructure-as-code. The hypothesis: a small, fine-tuned model trained on high-quality Terraform data can match or exceed them on this specific task.

This project provides:

  1. A benchmark to measure how well any LLM generates Terraform, with graded scoring across the full init -> validate -> plan -> apply pipeline
  2. A dataset framework for curating Terraform generation tasks at varying difficulty levels
  3. A runner that evaluates any model (via litellm) against the benchmark in a single command

The end goal is to train and release a small model (7B-13B parameters) that scores competitively against frontier models on Terraform generation.

How it works

Problem statement ──> LLM ──> .tf files ──> Terraform pipeline ──> Graded score
                                              │
                                              ├── terraform init      (0.1 weight)
                                              ├── terraform validate  (0.2 weight)
                                              ├── terraform plan      (0.4 weight)
                                              ├── terraform apply     (0.2 weight)
                                              └── validation script   (0.1 weight)

Each stage produces a score between 0.0 and 1.0. The plan stage compares actual planned resources against expected resources with partial credit — if the benchmark expects 2 S3 buckets and the model produces 1, that's a 0.5 on that resource type, not a zero.

Project structure

terraform_llm/
  datasets/                  # Benchmark data layer
    schema.py                # BenchmarkInstance, difficulty levels, validation
    dataset.py               # HuggingFace-style Dataset class (filter, map, split)
    loader.py                # JSONL loading, streaming, save/export
  agent/                     # Evaluation runner
    models.py                # LLM abstraction via litellm (any provider)
    environment.py           # Terraform execution (local subprocess or Docker)
    docker_environment.py    # Docker + LocalStack environment management
    evaluator.py             # Graded scoring pipeline + resource comparison
    agent.py                 # run_instance, run_benchmark
    results.py               # StageResult, InstanceResult, BenchmarkReport
  cli/                       # Command-line interface
    benchmark.py             # benchmark command
    generate.py              # generate command

Quick start

pip install -e .
# or
uv sync

CLI usage

All commands are available via uv run python -m terraform_llm.cli (or the installed entry point if configured).

benchmark — Run the evaluation pipeline

Run the benchmark against a dataset. By default, Terraform runs inside Docker with LocalStack for isolated AWS mocking.

# Basic run with Docker + LocalStack (default)
uv run python -m terraform_llm.cli benchmark dataset/ -o output --model anthropic/claude-sonnet-4-5-20250929

# Run without Docker (requires Terraform CLI on host)
uv run python -m terraform_llm.cli benchmark dataset/ -o output --model anthropic/claude-sonnet-4-5-20250929 --no-docker

# Run a specific instance
uv run python -m terraform_llm.cli benchmark dataset/ -o output --instance-id terraform-aws-s3-001

# Filter by difficulty, provider, or tags
uv run python -m terraform_llm.cli benchmark dataset/ -o output --difficulty easy --provider aws --tag s3 --limit 5

# Plan-only mode (skips apply and validation, faster but less thorough)
uv run python -m terraform_llm.cli benchmark dataset/ -o output --no-run-apply

# Skip code generation — reuse .tf files already in the output directory
uv run python -m terraform_llm.cli benchmark dataset/ -o output --skip-generation

# Custom Docker images
uv run python -m terraform_llm.cli benchmark dataset/ -o output \
  --terraform-image hashicorp/terraform:1.6 \
  --localstack-image localstack/localstack:3.0

Options:

Flag Default Description
DATASET (argument) required Path to JSONL dataset file or directory
-o, --output-dir output Output directory
--model anthropic/claude-3-5-sonnet-20241022 Model identifier (any litellm-supported model)
--docker / --no-docker --docker Use Docker + LocalStack for isolated execution
--terraform-image hashicorp/terraform:latest Docker image for Terraform
--localstack-image localstack/localstack:latest Docker image for LocalStack
--run-apply / --no-run-apply --run-apply Run terraform apply (creates infrastructure)
--skip-generation off Reuse existing .tf files from output directory
-i, --instance-id none Run a specific instance by ID
--difficulty none Filter by difficulty (easy, medium, hard)
-p, --provider none Filter by cloud provider (aws, azure, gcp)
--tag none Filter by tags (repeatable)
--limit none Max number of instances to run
--temperature 0.0 Model temperature
-v, --verbose off Verbose output

generate — Generate Terraform code from a prompt

Generate Terraform files from a natural language description, optionally validating them.

# Generate and validate
uv run python -m terraform_llm.cli generate "Create an S3 bucket with versioning" -o output

# Generate without validation
uv run python -m terraform_llm.cli generate "Create a VPC with public subnets" -o output --no-validate

# Use a different model
uv run python -m terraform_llm.cli generate "Create a Lambda function" -o output --model openai/gpt-4o

list — List instances in a dataset

uv run python -m terraform_llm.cli list dataset/

traces — View execution traces

uv run python -m terraform_llm.cli traces output/

Python API

Run the benchmark

from terraform_llm.datasets import load_dataset
from terraform_llm.agent import ModelConfig, EvalConfig, run_benchmark

dataset = load_dataset("data/benchmark.jsonl")

# Evaluate any model — litellm supports OpenAI, Anthropic, Ollama, HuggingFace, etc.
model = ModelConfig(model="anthropic/claude-sonnet-4-5-20250929")
config = EvalConfig(run_apply=True)  # full apply mode (deploys to LocalStack)

report = run_benchmark(dataset, model, config)
print(f"Mean score: {report.mean_score:.2f}")
print(f"Stage pass rates: {report.stage_pass_rates()}")

Run with Docker + LocalStack

config = EvalConfig(
    run_apply=True,
    use_docker=True,
    terraform_image="hashicorp/terraform:latest",
    localstack_image="localstack/localstack:latest",
)

report = run_benchmark(dataset, model, config)

Compare models

models = [
    ModelConfig(model="anthropic/claude-sonnet-4-5-20250929"),
    ModelConfig(model="openai/gpt-4o"),
    ModelConfig(model="ollama/llama3:8b"),       # local small model
    ModelConfig(model="ollama/codellama:13b"),    # local code model
]

for model in models:
    report = run_benchmark(dataset, model, EvalConfig())
    print(f"{model.model}: {report.mean_score:.2f}")

Benchmark instance format

Each instance in the JSONL dataset looks like:

{
  "instance_id": "terraform-aws-s3-001",
  "problem_statement": "Create an S3 bucket with versioning enabled and server-side encryption using AES256",
  "difficulty": "easy",
  "tags": ["aws", "s3", "storage"],
  "provider": "aws",
  "region": "us-east-1",
  "expected_resources": {"aws_s3_bucket": 1, "aws_s3_bucket_versioning": 1, "aws_s3_bucket_server_side_encryption_configuration": 1},
  "validation_script": "scripts/validate_s3.sh",
  "metadata": {"estimated_cost": "$0.02/month", "deployment_time_seconds": 30},
  "gold_solution": {"main.tf": "resource \"aws_s3_bucket\" ..."},
  "hints": ["Use aws_s3_bucket_versioning as a separate resource"]
}

Evaluation modes

Mode Flag What happens
Full apply (default) run_apply=True Deploys real infrastructure, runs validation, destroys after.
Plan only run_apply=False Runs init/validate/plan. No real infrastructure. Free.
Docker (default) --docker Runs Terraform in Docker with LocalStack. Isolated, no real AWS costs.
Local --no-docker Runs Terraform directly on the host. Requires Terraform CLI installed.

Scoring

The benchmark uses graded scoring, not binary pass/fail:

  • init (10%): Did providers resolve correctly?
  • validate (20%): Is the HCL syntactically and semantically valid?
  • plan (40%): Do the planned resources match what's expected? Partial credit for partial matches.
  • apply (20%): Did the infrastructure deploy successfully?
  • validation script (10%): Does the deployed infrastructure actually work as intended?

A model that gets to plan with correct resources but fails apply scores much higher than one that can't even pass validate.

Roadmap

  • Curate initial benchmark dataset (50-100 instances across easy/medium/hard)
  • Baseline frontier models (Claude, GPT-4, Gemini)
  • Baseline open models (Llama, CodeLlama, DeepSeek, Qwen)
  • Fine-tune a small model on gold solutions
  • Evaluate fine-tuned model against baselines
  • Multi-turn agent mode (iterative fix based on terraform errors)

Requirements

  • Python >= 3.12
  • Docker (for default isolated execution with LocalStack)
  • Terraform CLI (only needed with --no-docker mode)
  • API keys for whichever model provider you want to evaluate (set via environment variables)