Skip to content

pymonger/srl-idps-govcloud-terraform

Repository files navigation

Complete EKS Infrastructure with KEDA, Karpenter, and Airflow

This Terraform configuration creates a complete EKS cluster infrastructure in AWS GovCloud with KEDA, Karpenter, and Airflow deployment, including all necessary dependencies and custom Airflow Docker image support.

🏗️ Architecture

This Terraform configuration is specifically designed for AWS GovCloud environments and fully complies with GovCloud's internet access restrictions. All container images are sourced from ECR repositories within your AWS account, ensuring no external registry dependencies.

The infrastructure is organized into modular components:

Core Infrastructure Modules

Module Purpose Key Features
VPC (./modules/vpc) Network foundation Public/private subnets, NAT Gateway, Karpenter discovery tags
IAM (./modules/iam) Identity & access EKS roles, Karpenter policies, service account permissions
EKS (./modules/eks) Kubernetes cluster EKS cluster, node groups, OIDC provider via JPL IAM as Code
EFS (./modules/efs) Persistent storage File system, access points for Airflow DAGs/logs
SQS (./modules/sqs) Message queuing Karpenter interruption handling with dead letter queue
ECR (./modules/ecr) Container registry Image repositories with lifecycle policies
Kubernetes (./modules/kubernetes) Application deployment KEDA, Karpenter, Airflow via Helm charts

📋 Prerequisites

Required Tools

  • Terraform >= 1.0
  • AWS CLI configured for GovCloud
  • Docker for building custom Airflow image
  • kubectl for cluster interaction
  • helm for chart management
  • pre-commit (optional, for development workflow)

Required AWS Permissions

  • EKS cluster management
  • VPC and networking resources
  • IAM roles and policies
  • EFS file systems
  • SQS queues
  • ECR repositories

Required VPC Configuration ⚠️

CRITICAL: This Terraform configuration requires an existing VPC with specific configuration:

VPC Requirements:

  • VPC Tag: Must have tag JplVpcType = "TGW-Internal"
  • DNS Settings: Both enableDnsSupport and enableDnsHostnames must be enabled
  • Subnet Tags: Private subnets must have tag karpenter.sh/discovery = "<cluster-name>" (replace with your actual cluster name)

Verification Commands:

# Verify VPC exists with correct tag
aws ec2 describe-vpcs --filters "Name=tag:JplVpcType,Values=TGW-Internal" --region us-gov-west-1

# Get VPC ID from the above command output, then verify VPC DNS settings
VPC_ID=$(aws ec2 describe-vpcs --filters "Name=tag:JplVpcType,Values=TGW-Internal" --query 'Vpcs[0].VpcId' --output text --region us-gov-west-1)
aws ec2 describe-vpc-attribute --vpc-id $VPC_ID --attribute enableDnsSupport --region us-gov-west-1
aws ec2 describe-vpc-attribute --vpc-id $VPC_ID --attribute enableDnsHostnames --region us-gov-west-1

# Verify subnets have Karpenter discovery tags (replace CLUSTER_NAME with your actual cluster name)
CLUSTER_NAME="your-cluster-name"
aws ec2 describe-subnets --filters "Name=vpc-id,Values=$VPC_ID" "Name=tag:karpenter.sh/discovery,Values=$CLUSTER_NAME" --region us-gov-west-1

Administrative Requirements ⚠️

CRITICAL: Before running this Terraform configuration, your AWS GovCloud administrators must deploy the JPL IAM as Code CloudFormation stack:

Required CloudFormation Stack:

  • Stack Name Pattern: StackSet-jpl-roles-as-code-* (dynamically generated)
  • Purpose: Provides Custom::JplEksFederation resource for OIDC provider creation
  • Deployment: Must be deployed by administrators with elevated permissions
  • Status: Human-in-the-loop process that cannot be automated

Verification Commands:

# Check if the CloudFormation stack exists
aws cloudformation list-stacks --region us-gov-west-1 \
  --query 'StackSummaries[?contains(StackName, `StackSet-jpl-roles-as-code`) && StackStatus==`CREATE_COMPLETE`]'

# Check if custom resource exports are available
aws cloudformation list-exports --region us-gov-west-1 \
  --query 'Exports[?Name==`Custom::JplEksFed::ServiceToken`]'

🔒 GovCloud Compliance

This configuration addresses AWS GovCloud's internet access restrictions by sourcing all container images from ECR repositories within your AWS account.

Image Mirroring Process

Before deployment, mirror all required images to ECR:

# Mirror all required images to ECR
./scripts/mirror-images.sh

Required Images:

  • KEDA: kedacore/keda, kedacore/keda-metrics-apiserver, kedacore/keda-admission-webhooks
  • Karpenter: karpenter/controller, karpenter/webhook
  • Supporting: statsd-exporter, redis, git-sync/git-sync, postgresql
  • Base Images: unity/alpine, unity/busybox, unity/nginx (for sidecars)

🛠️ Development Setup

Pre-commit Hooks (Recommended)

# Setup pre-commit hooks and tools
./scripts/setup-pre-commit.sh

Available Checks:

  • Terraform: Formatting, validation, security scanning
  • Security: Credential detection, private key detection
  • Documentation: Markdown linting, YAML validation
  • Code Quality: JSON validation, trailing whitespace

🚀 Quick Start

1. Verify Administrative Setup

# Ensure CloudFormation stack is deployed
aws cloudformation list-stacks --region us-gov-west-1 \
  --query 'StackSummaries[?contains(StackName, `StackSet-jpl-roles-as-code`) && StackStatus==`CREATE_COMPLETE`]'

# Verify VPC exists and get its ID
aws ec2 describe-vpcs --filters "Name=tag:JplVpcType,Values=TGW-Internal" --region us-gov-west-1

# Get VPC ID and verify DNS settings are enabled (required for EKS)
VPC_ID=$(aws ec2 describe-vpcs --filters "Name=tag:JplVpcType,Values=TGW-Internal" --query 'Vpcs[0].VpcId' --output text --region us-gov-west-1)
aws ec2 describe-vpc-attribute --vpc-id $VPC_ID --attribute enableDnsSupport --region us-gov-west-1
aws ec2 describe-vpc-attribute --vpc-id $VPC_ID --attribute enableDnsHostnames --region us-gov-west-1

# Verify subnets have required Karpenter discovery tags (replace CLUSTER_NAME with your actual cluster name)
CLUSTER_NAME="your-cluster-name"
aws ec2 describe-subnets --filters "Name=vpc-id,Values=$VPC_ID" "Name=tag:karpenter.sh/discovery,Values=$CLUSTER_NAME" --region us-gov-west-1

2. Mirror Images (GovCloud Requirement)

./scripts/mirror-images.sh

3. Build Custom Airflow Image

./scripts/build-airflow-image.sh

4. Deploy Infrastructure

terraform init
terraform plan
terraform apply

5. Verify Deployment

# Check cluster status
kubectl get nodes

# Check application deployments
kubectl get pods -n keda
kubectl get pods -n karpenter
kubectl get pods -n sps

# Check Karpenter resources
kubectl get nodepools
kubectl get nodeclasses

6. Access Airflow

kubectl port-forward -n sps svc/airflow-webserver 8080:8080

Then open http://localhost:8080 in your browser.

⚙️ Configuration

Customize the deployment by modifying terraform.tfvars:

EKS Configuration

cluster_name = "your-cluster-name"
kubernetes_version = "1.32"
vpc_cidr = "10.0.0.0/16"
node_group_instance_types = ["t3.medium"]
node_group_desired_size = 4

Application Configuration

  • KEDA: Autoscaling for Airflow workers (1-10 replicas)
  • Karpenter: Node provisioning with c6i.large instances
    • Uses subnets and security groups tagged with karpenter.sh/discovery
    • Supports both Spot and On-Demand instances
    • Automatic node lifecycle management
  • Airflow: Custom image with Unity SPS plugins
  • EFS: Persistent storage for DAGs, logs, and shared data
  • SQS: Interruption handling for Karpenter

📊 Outputs

The configuration provides comprehensive outputs for integration with other systems:

EKS Outputs

  • cluster_id, cluster_endpoint, cluster_certificate_authority_data
  • vpc_id, private_subnet_ids, public_subnet_ids

Storage Outputs

  • efs_file_system_id, efs_security_group_id
  • karpenter_queue_url, karpenter_queue_arn

ECR Outputs

  • airflow_repository_url, karpenter_controller_repository_url
  • keda_operator_repository_url

IAM Outputs

  • karpenter_controller_role_arn
  • karpenter_node_instance_profile_name

🔐 Security Features

  • Private subnets for worker nodes
  • Security groups with minimal required access
  • IAM roles with least privilege policies
  • OIDC provider for secure service account integration
  • Public access restricted to specified CIDR blocks
  • Encrypted EFS file system
  • SQS queue policies for secure message handling
  • ECR image scanning enabled

💰 Cost Considerations

  • NAT Gateway: ~$0.045/hour
  • EKS cluster: ~$0.10/hour
  • Worker nodes: Based on instance type and usage
  • EFS: Storage and throughput costs
  • SQS: Per-message charges
  • ECR: Storage costs
  • Cost Optimization: Karpenter can provision Spot instances

🔧 Maintenance

  • Kubernetes Updates: Plan carefully for version upgrades
  • Node Groups: Zero-downtime updates supported
  • Security Groups: Can be modified without cluster downtime
  • IAM Policies: Can be updated without affecting workloads
  • ECR Lifecycle: Automatic cleanup of old images
  • Karpenter: Automatic node lifecycle management
  • KEDA: Automatic scaling based on workload demand

🐛 Troubleshooting

OIDC Provider Issues

Common Error Messages:

  • No exports found for Custom::JplEksFed::ServiceToken: CloudFormation stack not deployed
  • error creating CloudFormation stack: Insufficient permissions
  • No IAM OpenID Connect Provider found: OIDC provider doesn't exist
  • Access Denied: Insufficient permissions

Resolution Steps:

  1. Verify CloudFormation stack deployment
  2. Check custom resource exports availability
  3. Ensure AWS credentials have necessary permissions
  4. Re-run terraform plan and terraform apply

Getting Stack Name:

aws cloudformation list-stacks --region us-gov-west-1 \
  --query 'StackSummaries[?contains(StackName, `StackSet-jpl-roles-as-code`)].{StackName:StackName,Status:StackStatus}' \
  --output table

Karpenter Issues

Check Discovery Tags:

# Verify subnets have karpenter.sh/discovery tags
aws ec2 describe-subnets --region us-gov-west-1 \
  --filters "Name=tag:karpenter.sh/discovery,Values=your-cluster-name"

# Verify security groups have karpenter.sh/discovery tags
aws ec2 describe-security-groups --region us-gov-west-1 \
  --filters "Name=tag:karpenter.sh/discovery,Values=your-cluster-name"

AWS Auth Configuration:

# Check if Karpenter node role is in aws-auth
kubectl get configmap aws-auth -n kube-system -o yaml

# The aws-auth ConfigMap is automatically managed by Terraform
# If Karpenter nodes can't authenticate, verify the role ARN is correct
# and the role has the necessary permissions

Common Karpenter Node Issues:

  • Authentication failures: Check aws-auth ConfigMap configuration
  • API server connectivity: Verify security groups and subnet routing
  • Node not joining: Check IAM role permissions and instance profile

📚 Additional Resources

🤝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Run pre-commit hooks: pre-commit run --all-files
  5. Submit a pull request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Requirements

Name Version
terraform >= 1.0
aws ~> 5.0
helm ~> 2.12
kubernetes ~> 2.25
null ~> 3.0
tls ~> 4.0

Providers

Name Version
aws 5.100.0
null 3.2.4

Modules

Name Source Version
ecr ./modules/ecr n/a
efs ./modules/efs n/a
eks ./modules/eks n/a
iam ./modules/iam n/a
kubernetes ./modules/kubernetes n/a
sqs ./modules/sqs n/a

Resources

Name Type
null_resource.validate_subnets resource
aws_caller_identity.current data source
aws_region.current data source
aws_subnets.private data source
aws_vpc.existing data source

Inputs

Name Description Type Default Required
cluster_name Name of the EKS cluster string "gman-test" no
deploy_dags Whether to deploy the default DAGs (rdrgen, edrgen, vic2png) to Airflow bool true no
kubernetes_version Kubernetes version for the EKS cluster string "1.32" no
node_group_desired_size Desired number of nodes in the node group number 4 no
node_group_instance_types Instance types for the node group list(string)
[
"t3.medium"
]
no
node_group_max_size Maximum number of nodes in the node group number 4 no
node_group_min_size Minimum number of nodes in the node group number 0 no
public_access_cidrs CIDR blocks for public access to EKS cluster list(string)
[
"0.0.0.0/0"
]
no
service_ipv4_cidr CIDR block for Kubernetes services string "10.100.0.0/16" no
tags Tags to apply to all resources map(string)
{
"Environment": "gman-test",
"ManagedBy": "terraform",
"Owner": "account-managed"
}
no

Outputs

Name Description
airflow_release_name Airflow Helm release name
airflow_repository_url Airflow ECR repository URL
alpine_repository_url Alpine base image ECR repository URL
aws_ebs_csi_driver_repository_url AWS EBS CSI Driver ECR repository URL
busybox_repository_url Busybox base image ECR repository URL
cluster_arn EKS cluster ARN
cluster_certificate_authority_data EKS cluster certificate authority data
cluster_endpoint EKS cluster endpoint
cluster_id EKS cluster ID
cluster_name EKS cluster name
cluster_oidc_issuer_url EKS cluster OIDC issuer URL
cluster_security_group_id EKS cluster security group ID
edrgen_repository_url Unity EDRGEN ECR repository URL
efs_file_system_id EFS file system ID
efs_security_group_id EFS security group ID
eks_pause_repository_url EKS pause image ECR repository URL
external_attacher_repository_url External Attacher ECR repository URL
external_provisioner_repository_url External Provisioner ECR repository URL
external_resizer_repository_url External Resizer ECR repository URL
karpenter_controller_repository_url Karpenter controller ECR repository URL
karpenter_controller_role_arn Karpenter controller IAM role ARN
karpenter_node_instance_profile_name Karpenter node instance profile name
karpenter_queue_arn Karpenter interruption queue ARN
karpenter_queue_url Karpenter interruption queue URL
karpenter_release_name Karpenter Helm release name
keda_operator_repository_url KEDA operator ECR repository URL
keda_release_name KEDA Helm release name
livenessprobe_repository_url Liveness Probe ECR repository URL
nginx_repository_url Nginx base image ECR repository URL
node_driver_registrar_repository_url Node Driver Registrar ECR repository URL
node_group_arn EKS node group ARN
node_group_id EKS node group ID
node_security_group_id EKS node security group ID
oidc_provider_arn EKS OIDC provider ARN
oidc_provider_stack_arn CloudFormation stack ARN for OIDC provider
private_subnet_ids Private subnet IDs
public_subnet_ids Public subnet IDs
rdrgen_repository_url Unity RDRGEN ECR repository URL
vic2png_repository_url Unity VIC2PNG ECR repository URL
vpc_id VPC ID

About

Complete EKS infrastructure for AWS GovCloud with KEDA, Karpenter, and Airflow deployment

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published