Terraform module for connecting a GKE cluster to CAST AI

Website: https://www.cast.ai

Requirements

Terraform 0.13+

Using the module

A module to connect a GKE cluster to CAST AI.

Requires castai/castai and hashicorp/google providers to be configured.

For Phase 2 onboarding credentials from terraform-gke-iam are required

module "castai_gke_cluster" {
  source = "castai/gke-cluster/castai"

  project_id           = var.project_id
  gke_cluster_name     = var.cluster_name
  gke_cluster_location = module.gke.location # cluster region or zone

  gke_credentials            = module.castai_gke_iam.private_key
  delete_nodes_on_disconnect = var.delete_nodes_on_disconnect

  default_node_configuration = module.castai_gke_cluster.node_configurations["default"]

  node_configurations = {
    default = {
      disk_cpu_ratio = 25
      subnets        = [module.vpc.subnets_ids[0]]
      tags = {
        "node-config" : "default"
      }

      max_pods_per_node = 110
      network_tags      = ["dev"]
      disk_type         = "pd-balanced"

    }
  }
  node_templates = {
    spot_tmpl = {
      configuration_id = module.castai_gke_cluster.node_configurations["default"]

      should_taint = true

      custom_labels = {
        custom-label-key-1 = "custom-label-value-1"
        custom-label-key-2 = "custom-label-value-2"
      }

      custom_taints = [
        {
          key   = "custom-taint-key-1"
          value = "custom-taint-value-1"
        },
        {
          key   = "custom-taint-key-2"
          value = "custom-taint-value-2"
        }
      ]

      constraints = {
        fallback_restore_rate_seconds = 1800
        spot                          = true
        use_spot_fallbacks            = true
        min_cpu                       = 4
        max_cpu                       = 100
        instance_families = {
          exclude = ["e2"]
        }
        compute_optimized_state = "disabled"
        storage_optimized_state = "disabled"
        is_gpu_only             = false
        architectures           = ["amd64"]
      }

      gpu = {
        default_shared_clients_per_gpu = 9
        sharing_strategy               = "time-slicing"
        user_managed_gpu_drivers       = false

        sharing_configuration = [
          {
            gpu_name = "nvidia-a100-80gb"
            shared_clients_per_gpu = 11
          },
          {
            gpu_name = "nvidia-l4"
            shared_clients_per_gpu = 5
          },
          {
            gpu_name = "nvidia-tesla-t4"
            shared_clients_per_gpu = 3
          }
        ]
      }

      custom_instances_enabled                      = true
      custom_instances_with_extended_memory_enabled = true
    }
  }

  autoscaler_settings = {
    enabled                                 = true
    node_templates_partial_matching_enabled = false

    unschedulable_pods = {
      enabled = true
    }

    node_downscaler = {
      enabled = true

      empty_nodes = {
        enabled = true
      }

      evictor = {
        aggressive_mode           = false
        cycle_interval            = "5s10s"
        dry_run                   = false
        enabled                   = true
        node_grace_period_minutes = 10
        scoped_mode               = false
      }
    }

    cluster_limits = {
      enabled = true

      cpu = {
        max_cores = 20
        min_cores = 1
      }
    }
  }

  workload_scaling_policies = {
    default = {
      apply_type        = "IMMEDIATE"
      management_option = "MANAGED"

      cpu = {
        function                 = "QUANTILE"
        args                     = ["0.9"]
        overhead                 = 0.15
        look_back_period_seconds = 172800
        min                      = 0.1
        max                      = 2.0
      }

      memory = {
        function                 = "MAX"
        overhead                 = 0.35
        look_back_period_seconds = 172800

        limit = {
          type = "NO_LIMIT"
        }
      }

      assignment_rules = {
        rules = [
          {
            namespace = {
              names = ["default", "kube-system"]
            }
          },
          {
            workload = {
              gvk = ["Deployment", "StatefulSet"]
              labels_expressions = [
                {
                  key      = "region"
                  operator = "NotIn"
                  values   = ["eu-west-1", "eu-west-2"]
                },
                {
                  key      = "helm.sh/chart"
                  operator = "Exists"
                }
              ]
            }
          }
        ]
      }

      startup = {
        period_seconds = 300
      }

      predictive_scaling = {
        cpu = {
          enabled = true
        }
      }
    }
  }
}

Migrating from 3.x.x to 4.x.x

Version 4.x.x changes:

Removed custom_label attribute in castai_node_template resource. Use custom_labels instead.

Old configuration:

module "castai-gke-cluster" {
  node_templates = {
    spot_tmpl = {
      custom_label = {
        key = "custom-label-key-1"
        value = "custom-label-value-1"
      }
    }
  }
}

New configuration:

module "castai-gke-cluster" {
  node_templates = {
    spot_tmpl = {
      custom_labels = {
        custom-label-key-1 = "custom-label-value-1"
      }
    }
  }
}

Migrating from 4.x.x to 5.x.x

Version 5.x.x changed:

Removed compute_optimized and storage_optimized attributes in castai_node_template resource, constraints object. Use compute_optimized_state and storage_optimized_state instead.

Old configuration:

module "castai-gke-cluster" {
  node_templates = {
    spot_tmpl = {
      constraints = {
        compute_optimized = false
        storage_optimized = true
      }
    }
  }
}

New configuration:

module "castai-gke-cluster" {
  node_templates = {
    spot_tmpl = {
      constraints = {
        compute_optimized_state = "disabled"
        storage_optimized_state = "enabled"
      }
    }
  }
}

Migrating from 6.1.x to 6.3.x

Version 6.3.x changed:

Deprecated autoscaler_policies_json attribute. Use autoscaler_settings instead.

Old configuration:

module "castai-gke-cluster" {
  autoscaler_policies_json = <<-EOT
    {
        "enabled": true,
        "unschedulablePods": {
            "enabled": true
        },
        "nodeDownscaler": {
            "enabled": true,
            "emptyNodes": {
                "enabled": true
            },
            "evictor": {
                "aggressiveMode": false,
                "cycleInterval": "5m10s",
                "dryRun": false,
                "enabled": true,
                "nodeGracePeriodMinutes": 10,
                "scopedMode": false
            }
        },
        "nodeTemplatesPartialMatchingEnabled": false,
        "clusterLimits": {
            "cpu": {
                "maxCores": 20,
                "minCores": 1
            },
            "enabled": true
        }
    }
  EOT
}

New configuration:

module "castai-gke-cluster" {
  autoscaler_settings = {
    enabled                                 = true
    node_templates_partial_matching_enabled = false

    unschedulable_pods = {
      enabled = true
    }

    node_downscaler = {
      enabled = true

      empty_nodes = {
        enabled = true
      }

      evictor = {
        aggressive_mode           = false
        cycle_interval            = "5m10s"
        dry_run                   = false
        enabled                   = true
        node_grace_period_minutes = 10
        scoped_mode               = false
      }
    }

    cluster_limits = {
      enabled = true

      cpu = {
        max_cores = 20
        min_cores = 1
      }
    }
  }
}

Migrating from 9.x.x to 10.x.x

Version 10.x.x removes deprecated fields. These settings should now be configured via node_templates constraints.

Removed Fields

The autoscaler_policies_json variable has been removed. Use autoscaler_settings instead.

The following fields have been removed from autoscaler_settings:

Removed Field	Migration Path
`unschedulable_pods.custom_instances_enabled`	Use `node_templates.<name>.custom_instances_enabled`
`unschedulable_pods.headroom`	Deploy low-priority placeholder workloads (docs)
`unschedulable_pods.headroom_spot`	Deploy low-priority placeholder workloads (docs)
`unschedulable_pods.node_constraints`	Use `node_templates.<name>.constraints` (`min_cpu`, `max_cpu`, `min_memory`, `max_memory`)
`spot_instances.enabled`	Use `node_templates.<name>.constraints.spot`
`spot_instances.spot_backups`	Use `node_templates.<name>.constraints.use_spot_fallbacks` and `fallback_restore_rate_seconds`
`spot_instances.spot_diversity_enabled`	Use `node_templates.<name>.constraints.enable_spot_diversity`
`spot_instances.spot_diversity_price_increase_limit`	Use `node_templates.<name>.constraints.spot_diversity_price_increase_limit_percent`
`spot_instances.spot_interruption_predictions`	Use `node_templates.<name>.constraints.spot_interruption_predictions_enabled` and `spot_interruption_predictions_type`

Migration Example

Old configuration:

module "castai-gke-cluster" {
  source = "castai/gke-cluster/castai"

  autoscaler_settings = {
    enabled = true

    unschedulable_pods = {
      enabled                  = true
      custom_instances_enabled = true

      headroom = {
        enabled           = true
        cpu_percentage    = 10
        memory_percentage = 10
      }

      node_constraints = {
        min_cpu_cores = 4
        max_cpu_cores = 32
      }
    }

    spot_instances = {
      enabled = true
      spot_backups = {
        enabled = true
      }
    }
  }
}

New configuration:

module "castai-gke-cluster" {
  source = "castai/gke-cluster/castai"

  autoscaler_settings = {
    enabled = true

    unschedulable_pods = {
      enabled = true
    }
  }

  node_templates = {
    default_by_castai = {
      configuration_id = module.castai-gke-cluster.castai_node_configurations["default"]
      is_default       = true

      custom_instances_enabled = true

      constraints = {
        min_cpu            = 4
        max_cpu            = 32
        spot               = true
        use_spot_fallbacks = true
      }
    }
  }
}

Headroom Migration

Headroom functionality has been replaced with the recommended approach of deploying low-priority placeholder workloads. This provides more flexibility and follows Kubernetes native patterns.

See the CAST AI documentation on maintaining cluster headroom for detailed instructions.

Example placeholder deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: headroom-placeholder
spec:
  replicas: 1
  selector:
    matchLabels:
      app: headroom-placeholder
  template:
    metadata:
      labels:
        app: headroom-placeholder
    spec:
      priorityClassName: low-priority  # Create a PriorityClass with low priority
      containers:
      - name: pause
        image: registry.k8s.io/pause:3.9
        resources:
          requests:
            cpu: "2"      # Adjust based on desired headroom
            memory: "4Gi"

Examples

Usage examples are located in terraform provider repo

Requirements

Name	Version
terraform	>= 0.13
castai	>= 8.3
google	>= 2.49
helm	>= 3.0.0
null	>= 3.0

Providers

Name	Version
castai	7.61.0
google	6.46.0
helm	3.0.2
null	>= 3.0

Modules

Name	Source	Version
castai_omni_cluster	castai/omni-cluster/castai	~> 2.0

Resources

Name	Type
castai_autoscaler.castai_autoscaler_policies	resource
castai_gke_cluster.castai_cluster	resource
castai_node_configuration.this	resource
castai_node_configuration_default.this	resource
castai_node_template.this	resource
castai_workload_scaling_policy.this	resource
helm_release.castai_agent	resource
helm_release.castai_ai_optimizer_proxy	resource
helm_release.castai_ai_optimizer_proxy_self_managed	resource
helm_release.castai_cluster_controller	resource
helm_release.castai_cluster_controller_self_managed	resource
helm_release.castai_evictor	resource
helm_release.castai_evictor_ext	resource
helm_release.castai_evictor_self_managed	resource
helm_release.castai_kvisor	resource
helm_release.castai_kvisor_self_managed	resource
helm_release.castai_pod_mutator	resource
helm_release.castai_pod_mutator_self_managed	resource
helm_release.castai_pod_pinner	resource
helm_release.castai_pod_pinner_self_managed	resource
helm_release.castai_spot_handler	resource
helm_release.castai_workload_autoscaler	resource
helm_release.castai_workload_autoscaler_self_managed	resource
helm_release.castai_workload_autoscaler_exporter	resource
helm_release.castai_workload_autoscaler_exporter_self_managed	resource
null_resource.wait_for_cluster	resource
google_compute_subnetwork.gke_subnet	data source
google_container_cluster.gke	data source

Inputs

Name	Description	Type	Default	Required
agent_values	List of YAML formatted string values for agent helm chart	`list(string)`	`[]`	no
agent_version	Version of castai-agent helm chart. Default latest	`string`	`null`	no
ai_optimizer_values	List of YAML formatted string with ai-optimizer values	`list(string)`	`[]`	no
ai_optimizer_version	Version of castai-ai-optimizer helm chart. Default latest	`string`	`null`	no
api_url	URL of alternative CAST AI API to be used during development or testing	`string`	`"https://api.cast.ai"`	no
autoscaler_settings	Optional Autoscaler policy definitions to override current autoscaler settings	`any`	`null`	no
castai_api_token	Optional CAST AI API token created in console.cast.ai API Access keys section. Used only when `wait_for_cluster_ready` is set to true	`string`	`""`	no
castai_components_labels	Optional additional Kubernetes labels for CAST AI pods	`map(any)`	`{}`	no
castai_components_sets	Optional additional 'set' configurations for every CAST AI Helm release.	`map(string)`	`{}`	no
castware_api_url	URL of CAST AI API to be used from within the cluster by Cast AI applications (Castware). If left empty, `api_url` will be used within the cluster.	`string`	`""`	no
cluster_controller_values	List of YAML formatted string values for cluster-controller helm chart	`list(string)`	`[]`	no
cluster_controller_version	Version of castai-cluster-controller helm chart. Default latest	`string`	`null`	no
default_node_configuration	ID of the default node configuration	`string`	`""`	no
default_node_configuration_name	Name of the default node configuration	`string`	`""`	no
delete_nodes_on_disconnect	Optionally delete Cast AI created nodes when the cluster is destroyed	`bool`	`false`	no
evictor_ext_values	List of YAML formatted string with evictor-ext values	`list(string)`	`[]`	no
evictor_ext_version	Version of castai-evictor-ext chart. Default latest	`string`	`null`	no
evictor_values	List of YAML formatted string values for evictor helm chart	`list(string)`	`[]`	no
evictor_version	Version of castai-evictor chart. Default latest	`string`	`null`	no
gke_cluster_location	Location of the cluster to be connected to CAST AI. Can be region or zone for zonal clusters	`string`	n/a	yes
gke_cluster_name	Name of the cluster to be connected to CAST AI.	`string`	n/a	yes
gke_credentials	Optional GCP Service account credentials.json	`string`	n/a	yes
grpc_url	gRPC endpoint used by pod-pinner	`string`	`"grpc.cast.ai:443"`	no
install_ai_optimizer	Optional flag for installation of AI Optimizer (https://docs.cast.ai/docs/getting-started-ai)	`bool`	`false`	no
install_omni	Optional flag for installation of Omni product	`bool`	`false`	no
install_pod_mutator	Optional flag for installation of pod mutator	`bool`	`false`	no
install_security_agent	Optional flag for installation of security agent (Kvisor - https://docs.cast.ai/docs/kvisor)	`bool`	`false`	no
install_workload_autoscaler	Optional flag for installation of workload autoscaler (https://docs.cast.ai/docs/workload-autoscaling-configuration)	`bool`	`false`	no
install_workload_autoscaler_exporter	Optional flag for installation of workload autoscaler exporter (custom metrics exporter)	`bool`	`false`	no
kvisor_controller_extra_args	⚠️ DEPRECATED: use kvisor_values instead (see example: https://github.com/castai/terraform-provider-castai/tree/master/examples/gke/gke_cluster_with_security/castai.tf ). Extra arguments for the kvisor controller. Optionally enable kvisor to lint Kubernetes YAML manifests, scan workload images and check if workloads pass CIS Kubernetes Benchmarks as well as NSA, WASP and PCI recommendations.	`map(string)`	{ "image-scan-enabled": "true", "kube-bench-enabled": "true", "kube-linter-enabled": "true" }	no
kvisor_grpc_addr	CAST AI Kvisor optimized GRPC API address	`string`	`"kvisor.prod-master.cast.ai:443"`	no
kvisor_values	List of YAML formatted string values for kvisor helm chart, see example: https://github.com/castai/terraform-provider-castai/tree/master/examples/gke/gke_cluster_with_security/castai.tf	`list(string)`	`[]`	no
kvisor_version	Version of kvisor chart. If not provided, latest version will be used.	`string`	`null`	no
kvisor_wait	Wait for kvisor chart to finish release	`bool`	`true`	no
node_configurations	Map of GKE node configurations to create	`any`	`{}`	no
node_templates	Map of node templates to create	`any`	`{}`	no
organization_id	DEPRECATED (required only for pod mutator v0.0.25 and older): CAST AI Organization ID	`string`	`""`	no
pod_mutator_values	List of YAML formatted string values for pod-mutator helm chart	`list(string)`	`[]`	no
pod_mutator_version	Version of castai-pod-mutator helm chart. Default latest	`string`	`null`	no
pod_pinner_values	List of YAML formatted string values for agent helm chart	`list(string)`	`[]`	no
pod_pinner_version	Version of pod-pinner helm chart. Default latest	`string`	`null`	no
project_id	The project id from GCP	`string`	n/a	yes
self_managed	Whether CAST AI components' upgrades are managed by a customer; by default upgrades are managed CAST AI central system. WARNING: changing this after the module was created is not supported.	`bool`	`false`	no
spot_handler_values	List of YAML formatted string values for spot-handler helm chart	`list(string)`	`[]`	no
spot_handler_version	Version of castai-spot-handler helm chart. Default latest	`string`	`null`	no
wait_for_cluster_ready	Wait for cluster to be ready before finishing the module execution, this option requires `castai_api_token` to be set	`bool`	`false`	no
workload_autoscaler_values	List of YAML formatted string with cluster-workload-autoscaler values	`list(string)`	`[]`	no
workload_autoscaler_version	Version of castai-workload-autoscaler helm chart. Default latest	`string`	`null`	no
workload_autoscaler_exporter_values	List of YAML formatted string with workload-autoscaler-exporter values	`list(string)`	`[]`	no
workload_autoscaler_exporter_version	Version of castai-workload-autoscaler-exporter helm chart. Default latest	`string`	`null`	no
workload_scaling_policies	Map of workload scaling policies to create	`any`	`{}`	no

Outputs

Name	Description
castai_node_configurations	Map of node configurations ids by name
castai_node_templates	Map of node template by name
cluster_id	CAST.AI cluster id, which can be used for accessing cluster data using API
organization_id	CAST.AI organization id of the cluster

Name		Name	Last commit message	Last commit date
Latest commit History 280 Commits
.github		.github
.gitignore		.gitignore
.tflint.hcl		.tflint.hcl
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
locals.tf		locals.tf
main.tf		main.tf
outputs.tf		outputs.tf
variables.tf		variables.tf
versions.tf		versions.tf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Terraform module for connecting a GKE cluster to CAST AI

Requirements

Using the module

Migrating from 3.x.x to 4.x.x

Migrating from 4.x.x to 5.x.x

Migrating from 6.1.x to 6.3.x

Migrating from 9.x.x to 10.x.x

Removed Fields

Migration Example

Headroom Migration

Examples

Requirements

Providers

Modules

Resources

Inputs

Outputs

About

Uh oh!

Releases 110

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Terraform module for connecting a GKE cluster to CAST AI

Requirements

Using the module

Migrating from 3.x.x to 4.x.x

Migrating from 4.x.x to 5.x.x

Migrating from 6.1.x to 6.3.x

Migrating from 9.x.x to 10.x.x

Removed Fields

Migration Example

Headroom Migration

Examples

Requirements

Providers

Modules

Resources

Inputs

Outputs

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 110

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages