Skip to content

Commit d4da95d

Browse files
Narunas-Knarunas-castaidvoros-castai
authored
feat: Add GKE example for GitOps flow with umbrella helm chart (#666)
Co-authored-by: Narunas Kapocius <narunas@cast.ai> Co-authored-by: Daniel Voros <daniel.voros@cast.ai>
1 parent bda395c commit d4da95d

File tree

7 files changed

+202
-0
lines changed

7 files changed

+202
-0
lines changed
Lines changed: 121 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
# GKE + CAST AI GitOps example — umbrella Helm chart
2+
3+
## Overview
4+
5+
This example demonstrates a **GitOps onboarding flow** using the CAST AI umbrella Helm chart (`castai-helm/castai`).
6+
The umbrella chart replaces individual per-component charts and lets you switch between operating modes with a single `helm upgrade` command.
7+
8+
### When is Terraform needed?
9+
10+
| Mode | Terraform required? | What Terraform does |
11+
|---|---|---|
12+
| **Read-only** | No ||
13+
| **Workload Autoscaler** | No ||
14+
| **Node Autoscaler / Full** | **Yes** | Creates GCP service account with IAM permissions needed for node provisioning |
15+
16+
> For read-only and workload autoscaler modes you only need a CAST AI API key and Helm. Start there and add Terraform later only if you want node autoscaling.
17+
18+
---
19+
20+
## Umbrella chart modes
21+
22+
The umbrella chart uses **tags** to control which sub-charts are installed.
23+
24+
| Tag | Installed components | Use-case |
25+
|---|---|---|
26+
| `tags.readonly=true` | agent, spot-handler, kvisor, gpu-metrics-exporter | Observe the cluster — no changes made to workloads or nodes |
27+
| `tags.workload-autoscaler=true` | above + cluster-controller, evictor, pod-mutator, workload-autoscaler, workload-autoscaler-exporter | Right-size workload CPU/memory requests automatically |
28+
| `tags.full=true` | all components incl. pod-pinner, live | Full node autoscaler + workload autoscaler |
29+
30+
> Only one tag should be `true` at a time. When upgrading modes use `--reset-then-reuse-values` and flip the tags (see examples below).
31+
32+
---
33+
34+
## Prerequisites
35+
36+
- CAST AI account
37+
- CAST AI **organization member API key** from [console.cast.ai → Service Accounts](https://console.cast.ai/organization/management/access-control/service-accounts)
38+
- `castai-helm` Helm repo:
39+
```sh
40+
helm repo add castai-helm https://castai.github.io/helm-charts
41+
helm repo update
42+
```
43+
44+
---
45+
46+
## Step 1 — Install in read-only mode (Helm only)
47+
48+
No Terraform needed. The API key here is the CAST AI **member** key (not a full-access key).
49+
50+
```sh
51+
helm upgrade -i castai castai-helm/castai -n castai-agent --create-namespace \
52+
--set global.castai.apiKey="<your-castai-api-key>" \
53+
--set global.castai.provider="gke" \
54+
--set tags.readonly=true
55+
```
56+
57+
After the pods become ready your cluster appears as **Read only** in the CAST AI console.
58+
CAST AI can now observe the cluster — no changes are made to your workloads or nodes.
59+
60+
---
61+
62+
## Step 2 (optional) — Upgrade to Workload Autoscaler (Helm only)
63+
64+
When you are ready to let CAST AI right-size CPU/memory requests for your workloads, upgrade the release.
65+
**No Terraform changes required.**
66+
67+
`--reset-then-reuse-values` keeps all previously set values and only applies the overrides you specify.
68+
69+
```sh
70+
helm upgrade castai castai-helm/castai -n castai-agent \
71+
--reset-then-reuse-values \
72+
--set tags.readonly=false \
73+
--set tags.workload-autoscaler=true
74+
```
75+
76+
---
77+
78+
## Step 3 (optional) — Upgrade to Full mode / Node Autoscaler (Terraform + Helm)
79+
80+
Full mode enables node provisioning, bin-packing, spot instance handling, eviction, and pod pinning.
81+
This requires a **GCP service account** with the correct IAM permissions — Terraform creates it.
82+
83+
### 3a. Run Terraform
84+
85+
Fill in your values:
86+
87+
```sh
88+
cp tf.vars.example terraform.tfvars
89+
# edit terraform.tfvars
90+
```
91+
92+
Apply:
93+
94+
```sh
95+
terraform init
96+
terraform apply
97+
```
98+
99+
This registers the cluster with CAST AI and creates the GCP service account.
100+
101+
Capture the outputs — you'll need them to configure the Helm release:
102+
103+
```sh
104+
terraform output cluster_id
105+
terraform output -raw cluster_token
106+
```
107+
108+
> `cluster_token` expires after a few hours if no CAST AI component connects. Run the Helm upgrade promptly after this step.
109+
110+
### 3b. Upgrade the Helm release
111+
112+
If you were already running read-only or workload-autoscaler mode, upgrade using `--reset-then-reuse-values`.
113+
If this is a fresh install, use `helm upgrade -i` and pass `cluster_token` and `cluster_id` from step 3a.
114+
115+
```sh
116+
helm upgrade castai castai-helm/castai -n castai-agent \
117+
--reset-then-reuse-values \
118+
--set tags.readonly=false \
119+
--set tags.workload-autoscaler=false \
120+
--set tags.full=true
121+
```
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
resource "castai_gke_cluster" "this" {
2+
project_id = var.project_id
3+
location = var.cluster_region
4+
name = var.cluster_name
5+
delete_nodes_on_disconnect = var.delete_nodes_on_disconnect
6+
7+
credentials_json = module.castai-gke-iam.private_key
8+
}
9+
10+
module "castai-gke-iam" {
11+
source = "castai/gke-iam/castai"
12+
version = "~> 0.5"
13+
14+
project_id = var.project_id
15+
gke_cluster_name = var.cluster_name
16+
}
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
output "cluster_id" {
2+
value = castai_gke_cluster.this.id
3+
description = "CAST AI cluster ID."
4+
}
5+
6+
output "cluster_token" {
7+
value = castai_gke_cluster.this.cluster_token
8+
description = "CAST AI cluster token used by Castware to authenticate to Mothership."
9+
sensitive = true
10+
}
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
provider "castai" {
2+
api_url = var.castai_api_url
3+
api_token = var.castai_api_token
4+
}
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
castai_api_token = "PLACEHOLDER"
2+
project_id = "PLACEHOLDER"
3+
cluster_region = "PLACEHOLDER" # e.g. "us-central1" or "us-central1-a"
4+
cluster_name = "PLACEHOLDER"
5+
subnets = ["PLACEHOLDER", "PLACEHOLDER"] # e.g. ["default"]
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
variable "project_id" {
2+
type = string
3+
description = "GCP project ID in which GKE cluster is located."
4+
}
5+
6+
variable "cluster_name" {
7+
type = string
8+
description = "GKE cluster name in GCP project."
9+
}
10+
11+
variable "cluster_region" {
12+
type = string
13+
description = "Region of the cluster to be connected to CAST AI."
14+
}
15+
16+
variable "subnets" {
17+
type = list(string)
18+
description = "Subnet IDs used by CAST AI to provision nodes."
19+
}
20+
21+
variable "delete_nodes_on_disconnect" {
22+
type = bool
23+
description = "Optionally delete Cast AI created nodes when the cluster is destroyed."
24+
default = false
25+
}
26+
27+
variable "castai_api_token" {
28+
type = string
29+
description = "CAST AI API token created in console.cast.ai API Access keys section."
30+
}
31+
32+
variable "castai_api_url" {
33+
type = string
34+
description = "CAST AI api url."
35+
default = "https://api.cast.ai"
36+
}
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
terraform {
2+
required_version = ">= 1.3.2"
3+
4+
required_providers {
5+
castai = {
6+
source = "castai/castai"
7+
version = ">= 3.11.0"
8+
}
9+
}
10+
}

0 commit comments

Comments
 (0)