|
| 1 | +# modules/monitoring |
| 2 | + |
| 3 | +Terraform module that wires up a complete monitoring stack on top of the |
| 4 | +`rancher-monitoring` (kube-prometheus-stack) add-on that ships with every |
| 5 | +Harvester cluster. A single `module` block deploys PrometheusRules, |
| 6 | +Alertmanager configuration, a Google Chat notification relay, and Grafana |
| 7 | +dashboards — all configurable via input variables. |
| 8 | + |
| 9 | +## Prerequisites |
| 10 | + |
| 11 | +- `rancher-monitoring` add-on installed on the Harvester cluster |
| 12 | +- Google Chat Space with an incoming webhook URL |
| 13 | +- `kubectl` available in the Terraform execution environment (used by |
| 14 | + `null_resource` provisioners to patch the Alertmanager Secret) |
| 15 | + |
| 16 | +## Usage |
| 17 | + |
| 18 | +```hcl |
| 19 | +module "monitoring" { |
| 20 | + source = "github.com/wso2-enterprise/open-cloud-datacenter//modules/monitoring?ref=v0.4.0" |
| 21 | +
|
| 22 | + environment = "lk" |
| 23 | + kubeconfig_path = "/path/to/harvester.kubeconfig" |
| 24 | + kubeconfig_context = "local" |
| 25 | + google_chat_webhook_url = var.google_chat_webhook_url |
| 26 | +
|
| 27 | + # Optional — show a "View Alert" deep-link button in each notification card. |
| 28 | + # Find this URL: Harvester UI → Add-ons → rancher-monitoring → alert-manager |
| 29 | + # alertmanager_url = "https://<harvester-ip>/api/v1/namespaces/cattle-monitoring-system/services/http:rancher-monitoring-alertmanager:9093/proxy" |
| 30 | +} |
| 31 | +``` |
| 32 | + |
| 33 | +--- |
| 34 | + |
| 35 | +## Architecture |
| 36 | + |
| 37 | +``` |
| 38 | +Prometheus (rancher-monitoring) |
| 39 | + │ evaluates PrometheusRule CRDs labelled release=rancher-monitoring |
| 40 | + ▼ |
| 41 | +Alertmanager (rancher-monitoring) |
| 42 | + │ matches severity label → route → receiver "google-chat" |
| 43 | + │ webhook_configs: http://calert.cattle-monitoring-system:6000/create |
| 44 | + ▼ |
| 45 | +calert (Deployment — ghcr.io/mr-karan/calert) |
| 46 | + │ accepts Alertmanager webhook POST, renders Google Chat Cards v2 |
| 47 | + ▼ |
| 48 | +Google Chat Space (incoming webhook) |
| 49 | +``` |
| 50 | + |
| 51 | +### Resources created |
| 52 | + |
| 53 | +| Resource | Kubernetes kind | Name pattern | Namespace | |
| 54 | +|---|---|---|---| |
| 55 | +| Alertmanager config | Secret | `alertmanager-rancher-monitoring-alertmanager` | `cattle-monitoring-system` | |
| 56 | +| calert config + template | Secret | `calert-config` | `cattle-monitoring-system` | |
| 57 | +| calert | Deployment + Service | `calert` | `cattle-monitoring-system` | |
| 58 | +| Storage alerts | PrometheusRule | `{env}-harvester-storage-alerts` | `cattle-monitoring-system` | |
| 59 | +| VM alerts | PrometheusRule | `{env}-harvester-vm-alerts` | `cattle-monitoring-system` | |
| 60 | +| Node alerts | PrometheusRule | `{env}-harvester-node-alerts` | `cattle-monitoring-system` | |
| 61 | +| Storage dashboard | ConfigMap | `{env}-harvester-storage-dashboard` | `cattle-dashboards` | |
| 62 | +| VM dashboard | ConfigMap | `{env}-harvester-vm-dashboard` | `cattle-dashboards` | |
| 63 | +| Node dashboard | ConfigMap | `{env}-harvester-node-dashboard` | `cattle-dashboards` | |
| 64 | + |
| 65 | +--- |
| 66 | + |
| 67 | +## Design decisions |
| 68 | + |
| 69 | +### Why direct Secret injection instead of AlertmanagerConfig CRD |
| 70 | + |
| 71 | +`AlertmanagerConfig` v1alpha1 silently drops any field it does not recognise — |
| 72 | +including `googleChatConfigs`. The module therefore patches the |
| 73 | +`alertmanager-rancher-monitoring-alertmanager` Secret directly using a |
| 74 | +`null_resource` + `kubectl apply`. Prometheus Operator watches the Secret and |
| 75 | +hot-reloads Alertmanager within ~30 s of any change. |
| 76 | + |
| 77 | +The `kubernetes_manifest` resource is not used for this Secret because |
| 78 | +rancher-monitoring Helm pre-creates it; a `kubernetes_manifest` would fail |
| 79 | +with "already exists" on the first `terraform apply`. |
| 80 | + |
| 81 | +### calert as a Google Chat relay |
| 82 | + |
| 83 | +Google Chat does not have a native Alertmanager receiver. calert |
| 84 | +(`ghcr.io/mr-karan/calert`) is a purpose-built relay that accepts the standard |
| 85 | +Alertmanager webhook payload and reformats it into Google Chat Cards v2 JSON. |
| 86 | + |
| 87 | +### Hot-reload |
| 88 | + |
| 89 | +The calert Deployment carries a `checksum/config` annotation computed from the |
| 90 | +rendered config and message template. Any `terraform apply` that changes the |
| 91 | +template or config content automatically triggers a rolling restart — no manual |
| 92 | +pod deletion required. |
| 93 | + |
| 94 | +### PrometheusRule label selector |
| 95 | + |
| 96 | +Prometheus Operator discovers PrometheusRule CRDs via its `ruleSelector`. The |
| 97 | +rancher-monitoring Helm chart configures this selector to match |
| 98 | +`release=rancher-monitoring`. Every PrometheusRule created by this module uses |
| 99 | +`local.rule_labels`, which merges `release=rancher-monitoring` with the common |
| 100 | +`managed_by` and `environment` labels. **Omitting `local.rule_labels` from a |
| 101 | +PrometheusRule will cause Prometheus to ignore the rule entirely.** |
| 102 | + |
| 103 | +--- |
| 104 | + |
| 105 | +## Alert inventory |
| 106 | + |
| 107 | +### Storage (`prometheus_rule_storage`) |
| 108 | + |
| 109 | +| Alert name | Severity | Condition | |
| 110 | +|---|---|---| |
| 111 | +| `LonghornVolumeFaulted` | critical | Volume state = Faulted | |
| 112 | +| `LonghornVolumeDegradedWarning` | warning | Volume degraded for 15 m | |
| 113 | +| `LonghornVolumeDegradedCritical` | critical | Volume degraded for 60 m | |
| 114 | +| `LonghornVolumeReplicaCountLow` | warning | Healthy replica count < expected | |
| 115 | +| `LonghornReplicaRebuildBacklog` | warning | Concurrent rebuilds per node > threshold | |
| 116 | +| `LonghornEvictionWithDegradedVolumes` | critical | Disk eviction active + volumes degraded | |
| 117 | +| `LonghornDiskUsageHigh` | warning / critical | Disk usage % above configurable threshold | |
| 118 | + |
| 119 | +### VM / KubeVirt (`prometheus_rule_vm`) |
| 120 | + |
| 121 | +| Alert name | Severity | Condition | |
| 122 | +|---|---|---| |
| 123 | +| `VirtLauncherPodStuck` | critical | virt-launcher pod Pending > `virt_launcher_stuck_for` | |
| 124 | +| `VirtLauncherContainerCreating` | critical | virt-launcher stuck in ContainerCreating | |
| 125 | +| `VirtLauncherCrashLoop` | critical | ≥ 3 restarts in 15 m | |
| 126 | +| `HpVolumePodNotRunning` | critical | hotplug volume pod not Running > `hp_volume_stuck_for` | |
| 127 | +| `HpVolumeMapDeviceFailed` | critical | exit status 32 (NFS/Block mode conflict) | |
| 128 | +| `StaleVolumeAttachmentBlocking` | warning | CSI blocked by stale VolumeAttachment | |
| 129 | + |
| 130 | +### Node (`prometheus_rule_node`) |
| 131 | + |
| 132 | +| Alert name | Severity | Condition | |
| 133 | +|---|---|---| |
| 134 | +| `NodeCpuHigh` | warning / critical | CPU utilisation > configurable threshold | |
| 135 | +| `NodeMemoryHigh` | warning | Memory utilisation > configurable threshold | |
| 136 | + |
| 137 | +--- |
| 138 | + |
| 139 | +## How to add a new alert |
| 140 | + |
| 141 | +### Option A — extend an existing rule group |
| 142 | + |
| 143 | +This is the fastest path when the new alert belongs to an existing category |
| 144 | +(storage, VM, or node). |
| 145 | + |
| 146 | +1. Open [main.tf](main.tf) and locate the matching `kubernetes_manifest` block: |
| 147 | + - `prometheus_rule_storage` — Longhorn / disk |
| 148 | + - `prometheus_rule_vm` — KubeVirt / virt-launcher |
| 149 | + - `prometheus_rule_node` — node CPU / memory |
| 150 | + |
| 151 | +2. Add a new map to the `rules` list inside the relevant group: |
| 152 | + |
| 153 | + ```hcl |
| 154 | + { |
| 155 | + alert = "MyNewAlert" # PascalCase, [A-Za-z0-9_] only |
| 156 | + expr = "my_metric > 0" # valid PromQL — test in Grafana Explore first |
| 157 | + for = "5m" # omit for instant (stateless) alerts |
| 158 | + labels = { |
| 159 | + severity = "warning" # "warning" or "critical" — Alertmanager routes on this |
| 160 | + } |
| 161 | + annotations = { |
| 162 | + summary = "One-line description shown in the card" |
| 163 | + description = "Detail with template vars: instance={{ $labels.instance }}, value={{ $value }}" |
| 164 | + runbook_url = "${var.runbook_base_url}/MyNewAlert" |
| 165 | + } |
| 166 | + } |
| 167 | + ``` |
| 168 | + |
| 169 | +3. Apply: |
| 170 | + |
| 171 | + ```bash |
| 172 | + terraform plan # verify the rule diff looks correct |
| 173 | + terraform apply |
| 174 | + ``` |
| 175 | + |
| 176 | +### Option B — add a new PrometheusRule resource |
| 177 | + |
| 178 | +Use this when the alert belongs to a distinct new category that warrants its |
| 179 | +own Kubernetes object and Grafana dashboard. |
| 180 | + |
| 181 | +```hcl |
| 182 | +resource "kubernetes_manifest" "prometheus_rule_myapp" { |
| 183 | + manifest = { |
| 184 | + apiVersion = "monitoring.coreos.com/v1" |
| 185 | + kind = "PrometheusRule" |
| 186 | + metadata = { |
| 187 | + name = "${var.environment}-harvester-myapp-alerts" |
| 188 | + namespace = var.monitoring_namespace |
| 189 | + labels = local.rule_labels # required — carries release=rancher-monitoring |
| 190 | + } |
| 191 | + spec = { |
| 192 | + groups = [ |
| 193 | + { |
| 194 | + name = "harvester.myapp" |
| 195 | + rules = [ |
| 196 | + { |
| 197 | + alert = "MyAppDown" |
| 198 | + expr = "up{job=\"myapp\"} == 0" |
| 199 | + for = "2m" |
| 200 | + labels = { severity = "critical" } |
| 201 | + annotations = { |
| 202 | + summary = "MyApp is unreachable" |
| 203 | + description = "Instance {{ $labels.instance }} has been down for 2 m." |
| 204 | + runbook_url = "${var.runbook_base_url}/MyAppDown" |
| 205 | + } |
| 206 | + } |
| 207 | + ] |
| 208 | + } |
| 209 | + ] |
| 210 | + } |
| 211 | + } |
| 212 | +} |
| 213 | +``` |
| 214 | + |
| 215 | +Add a corresponding output in [outputs.tf](outputs.tf) and expose it through |
| 216 | +the environment layer's `outputs.tf`. |
| 217 | + |
| 218 | +### Verifying a rule was picked up |
| 219 | + |
| 220 | +```bash |
| 221 | +# List all PrometheusRule objects |
| 222 | +kubectl get prometheusrules -n cattle-monitoring-system |
| 223 | + |
| 224 | +# Confirm the rule name appears in Prometheus's loaded rule set |
| 225 | +kubectl exec -n cattle-monitoring-system \ |
| 226 | + $(kubectl get pod -n cattle-monitoring-system -l app=rancher-monitoring-prometheus -o name | head -1) \ |
| 227 | + -- wget -qO- localhost:9090/api/v1/rules \ |
| 228 | + | jq '.data.groups[].rules[].name' | grep MyNewAlert |
| 229 | +``` |
| 230 | + |
| 231 | +### Test-firing an alert |
| 232 | + |
| 233 | +```bash |
| 234 | +# Port-forward Alertmanager |
| 235 | +kubectl port-forward -n cattle-monitoring-system \ |
| 236 | + svc/rancher-monitoring-alertmanager 9093:9093 |
| 237 | + |
| 238 | +# POST a synthetic alert |
| 239 | +curl -s -X POST http://localhost:9093/api/v2/alerts \ |
| 240 | + -H 'Content-Type: application/json' \ |
| 241 | + -d '[{ |
| 242 | + "labels": { "alertname": "MyNewAlert", "severity": "warning" }, |
| 243 | + "annotations": { "summary": "Test fire", "description": "Synthetic test" } |
| 244 | + }]' |
| 245 | +``` |
| 246 | + |
| 247 | +A Google Chat card should appear within a few seconds. |
| 248 | + |
| 249 | +--- |
| 250 | + |
| 251 | +## Notification card anatomy |
| 252 | + |
| 253 | +Each alert produces one card. The card structure is defined in a Go template |
| 254 | +stored in the `calert-config` Secret and rendered by calert at runtime. |
| 255 | + |
| 256 | +``` |
| 257 | +┌─────────────────────────────────────────────────────┐ |
| 258 | +│ (WARNING) LonghornDiskUsageHigh | Firing │ ← header |
| 259 | +├─────────────────────────────────────────────────────┤ |
| 260 | +│ Summary: Disk usage on node-1 is 87% │ ┐ |
| 261 | +│ Description: Longhorn disk sdb on node-1 … │ │ one decoratedText |
| 262 | +│ Runbook: https://wiki.internal/runbooks/… │ │ widget per annotation |
| 263 | +├─────────────────────────────────────────────────────┤ ┘ |
| 264 | +│ ▶ Alert Details (collapsible) │ ← all labels |
| 265 | +├─────────────────────────────────────────────────────┤ |
| 266 | +│ [View Alert] [View in Prometheus] │ ← buttons (optional) |
| 267 | +└─────────────────────────────────────────────────────┘ |
| 268 | +``` |
| 269 | + |
| 270 | +**"View Alert" button** is only rendered when `alertmanager_url` is set. It |
| 271 | +links to: |
| 272 | +``` |
| 273 | +<alertmanager_url>/#/alerts?filter={alertname="<name>"} |
| 274 | +``` |
| 275 | + |
| 276 | +**"View in Prometheus" button** is only rendered when the alert carries a |
| 277 | +`GeneratorURL` (set automatically by Prometheus when a rule fires for real; |
| 278 | +absent in synthetic test-fires). |
| 279 | + |
| 280 | +### Template evaluation: Terraform vs Go |
| 281 | + |
| 282 | +The card template is a Go template evaluated by calert at runtime — but it |
| 283 | +lives inside a Terraform heredoc and is written to a Kubernetes Secret at |
| 284 | +`terraform apply` time. This means two template engines interact: |
| 285 | + |
| 286 | +| Syntax | Evaluated by | When | |
| 287 | +|---|---|---| |
| 288 | +| `${var.alertmanager_url}` | Terraform | at `terraform apply` | |
| 289 | +| `%{~ if var.alertmanager_url != "" ~}` | Terraform | at `terraform apply` | |
| 290 | +| `{{.Labels.alertname}}` | calert (Go template) | at alert runtime | |
| 291 | +| `{{.Annotations.SortedPairs}}` | calert (Go template) | at alert runtime | |
| 292 | + |
| 293 | +Terraform bakes the base URL as a literal string into the template file. |
| 294 | +calert then substitutes the per-alert `alertname` at runtime. Both coexist |
| 295 | +safely in the same heredoc because Terraform ignores `{{ }}` delimiters and |
| 296 | +calert ignores `${ }` delimiters. |
| 297 | + |
| 298 | +--- |
| 299 | + |
| 300 | +## Variable reference |
| 301 | + |
| 302 | +### Required |
| 303 | + |
| 304 | +| Name | Type | Description | |
| 305 | +|---|---|---| |
| 306 | +| `environment` | string | Short environment identifier used in resource names (`lk`, `prod`, …) | |
| 307 | +| `kubeconfig_path` | string | Path to the Harvester kubeconfig file | |
| 308 | +| `kubeconfig_context` | string | kubectl context within the kubeconfig | |
| 309 | +| `google_chat_webhook_url` | string (sensitive) | Google Chat incoming webhook URL | |
| 310 | + |
| 311 | +### Optional |
| 312 | + |
| 313 | +| Name | Type | Default | Description | |
| 314 | +|---|---|---|---| |
| 315 | +| `alertmanager_url` | string | `""` | Alertmanager UI base URL — enables "View Alert" button. Leave empty to omit. | |
| 316 | +| `monitoring_namespace` | string | `cattle-monitoring-system` | Namespace where rancher-monitoring runs | |
| 317 | +| `dashboards_namespace` | string | `cattle-dashboards` | Namespace where Grafana picks up dashboard ConfigMaps | |
| 318 | +| `runbook_base_url` | string | `https://wiki.internal/runbooks/harvester` | Base URL prepended to each alert's `runbook_url` annotation | |
| 319 | +| `disk_usage_warning_pct` | number | `80` | Longhorn disk usage % — warning threshold | |
| 320 | +| `disk_usage_critical_pct` | number | `90` | Longhorn disk usage % — critical threshold | |
| 321 | +| `replica_rebuild_warning_count` | number | `5` | Concurrent Longhorn rebuilds per node before warning | |
| 322 | +| `node_cpu_warning_pct` | number | `85` | Node CPU utilisation % — warning threshold | |
| 323 | +| `node_cpu_critical_pct` | number | `95` | Node CPU utilisation % — critical threshold | |
| 324 | +| `node_memory_warning_pct` | number | `85` | Node memory utilisation % — warning threshold | |
| 325 | +| `virt_launcher_stuck_for` | string | `"5m"` | Duration virt-launcher must be Pending/ContainerCreating before alerting | |
| 326 | +| `hp_volume_stuck_for` | string | `"3m"` | Duration hp-volume pod must be non-Running before alerting | |
| 327 | + |
| 328 | +### Outputs |
| 329 | + |
| 330 | +| Name | Description | |
| 331 | +|---|---| |
| 332 | +| `prometheus_rule_storage_name` | Name of the storage PrometheusRule | |
| 333 | +| `prometheus_rule_vm_name` | Name of the VM PrometheusRule | |
| 334 | +| `prometheus_rule_node_name` | Name of the node PrometheusRule | |
| 335 | +| `alertmanager_config_name` | Name of the Alertmanager config Secret | |
| 336 | +| `grafana_dashboard_storage_name` | Name of the storage Grafana dashboard ConfigMap | |
| 337 | +| `grafana_dashboard_vm_name` | Name of the VM Grafana dashboard ConfigMap | |
| 338 | +| `grafana_dashboard_node_name` | Name of the node Grafana dashboard ConfigMap | |
| 339 | +| `monitoring_namespace` | Namespace all monitoring resources were deployed into | |
0 commit comments