Skip to content

Commit 314a8c0

Browse files
committed
Add README for monitoring module
1 parent 013edf3 commit 314a8c0

File tree

1 file changed

+339
-0
lines changed

1 file changed

+339
-0
lines changed

modules/monitoring/README.md

Lines changed: 339 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,339 @@
1+
# modules/monitoring
2+
3+
Terraform module that wires up a complete monitoring stack on top of the
4+
`rancher-monitoring` (kube-prometheus-stack) add-on that ships with every
5+
Harvester cluster. A single `module` block deploys PrometheusRules,
6+
Alertmanager configuration, a Google Chat notification relay, and Grafana
7+
dashboards — all configurable via input variables.
8+
9+
## Prerequisites
10+
11+
- `rancher-monitoring` add-on installed on the Harvester cluster
12+
- Google Chat Space with an incoming webhook URL
13+
- `kubectl` available in the Terraform execution environment (used by
14+
`null_resource` provisioners to patch the Alertmanager Secret)
15+
16+
## Usage
17+
18+
```hcl
19+
module "monitoring" {
20+
source = "github.com/wso2-enterprise/open-cloud-datacenter//modules/monitoring?ref=v0.4.0"
21+
22+
environment = "lk"
23+
kubeconfig_path = "/path/to/harvester.kubeconfig"
24+
kubeconfig_context = "local"
25+
google_chat_webhook_url = var.google_chat_webhook_url
26+
27+
# Optional — show a "View Alert" deep-link button in each notification card.
28+
# Find this URL: Harvester UI → Add-ons → rancher-monitoring → alert-manager
29+
# alertmanager_url = "https://<harvester-ip>/api/v1/namespaces/cattle-monitoring-system/services/http:rancher-monitoring-alertmanager:9093/proxy"
30+
}
31+
```
32+
33+
---
34+
35+
## Architecture
36+
37+
```
38+
Prometheus (rancher-monitoring)
39+
│ evaluates PrometheusRule CRDs labelled release=rancher-monitoring
40+
41+
Alertmanager (rancher-monitoring)
42+
│ matches severity label → route → receiver "google-chat"
43+
│ webhook_configs: http://calert.cattle-monitoring-system:6000/create
44+
45+
calert (Deployment — ghcr.io/mr-karan/calert)
46+
│ accepts Alertmanager webhook POST, renders Google Chat Cards v2
47+
48+
Google Chat Space (incoming webhook)
49+
```
50+
51+
### Resources created
52+
53+
| Resource | Kubernetes kind | Name pattern | Namespace |
54+
|---|---|---|---|
55+
| Alertmanager config | Secret | `alertmanager-rancher-monitoring-alertmanager` | `cattle-monitoring-system` |
56+
| calert config + template | Secret | `calert-config` | `cattle-monitoring-system` |
57+
| calert | Deployment + Service | `calert` | `cattle-monitoring-system` |
58+
| Storage alerts | PrometheusRule | `{env}-harvester-storage-alerts` | `cattle-monitoring-system` |
59+
| VM alerts | PrometheusRule | `{env}-harvester-vm-alerts` | `cattle-monitoring-system` |
60+
| Node alerts | PrometheusRule | `{env}-harvester-node-alerts` | `cattle-monitoring-system` |
61+
| Storage dashboard | ConfigMap | `{env}-harvester-storage-dashboard` | `cattle-dashboards` |
62+
| VM dashboard | ConfigMap | `{env}-harvester-vm-dashboard` | `cattle-dashboards` |
63+
| Node dashboard | ConfigMap | `{env}-harvester-node-dashboard` | `cattle-dashboards` |
64+
65+
---
66+
67+
## Design decisions
68+
69+
### Why direct Secret injection instead of AlertmanagerConfig CRD
70+
71+
`AlertmanagerConfig` v1alpha1 silently drops any field it does not recognise —
72+
including `googleChatConfigs`. The module therefore patches the
73+
`alertmanager-rancher-monitoring-alertmanager` Secret directly using a
74+
`null_resource` + `kubectl apply`. Prometheus Operator watches the Secret and
75+
hot-reloads Alertmanager within ~30 s of any change.
76+
77+
The `kubernetes_manifest` resource is not used for this Secret because
78+
rancher-monitoring Helm pre-creates it; a `kubernetes_manifest` would fail
79+
with "already exists" on the first `terraform apply`.
80+
81+
### calert as a Google Chat relay
82+
83+
Google Chat does not have a native Alertmanager receiver. calert
84+
(`ghcr.io/mr-karan/calert`) is a purpose-built relay that accepts the standard
85+
Alertmanager webhook payload and reformats it into Google Chat Cards v2 JSON.
86+
87+
### Hot-reload
88+
89+
The calert Deployment carries a `checksum/config` annotation computed from the
90+
rendered config and message template. Any `terraform apply` that changes the
91+
template or config content automatically triggers a rolling restart — no manual
92+
pod deletion required.
93+
94+
### PrometheusRule label selector
95+
96+
Prometheus Operator discovers PrometheusRule CRDs via its `ruleSelector`. The
97+
rancher-monitoring Helm chart configures this selector to match
98+
`release=rancher-monitoring`. Every PrometheusRule created by this module uses
99+
`local.rule_labels`, which merges `release=rancher-monitoring` with the common
100+
`managed_by` and `environment` labels. **Omitting `local.rule_labels` from a
101+
PrometheusRule will cause Prometheus to ignore the rule entirely.**
102+
103+
---
104+
105+
## Alert inventory
106+
107+
### Storage (`prometheus_rule_storage`)
108+
109+
| Alert name | Severity | Condition |
110+
|---|---|---|
111+
| `LonghornVolumeFaulted` | critical | Volume state = Faulted |
112+
| `LonghornVolumeDegradedWarning` | warning | Volume degraded for 15 m |
113+
| `LonghornVolumeDegradedCritical` | critical | Volume degraded for 60 m |
114+
| `LonghornVolumeReplicaCountLow` | warning | Healthy replica count < expected |
115+
| `LonghornReplicaRebuildBacklog` | warning | Concurrent rebuilds per node > threshold |
116+
| `LonghornEvictionWithDegradedVolumes` | critical | Disk eviction active + volumes degraded |
117+
| `LonghornDiskUsageHigh` | warning / critical | Disk usage % above configurable threshold |
118+
119+
### VM / KubeVirt (`prometheus_rule_vm`)
120+
121+
| Alert name | Severity | Condition |
122+
|---|---|---|
123+
| `VirtLauncherPodStuck` | critical | virt-launcher pod Pending > `virt_launcher_stuck_for` |
124+
| `VirtLauncherContainerCreating` | critical | virt-launcher stuck in ContainerCreating |
125+
| `VirtLauncherCrashLoop` | critical | ≥ 3 restarts in 15 m |
126+
| `HpVolumePodNotRunning` | critical | hotplug volume pod not Running > `hp_volume_stuck_for` |
127+
| `HpVolumeMapDeviceFailed` | critical | exit status 32 (NFS/Block mode conflict) |
128+
| `StaleVolumeAttachmentBlocking` | warning | CSI blocked by stale VolumeAttachment |
129+
130+
### Node (`prometheus_rule_node`)
131+
132+
| Alert name | Severity | Condition |
133+
|---|---|---|
134+
| `NodeCpuHigh` | warning / critical | CPU utilisation > configurable threshold |
135+
| `NodeMemoryHigh` | warning | Memory utilisation > configurable threshold |
136+
137+
---
138+
139+
## How to add a new alert
140+
141+
### Option A — extend an existing rule group
142+
143+
This is the fastest path when the new alert belongs to an existing category
144+
(storage, VM, or node).
145+
146+
1. Open [main.tf](main.tf) and locate the matching `kubernetes_manifest` block:
147+
- `prometheus_rule_storage` — Longhorn / disk
148+
- `prometheus_rule_vm` — KubeVirt / virt-launcher
149+
- `prometheus_rule_node` — node CPU / memory
150+
151+
2. Add a new map to the `rules` list inside the relevant group:
152+
153+
```hcl
154+
{
155+
alert = "MyNewAlert" # PascalCase, [A-Za-z0-9_] only
156+
expr = "my_metric > 0" # valid PromQL — test in Grafana Explore first
157+
for = "5m" # omit for instant (stateless) alerts
158+
labels = {
159+
severity = "warning" # "warning" or "critical" — Alertmanager routes on this
160+
}
161+
annotations = {
162+
summary = "One-line description shown in the card"
163+
description = "Detail with template vars: instance={{ $labels.instance }}, value={{ $value }}"
164+
runbook_url = "${var.runbook_base_url}/MyNewAlert"
165+
}
166+
}
167+
```
168+
169+
3. Apply:
170+
171+
```bash
172+
terraform plan # verify the rule diff looks correct
173+
terraform apply
174+
```
175+
176+
### Option B — add a new PrometheusRule resource
177+
178+
Use this when the alert belongs to a distinct new category that warrants its
179+
own Kubernetes object and Grafana dashboard.
180+
181+
```hcl
182+
resource "kubernetes_manifest" "prometheus_rule_myapp" {
183+
manifest = {
184+
apiVersion = "monitoring.coreos.com/v1"
185+
kind = "PrometheusRule"
186+
metadata = {
187+
name = "${var.environment}-harvester-myapp-alerts"
188+
namespace = var.monitoring_namespace
189+
labels = local.rule_labels # required — carries release=rancher-monitoring
190+
}
191+
spec = {
192+
groups = [
193+
{
194+
name = "harvester.myapp"
195+
rules = [
196+
{
197+
alert = "MyAppDown"
198+
expr = "up{job=\"myapp\"} == 0"
199+
for = "2m"
200+
labels = { severity = "critical" }
201+
annotations = {
202+
summary = "MyApp is unreachable"
203+
description = "Instance {{ $labels.instance }} has been down for 2 m."
204+
runbook_url = "${var.runbook_base_url}/MyAppDown"
205+
}
206+
}
207+
]
208+
}
209+
]
210+
}
211+
}
212+
}
213+
```
214+
215+
Add a corresponding output in [outputs.tf](outputs.tf) and expose it through
216+
the environment layer's `outputs.tf`.
217+
218+
### Verifying a rule was picked up
219+
220+
```bash
221+
# List all PrometheusRule objects
222+
kubectl get prometheusrules -n cattle-monitoring-system
223+
224+
# Confirm the rule name appears in Prometheus's loaded rule set
225+
kubectl exec -n cattle-monitoring-system \
226+
$(kubectl get pod -n cattle-monitoring-system -l app=rancher-monitoring-prometheus -o name | head -1) \
227+
-- wget -qO- localhost:9090/api/v1/rules \
228+
| jq '.data.groups[].rules[].name' | grep MyNewAlert
229+
```
230+
231+
### Test-firing an alert
232+
233+
```bash
234+
# Port-forward Alertmanager
235+
kubectl port-forward -n cattle-monitoring-system \
236+
svc/rancher-monitoring-alertmanager 9093:9093
237+
238+
# POST a synthetic alert
239+
curl -s -X POST http://localhost:9093/api/v2/alerts \
240+
-H 'Content-Type: application/json' \
241+
-d '[{
242+
"labels": { "alertname": "MyNewAlert", "severity": "warning" },
243+
"annotations": { "summary": "Test fire", "description": "Synthetic test" }
244+
}]'
245+
```
246+
247+
A Google Chat card should appear within a few seconds.
248+
249+
---
250+
251+
## Notification card anatomy
252+
253+
Each alert produces one card. The card structure is defined in a Go template
254+
stored in the `calert-config` Secret and rendered by calert at runtime.
255+
256+
```
257+
┌─────────────────────────────────────────────────────┐
258+
│ (WARNING) LonghornDiskUsageHigh | Firing │ ← header
259+
├─────────────────────────────────────────────────────┤
260+
│ Summary: Disk usage on node-1 is 87% │ ┐
261+
│ Description: Longhorn disk sdb on node-1 … │ │ one decoratedText
262+
│ Runbook: https://wiki.internal/runbooks/… │ │ widget per annotation
263+
├─────────────────────────────────────────────────────┤ ┘
264+
│ ▶ Alert Details (collapsible) │ ← all labels
265+
├─────────────────────────────────────────────────────┤
266+
│ [View Alert] [View in Prometheus] │ ← buttons (optional)
267+
└─────────────────────────────────────────────────────┘
268+
```
269+
270+
**"View Alert" button** is only rendered when `alertmanager_url` is set. It
271+
links to:
272+
```
273+
<alertmanager_url>/#/alerts?filter={alertname="<name>"}
274+
```
275+
276+
**"View in Prometheus" button** is only rendered when the alert carries a
277+
`GeneratorURL` (set automatically by Prometheus when a rule fires for real;
278+
absent in synthetic test-fires).
279+
280+
### Template evaluation: Terraform vs Go
281+
282+
The card template is a Go template evaluated by calert at runtime — but it
283+
lives inside a Terraform heredoc and is written to a Kubernetes Secret at
284+
`terraform apply` time. This means two template engines interact:
285+
286+
| Syntax | Evaluated by | When |
287+
|---|---|---|
288+
| `${var.alertmanager_url}` | Terraform | at `terraform apply` |
289+
| `%{~ if var.alertmanager_url != "" ~}` | Terraform | at `terraform apply` |
290+
| `{{.Labels.alertname}}` | calert (Go template) | at alert runtime |
291+
| `{{.Annotations.SortedPairs}}` | calert (Go template) | at alert runtime |
292+
293+
Terraform bakes the base URL as a literal string into the template file.
294+
calert then substitutes the per-alert `alertname` at runtime. Both coexist
295+
safely in the same heredoc because Terraform ignores `{{ }}` delimiters and
296+
calert ignores `${ }` delimiters.
297+
298+
---
299+
300+
## Variable reference
301+
302+
### Required
303+
304+
| Name | Type | Description |
305+
|---|---|---|
306+
| `environment` | string | Short environment identifier used in resource names (`lk`, `prod`, …) |
307+
| `kubeconfig_path` | string | Path to the Harvester kubeconfig file |
308+
| `kubeconfig_context` | string | kubectl context within the kubeconfig |
309+
| `google_chat_webhook_url` | string (sensitive) | Google Chat incoming webhook URL |
310+
311+
### Optional
312+
313+
| Name | Type | Default | Description |
314+
|---|---|---|---|
315+
| `alertmanager_url` | string | `""` | Alertmanager UI base URL — enables "View Alert" button. Leave empty to omit. |
316+
| `monitoring_namespace` | string | `cattle-monitoring-system` | Namespace where rancher-monitoring runs |
317+
| `dashboards_namespace` | string | `cattle-dashboards` | Namespace where Grafana picks up dashboard ConfigMaps |
318+
| `runbook_base_url` | string | `https://wiki.internal/runbooks/harvester` | Base URL prepended to each alert's `runbook_url` annotation |
319+
| `disk_usage_warning_pct` | number | `80` | Longhorn disk usage % — warning threshold |
320+
| `disk_usage_critical_pct` | number | `90` | Longhorn disk usage % — critical threshold |
321+
| `replica_rebuild_warning_count` | number | `5` | Concurrent Longhorn rebuilds per node before warning |
322+
| `node_cpu_warning_pct` | number | `85` | Node CPU utilisation % — warning threshold |
323+
| `node_cpu_critical_pct` | number | `95` | Node CPU utilisation % — critical threshold |
324+
| `node_memory_warning_pct` | number | `85` | Node memory utilisation % — warning threshold |
325+
| `virt_launcher_stuck_for` | string | `"5m"` | Duration virt-launcher must be Pending/ContainerCreating before alerting |
326+
| `hp_volume_stuck_for` | string | `"3m"` | Duration hp-volume pod must be non-Running before alerting |
327+
328+
### Outputs
329+
330+
| Name | Description |
331+
|---|---|
332+
| `prometheus_rule_storage_name` | Name of the storage PrometheusRule |
333+
| `prometheus_rule_vm_name` | Name of the VM PrometheusRule |
334+
| `prometheus_rule_node_name` | Name of the node PrometheusRule |
335+
| `alertmanager_config_name` | Name of the Alertmanager config Secret |
336+
| `grafana_dashboard_storage_name` | Name of the storage Grafana dashboard ConfigMap |
337+
| `grafana_dashboard_vm_name` | Name of the VM Grafana dashboard ConfigMap |
338+
| `grafana_dashboard_node_name` | Name of the node Grafana dashboard ConfigMap |
339+
| `monitoring_namespace` | Namespace all monitoring resources were deployed into |

0 commit comments

Comments
 (0)