Add Harvester monitoring stack module#23
Conversation
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (7)
WalkthroughAdds a new Terraform monitoring module that configures Alertmanager (via Secret), deploys a Google Chat forwarder (calert), creates PrometheusRule CRDs and Grafana dashboard ConfigMaps, provides Go templates for Google Chat cards, example usage, and module inputs/outputs. Changes
Sequence DiagramsequenceDiagram
participant User
participant TF as Terraform
participant K8s as Kubernetes
participant PM as Prometheus<br/>Operator
participant AM as Alertmanager
participant Calert as Calert<br/>Forwarder
participant GC as Google Chat
User->>TF: terraform apply monitoring module
TF->>K8s: create PrometheusRule CRDs (storage/VM/node)
TF->>K8s: apply Alertmanager Secret (rendered YAML via null_resource)
TF->>K8s: create calert Config Secret (templates/config)
TF->>K8s: deploy calert Deployment and Service
TF->>K8s: create Grafana dashboard ConfigMaps
PM->>AM: alerts fired per PrometheusRule
AM->>AM: route/inhibit alerts
AM->>Calert: POST webhook (firing alerts)
Calert->>Calert: render Google Chat card (Go template)
Calert->>GC: POST card message
GC->>User: deliver notification
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (7)
modules/monitoring/examples/basic/main.tf (1)
34-34: Consider parameterizing the environment value.The
environment = "lk"is hardcoded. For a reusable example, consider using a variable:♻️ Suggested refactor
# Identifiers - environment = "lk" + environment = var.environment kubeconfig_path = var.kubeconfig_pathAnd add a variable declaration:
variable "environment" { type = string default = "lk" description = "Environment identifier used for resource naming." }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modules/monitoring/examples/basic/main.tf` at line 34, Replace the hardcoded environment = "lk" in the example main.tf with a reference to a variable (e.g., use var.environment) and add a variable declaration named environment with type string, a default of "lk" and a description so the example becomes reusable and configurable; update any resource/name interpolation that relied on the literal "lk" to use var.environment instead.modules/monitoring/variables.tf (2)
88-98: Consider adding validation for Prometheus duration format.The
virt_launcher_stuck_forandhp_volume_stuck_forvariables accept duration strings but have no validation. Invalid formats (e.g., "5 minutes" instead of "5m") would cause Prometheus to reject the rule.🛡️ Suggested validation
variable "virt_launcher_stuck_for" { type = string default = "5m" description = "Duration a virt-launcher pod must be Pending/ContainerCreating before alerting." validation { condition = can(regex("^[0-9]+(ms|s|m|h|d|w|y)$", var.virt_launcher_stuck_for)) error_message = "virt_launcher_stuck_for must be a valid Prometheus duration (e.g., 5m, 1h, 30s)." } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modules/monitoring/variables.tf` around lines 88 - 98, Add a validation block to both variable declarations (virt_launcher_stuck_for and hp_volume_stuck_for) to ensure values match Prometheus duration syntax; use a regex-based condition (e.g., can(regex("^[0-9]+(ms|s|m|h|d|w|y)$", var.<variable_name>))) and provide a clear error_message like "<variable_name> must be a valid Prometheus duration (e.g., 5m, 1h, 30s)". Apply this same validation to both variables so invalid strings (e.g., "5 minutes") are rejected at plan time.
52-80: Consider adding validation to ensure warning thresholds are lower than critical thresholds.The threshold variables have sensible defaults, but there's no validation to prevent misconfiguration where
warning_pct >= critical_pct. This could cause confusing alert behavior.🛡️ Suggested validation
variable "disk_usage_warning_pct" { type = number default = 80 description = "Longhorn disk usage percentage that triggers a warning alert." validation { condition = var.disk_usage_warning_pct >= 0 && var.disk_usage_warning_pct <= 100 error_message = "disk_usage_warning_pct must be between 0 and 100." } } # Add a precondition in main.tf: # lifecycle { # precondition { # condition = var.disk_usage_warning_pct < var.disk_usage_critical_pct # error_message = "disk_usage_warning_pct must be less than disk_usage_critical_pct." # } # }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modules/monitoring/variables.tf` around lines 52 - 80, The threshold variables (disk_usage_warning_pct, disk_usage_critical_pct, node_cpu_warning_pct, node_cpu_critical_pct and similar) lack validation to prevent invalid values and misconfiguration where a warning >= critical; add validation blocks on each percentage variable to enforce 0..100 and sensible ranges (e.g., validation condition ensures the variable is between 0 and 100 with a clear error_message), and add a module-level precondition or lifecycle precondition (e.g., in main.tf) that asserts disk_usage_warning_pct < disk_usage_critical_pct and node_cpu_warning_pct < node_cpu_critical_pct (with error messages) so Terraform fails fast if a warning threshold is not strictly less than its corresponding critical threshold.modules/monitoring/main.tf (4)
612-621: Hardcoded namespacecpd-dpmay limit reusability.The
HpVolumePodNotRunningalert hardcodesnamespace="cpd-dp". Consider making this configurable via a variable if different environments use different namespaces for VM workloads.♻️ Suggested refactor
Add to
variables.tf:variable "vm_workload_namespace" { type = string default = "cpd-dp" description = "Namespace where VM-related workloads (hp-volume pods) run." }Then update the expression:
- expr = "kube_pod_status_phase{pod=~\"hp-volume-.*\", namespace=\"cpd-dp\", phase!=\"Running\"} == 1" + expr = "kube_pod_status_phase{pod=~\"hp-volume-.*\", namespace=\"${var.vm_workload_namespace}\", phase!=\"Running\"} == 1"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modules/monitoring/main.tf` around lines 612 - 621, The alert HpVolumePodNotRunning hardcodes namespace="cpd-dp"; introduce a new variable (e.g., vm_workload_namespace with default "cpd-dp") and replace the hardcoded namespace in the PromQL expression (kube_pod_status_phase{pod=~"hp-volume-.*", namespace=...}) and any annotation text that includes the namespace (the kubectl get events -n ... command in description) to reference the new variable (var.vm_workload_namespace) so the alert is configurable across environments.
736-745: Consider adding a critical threshold for node memory.The module defines both warning and critical alerts for CPU (
NodeHighCPU,NodeCPUCritical) and disk usage (LonghornDiskUsageHigh,LonghornDiskUsageCritical), but only a warning alert for memory (NodeHighMemory). Consider addingNodeMemoryCriticalfor consistency in escalation paths.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modules/monitoring/main.tf` around lines 736 - 745, The Prometheus rule only defines a warning alert "NodeHighMemory" (using var.node_memory_warning_pct) but lacks a matching critical alert; add a new alert block "NodeMemoryCritical" mirroring "NodeHighMemory" but using a higher threshold variable (e.g., var.node_memory_critical_pct), set labels = { severity = "critical" }, choose an appropriate for duration (e.g., "5m" or matching existing critical alerts), and add annotations including summary, description and runbook_url pointing to "${var.runbook_base_url}/NodeMemoryCritical"; also ensure the new variable var.node_memory_critical_pct is declared in variables.tf if it doesn't exist.
289-315: Consider adding input validation and destroy-time cleanup.Two observations:
Shell injection risk: The
kubeconfig_contextis interpolated directly into the shell command (line 305, 311). If the context name contains shell metacharacters, this could cause unexpected behavior. Consider quoting or validating the input.No destroy provisioner: When running
terraform destroy, the Secret will remain in the cluster. If this is intentional (to preserve Alertmanager config), document it. Otherwise, consider adding a destroy provisioner.🛡️ Suggested validation in variables.tf
variable "kubeconfig_context" { type = string description = "kubectl context name to use from the kubeconfig." validation { condition = can(regex("^[a-zA-Z0-9._-]+$", var.kubeconfig_context)) error_message = "kubeconfig_context must contain only alphanumeric characters, dots, underscores, and hyphens." } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modules/monitoring/main.tf` around lines 289 - 315, The provisioner in null_resource.alertmanager_base_config injects var.kubeconfig_context directly into a shell command (risking shell injection) and there is no destroy-time cleanup; to fix, add validation for kubeconfig_context in variables.tf (use a regex like ^[A-Za-z0-9._-]+$) to reject unsafe names, then change the local-exec provisioner to avoid direct interpolation by passing the context via an environment variable (e.g., KUBE_CONTEXT) and reference it safely inside the heredoc as a quoted shell variable (use "$KUBE_CONTEXT") within the command in provisioner "local-exec" of null_resource.alertmanager_base_config; finally, either add a destroy provisioner on null_resource.alertmanager_base_config that runs a safe kubectl delete secret ... using the same KUBE_CONTEXT handling or document intentional retention of the Secret.
370-381: Add resource limits and health probes to the calert container.The calert container lacks resource requests/limits and health probes:
- Resource limits: Without limits, the pod may be evicted under memory pressure (BestEffort QoS class).
- Health probes: Without liveness/readiness probes, Kubernetes cannot detect if calert becomes unresponsive.
♻️ Suggested enhancements
containers = [{ name = "calert" image = "ghcr.io/mr-karan/calert:v2.3.0" args = ["--config=/etc/calert/config.toml"] ports = [{ containerPort = 6000 }] + resources = { + requests = { cpu = "10m", memory = "32Mi" } + limits = { cpu = "100m", memory = "64Mi" } + } + livenessProbe = { + httpGet = { path = "/ping", port = 6000 } + initialDelaySeconds = 5 + periodSeconds = 30 + } + readinessProbe = { + httpGet = { path = "/ping", port = 6000 } + initialDelaySeconds = 5 + periodSeconds = 10 + } volumeMounts = [{🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modules/monitoring/main.tf` around lines 370 - 381, The calert container spec currently lacks resource requests/limits and health probes; update the container entry for name "calert" (in the spec.containers block) to add a resources block with at least cpu and memory requests and limits to avoid BestEffort QoS, and add both livenessProbe and readinessProbe (HTTP or TCP probe matching calert’s health endpoint/port 6000 or an exec if available) with appropriate initialDelaySeconds, periodSeconds, timeoutSeconds and failureThreshold values; ensure probes target containerPort 6000 and keep volumeMounts unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modules/monitoring/examples/basic/main.tf`:
- Around line 30-31: The module source in module "monitoring" is pointing to the
wrong GitHub org and a non-existent tag; update the source string in module
"monitoring" to either use the correct organization name
("wso2/open-cloud-datacenter") with a valid tag once released, or change it to a
local relative path (e.g., ../..) for local/example development until the module
is published; ensure the source value is the only change and retain the ref/tag
semantics when switching back to the git-based source.
In `@modules/monitoring/main.tf`:
- Around line 488-497: The alert named LonghornEvictionWithDegradedVolumes
currently fires on the metric longhorn_disk_eviction_requested == 1 but does not
check for degraded volumes; either tighten the expr in the alert to also require
degraded-volume metrics (e.g., combine longhorn_disk_eviction_requested with the
appropriate degraded-volume metric/label such as longhorn_volume_degraded or a
sum/count of degraded volumes for the same node/disk) so the rule only fires
when eviction AND degraded volumes coincide, or rename the alert to
LonghornDiskEvictionActive and update summary/description/runbook to reflect
that it triggers on any active disk eviction; update the alert block references
to LonghornEvictionWithDegradedVolumes or the new name accordingly.
---
Nitpick comments:
In `@modules/monitoring/examples/basic/main.tf`:
- Line 34: Replace the hardcoded environment = "lk" in the example main.tf with
a reference to a variable (e.g., use var.environment) and add a variable
declaration named environment with type string, a default of "lk" and a
description so the example becomes reusable and configurable; update any
resource/name interpolation that relied on the literal "lk" to use
var.environment instead.
In `@modules/monitoring/main.tf`:
- Around line 612-621: The alert HpVolumePodNotRunning hardcodes
namespace="cpd-dp"; introduce a new variable (e.g., vm_workload_namespace with
default "cpd-dp") and replace the hardcoded namespace in the PromQL expression
(kube_pod_status_phase{pod=~"hp-volume-.*", namespace=...}) and any annotation
text that includes the namespace (the kubectl get events -n ... command in
description) to reference the new variable (var.vm_workload_namespace) so the
alert is configurable across environments.
- Around line 736-745: The Prometheus rule only defines a warning alert
"NodeHighMemory" (using var.node_memory_warning_pct) but lacks a matching
critical alert; add a new alert block "NodeMemoryCritical" mirroring
"NodeHighMemory" but using a higher threshold variable (e.g.,
var.node_memory_critical_pct), set labels = { severity = "critical" }, choose an
appropriate for duration (e.g., "5m" or matching existing critical alerts), and
add annotations including summary, description and runbook_url pointing to
"${var.runbook_base_url}/NodeMemoryCritical"; also ensure the new variable
var.node_memory_critical_pct is declared in variables.tf if it doesn't exist.
- Around line 289-315: The provisioner in null_resource.alertmanager_base_config
injects var.kubeconfig_context directly into a shell command (risking shell
injection) and there is no destroy-time cleanup; to fix, add validation for
kubeconfig_context in variables.tf (use a regex like ^[A-Za-z0-9._-]+$) to
reject unsafe names, then change the local-exec provisioner to avoid direct
interpolation by passing the context via an environment variable (e.g.,
KUBE_CONTEXT) and reference it safely inside the heredoc as a quoted shell
variable (use "$KUBE_CONTEXT") within the command in provisioner "local-exec" of
null_resource.alertmanager_base_config; finally, either add a destroy
provisioner on null_resource.alertmanager_base_config that runs a safe kubectl
delete secret ... using the same KUBE_CONTEXT handling or document intentional
retention of the Secret.
- Around line 370-381: The calert container spec currently lacks resource
requests/limits and health probes; update the container entry for name "calert"
(in the spec.containers block) to add a resources block with at least cpu and
memory requests and limits to avoid BestEffort QoS, and add both livenessProbe
and readinessProbe (HTTP or TCP probe matching calert’s health endpoint/port
6000 or an exec if available) with appropriate initialDelaySeconds,
periodSeconds, timeoutSeconds and failureThreshold values; ensure probes target
containerPort 6000 and keep volumeMounts unchanged.
In `@modules/monitoring/variables.tf`:
- Around line 88-98: Add a validation block to both variable declarations
(virt_launcher_stuck_for and hp_volume_stuck_for) to ensure values match
Prometheus duration syntax; use a regex-based condition (e.g.,
can(regex("^[0-9]+(ms|s|m|h|d|w|y)$", var.<variable_name>))) and provide a clear
error_message like "<variable_name> must be a valid Prometheus duration (e.g.,
5m, 1h, 30s)". Apply this same validation to both variables so invalid strings
(e.g., "5 minutes") are rejected at plan time.
- Around line 52-80: The threshold variables (disk_usage_warning_pct,
disk_usage_critical_pct, node_cpu_warning_pct, node_cpu_critical_pct and
similar) lack validation to prevent invalid values and misconfiguration where a
warning >= critical; add validation blocks on each percentage variable to
enforce 0..100 and sensible ranges (e.g., validation condition ensures the
variable is between 0 and 100 with a clear error_message), and add a
module-level precondition or lifecycle precondition (e.g., in main.tf) that
asserts disk_usage_warning_pct < disk_usage_critical_pct and
node_cpu_warning_pct < node_cpu_critical_pct (with error messages) so Terraform
fails fast if a warning threshold is not strictly less than its corresponding
critical threshold.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 97e6414e-2193-4de7-bfbb-4fa4d16fa40d
📒 Files selected for processing (6)
modules/monitoring/examples/basic/main.tfmodules/monitoring/main.tfmodules/monitoring/outputs.tfmodules/monitoring/templates/google-chat.tmplmodules/monitoring/variables.tfmodules/monitoring/versions.tf
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modules/monitoring/README.md`:
- Around line 37-49: Several fenced code blocks in README.md are missing
language identifiers (markdownlint MD040); update each unlabeled triple-backtick
block used for ASCII diagrams/logs to include a language tag (e.g., "text") so
markdownlint stops warning. Locate the unlabeled fenced blocks that contain the
Prometheus→Alertmanager→calert diagram, the ASCII alert card sample, and the
alertmanager URL sample (the triple-backtick blocks surrounding those
ASCII/artifacts) and change them from ``` to ```text for each block.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 676f8c62-5f73-4453-bcd7-5adfd91a1fc5
📒 Files selected for processing (1)
modules/monitoring/README.md
c16c122 to
2fe32e4
Compare
3b3e2f0 to
9a62ca0
Compare
Purpose
Adds
modules/monitoring— a reusable Terraform module that provisions a complete monitoring stack on top of therancher-monitoring(kube-prometheus-stack) add-on present on every Harvester cluster.Closes / relates to: wso2-enterprise/wso2-datacenter-project#29
Goals
Approach
Alertmanager config — direct Secret patch instead of AlertmanagerConfig CRD
AlertmanagerConfigv1alpha1 silently dropsgoogleChatConfigsfields. The module patches the pre-existingalertmanager-rancher-monitoring-alertmanagerSecret directly via anull_resource+kubectl apply. Prometheus Operator hot-reloads Alertmanager within ~30 s.calert as Google Chat relay
Alertmanager's
webhook_configsPOST to an in-clustercalertDeployment (ghcr.io/mr-karan/calert), which renders Google Chat Cards v2 from a Go template baked into the Secret atterraform applytime. Achecksum/configannotation on the Deployment triggers rolling restarts whenever the template changes.Resources created per environment
{env}-harvester-storage-alertscattle-monitoring-system{env}-harvester-vm-alertscattle-monitoring-system{env}-harvester-node-alertscattle-monitoring-systemalertmanager-rancher-monitoring-alertmanagercattle-monitoring-systemcalert-configcattle-monitoring-systemcalertcattle-monitoring-system{env}-harvester-{storage,vm,node}-dashboardcattle-dashboardsRelease note
New module
modules/monitoring. No changes to existing modules.Consumers can invoke it as:
Documentation
modules/monitoring/README.md— architecture overview, design decisions, alert inventory, step-by-step guide to add new alerts, template evaluation model, full variable and output referencemodules/monitoring/examples/basic/main.tf— minimal working example using a local relative source pathSummary by CodeRabbit
Release Notes
New Features
Documentation