@@ -312,9 +312,86 @@ Integer number followed by one of "s, m, h, d, w, M, y"
312312 - role: axonops.axonops.configurations
313313` ` `
314314
315+ # ## Metric Alerts
316+ Metric alerts can be configured by providing a YAML file called `metric_alert_rules.yml` in the directory
317+ ` config/[YOUR_ORG_NAME]`
318+ to make them available for all clusters in the organization, or in `config/[YOUR_ORG_NAME]/[YOUR_CLUSTER_NAME]` to make
319+ them available for a specific cluster.
320+
321+ The file is optional, if the file is not provided, no metric alerts will be configured.
322+ The format of the file is as follows :
323+
324+ ` ` ` yaml
325+ metric_alert_rules:
326+ - name: name of the check
327+ dashboard: dashboard name
328+ chart: chart name
329+ operator: '>='
330+ critical_value: 2
331+ warning_value: 1
332+ duration: 15m
333+ description: my example metric alert
334+ enabled: true
335+ ` ` `
336+
337+ The variable `metric_alert_rules` is a list of metric alert definitions. The variable is optional.
338+
339+ # ### list of parameters for metric_alert_rules
340+
341+ | Parameter | Description | Type | Default |
342+ |--------------|------------------------------------------------------------------------------------------------|---------|---------|
343+ | `name` | Name of the alert | String | |
344+ | `description`| Description of the alert | String | |
345+ | `chart` | Name of the chart to monitor | String | |
346+ | `dashboard` | Name of the dashboard containing the chart | String | |
347+ | `operator` | Comparison operator for the alert condition. Value accepted : ' ==' , '>=', '>', '<=', '<', '!='. | String | |
348+ | `critical_value` | Value to trigger a critical alert | Float | |
349+ | `warning_value` | Value to trigger a warning alert | Float | |
350+ | `duration` | Duration for which the condition must be met before triggering the alert | String | |
351+ | `enabled` | Whether the alert is enabled or not | Boolean | True |
352+
353+ # ### Check for DOWN nodes
354+
355+ This is an example of a metric alert that triggers a critical alert when the number of DOWN nodes per cluster
356+ is greater than or equal to 2, and a warning alert when it is greater than or equal to 1, for a duration of 15 minutes.
357+ ` ` ` yaml
358+ metric_alert_rules:
359+ - name: DOWN count per node
360+ dashboard: Overview
361+ chart: Number of Endpoints Down Per Node Point Of View
362+ operator: '>='
363+ critical_value: 2
364+ warning_value: 1
365+ duration: 15m
366+ description: Detected DOWN nodes
367+
368+ ` ` `
369+
370+ # ### Check for High Disk Utilization
371+ This is an example of a metric alert that triggers a critical alert when the disk usage percentage for any mount point
372+ is greater than or equal to 90%, and a warning alert when it is greater than or equal to 75%, for a duration of 12 hours.
373+
374+ ` ` ` yaml
375+ metric_alert_rules:
376+ - name: Disk % Usage $mountpoint
377+ dashboard: System
378+ chart: Disk % Usage $mountpoint
379+ operator: '>='
380+ critical_value: 90
381+ warning_value: 75
382+ duration: 12h
383+ description: Detected High disk utilization
384+ ` ` `
385+
386+ **Note:** More examples of metric checks can be found in the org level
387+ [metric_alert_rules.yml](../../examples/configurations/config/REPLACE_WITH_ORG_NAME/metric_alert_rules.yml) or the cluster level
388+ [metric_alert_rules.yml](../../examples/configurations/config/REPLACE_WITH_ORG_NAME/REPLACE_WITH_CLUSTER_NAME/metric_alert_rules.yml)
389+ example files.
390+
391+
315392# ## Service Checks
316393
317- Service checks can be configured by providing YAML a file called `service_checks.yml` in the directory
394+ Service checks can be configured by providing a YAML file called `service_checks.yml` in the directory
318395` config/[YOUR_ORG_NAME]`
319396to make them available for all clusters in the organization, or in `config/[YOUR_ORG_NAME]/[YOUR_CLUSTER_NAME]` to make
320397them available for a specific cluster.
0 commit comments