Skip to content

Commit 794d0ec

Browse files
authored
Merge pull request #49 from axonops/add_metric_doc
Add metric doc
2 parents 10b5b56 + 96f306e commit 794d0ec

File tree

3 files changed

+82
-3
lines changed

3 files changed

+82
-3
lines changed

docs/roles/configurations.md

Lines changed: 78 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -312,9 +312,86 @@ Integer number followed by one of "s, m, h, d, w, M, y"
312312
- role: axonops.axonops.configurations
313313
```
314314

315+
### Metric Alerts
316+
Metric alerts can be configured by providing a YAML file called `metric_alert_rules.yml` in the directory
317+
`config/[YOUR_ORG_NAME]`
318+
to make them available for all clusters in the organization, or in `config/[YOUR_ORG_NAME]/[YOUR_CLUSTER_NAME]` to make
319+
them available for a specific cluster.
320+
321+
The file is optional, if the file is not provided, no metric alerts will be configured.
322+
The format of the file is as follows:
323+
324+
```yaml
325+
metric_alert_rules:
326+
- name: name of the check
327+
dashboard: dashboard name
328+
chart: chart name
329+
operator: '>='
330+
critical_value: 2
331+
warning_value: 1
332+
duration: 15m
333+
description: my example metric alert
334+
enabled: true
335+
```
336+
337+
The variable `metric_alert_rules` is a list of metric alert definitions. The variable is optional.
338+
339+
#### list of parameters for metric_alert_rules
340+
341+
| Parameter | Description | Type | Default |
342+
|--------------|------------------------------------------------------------------------------------------------|---------|---------|
343+
| `name` | Name of the alert | String | |
344+
| `description`| Description of the alert | String | |
345+
| `chart` | Name of the chart to monitor | String | |
346+
| `dashboard` | Name of the dashboard containing the chart | String | |
347+
| `operator` | Comparison operator for the alert condition. Value accepted: '==', '>=', '>', '<=', '<', '!='. | String | |
348+
| `critical_value` | Value to trigger a critical alert | Float | |
349+
| `warning_value` | Value to trigger a warning alert | Float | |
350+
| `duration` | Duration for which the condition must be met before triggering the alert | String | |
351+
| `enabled` | Whether the alert is enabled or not | Boolean | True |
352+
353+
#### Check for DOWN nodes
354+
355+
This is an example of a metric alert that triggers a critical alert when the number of DOWN nodes per cluster
356+
is greater than or equal to 2, and a warning alert when it is greater than or equal to 1, for a duration of 15 minutes.
357+
```yaml
358+
metric_alert_rules:
359+
- name: DOWN count per node
360+
dashboard: Overview
361+
chart: Number of Endpoints Down Per Node Point Of View
362+
operator: '>='
363+
critical_value: 2
364+
warning_value: 1
365+
duration: 15m
366+
description: Detected DOWN nodes
367+
368+
```
369+
370+
#### Check for High Disk Utilization
371+
This is an example of a metric alert that triggers a critical alert when the disk usage percentage for any mount point
372+
is greater than or equal to 90%, and a warning alert when it is greater than or equal to 75%, for a duration of 12 hours.
373+
374+
```yaml
375+
metric_alert_rules:
376+
- name: Disk % Usage $mountpoint
377+
dashboard: System
378+
chart: Disk % Usage $mountpoint
379+
operator: '>='
380+
critical_value: 90
381+
warning_value: 75
382+
duration: 12h
383+
description: Detected High disk utilization
384+
```
385+
386+
**Note:** More examples of metric checks can be found in the org level
387+
[metric_alert_rules.yml](../../examples/configurations/config/REPLACE_WITH_ORG_NAME/metric_alert_rules.yml) or the cluster level
388+
[metric_alert_rules.yml](../../examples/configurations/config/REPLACE_WITH_ORG_NAME/REPLACE_WITH_CLUSTER_NAME/metric_alert_rules.yml)
389+
example files.
390+
391+
315392
### Service Checks
316393

317-
Service checks can be configured by providing YAML a file called `service_checks.yml` in the directory
394+
Service checks can be configured by providing a YAML file called `service_checks.yml` in the directory
318395
`config/[YOUR_ORG_NAME]`
319396
to make them available for all clusters in the organization, or in `config/[YOUR_ORG_NAME]/[YOUR_CLUSTER_NAME]` to make
320397
them available for a specific cluster.

examples/configurations/config/REPLACE_WITH_ORG_NAME/REPLACE_WITH_CLUSTER_NAME/service_checks.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -127,6 +127,8 @@ axonops_shell_check:
127127
exit $EXIT_CRITICAL
128128
fi
129129
130+
echo "No schema disagreement detected"
131+
130132
exit $EXIT_OK
131133
132134
# This check verifies that all files and directories under the Cassandra data directory

plugins/modules/alert_rule.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -74,8 +74,8 @@
7474

7575
import re
7676
import uuid
77-
from ansible.module_utils.basic import AnsibleModule
7877

78+
from ansible.module_utils.basic import AnsibleModule
7979
from ansible_collections.axonops.axonops.plugins.module_utils.axonops import AxonOps
8080
from ansible_collections.axonops.axonops.plugins.module_utils.axonops_utils import dicts_are_different, \
8181
find_by_field, get_integration_id_by_name, get_value_by_name, make_module_args, normalize_numbers
@@ -88,7 +88,7 @@ def run_module():
8888
'dashboard': {'type': 'str', 'required': True},
8989
'chart': {'type': 'str', 'required': True},
9090
'metric': {'type': 'str', 'default': ''},
91-
'operator': {'type': 'str', 'choices': ['=', '>=', '>', '<=', '<', '!=']},
91+
'operator': {'type': 'str', 'choices': ['==', '>=', '>', '<=', '<', '!=']},
9292
'warning_value': {'type': 'float'},
9393
'critical_value': {'type': 'float'},
9494
'duration': {'type': 'str'},

0 commit comments

Comments
 (0)