Skip to content

Commit 7bab7f6

Browse files
authored
Merge pull request #48 from axonops/fix_shell_checks
fix shell checks, add documentation and example
2 parents e185a50 + 7025fbb commit 7bab7f6

File tree

19 files changed

+673
-467
lines changed

19 files changed

+673
-467
lines changed

docs/roles/configurations.md

Lines changed: 124 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,9 @@
22

33
## Overview
44

5-
The `configurations` role configures alerts, integrations, and monitoring settings for your AxonOps deployment. This role manages metric alerts, backup configurations, service checks, integration with notification services (Slack, PagerDuty), log alerts, and custom dashboards.
5+
The `configurations` role configures alerts, integrations, and monitoring settings for your AxonOps deployment. This
6+
role manages metric alerts, backup configurations, service checks, integration with notification services (Slack,
7+
PagerDuty), log alerts, and custom dashboards.
68

79
## Requirements
810

@@ -14,20 +16,20 @@ The `configurations` role configures alerts, integrations, and monitoring settin
1416

1517
### Required Variables
1618

17-
| Variable | Description | Example |
18-
|----------|-------------|---------|
19-
| `org` | Organization name in AxonOps | `mycompany` |
19+
| Variable | Description | Example |
20+
|-----------|--------------------------------------|----------------------|
21+
| `org` | Organization name in AxonOps | `mycompany` |
2022
| `cluster` | Cluster name to configure alerts for | `production-cluster` |
2123

2224
**Note**: These variables can also be set via environment variables `AXONOPS_ORG` and `AXONOPS_CLUSTER`.
2325

2426
### Optional Feature Flags
2527

26-
| Variable | Description | Default |
27-
|----------|-------------|---------|
28-
| `adaptive_repair` | Configuration for adaptive repair settings | undefined |
29-
| `agent_disconnection_tolerance` | Agent disconnection tolerance settings | undefined |
30-
| `human_readableid` | Human-readable ID configuration | undefined |
28+
| Variable | Description | Default |
29+
|---------------------------------|--------------------------------------------|-----------|
30+
| `adaptive_repair` | Configuration for adaptive repair settings | undefined |
31+
| `agent_disconnection_tolerance` | Agent disconnection tolerance settings | undefined |
32+
| `human_readableid` | Human-readable ID configuration | undefined |
3133

3234
## Dependencies
3335

@@ -124,6 +126,7 @@ This role requires a running AxonOps Server with API access.
124126
roles:
125127
- role: axonops.axonops.configurations
126128
```
129+
127130
## Details playbook
128131
129132
### Adaptive Repair Configuration
@@ -132,7 +135,9 @@ The Adaptive Repair feature can be configured by setting the `adaptive_repair` v
132135
no need for files in the `config` directory.
133136

134137
This allows you to enable or disable adaptive repair settings for your cluster.
138+
135139
#### List of Parameters
140+
136141
| Parameter | Description | Type | Default |
137142
|-----------------------|----------------------------------------------------------------------------------|---------|---------|
138143
| `enabled` | Enable or disable adaptive repair | boolean | `true` |
@@ -175,7 +180,9 @@ This allows you to enable or disable adaptive repair settings for your cluster.
175180
```
176181

177182
#### Set GC Grace Threshold
178-
Set the GC grace period. AxonOps will ignore tables that have a `gc_grace_seconds` value lower than the specified threshold.
183+
184+
Set the GC grace period. AxonOps will ignore tables that have a `gc_grace_seconds` value lower than the specified
185+
threshold.
179186
The default is `86400` seconds (1 day).
180187

181188
```yaml
@@ -195,6 +202,7 @@ The default is `86400` seconds (1 day).
195202
#### Set Table Parallelism
196203

197204
It is suggested to keep this value at least as the number of table in the cluster.
205+
198206
```yaml
199207
- name: Set Table Parallelism for Adaptive Repair
200208
hosts: localhost
@@ -218,8 +226,8 @@ It is suggested to keep this value at least as the number of table in the cluste
218226
org: mycompany
219227
cluster: production-cluster
220228
adaptive_repair:
221-
enabled: true
222-
segmentretries: 10
229+
enabled: true
230+
segmentretries: 10
223231
224232
roles:
225233
- role: axonops.axonops.configurations
@@ -228,6 +236,7 @@ It is suggested to keep this value at least as the number of table in the cluste
228236
#### Set Segment Target Size
229237

230238
Number from 16 to 10240
239+
231240
```yaml
232241
- name: Set Segment Target Size for Adaptive Repair
233242
hosts: localhost
@@ -243,6 +252,7 @@ Number from 16 to 10240
243252
```
244253

245254
#### Exclude Tables from Adaptive Repair
255+
246256
List of tables to exclude from adaptive repair. The accepted format is a list of strings in the form "keyspace.table".
247257
To exclude an entire keyspace, use "keyspace.*".
248258
The default is an empty list.
@@ -256,15 +266,16 @@ The default is an empty list.
256266
adaptive_repair:
257267
enabled: true
258268
excludedtables:
259-
- "system.peers"
260-
- "system.local"
269+
- "system.peers"
270+
- "system.local"
261271
262272
263273
roles:
264274
- role: axonops.axonops.configurations
265275
```
266276

267277
#### Set Maximum Segments per Table
278+
268279
Set the maximum number of segments per table to repair in a single repair cycle.
269280
Having too many segments in a table causes too many repair commands to be sent.
270281

@@ -283,6 +294,7 @@ Having too many segments in a table causes too many repair commands to be sent.
283294
```
284295

285296
#### Set Segment Timeout
297+
286298
Set the timeout in seconds for each segment repair operation.
287299
Integer number followed by one of "s, m, h, d, w, M, y"
288300

@@ -299,25 +311,106 @@ Integer number followed by one of "s, m, h, d, w, M, y"
299311
roles:
300312
- role: axonops.axonops.configurations
301313
```
314+
315+
### Service Checks
316+
317+
Service checks can be configured by providing YAML a file called `service_checks.yml` in the directory
318+
`config/[YOUR_ORG_NAME]`
319+
to make them available for all clusters in the organization, or in `config/[YOUR_ORG_NAME]/[YOUR_CLUSTER_NAME]` to make
320+
them available for a specific cluster.
321+
322+
The file is optional, if the file is not provided, no service checks will be configured.
323+
324+
The format of the file is as follows:
325+
326+
```yaml
327+
axonops_shell_check: [ ]
328+
329+
axonops_tcp_check: [ ]
330+
```
331+
332+
both `axonops_shell_check` and `axonops_tcp_checks` are optionals.
333+
334+
#### list of parameters for axonops_shell_check
335+
336+
| Parameter | Description | Type | Default |
337+
|------------|---------------------------------------|---------|-------------|
338+
| `name` | Name of the shell check | String | |
339+
| `present` | Whether the check is present or not | Boolean | True |
340+
| `interval` | How much ofthen the check need to run | String | |
341+
| `timeout` | Timeout for the check | String | |
342+
| `shell` | Shell used by the script | String | '/bin/bash' |
343+
| `script` | Script of the check | String | |
344+
345+
List of outcome codes for shell checks:
346+
347+
- `0`: OK
348+
- `1`: WARNING
349+
- `2`: CRITICAL
350+
351+
#### Dummy example of axonops_shell_check
352+
353+
This is example of a dummy shell check that always returns CRITICAL:
354+
355+
```yaml
356+
axonops_shell_check:
357+
- name: "Dummy check"
358+
present: true
359+
interval: "5m"
360+
timeout: "10s"
361+
script: |
362+
#!/bin/bash
363+
echo "This is a dummy check"
364+
exit 2"
365+
```
366+
367+
#### Example of a shell check to monitor if a Debian/Ubuntu host needs a reboot
368+
369+
This check looks for the presence of the file `/var/run/reboot-required`, which is created by the system when a reboot
370+
is needed after package installations or updates.
371+
372+
```yaml
373+
axonops_shell_check:
374+
- name: Debian / Ubuntu - Check host needs reboot
375+
interval: 12h
376+
present: true
377+
timeout: 1m
378+
script: |-
379+
set -euo pipefail
380+
381+
if [ -f /var/run/reboot-required ]
382+
then
383+
echo `hostname` Reboot required
384+
exit 1
385+
else
386+
echo "Nothing to do"
387+
fi
388+
```
389+
390+
**Note:** More examples of service checks can be found in the org level
391+
[service_checks.yml](../../examples/configurations/config/REPLACE_WITH_ORG_NAME/service_checks.yml) or the cluster level
392+
[service_checks.yml](../../examples/configurations/config/REPLACE_WITH_ORG_NAME/REPLACE_WITH_CLUSTER_NAME/service_checks.yml)
393+
example files.
394+
302395
## Available Tags
303396

304397
The role supports granular control through the following tags:
305398

306-
| Tag | Description |
307-
|-----|-------------|
308-
| `metrics` | Configure metric alerts |
309-
| `backups` | Configure backup settings |
310-
| `service_checks` | Configure service check alerts |
311-
| `slack` | Configure Slack integration |
312-
| `pagerduty_integration` | Configure PagerDuty integration |
313-
| `adaptive_repair` | Configure adaptive repair settings |
399+
| Tag | Description |
400+
|---------------------------------|-----------------------------------------|
401+
| `metrics` | Configure metric alerts |
402+
| `backups` | Configure backup settings |
403+
| `service_checks` | Configure service check alerts |
404+
| `slack` | Configure Slack integration |
405+
| `pagerduty_integration` | Configure PagerDuty integration |
406+
| `adaptive_repair` | Configure adaptive repair settings |
314407
| `agent_disconnection_tolerance` | Configure agent disconnection tolerance |
315-
| `commitlogs_archive` | Configure commit log archiving |
316-
| `human_readableid` | Configure human-readable IDs |
317-
| `log_alerts` | Configure log-based alerts |
318-
| `logcollector` | Configure log collector |
319-
| `dashboards` | Import custom dashboards |
320-
| `routes` | Configure alert routing rules |
408+
| `commitlogs_archive` | Configure commit log archiving |
409+
| `human_readableid` | Configure human-readable IDs |
410+
| `log_alerts` | Configure log-based alerts |
411+
| `logcollector` | Configure log collector |
412+
| `dashboards` | Import custom dashboards |
413+
| `routes` | Configure alert routing rules |
321414

322415
## Tasks Overview
323416

@@ -359,11 +452,13 @@ The role performs the following tasks based on the enabled tags:
359452
- **API Access**: Ensure you have proper API credentials configured for the AxonOps Server
360453
- **Organization and Cluster**: The `org` and `cluster` variables must match existing entries in your AxonOps deployment
361454
- **Idempotency**: The role is designed to be idempotent and can be run multiple times safely
362-
- **Configuration Files**: Alert definitions can be customized by providing your own configuration files in the appropriate directories
455+
- **Configuration Files**: Alert definitions can be customized by providing your own configuration files in the
456+
appropriate directories
363457

364458
## Additional Resources
365459

366460
For more information about AxonOps alerts and configuration, see:
461+
367462
- [ALERTS.md](../../ALERTS.md) in the repository root
368463
- [AxonOps Documentation](https://docs.axonops.com/)
369464

examples/alerts/config/REPLACE_WITH_ORG_NAME/REPLACE_WITH_CLUSTER_NAME/service_checks.yml

Lines changed: 0 additions & 62 deletions
This file was deleted.

0 commit comments

Comments
 (0)