Skip to content

Commit a1cc078

Browse files
committed
get slack integration working, with node down alert
1 parent 6340947 commit a1cc078

File tree

11 files changed

+126
-21
lines changed

11 files changed

+126
-21
lines changed

ansible/roles/alertmanager/README.md

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,3 +4,71 @@
44
notes:
55
- HA is not supported
66
- state ("notification state and configured silences") is not preserved across rebuild
7+
- not used for caas
8+
- no dashboard
9+
10+
11+
12+
## Role variables
13+
14+
The following variables are equivalent to similarly-named arguments to the
15+
`alertmanager` binary. See `man alertmanager` for more info:
16+
17+
- TODO:
18+
19+
The following variables are templated into the alertmanager configuration file:
20+
21+
- TODO:
22+
23+
Other variables:
24+
- TODO:
25+
26+
27+
## TODO
28+
29+
memory usage looks a bit close:
30+
31+
```
32+
[root@RL9-control rocky]# free -h
33+
total used free shared buff/cache available
34+
Mem: 3.6Gi 2.4Gi 168Mi 11Mi 1.5Gi 1.2Gi
35+
Swap: 0B 0B 0B
36+
```
37+
38+
39+
40+
## Slack Integration
41+
42+
1. Create an app with a bot token:
43+
44+
- Go to https://api.slack.com/apps
45+
- select "Create an App"
46+
- select "From scratch"
47+
- Set app name and workspacef fields, select "Create"
48+
- Fill out "Short description" and "Background color" fields, select "Save changes"
49+
- Select "OAuth & Permissions" on left menu
50+
- Under "Scopes : Bot Token Scopes", select "Add an OAuth Scope", add
51+
`chat:write` and select "Save changes"
52+
- Select "Install App" on left menu, select "Install to your-workspace", select Allow
53+
- Copy the Bot User OAuth token shown
54+
55+
2. Add the bot token into the config and enable Slurm integration
56+
57+
- Open `environments/site/inventory/group_vars/all/vault_alertmanager.yml`
58+
- Uncomment `vault_alertmanager_slack_integration_app_creds` and add the token
59+
- Vault-encrypt that file:
60+
61+
ansible-vault encrypt environments/$ENV/inventory/group_vars/all/vault_alertmanager.yml
62+
63+
- Open `environments/site/inventory/group_vars/all/alertmanager.yml`
64+
- Uncomment the config and set your alert channel name
65+
66+
3. Invite the bot to your alerts channel
67+
- In the appropriate Slack channel type:
68+
69+
/invite @YOUR_BOT_NAME
70+
71+
72+
## Adding Rules
73+
74+
TODO: describe how prom config works

ansible/roles/alertmanager/defaults/main.yml

Lines changed: 13 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -7,12 +7,18 @@ alertmanager_enabled: true
77

88
alertmanager_system_user: alertmanager
99
alertmanager_system_group: alertmanager
10-
alertmanager_config_path: /etc/alertmanager/alertmanager.yml
11-
alertmanager_storage_dir: /var/lib/alertmanager
12-
alertmanager_web_listen_addresses:
13-
- ':9100'
14-
alertmanager_web_external_url: http://localhost:9093/
15-
alertmanager_config_flags: {}
10+
alertmanager_config_file: /etc/alertmanager/alertmanager.yml # --config.file: Alertmanager configuration file name
11+
alertmanager_storage_path: /var/lib/alertmanager # --storage.path: Base path for data storage
12+
13+
alertmanager_port: '9093'
14+
alertmanager_web_listen_addresses: # elements of --web.listen-address
15+
- ":{{ alertmanager_port }}"
16+
alertmanager_web_external_url: "http://localhost:{{ alertmanager_port}}/" # --web.external-url: The URL under which Alertmanager is externally reachable (for example, if Alertmanager is served via a reverse proxy). Used for generating relative and absolute links back to Alertmanager itself. If the URL has a path portion, it will be used to prefix all HTTP endpoints served by Alertmanager. If omitted, relevant URL components will be derived automatically
17+
# TODO: work out how we proxy this through ondemand
18+
19+
alertmanager_data_retention: '120h' # --data.retention # How long to keep data for
20+
alertmanager_data_maintenance_interval: '15m' # --data.maintenance-interval: Interval between garbage collection and snapshotting to disk of the silences and the notification logs
21+
alertmanager_config_flags: {} # other command-line parameters as shown by `man alertmanager`
1622
# TODO: data retention?
1723
alertmanager_config_template: alertmanager.yml.j2
1824

@@ -28,7 +34,7 @@ alertmanager_config_template: alertmanager.yml.j2
2834
alertmanager_default_receivers:
2935
- name: 'null'
3036

31-
alertmanager_slack_receiver: {} # really defined in common as it needs prometheus_address
37+
alertmanager_slack_receiver: {} # defined in common env as it needs prometheus_address
3238

3339
alertmanager_extra_receivers: "{{ [alertmanager_slack_receiver] if alertmanager_slack_integration is defined else [] }}"
3440

ansible/roles/alertmanager/tasks/configure.yml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,8 @@
66
group: "{{ alertmanager_system_group }}"
77
mode: u=rwX,go=rX
88
loop:
9-
- "{{ alertmanager_config_path | dirname }}"
10-
- "{{ alertmanager_storage_dir }}"
9+
- "{{ alertmanager_config_file | dirname }}"
10+
- "{{ alertmanager_storage_path }}"
1111

1212
# TODO: selinux?
1313

@@ -25,7 +25,7 @@
2525
- name: Template alertmanager config
2626
ansible.builtin.template:
2727
src: "{{ alertmanager_config_template }}"
28-
dest: "{{ alertmanager_config_path }}"
28+
dest: "{{ alertmanager_config_file }}"
2929
owner: "{{ alertmanager_system_user }}"
3030
group: "{{ alertmanager_system_group }}"
3131
mode: u=rw,go=r # TODO: check there are no sensitive things in here!

ansible/roles/alertmanager/templates/alertmanager.service.j2

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -16,8 +16,10 @@ Group={{ alertmanager_system_group }}
1616
ExecReload=/bin/kill -HUP $MAINPID
1717
ExecStart={{ alertmanager_binary_dir }}/alertmanager \
1818
--cluster.listen-address='' \
19-
--config.file={{ alertmanager_config_path }} \
20-
--storage.path={{ alertmanager_storage_dir }} \
19+
--config.file={{ alertmanager_config_file }} \
20+
--storage.path={{ alertmanager_storage_path }} \
21+
--data.retention={{ alertmanager_data_retention }} \
22+
--data.maintenance-interval={{ alertmanager_data_maintenance_interval }} \
2123
{% for address in alertmanager_web_listen_addresses %}
2224
--web.listen-address={{ address }} \
2325
{% endfor %}
@@ -36,7 +38,7 @@ NoNewPrivileges=true
3638
MemoryDenyWriteExecute=true
3739
PrivateTmp=true
3840
ProtectHome=true
39-
ReadWriteDirectories={{ alertmanager_storage_dir }}
41+
ReadWriteDirectories={{ alertmanager_storage_path }}
4042
RemoveIPC=true
4143
RestrictSUIDSGID=true
4244

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
{{ ansible_managed | comment }}
22

3-
{{ alertmanager_config_default }}
4-
{{ alertmanager_config_extra }}
3+
{{ alertmanager_config_default | to_nice_yaml }}
4+
{{ alertmanager_config_extra | to_nice_yaml if alertmanager_config_extra | length > 0 else '' }}
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
2+
groups:
3+
- name: Slurm
4+
rules:
5+
- alert: SlurmNodeDown
6+
annotations:
7+
description: '{{ $value }} Slurm nodes are in down status'
8+
summary: 'At least one Slurm node is down.'
9+
expr: "slurm_nodes_down > 0\n"
10+
labels:
11+
severity: critical

environments/common/inventory/group_vars/all/alertmanager.yml

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,7 @@
1-
alertmanager_slack_receiver:
1+
2+
alertmanager_port: '9093' # defined here as required for prometheus
3+
4+
alertmanager_slack_receiver: # defined here as needs prometheus address
25
name: slack-receiver
36
slack_configs:
47
- channel: "{{ alertmanager_slack_integration.channel | default('none') }}"

environments/common/inventory/group_vars/all/defaults.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ prometheus_address: "{{ hostvars[groups['prometheus'].0].api_address }}"
2222
openondemand_address: "{{ hostvars[groups['openondemand'].0].api_address if groups['openondemand'] | count > 0 else '' }}"
2323
grafana_address: "{{ hostvars[groups['grafana'].0].api_address }}"
2424
k3s_server_name: "{{ hostvars[groups['k3s_server'] | first].ansible_host }}"
25-
25+
alertmanager_address: "{{ hostvars[groups['alertmanager'].0].api_address }}"
2626
############################# bootstrap: local user configuration #########################
2727

2828
# Note RockyLinux 8.5 defines system user/groups in range 201-999

environments/common/inventory/group_vars/all/prometheus.yml

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -9,10 +9,16 @@ prometheus_storage_retention: "31d"
99
prometheus_storage_retention_size: "100GB"
1010
prometheus_db_dir: "{{ appliances_state_dir | default('/var/lib') }}/prometheus"
1111

12-
prometheus_alertmanager_config: []
13-
14-
prometheus_alert_rules_files:
15-
- "{{ appliances_repository_root }}/environments/common/files/prometheus/rules/*.rules"
12+
prometheus_alertmanager_config_default:
13+
- static_configs:
14+
- targets:
15+
- "{{ alertmanager_address }}:{{ alertmanager_port }}"
16+
prometheus_alertmanager_config: "{{ prometheus_alertmanager_config_default if groups['alertmanager'] else {} }}"
17+
18+
# by default, use rule files from the following path relative to current and all parent environment inventory directories:
19+
prometheus_alert_rules_files_inventory_glob: ../files/prometheus/rules/*.rules
20+
prometheus_alert_rules_files: "{{ ansible_inventory_sources | product([prometheus_alert_rules_files_inventory_glob]) | map('join', '/') | map('realpath') }}"
21+
# TODO: find a way to include/exclude files?
1622

1723
prometheus_alert_rules: []
1824

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
# Uncomment below and add Slack bot app creds in the adjacent file
2+
# vault_alertmanager.ym for Slack integration:
3+
#
4+
# alertmanager_slack_integration:
5+
# channel: '#alerts'
6+
# app_creds: "{{ vault_alertmanager_slack_integration_app_creds }}"

0 commit comments

Comments
 (0)