Skip to content

Commit e5dff96

Browse files
committed
updated docs
1 parent e6a4e4b commit e5dff96

File tree

1 file changed

+25
-33
lines changed

1 file changed

+25
-33
lines changed

docs/monitoring-and-logging.README.md

Lines changed: 25 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,9 @@
22

33
## Components overview
44

5+
### [kube-prometheus-stack](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack)
6+
An umbrella Helm chart which the appliance uses to deploy and manages containerised versions of Grafana and Prometheus.
7+
58
### [filebeat](https://www.elastic.co/beats/filebeat)
69

710
Parses log files and ships them to elasticsearch. Note we use the version shipped by Open Distro.
@@ -85,18 +88,15 @@ This section details the configuration of grafana.
8588

8689
### Defaults
8790

88-
Internally, we use the [cloudalchemy.grafana](https://github.com/cloudalchemy/ansible-grafana) role. You can customise any of the variables that the role supports. For a full list, please see the
89-
[upstream documentation](https://github.com/cloudalchemy/ansible-grafana). The appliance defaults can be found here:
90-
91-
> [environments/common/inventory/group_vars/all/grafana.yml](../environments/common/inventory/group_vars/all/grafana.yml)
91+
Internally, we configure Grafana using the [kube-prometheus-stack](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack) Helm chart which passes values to a [Grafana subchart.](https://github.com/grafana/helm-charts/tree/main/charts/grafana) Common configuration options for the chart are exposed in [ansible/roles/kube-prometheus-stack/defaults/main/main.yml](../ansible/roles/kube-prometheus-stack/defaults/main/main.yml), with sensible defaults for Grafana being set in [environments/common/inventory/group_vars/all/grafana.yml](../environments/common/inventory/group_vars/all/grafana.yml). For more control over configuration, the exact values that will be merged with the Helm chart defaults can be found in [ansible/roles/kube-prometheus-stack/defaults/main/helm.yml](../ansible/roles/kube-prometheus-stack/defaults/main/helm.yml).
9292

9393
### Placement
9494

95-
The `grafana` group controls the placement of the grafana service. Load balancing is currently unsupported so it is important that you only assign one host to this group.
95+
The `prometheus` group controls the placement of the Kubernetes monitoring stack. Load balancing is currently unsupported so it is important that you only assign one host to this group.
9696

9797
### Access
9898

99-
If Open Ondemand is enabled then by default this is used to proxy Grafana, otherwise Grafana is accessed through the first . See `grafana_url` in [environments/common/inventory/group_vars/all/grafana.yml](../environments/common/inventory/group_vars/all/grafana.yml). The port used (variable `grafana_port`) defaults to `3000`.
99+
If Open Ondemand is enabled then by default this is used to proxy Grafana, otherwise Grafana is accessed through the first host in the Prometheus group (note that currently there is no support for load balancing so only one host should be in this group, the control node is used by default). See `grafana_url` in [environments/common/inventory/group_vars/all/grafana.yml](../environments/common/inventory/group_vars/all/grafana.yml). The port used (variable `grafana_port`) defaults to `30001`.
100100

101101
The default credentials for the admin user are:
102102

@@ -105,11 +105,13 @@ The default credentials for the admin user are:
105105

106106
Where `vault_grafana_admin_password` is a variable containing the actual password. This is generated by the `generate-passwords.yml` adhoc playbook (see [README.md](../README.md#creating-a-slurm-appliance)).
107107

108+
Note that if Open OnDemand is enabled, Grafana is only accessible through OOD's proxy. Requests to `grafana_url_direct` will be redirected through the proxy, which will ask you to authenticate against Open OnDemand (NOT Grafana credentials). See [Open OnDemand docs.](openondemand.README.md)
109+
108110
### grafana dashboards
109111

110-
The appliance ships with a default set of dashboards. The set of dashboards can be configured via the `grafana_dashboards` variable. The dashboards are either internal to the [grafana-dashboards role](../ansible/roles/grafana-dashboards/files/) or downloaded from grafana.com.
112+
In addition to the default set of dashboards that are deployed by kube-prometheus-stack, the appliance ships with a default set of dashboards (listed below). The set of appliance-specific dashboards can be configured via the `grafana_dashboards` variable. The dashboards are either internal to the [grafana-dashboards role](../ansible/roles/grafana-dashboards/files/) or downloaded from grafana.com.
111113

112-
#### node exporter
114+
#### node exporter slurm
113115

114116
This shows detailed metrics about an individual host. The metric source is `node exporter` (See [prometheus section](#prometheus-1) for more details). A slurm job annotation can optionally be enabled which will highlight the period of time where a given slurm job was running. The slurm job that is highlighted is controlled by the `Slurm Job ID` variable. An example is shown below:
115117

@@ -210,42 +212,32 @@ This section details the configuration of prometheus.
210212

211213
### Defaults
212214

213-
Internally, we use the [cloudalchemy.prometheus](https://github.com/cloudalchemy/ansible-prometheus) role. You can customise any of the variables that the role supports. For a full list, please see the
214-
[upstream documentation](https://github.com/cloudalchemy/ansible-prometheus). The appliance defaults can be found here:
215-
216-
> [environments/common/inventory/group_vars/all/prometheus.yml](../environments/common/inventory/group_vars/all/prometheus.yml)
215+
Like Grafana, Prometheus is internally configured using kube-prometheus-stack with rolevars in [ansible/roles/kube_prometheus_stack/defaults/main](../ansible/roles/kube_prometheus_stack/defaults/main) (see [Grafana defaults section](#grafana-1) for more detail). Sensible defaults are defined in [environments/common/inventory/group_vars/all/prometheus.yml](../environments/common/inventory/group_vars/all/prometheus.yml).
217216

218217
### Placement
219218

220-
The `prometheus` group determines the placement of the prometheus service. Load balancing is currently unsupported so it is important that you only assign one host to this group.
219+
The `prometheus` group controls the placement of the Kubernetes monitoring stack. Load balancing is currently unsupported so it is important that you only assign one host to this group.
221220

222221
### Access
223222

224-
Prometheus is exposed on port `9090` on all hosts in the prometheus group. Currently, the configuration assumes a single host. Following the reference layout in `environments/common/layouts/everything`, this will be set to the slurm `control` node, prometheus would then be accessible from:
223+
Prometheus is exposed on port `30000` on all hosts in the prometheus group. Currently, the configuration assumes a single host. Following the reference layout in `environments/common/layouts/everything`, this will be set to the slurm `control` node, prometheus would then be accessible from:
225224

226-
> http://<control_node_ip>:9090
225+
> http://<control_node_ip>:30000
227226
228-
The port can customised by overriding the `prometheus_web_external_url` variable.
227+
The port can customised by overriding the `prometheus_port` variable.
229228

230229
Note that this service is not password protected, allowing anyone with access to the URL to make queries.
231230

232-
### Recording rules
231+
### Alerting and recording rules
233232

234-
The upstream documentation can be found [here](https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/).
233+
See the upstream documentation for [alerting](https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/) and [recording](https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/) rules.
235234

236-
This appliance provides a default set of recording rules which can be found here:
237-
238-
> [environments/common/files/prometheus/rules/precompute.rules](../environments/common/files/prometheus/rules/precompute.rules)
239-
240-
The intended purpose is to pre-compute some expensive queries that are used
241-
in the reference set of grafana dashboards.
242-
243-
To add new, or to remove rules you will be to adjust the `prometheus_alert_rules_files` variable. The default value can be found in:
235+
In addition to the default recording and alerting rules set by kube-prometheus-stack, the appliances provides a default set of rules which can be found in the `prometheus_extra_rules` list in:
244236

245237
> [environments/common/inventory/group_vars/all/prometheus.yml](../environments/common/inventory/group_vars/all/prometheus.yml)
246238
247-
You can extend this variable in your environment specific configuration to reference extra files or to remove the defaults. The reference set of dashboards expect these variables to be defined, so if you remove them, you
248-
will also have to update your dashboards.
239+
The provided default recording rules are intended to pre-compute some expensive queries that are used
240+
in the reference set of grafana dashboards. The default alerting rules define alerts for issues with Slurm nodes.
249241

250242
### node_exporter
251243

@@ -262,18 +254,18 @@ This appliance customises the default set of collectors to a minimal set, these
262254
- meminfo
263255
- infiniband
264256
- cpufreq
257+
- diskstats
258+
- filesystem
265259

266-
The list can be customised by overriding the `collect[]` parameter of the `node` job in the `prometheus_scrape_configs` dictionary. The defaults can be found in:
267-
268-
> [environments/common/inventory/group_vars/all/prometheus.yml](../environments/common/inventory/group_vars/all/prometheus.yml).
260+
The list can be customised by adding or removing `--collector` flags to Node Exporter's command line arguments. The defaults can be found in:
269261

270-
Variables in this file should *not* be customised directly, but should be overridden in your `environment`. See [README.md](../README.md#environments) which details the process of overriding default variables in more detail.
262+
> [environments/common/inventory/group_vars/all/node_exporter.yml](../environments/common/inventory/group_vars/all/node_exporter.yml).
271263
272264
### custom ansible filters
273265

274266
#### prometheus_node_exporter_targets
275267

276-
Groups prometheus targets into per environment groups. The ansible variable, `env` is used to determine the grouping. The metrics for each target in the group are given the prometheus label, `env: $env`, where `$env` is the value of the `env` variable for that host.
268+
Groups prometheus targets into per environment groups. The ansible variable, `cluster_env` is used to determine the grouping. The metrics for each target in the group are given the prometheus label, `cluster_env: $cluster_env`, where `$cluster_env` is the value of the `cluster_env` variable for that host.
277269

278270
## slurm-stats
279271

0 commit comments

Comments
 (0)