You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Parses log files and ships them to elasticsearch. Note we use the version shipped by Open Distro.
@@ -85,18 +88,15 @@ This section details the configuration of grafana.
85
88
86
89
### Defaults
87
90
88
-
Internally, we use the [cloudalchemy.grafana](https://github.com/cloudalchemy/ansible-grafana) role. You can customise any of the variables that the role supports. For a full list, please see the
89
-
[upstream documentation](https://github.com/cloudalchemy/ansible-grafana). The appliance defaults can be found here:
Internally, we configure Grafana using the [kube-prometheus-stack](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack) Helm chart which passes values to a [Grafana subchart.](https://github.com/grafana/helm-charts/tree/main/charts/grafana) Common configuration options for the chart are exposed in [ansible/roles/kube-prometheus-stack/defaults/main/main.yml](../ansible/roles/kube-prometheus-stack/defaults/main/main.yml), with sensible defaults for Grafana being set in [environments/common/inventory/group_vars/all/grafana.yml](../environments/common/inventory/group_vars/all/grafana.yml). For more control over configuration, the exact values that will be merged with the Helm chart defaults can be found in [ansible/roles/kube-prometheus-stack/defaults/main/helm.yml](../ansible/roles/kube-prometheus-stack/defaults/main/helm.yml).
92
92
93
93
### Placement
94
94
95
-
The `grafana` group controls the placement of the grafana service. Load balancing is currently unsupported so it is important that you only assign one host to this group.
95
+
The `prometheus` group controls the placement of the Kubernetes monitoring stack. Load balancing is currently unsupported so it is important that you only assign one host to this group.
96
96
97
97
### Access
98
98
99
-
If Open Ondemand is enabled then by default this is used to proxy Grafana, otherwise Grafana is accessed through the first . See `grafana_url` in [environments/common/inventory/group_vars/all/grafana.yml](../environments/common/inventory/group_vars/all/grafana.yml). The port used (variable `grafana_port`) defaults to `3000`.
99
+
If Open Ondemand is enabled then by default this is used to proxy Grafana, otherwise Grafana is accessed through the first host in the Prometheus group (note that currently there is no support for load balancing so only one host should be in this group, the control node is used by default). See `grafana_url` in [environments/common/inventory/group_vars/all/grafana.yml](../environments/common/inventory/group_vars/all/grafana.yml). The port used (variable `grafana_port`) defaults to `30001`.
100
100
101
101
The default credentials for the admin user are:
102
102
@@ -105,11 +105,13 @@ The default credentials for the admin user are:
105
105
106
106
Where `vault_grafana_admin_password` is a variable containing the actual password. This is generated by the `generate-passwords.yml` adhoc playbook (see [README.md](../README.md#creating-a-slurm-appliance)).
107
107
108
+
Note that if Open OnDemand is enabled, Grafana is only accessible through OOD's proxy. Requests to `grafana_url_direct` will be redirected through the proxy, which will ask you to authenticate against Open OnDemand (NOT Grafana credentials). See [Open OnDemand docs.](openondemand.README.md)
109
+
108
110
### grafana dashboards
109
111
110
-
The appliance ships with a default set of dashboards. The set of dashboards can be configured via the `grafana_dashboards` variable. The dashboards are either internal to the [grafana-dashboards role](../ansible/roles/grafana-dashboards/files/) or downloaded from grafana.com.
112
+
In addition to the default set of dashboards that are deployed by kube-prometheus-stack, the appliance ships with a default set of dashboards (listed below). The set of appliance-specific dashboards can be configured via the `grafana_dashboards` variable. The dashboards are either internal to the [grafana-dashboards role](../ansible/roles/grafana-dashboards/files/) or downloaded from grafana.com.
111
113
112
-
#### node exporter
114
+
#### node exporter slurm
113
115
114
116
This shows detailed metrics about an individual host. The metric source is `node exporter` (See [prometheus section](#prometheus-1) for more details). A slurm job annotation can optionally be enabled which will highlight the period of time where a given slurm job was running. The slurm job that is highlighted is controlled by the `Slurm Job ID` variable. An example is shown below:
115
117
@@ -210,42 +212,32 @@ This section details the configuration of prometheus.
210
212
211
213
### Defaults
212
214
213
-
Internally, we use the [cloudalchemy.prometheus](https://github.com/cloudalchemy/ansible-prometheus) role. You can customise any of the variables that the role supports. For a full list, please see the
214
-
[upstream documentation](https://github.com/cloudalchemy/ansible-prometheus). The appliance defaults can be found here:
Like Grafana, Prometheus is internally configured using kube-prometheus-stack with rolevars in [ansible/roles/kube_prometheus_stack/defaults/main](../ansible/roles/kube_prometheus_stack/defaults/main) (see [Grafana defaults section](#grafana-1) for more detail). Sensible defaults are defined in [environments/common/inventory/group_vars/all/prometheus.yml](../environments/common/inventory/group_vars/all/prometheus.yml).
217
216
218
217
### Placement
219
218
220
-
The `prometheus` group determines the placement of the prometheus service. Load balancing is currently unsupported so it is important that you only assign one host to this group.
219
+
The `prometheus` group controls the placement of the Kubernetes monitoring stack. Load balancing is currently unsupported so it is important that you only assign one host to this group.
221
220
222
221
### Access
223
222
224
-
Prometheus is exposed on port `9090` on all hosts in the prometheus group. Currently, the configuration assumes a single host. Following the reference layout in `environments/common/layouts/everything`, this will be set to the slurm `control` node, prometheus would then be accessible from:
223
+
Prometheus is exposed on port `30000` on all hosts in the prometheus group. Currently, the configuration assumes a single host. Following the reference layout in `environments/common/layouts/everything`, this will be set to the slurm `control` node, prometheus would then be accessible from:
225
224
226
-
> http://<control_node_ip>:9090
225
+
> http://<control_node_ip>:30000
227
226
228
-
The port can customised by overriding the `prometheus_web_external_url` variable.
227
+
The port can customised by overriding the `prometheus_port` variable.
229
228
230
229
Note that this service is not password protected, allowing anyone with access to the URL to make queries.
231
230
232
-
### Recording rules
231
+
### Alerting and recording rules
233
232
234
-
The upstream documentation can be found [here](https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/).
233
+
See the upstream documentation for [alerting](https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/) and [recording](https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/) rules.
235
234
236
-
This appliance provides a default set of recording rules which can be found here:
The intended purpose is to pre-compute some expensive queries that are used
241
-
in the reference set of grafana dashboards.
242
-
243
-
To add new, or to remove rules you will be to adjust the `prometheus_alert_rules_files` variable. The default value can be found in:
235
+
In addition to the default recording and alerting rules set by kube-prometheus-stack, the appliances provides a default set of rules which can be found in the `prometheus_extra_rules` list in:
You can extend this variable in your environment specific configuration to reference extra files or to remove the defaults. The reference set of dashboards expect these variables to be defined, so if you remove them, you
248
-
will also have to update your dashboards.
239
+
The provided default recording rules are intended to pre-compute some expensive queries that are used
240
+
in the reference set of grafana dashboards. The default alerting rules define alerts for issues with Slurm nodes.
249
241
250
242
### node_exporter
251
243
@@ -262,18 +254,18 @@ This appliance customises the default set of collectors to a minimal set, these
262
254
- meminfo
263
255
- infiniband
264
256
- cpufreq
257
+
- diskstats
258
+
- filesystem
265
259
266
-
The list can be customised by overriding the `collect[]` parameter of the `node` job in the `prometheus_scrape_configs` dictionary. The defaults can be found in:
The list can be customised by adding or removing `--collector` flags to Node Exporter's command line arguments. The defaults can be found in:
269
261
270
-
Variables in this file should *not* be customised directly, but should be overridden in your `environment`. See [README.md](../README.md#environments) which details the process of overriding default variables in more detail.
Groups prometheus targets into per environment groups. The ansible variable, `env` is used to determine the grouping. The metrics for each target in the group are given the prometheus label, `env: $env`, where `$env` is the value of the `env` variable for that host.
268
+
Groups prometheus targets into per environment groups. The ansible variable, `cluster_env` is used to determine the grouping. The metrics for each target in the group are given the prometheus label, `cluster_env: $cluster_env`, where `$cluster_env` is the value of the `cluster_env` variable for that host.
0 commit comments