updated docs

wtripp180901 · wtripp180901 · commit e5dff96e6484 · 2024-10-22T13:48:15.000+01:00
diff --git a/docs/monitoring-and-logging.README.md b/docs/monitoring-and-logging.README.md
@@ -2,6 +2,9 @@
 
 ## Components overview
 
+### [kube-prometheus-stack](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack)
+An umbrella Helm chart which the appliance uses to deploy and manages containerised versions of Grafana and Prometheus.
+
 ### [filebeat](https://www.elastic.co/beats/filebeat)
 
 Parses log files and ships them to elasticsearch. Note we use the version shipped by Open Distro.
@@ -85,18 +88,15 @@ This section details the configuration of grafana.
 
 ### Defaults
 
-Internally, we use the [cloudalchemy.grafana](https://github.com/cloudalchemy/ansible-grafana) role. You can customise any of the variables that the role supports. For a full list, please see the
-[upstream documentation](https://github.com/cloudalchemy/ansible-grafana). The appliance defaults can be found here:
-
-> [environments/common/inventory/group_vars/all/grafana.yml](../environments/common/inventory/group_vars/all/grafana.yml)
+Internally, we configure Grafana using the [kube-prometheus-stack](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack) Helm chart which passes values to a [Grafana subchart.](https://github.com/grafana/helm-charts/tree/main/charts/grafana) Common configuration options for the chart are exposed in [ansible/roles/kube-prometheus-stack/defaults/main/main.yml](../ansible/roles/kube-prometheus-stack/defaults/main/main.yml), with sensible defaults for Grafana being set in [environments/common/inventory/group_vars/all/grafana.yml](../environments/common/inventory/group_vars/all/grafana.yml). For more control over configuration, the exact values that will be merged with the Helm chart defaults can be found in [ansible/roles/kube-prometheus-stack/defaults/main/helm.yml](../ansible/roles/kube-prometheus-stack/defaults/main/helm.yml).
 
 ### Placement
 
-The `grafana` group controls the placement of the grafana service. Load balancing is currently unsupported so it is important that you only assign one host to this group.
+The `prometheus` group controls the placement of the Kubernetes monitoring stack. Load balancing is currently unsupported so it is important that you only assign one host to this group.
 
 ### Access
 
-If Open Ondemand is enabled then by default this is used to proxy Grafana, otherwise Grafana is accessed through the first . See `grafana_url` in [environments/common/inventory/group_vars/all/grafana.yml](../environments/common/inventory/group_vars/all/grafana.yml). The port used (variable `grafana_port`) defaults to `3000`.
+If Open Ondemand is enabled then by default this is used to proxy Grafana, otherwise Grafana is accessed through the first host in the Prometheus group (note that currently there is no support for load balancing so only one host should be in this group, the control node is used by default). See `grafana_url` in [environments/common/inventory/group_vars/all/grafana.yml](../environments/common/inventory/group_vars/all/grafana.yml). The port used (variable `grafana_port`) defaults to `30001`.
 
 The default credentials for the admin user are:
 
@@ -105,11 +105,13 @@ The default credentials for the admin user are:
 
 Where `vault_grafana_admin_password` is a variable containing the actual password. This is generated by the `generate-passwords.yml` adhoc playbook (see [README.md](../README.md#creating-a-slurm-appliance)).
 
+Note that if Open OnDemand is enabled, Grafana is only accessible through OOD's proxy. Requests to `grafana_url_direct` will be redirected through the proxy, which will ask you to authenticate against Open OnDemand (NOT Grafana credentials). See [Open OnDemand docs.](openondemand.README.md)
+
 ### grafana dashboards
 
-The appliance ships with a default set of dashboards. The set of dashboards can be configured via the `grafana_dashboards` variable. The dashboards are either internal to the [grafana-dashboards role](../ansible/roles/grafana-dashboards/files/) or downloaded from grafana.com.
+In addition to the default set of dashboards that are deployed by kube-prometheus-stack, the appliance ships with a default set of dashboards (listed below). The set of appliance-specific dashboards can be configured via the `grafana_dashboards` variable. The dashboards are either internal to the [grafana-dashboards role](../ansible/roles/grafana-dashboards/files/) or downloaded from grafana.com.
 
-#### node exporter
+#### node exporter slurm
 
 This shows detailed metrics about an individual host. The metric source is `node exporter` (See [prometheus section](#prometheus-1) for more details). A slurm job annotation can optionally be enabled which will highlight the period of time where a given slurm job was running. The slurm job that is highlighted is controlled by the `Slurm Job ID` variable. An example is shown below:
 
@@ -210,42 +212,32 @@ This section details the configuration of prometheus.
 
 ### Defaults
 
-Internally, we use the [cloudalchemy.prometheus](https://github.com/cloudalchemy/ansible-prometheus) role. You can customise any of the variables that the role supports. For a full list, please see the
-[upstream documentation](https://github.com/cloudalchemy/ansible-prometheus). The appliance defaults can be found here:
-
-> [environments/common/inventory/group_vars/all/prometheus.yml](../environments/common/inventory/group_vars/all/prometheus.yml)
+Like Grafana, Prometheus is internally configured using kube-prometheus-stack with rolevars in [ansible/roles/kube_prometheus_stack/defaults/main](../ansible/roles/kube_prometheus_stack/defaults/main) (see [Grafana defaults section](#grafana-1) for more detail). Sensible defaults are defined in [environments/common/inventory/group_vars/all/prometheus.yml](../environments/common/inventory/group_vars/all/prometheus.yml). 
 
 ### Placement
 
-The `prometheus` group determines the placement of the prometheus service. Load balancing is currently unsupported so it is important that you only assign one host to this group.
+The `prometheus` group controls the placement of the Kubernetes monitoring stack. Load balancing is currently unsupported so it is important that you only assign one host to this group.
 
 ### Access
 
-Prometheus is exposed on port `9090` on all hosts in the prometheus group. Currently, the configuration assumes a single host. Following the reference layout in `environments/common/layouts/everything`, this will be set to the slurm `control` node, prometheus would then be accessible from:
+Prometheus is exposed on port `30000` on all hosts in the prometheus group. Currently, the configuration assumes a single host. Following the reference layout in `environments/common/layouts/everything`, this will be set to the slurm `control` node, prometheus would then be accessible from:
 
- > http://<control_node_ip>:9090
+ > http://<control_node_ip>:30000
 
-The port can customised by overriding the `prometheus_web_external_url` variable.
+The port can customised by overriding the `prometheus_port` variable.
 
 Note that this service is not password protected, allowing anyone with access to the URL to make queries.
 
-### Recording rules
+### Alerting and recording rules
 
-The upstream documentation can be found [here](https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/).
+See the upstream documentation for [alerting](https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/) and [recording](https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/) rules.
 
-This appliance provides a default set of recording rules which can be found here:
-
-> [environments/common/files/prometheus/rules/precompute.rules](../environments/common/files/prometheus/rules/precompute.rules)
-
-The intended purpose is to pre-compute some expensive queries that are used
-in the reference set of grafana dashboards.
-
-To add new, or to remove rules you will be to adjust the `prometheus_alert_rules_files` variable. The default value can be found in:
+In addition to the default recording and alerting rules set by kube-prometheus-stack, the appliances provides a default set of rules which can be found in the `prometheus_extra_rules` list in:
 
 > [environments/common/inventory/group_vars/all/prometheus.yml](../environments/common/inventory/group_vars/all/prometheus.yml)
 
-You can extend this variable in your environment specific configuration to reference extra files or to remove the defaults. The reference set of dashboards expect these variables to be defined, so if you remove them, you
-will also have to update your dashboards.
+The provided default recording rules are intended to pre-compute some expensive queries that are used
+in the reference set of grafana dashboards. The default alerting rules define alerts for issues with Slurm nodes.
 
 ### node_exporter
 
@@ -262,18 +254,18 @@ This appliance customises the default set of collectors to a minimal set, these
 - meminfo
 - infiniband
 - cpufreq
+- diskstats
+- filesystem
 
-The list can be customised by overriding the `collect[]` parameter of the `node` job in the `prometheus_scrape_configs` dictionary. The defaults can be found in:
-
-> [environments/common/inventory/group_vars/all/prometheus.yml](../environments/common/inventory/group_vars/all/prometheus.yml).
+The list can be customised by adding or removing `--collector` flags to Node Exporter's command line arguments. The defaults can be found in:
 
-Variables in this file should *not* be customised directly, but should be overridden in your `environment`. See [README.md](../README.md#environments) which details the process of overriding default variables in more detail.
+> [environments/common/inventory/group_vars/all/node_exporter.yml](../environments/common/inventory/group_vars/all/node_exporter.yml).
 
 ### custom ansible filters
 
 #### prometheus_node_exporter_targets
 
-Groups prometheus targets into per environment groups. The ansible variable, `env` is used to determine the grouping. The metrics for each target in the group are given the prometheus label, `env: $env`, where `$env` is the value of the `env` variable for that host.
+Groups prometheus targets into per environment groups. The ansible variable, `cluster_env` is used to determine the grouping. The metrics for each target in the group are given the prometheus label, `cluster_env: $cluster_env`, where `$cluster_env` is the value of the `cluster_env` variable for that host.
 
 ## slurm-stats