Create Userguide for dashboards. Add tood for remaining files

alhendrickson · alhendrickson · commit 7c74ea9cdd33 · 2025-06-03T15:39:20.000Z
diff --git a/docs/_static/screenshots-dashboards-alerts.png b/docs/_static/screenshots-dashboards-alerts.png
diff --git a/docs/_static/screenshots-dashboards-availability.png b/docs/_static/screenshots-dashboards-availability.png
diff --git a/docs/_static/screenshots-dashboards-docker-metrics.png b/docs/_static/screenshots-dashboards-docker-metrics.png
diff --git a/docs/_static/screenshots-dashboards-es-metrics.png b/docs/_static/screenshots-dashboards-es-metrics.png
diff --git a/docs/_static/screenshots-dashboards-vm-metrics.png b/docs/_static/screenshots-dashboards-vm-metrics.png
diff --git a/docs/index.md b/docs/index.md
@@ -1,11 +1,9 @@
    
 # Cogstack Platform Toolit
 
-This project provides utilities for running Cogstack in production
+This project provides utilities for running Cogstack in production.
 
-## Features
-
-- [Observability](observability/_index.md) 
+- [CogStack Observability](observability/_index.md) 
 
 ```{toctree}
 :hidden:
diff --git a/docs/observability/customization/custom-dashboards.md b/docs/observability/customization/custom-dashboards.md
@@ -1,5 +1,5 @@
 # Custom Dashboards
-
+//TODO
 Grafana is setup with preconfigured dashboards, datasource, and alerting. These will work when prometheus is run in this stack, and is dependent on all the metrics following defined rules. 
 
 it is advised that any edits or new configs get committed back into your git repository, and stick with grafana provisioning instead of allowing manual edits
diff --git a/docs/observability/customization/custom-prometheus-configs.md b/docs/observability/customization/custom-prometheus-configs.md
@@ -1,5 +1,5 @@
 # Custom Prometheus Configuration
-
+//TODO
 
 You can add compeltely custom prometheus scrape configs and recording rules by mounting in docker.
 
diff --git a/docs/observability/get-started/quickstart.md b/docs/observability/get-started/quickstart.md
@@ -51,12 +51,11 @@ Now refresh the grafana dashboard, and you can see the availability of google.co
 This is the end of this quickstart tutorial, that enables probing availability of endpoints.
 
 For the next steps we can:
-
+- Look deeper into the observability dashboards, on [Dashboards Userguide](./userguide-tutorial.md)
 - Productionise our deployment to enable further features
 - Configure *Telemetry* like VM memory usage, and Elasticsearch index size, by running Exporters
 - Enable *Alerting* based on our availability and a defined Service Level Objective (SLO)
 - Setup further *Probing* of our running services to get availability metrics
-- Look further into the available dashboards
 - Fully customize the stack with our own dashboards, recording rules and metrics
 
 
diff --git a/docs/observability/get-started/userguide-tutorial.md b/docs/observability/get-started/userguide-tutorial.md
@@ -1,11 +1,76 @@
 # Dashboard User Guide
+This guide walks you through how to monitor your stack using the included Grafana dashboards. It shows how to use each dashboard, and some ideas of what things to look out for.
 
-This is my dashboard user guide
+## Availability - How well are things running?
+![Availability Dashboard](../../_static/screenshots-dashboards-availability.png)
 
+Open the Cogstack Monitoring Dashboard on [localhost/grafana](http://localhost/grafana/d/NEzutrbMk/cogstack-monitoring-dashboard) 
 
-## Grafana Dashboards
+Use the percentage uptime charts at the top to see the availability over a given time period. For example, “Over the last 8 hours, we have 99.5% availability on my service”. 
 
-- Availability
-- Elasticsearch
-- VM Metrics (Memory use, CPU etc)
-- Docker Metrics (Running containers)
+Use the time filter in the top right corner of the page to change the window, for example change it to 30 days to see availability for the total month. 
+
+Look for trends like:
+- Has there been a full outage of a service for 5 minutes, where where 5m availability goes to 0
+- Is there some disruption over the time period, where my 5m availability stays high, but my 6h availability is going down?
+- Have we met the service level objective, if we set the time threshold to 30 days? 
+
+Use the filters at the top, or click in the table to better filter the view down to specific targets, services or hosts. 
+
+See [Setup Probing](../setup/probing.md) to do the full setup of probers.
+
+## Inventory - What is running? 
+![Docker Metrics Dashboard](../../_static/screenshots-dashboards-docker-metrics.png)
+
+Use the Docker Metrics dashboard to check which containers are running, where, and whether they're healthy. This is useful for verifying deployments or diagnosing issues.
+
+The dashboard above includes the hostnames, IP addresses and any other details configured. 
+
+Check for things like:
+- Containers not running where you thought they should be by looking at the hostname for each container
+- Containers restarting unexpectedly, by looking at the "Running" column in the table
+
+See [telemetry](../setup/telemetry.md) to set this up
+
+## Telemetry - How can I see details of resources?
+Some additional dashboards are setup to provide more metrics.
+
+### VM Metrics
+![ VM Metrics dashboard ](../../_static/screenshots-dashboards-vm-metrics.png)
+
+Open the VM Metrics dashboard on [localhost/grafana](http://localhost/grafana/d/rYdddlPWk/vm-metrics-in-cogstack)
+
+Select a VM from the host dropdown .
+
+Look for things like:
+
+- CPU Usage — is a process using too much CPU?
+- Memory Usage — if you're running out of RAM 
+- Disk IO / Space — alerts you to low disk conditions
+- Trends over time, by setting the time filter to 30 days. Is your disk usage increasing over time?
+
+### Elasticsearch Metrics
+![ElasticSearch Metrics Dashboard](../../_static/screenshots-dashboards-es-metrics.png)
+Open the Elasticsearch Metrics dashboard on [localhost/grafana](http://localhost/grafana/d/n_nxrE_mk/elasticsearch-metrics-in-cogstack)
+
+This dashboard helps you understand how your ElasticSearch or Opensearch cluster is behaving. 
+
+Look at:
+- Cluster health status — shows yellow/red states immediately
+- Index size per shard — to detect unbalanced index growth
+- Query latency and throughput — useful during heavy search loads
+
+See [telemetry](../setup/telemetry.md) to set this up
+
+## Alerting - When should I look at this?
+Alerting is setup using Grafana Alerts, but paused by default
+
+When alerts are setup, the grafana graphs will show when the alerts were fired.
+![Alerts Firing on dashboard](../../_static/screenshots-dashboards-alerts.png)
+
+Two sets of rules are defined in this project:
+
+- Basic alerts using uptime. If over 5m or 6h, if it drops below a certain percentage uptime, send an alert
+- Alerting on SLOs by using burn rates, for multi-window multi-rate alerts following best practices defined in [Google SRE - Prometheus Alerting: Turn SLOs into Alerts](https://sre.google/workbook/alerting-on-slos/) 
+
+See [Alerting](../setup/alerting.md) to set this up
diff --git a/docs/observability/reference/quickstart-manual.md b/docs/observability/reference/quickstart-manual.md
@@ -1,5 +1,5 @@
 # Manual Quickstart
-
+//TODO
 The quickstart page uses a script to setup the folders for you.
 
 This page instead details how to do it manually, to provide clarity.
diff --git a/docs/observability/reference/understanding-metrics.md b/docs/observability/reference/understanding-metrics.md
@@ -0,0 +1,28 @@
+# Understanding Concepts
+
+This page provides some reference explanations for the concepts used
+
+## Availability in depth
+We measure the availability of the stack using prometheus and blackbox exporter.
+
+The exporter calls an endpoint defined in the yaml at a given frequency, and exposes the result as either a 0 or 1. 
+
+The success metric is 0 or 1, so our uptime over a time period is the average of the value over that period. EG - `avg_over_time(probe_success[8h]) * 100 `
+
+Probing frequency is defined by the prometheus scrape_interval in the prometheus config, the exporter itself doesnt know. Example interval by default is every 10s 
+
+
+### Availability at a given point in time
+What does the percentage availability mean? Lets explain with an example:
+
+Say we see in our 8h availability graph, we have 98.77% availability at 15:00 yesterday.
+
+Our probe interval is every 10 seconds. This means that in 8 hours we make 2440 calls.
+
+For 98.77% availability, we must have had 30 calls fail over the time period (2440 * 0.9877)
+
+30 failing calls over the time period could happen in a few ways:
+- We could have just dropped 30 calls spaced evenly over the period of 8 hours, which probably can't be noticed
+-  we could have had a outage of 0% availability for 5 minutes in sequence, where the thing is properly broken for that period. This would mean 30 calls failed, so uptime over 8 hours is 98.7% 
+
+This show why we want to understand availability over different time windows
diff --git a/docs/observability/setup/_index.md b/docs/observability/setup/_index.md
@@ -6,7 +6,7 @@
 full-installation.md
 probing.md
 telemetry.md
-alerts.md
+alerting.md
 
 
 ```
diff --git a/docs/observability/setup/alerting.md b/docs/observability/setup/alerting.md
@@ -1,15 +1,17 @@
 # Alerting
-TODO
+//TODO
+By default, alerts are paused. The project is configured to easily send alerts to any Slack Webhook out of the box, but can be customized further. 
+ 
+There are two sets of rules :
 
-The alerts are paused by default.
+- Basic alerts using uptime. If over 5m or 6h, if it drops below a certain percentage uptime, send an alert
+- Alerting on SLOs by using burn rates, for multi-window multi-rate alerts Google SRE - Prometheus Alerting: Turn SLOs into Alerts. 
 
-Alerting is based on either pure availability on 5 minutes or 6 hours, as well as a burn rate implementation. 
 
-See [Google SRE Guide](https://sre.google/workbook/alerting-on-slos/#4-alert-on-burn-rate) which explains burn rate alerting. The alerting setup here follows the recommendations in the SRE handbook for Multiwindow, Multiburn rate alerting.
 
-For burn rate alerting, ensure that a recording rule is setup to create a record for `slo_target_over_30_days`, with a job label that matches your probe job labels. See the prometheus readme in this project. 
+## How to Enable Alerting
 
-## Define a SLO
+### Define a SLO
 To enable the burn rate alerting feature, create prometheus recording rule file with the following contents.
 
 ```yaml
@@ -24,15 +26,17 @@ groups:
 
 In docker, mount the file in `site/prometheus/recording-rules/slo.yml`.
 
-
-## Turn on alerting
+### Turn on alerting
 - Enable/Disable alerts using environment variables 
 - By default alerts will send to slack. Provide the env variable `SLACK_WEBHOOK_URL` to go there
 
 
 ## Configuration
+
+Alerting is setup using Grafana Alerts. 
 - To change where the alerts are sent: create and mount custom a custom contact point in `/etc/grafana/provisioning/alerting/custom-contact.yml`. Then change the environment variable `ALERTING_DEFAULT_CONTACT` to use that name
 - Add custom alerts by mounting alert files in `/etc/grafana/provisioning/alerting/`.
 
 For more info see [Grafana Provisioning](https://grafana.com/docs/grafana/latest/alerting/set-up/provision-alerting-resources/)
 
+See [Google SRE Guide](https://sre.google/workbook/alerting-on-slos/#4-alert-on-burn-rate) which explains burn rate alerting. The alerting setup here follows the recommendations in the SRE handbook for Multiwindow, Multiburn rate alerting.
diff --git a/docs/observability/setup/full-installation.md b/docs/observability/setup/full-installation.md
@@ -1,4 +1,6 @@
 # Production Setup
+//TODO
+
 This page shows how to setup the stack for a production deployment.
 
 If you havent already done it, do see the quickstart tutorial
diff --git a/docs/observability/setup/probing.md b/docs/observability/setup/probing.md
@@ -1,7 +1,12 @@
 # Availability Probing
+//TODO
 
+HTTP Probers are setup to scrape the real endpoints exposed by our services, and we can calculate a percentage uptime and latency based on those.
 
+See the [Reference](../reference/understanding-metrics.md) for more details. 
 
+
+## Adding Probers
 - `site/prometheus/scrape-configs/probers/*.yml`. 
 Add yaml files to this folder as probe targets. Any yml files put into this directory, for example "probe.example.yml", will be used as targets to probe for availability using blackbox exporter. Add any URLs that you want to measure the availability of. 
 
@@ -16,4 +21,9 @@ Add yaml files to this folder as probe targets. Any yml files put into this dire
     host: a_hostname # (Optional) A readable hostname
     custom_label: a_custom_label # (Optional)  Any other label
     
-```
+```
+
+## Configuring Probers
+- How to setup custom exporter module
+- How to use the module in my yml
+
diff --git a/docs/observability/setup/telemetry.md b/docs/observability/setup/telemetry.md
@@ -1,4 +1,5 @@
 # Telemetry
+//TODO
 We can get telemetry from our services and VMs displayed in our dashboards. This telemetry gives us things like memory usage, and running container versions.
 
 Using telemetry lets us get feedback from the stack, diagnose problems, and predict issues before they occur.
diff --git a/observability/examples/simple/docker-compose.yml b/observability/examples/simple/docker-compose.yml
@@ -39,11 +39,6 @@ services:
     restart: unless-stopped
     networks:
       - observability
-    labels:
-      - "traefik.enable=true"
-      - "traefik.http.routers.node-exporter.rule=PathPrefix(`/node-exporter`)"
-      - "traefik.http.middlewares.node-exporter-stripprefix.stripprefix.prefixes=/node-exporter"
-      - "traefik.http.routers.node-exporter.middlewares=node-exporter-stripprefix@docker"
 networks:
   observability:
     driver: bridge

-Original file line number
+Diff line change
@@ @@ -1,5 +1,5 @@ @@
 # Custom Prometheus Configuration
+-
 +//TODO
 You can add compeltely custom prometheus scrape configs and recording rules by mounting in docker.