|
1 | 1 | # Dashboard User Guide |
| 2 | +This guide walks you through how to monitor your stack using the included Grafana dashboards. It shows how to use each dashboard, and some ideas of what things to look out for. |
2 | 3 |
|
3 | | -This is my dashboard user guide |
| 4 | +## Availability - How well are things running? |
| 5 | + |
4 | 6 |
|
| 7 | +Open the Cogstack Monitoring Dashboard on [localhost/grafana](http://localhost/grafana/d/NEzutrbMk/cogstack-monitoring-dashboard) |
5 | 8 |
|
6 | | -## Grafana Dashboards |
| 9 | +Use the percentage uptime charts at the top to see the availability over a given time period. For example, “Over the last 8 hours, we have 99.5% availability on my service”. |
7 | 10 |
|
8 | | -- Availability |
9 | | -- Elasticsearch |
10 | | -- VM Metrics (Memory use, CPU etc) |
11 | | -- Docker Metrics (Running containers) |
| 11 | +Use the time filter in the top right corner of the page to change the window, for example change it to 30 days to see availability for the total month. |
| 12 | + |
| 13 | +Look for trends like: |
| 14 | +- Has there been a full outage of a service for 5 minutes, where where 5m availability goes to 0 |
| 15 | +- Is there some disruption over the time period, where my 5m availability stays high, but my 6h availability is going down? |
| 16 | +- Have we met the service level objective, if we set the time threshold to 30 days? |
| 17 | + |
| 18 | +Use the filters at the top, or click in the table to better filter the view down to specific targets, services or hosts. |
| 19 | + |
| 20 | +See [Setup Probing](../setup/probing.md) to do the full setup of probers. |
| 21 | + |
| 22 | +## Inventory - What is running? |
| 23 | + |
| 24 | + |
| 25 | +Use the Docker Metrics dashboard to check which containers are running, where, and whether they're healthy. This is useful for verifying deployments or diagnosing issues. |
| 26 | + |
| 27 | +The dashboard above includes the hostnames, IP addresses and any other details configured. |
| 28 | + |
| 29 | +Check for things like: |
| 30 | +- Containers not running where you thought they should be by looking at the hostname for each container |
| 31 | +- Containers restarting unexpectedly, by looking at the "Running" column in the table |
| 32 | + |
| 33 | +See [telemetry](../setup/telemetry.md) to set this up |
| 34 | + |
| 35 | +## Telemetry - How can I see details of resources? |
| 36 | +Some additional dashboards are setup to provide more metrics. |
| 37 | + |
| 38 | +### VM Metrics |
| 39 | + |
| 40 | + |
| 41 | +Open the VM Metrics dashboard on [localhost/grafana](http://localhost/grafana/d/rYdddlPWk/vm-metrics-in-cogstack) |
| 42 | + |
| 43 | +Select a VM from the host dropdown . |
| 44 | + |
| 45 | +Look for things like: |
| 46 | + |
| 47 | +- CPU Usage — is a process using too much CPU? |
| 48 | +- Memory Usage — if you're running out of RAM |
| 49 | +- Disk IO / Space — alerts you to low disk conditions |
| 50 | +- Trends over time, by setting the time filter to 30 days. Is your disk usage increasing over time? |
| 51 | + |
| 52 | +### Elasticsearch Metrics |
| 53 | + |
| 54 | +Open the Elasticsearch Metrics dashboard on [localhost/grafana](http://localhost/grafana/d/n_nxrE_mk/elasticsearch-metrics-in-cogstack) |
| 55 | + |
| 56 | +This dashboard helps you understand how your ElasticSearch or Opensearch cluster is behaving. |
| 57 | + |
| 58 | +Look at: |
| 59 | +- Cluster health status — shows yellow/red states immediately |
| 60 | +- Index size per shard — to detect unbalanced index growth |
| 61 | +- Query latency and throughput — useful during heavy search loads |
| 62 | + |
| 63 | +See [telemetry](../setup/telemetry.md) to set this up |
| 64 | + |
| 65 | +## Alerting - When should I look at this? |
| 66 | +Alerting is setup using Grafana Alerts, but paused by default |
| 67 | + |
| 68 | +When alerts are setup, the grafana graphs will show when the alerts were fired. |
| 69 | + |
| 70 | + |
| 71 | +Two sets of rules are defined in this project: |
| 72 | + |
| 73 | +- Basic alerts using uptime. If over 5m or 6h, if it drops below a certain percentage uptime, send an alert |
| 74 | +- Alerting on SLOs by using burn rates, for multi-window multi-rate alerts following best practices defined in [Google SRE - Prometheus Alerting: Turn SLOs into Alerts](https://sre.google/workbook/alerting-on-slos/) |
| 75 | + |
| 76 | +See [Alerting](../setup/alerting.md) to set this up |
0 commit comments