Skip to content

Commit 7c74ea9

Browse files
committed
Create Userguide for dashboards. Add tood for remaining files
1 parent 5ac71a8 commit 7c74ea9

18 files changed

+132
-30
lines changed
17 KB
Loading
133 KB
Loading
132 KB
Loading
89.8 KB
Loading
99.9 KB
Loading

docs/index.md

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,9 @@
11

22
# Cogstack Platform Toolit
33

4-
This project provides utilities for running Cogstack in production
4+
This project provides utilities for running Cogstack in production.
55

6-
## Features
7-
8-
- [Observability](observability/_index.md)
6+
- [CogStack Observability](observability/_index.md)
97

108
```{toctree}
119
:hidden:

docs/observability/customization/custom-dashboards.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
# Custom Dashboards
2-
2+
//TODO
33
Grafana is setup with preconfigured dashboards, datasource, and alerting. These will work when prometheus is run in this stack, and is dependent on all the metrics following defined rules.
44

55
it is advised that any edits or new configs get committed back into your git repository, and stick with grafana provisioning instead of allowing manual edits

docs/observability/customization/custom-prometheus-configs.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
# Custom Prometheus Configuration
2-
2+
//TODO
33

44
You can add compeltely custom prometheus scrape configs and recording rules by mounting in docker.
55

docs/observability/get-started/quickstart.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -51,12 +51,11 @@ Now refresh the grafana dashboard, and you can see the availability of google.co
5151
This is the end of this quickstart tutorial, that enables probing availability of endpoints.
5252

5353
For the next steps we can:
54-
54+
- Look deeper into the observability dashboards, on [Dashboards Userguide](./userguide-tutorial.md)
5555
- Productionise our deployment to enable further features
5656
- Configure *Telemetry* like VM memory usage, and Elasticsearch index size, by running Exporters
5757
- Enable *Alerting* based on our availability and a defined Service Level Objective (SLO)
5858
- Setup further *Probing* of our running services to get availability metrics
59-
- Look further into the available dashboards
6059
- Fully customize the stack with our own dashboards, recording rules and metrics
6160

6261

Lines changed: 71 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,76 @@
11
# Dashboard User Guide
2+
This guide walks you through how to monitor your stack using the included Grafana dashboards. It shows how to use each dashboard, and some ideas of what things to look out for.
23

3-
This is my dashboard user guide
4+
## Availability - How well are things running?
5+
![Availability Dashboard](../../_static/screenshots-dashboards-availability.png)
46

7+
Open the Cogstack Monitoring Dashboard on [localhost/grafana](http://localhost/grafana/d/NEzutrbMk/cogstack-monitoring-dashboard)
58

6-
## Grafana Dashboards
9+
Use the percentage uptime charts at the top to see the availability over a given time period. For example, “Over the last 8 hours, we have 99.5% availability on my service”.
710

8-
- Availability
9-
- Elasticsearch
10-
- VM Metrics (Memory use, CPU etc)
11-
- Docker Metrics (Running containers)
11+
Use the time filter in the top right corner of the page to change the window, for example change it to 30 days to see availability for the total month.
12+
13+
Look for trends like:
14+
- Has there been a full outage of a service for 5 minutes, where where 5m availability goes to 0
15+
- Is there some disruption over the time period, where my 5m availability stays high, but my 6h availability is going down?
16+
- Have we met the service level objective, if we set the time threshold to 30 days?
17+
18+
Use the filters at the top, or click in the table to better filter the view down to specific targets, services or hosts.
19+
20+
See [Setup Probing](../setup/probing.md) to do the full setup of probers.
21+
22+
## Inventory - What is running?
23+
![Docker Metrics Dashboard](../../_static/screenshots-dashboards-docker-metrics.png)
24+
25+
Use the Docker Metrics dashboard to check which containers are running, where, and whether they're healthy. This is useful for verifying deployments or diagnosing issues.
26+
27+
The dashboard above includes the hostnames, IP addresses and any other details configured.
28+
29+
Check for things like:
30+
- Containers not running where you thought they should be by looking at the hostname for each container
31+
- Containers restarting unexpectedly, by looking at the "Running" column in the table
32+
33+
See [telemetry](../setup/telemetry.md) to set this up
34+
35+
## Telemetry - How can I see details of resources?
36+
Some additional dashboards are setup to provide more metrics.
37+
38+
### VM Metrics
39+
![ VM Metrics dashboard ](../../_static/screenshots-dashboards-vm-metrics.png)
40+
41+
Open the VM Metrics dashboard on [localhost/grafana](http://localhost/grafana/d/rYdddlPWk/vm-metrics-in-cogstack)
42+
43+
Select a VM from the host dropdown .
44+
45+
Look for things like:
46+
47+
- CPU Usage — is a process using too much CPU?
48+
- Memory Usage — if you're running out of RAM
49+
- Disk IO / Space — alerts you to low disk conditions
50+
- Trends over time, by setting the time filter to 30 days. Is your disk usage increasing over time?
51+
52+
### Elasticsearch Metrics
53+
![ElasticSearch Metrics Dashboard](../../_static/screenshots-dashboards-es-metrics.png)
54+
Open the Elasticsearch Metrics dashboard on [localhost/grafana](http://localhost/grafana/d/n_nxrE_mk/elasticsearch-metrics-in-cogstack)
55+
56+
This dashboard helps you understand how your ElasticSearch or Opensearch cluster is behaving.
57+
58+
Look at:
59+
- Cluster health status — shows yellow/red states immediately
60+
- Index size per shard — to detect unbalanced index growth
61+
- Query latency and throughput — useful during heavy search loads
62+
63+
See [telemetry](../setup/telemetry.md) to set this up
64+
65+
## Alerting - When should I look at this?
66+
Alerting is setup using Grafana Alerts, but paused by default
67+
68+
When alerts are setup, the grafana graphs will show when the alerts were fired.
69+
![Alerts Firing on dashboard](../../_static/screenshots-dashboards-alerts.png)
70+
71+
Two sets of rules are defined in this project:
72+
73+
- Basic alerts using uptime. If over 5m or 6h, if it drops below a certain percentage uptime, send an alert
74+
- Alerting on SLOs by using burn rates, for multi-window multi-rate alerts following best practices defined in [Google SRE - Prometheus Alerting: Turn SLOs into Alerts](https://sre.google/workbook/alerting-on-slos/)
75+
76+
See [Alerting](../setup/alerting.md) to set this up

0 commit comments

Comments
 (0)