Skip to content

Commit eedea5b

Browse files
committed
Add Full install, exporters, and alerting docs
1 parent 7c74ea9 commit eedea5b

File tree

19 files changed

+486
-95
lines changed

19 files changed

+486
-95
lines changed

docs/observability/customization/custom-dashboards.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,9 @@
11
# Custom Dashboards
2-
//TODO
2+
You can setup custom dashboards as json files, and include them along with the defaults in this project.
3+
34
Grafana is setup with preconfigured dashboards, datasource, and alerting. These will work when prometheus is run in this stack, and is dependent on all the metrics following defined rules.
45

5-
it is advised that any edits or new configs get committed back into your git repository, and stick with grafana provisioning instead of allowing manual edits
6+
It is advised that any edits or new configs get committed back into your git repository, and stick with grafana provisioning instead of allowing manual edits.
67

78

89
## How to add a new dashboard with provisioning

docs/observability/customization/custom-prometheus-configs.md

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,9 @@
11
# Custom Prometheus Configuration
2-
//TODO
3-
42
You can add compeltely custom prometheus scrape configs and recording rules by mounting in docker.
53

4+
- `site/prometheus/scrape-configs/*.yml`. This is for advanced configuration.
65

7-
8-
- `site/prometheus/scrape-configs/*.yml`. This is for advanced configuration. Any yml file put in this directory will be used as standard promethues scrape configs. This will give full flexibility over what metrics are collected and all features in prometheus. Add any further configs that you want prometheus to use.
6+
Any yml file put in this directory will be used as standard promethues scrape configs. This will give full flexibility over what metrics are collected and all features in prometheus. Add any further configs that you want prometheus to use.
97

108
```yaml
119
# Custom scrape config definition

docs/observability/reference/_index.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
```{toctree}
44
:maxdepth: 2
55
6-
quickstart-manual.md
6+
project-details.md
7+
concept-materials.md
78
89
```
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
# Concepts
2+
```{toctree}
3+
:maxdepth: 2
4+
understanding-metrics.md
5+
6+
```
7+
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
# Further Project Details
2+
3+
4+
```{toctree}
5+
:maxdepth: 2
6+
quickstart-manual.md
7+
8+
```
9+

docs/observability/setup/_index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
```{toctree}
44
:maxdepth: 2
55
6-
full-installation.md
6+
production-setup.md
77
probing.md
88
telemetry.md
99
alerting.md
Lines changed: 84 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,42 +1,104 @@
11
# Alerting
2-
//TODO
3-
By default, alerts are paused. The project is configured to easily send alerts to any Slack Webhook out of the box, but can be customized further.
4-
5-
There are two sets of rules :
62

7-
- Basic alerts using uptime. If over 5m or 6h, if it drops below a certain percentage uptime, send an alert
8-
- Alerting on SLOs by using burn rates, for multi-window multi-rate alerts Google SRE - Prometheus Alerting: Turn SLOs into Alerts.
3+
This guide explains how to enable and customize alerting in the CogStack observability stack using Grafana and Prometheus.
94

5+
By default, alerts are **paused**. The system is preconfigured to send alerts to a **Slack Webhook**, but this can be customized.
106

7+
There are two categories of alerting:
8+
9+
* **Basic availability alerts**: Triggered when uptime falls below a threshold over short windows (5m or 6h).
10+
* **Burn rate alerts**: Using multi-window multi-rate alerts following best practices in [Google SRE principles](https://sre.google/workbook/alerting-on-slos/), used to track compliance with SLOs.
11+
12+
---
1113

1214
## How to Enable Alerting
1315

14-
### Define a SLO
15-
To enable the burn rate alerting feature, create prometheus recording rule file with the following contents.
16+
### 1. Define Your SLO
17+
18+
To configure burn rate alerting, create a Prometheus recording rule to define your target SLO:
1619

17-
```yaml
20+
```
1821
groups:
1922
- name: slo-target-rules
2023
rules:
21-
- record: slo_target_over_30_days # (Dont change)
22-
expr: 0.95 # Mandatory - Specify the SLO you want to target, for example 0.95 for 95% uptime over 30 days
24+
- record: slo_target_over_30_days
25+
expr: 0.95
2326
labels:
24-
job: "probe-cogstack-availability" #Mandatory - name the job, which must match the job in the probe targets defined
27+
job: "probe-services"
28+
```
29+
30+
* `expr`: Target SLO (e.g., `0.95` for 95% over 30 days)
31+
* `job`: Must match the probe job name defined in your configuration. This allows you to have different SLOs for different endpoints.
32+
33+
Place this file at:
34+
35+
```
36+
prometheus/recording-rules/slo.yml
2537
```
2638

27-
In docker, mount the file in `site/prometheus/recording-rules/slo.yml`.
39+
This should be mounted in the docker container under `/etc/prometheus/cogstack/site/prometheus/recording-rules/slo.yml`, which should be already setup if you followed the setup instructions.
40+
41+
---
42+
43+
### 2. Configure Alerting Environment
44+
45+
Set these environment variables to control alerting behavior:
46+
47+
```
48+
ALERTING_PAUSE_AVAILABILITY_5M=true
49+
ALERTING_PAUSE_AVAILABILITY_6H=true
50+
ALERTING_PAUSE_BURN_RATE=true
51+
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/your-webhook
52+
```
53+
54+
* Set any of the `ALERTING_PAUSE_*` variables to `false` to enable that alert type.
55+
* `SLACK_WEBHOOK_URL` should be set to a webhook, which will send any alerts to slack.
56+
57+
---
58+
59+
## Advanced Customization
60+
### Customize Alert Contact points
2861

29-
### Turn on alerting
30-
- Enable/Disable alerts using environment variables
31-
- By default alerts will send to slack. Provide the env variable `SLACK_WEBHOOK_URL` to go there
62+
You can customize where alerts are sent by defining a new contact point in Grafana:
63+
64+
```
65+
notifiers:
66+
- name: "custom-contact"
67+
type: "slack"
68+
settings:
69+
url: "https://hooks.slack.com/services/..."
70+
```
71+
72+
Mount this file into:
73+
74+
```
75+
/etc/grafana/provisioning/alerting/custom-contact.yml
76+
```
77+
78+
Then update the environment variable:
79+
80+
```
81+
ALERTING_DEFAULT_CONTACT=custom-contact
82+
```
83+
84+
**Note** to be only mount the exact file, and not override the whole provisioning folder in the image, as this is already used to contain the defaults.
85+
86+
---
87+
88+
### Add Custom Alerts
89+
To define additional alert rules, create files in:
90+
91+
```
92+
/etc/grafana/provisioning/alerting/
93+
```
3294

95+
Grafana will automatically load these at startup.
3396

34-
## Configuration
97+
---
3598

36-
Alerting is setup using Grafana Alerts.
37-
- To change where the alerts are sent: create and mount custom a custom contact point in `/etc/grafana/provisioning/alerting/custom-contact.yml`. Then change the environment variable `ALERTING_DEFAULT_CONTACT` to use that name
38-
- Add custom alerts by mounting alert files in `/etc/grafana/provisioning/alerting/`.
99+
## Further Reading
39100

40-
For more info see [Grafana Provisioning](https://grafana.com/docs/grafana/latest/alerting/set-up/provision-alerting-resources/)
101+
* [Grafana Alerting Provisioning](https://grafana.com/docs/grafana/latest/alerting/set-up/provision-alerting-resources/)
102+
* [Google SRE – Burn Rate Alerting](https://sre.google/workbook/alerting-on-slos/#4-alert-on-burn-rate)
41103

42-
See [Google SRE Guide](https://sre.google/workbook/alerting-on-slos/#4-alert-on-burn-rate) which explains burn rate alerting. The alerting setup here follows the recommendations in the SRE handbook for Multiwindow, Multiburn rate alerting.
104+
Let me know if you'd like to split this into multiple focused guides, e.g., one for basic uptime, one for SLO-based alerts.

docs/observability/setup/full-installation.md

Lines changed: 0 additions & 43 deletions
This file was deleted.
Lines changed: 122 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,29 +1,130 @@
1-
# Availability Probing
2-
//TODO
1+
# Availability
32

4-
HTTP Probers are setup to scrape the real endpoints exposed by our services, and we can calculate a percentage uptime and latency based on those.
3+
This guide explains how to configure HTTP probers using Blackbox Exporter to monitor the availability of your services. These probers generate uptime and latency metrics, which can then be visualized in Grafana.
54

6-
See the [Reference](../reference/understanding-metrics.md) for more details.
5+
See the [Reference](../reference/understanding-metrics.md) for an explanation of the metrics this generates.
76

7+
---
88

9-
## Adding Probers
10-
- `site/prometheus/scrape-configs/probers/*.yml`.
11-
Add yaml files to this folder as probe targets. Any yml files put into this directory, for example "probe.example.yml", will be used as targets to probe for availability using blackbox exporter. Add any URLs that you want to measure the availability of.
9+
## How to Add New Probers
1210

13-
```yaml
14-
# Prober yml
15-
- targets:
16-
- https://google.com/something
17-
labels:
18-
name: google-homepage # Mandatory - the name of the service being probed
19-
job: override_job # (Optional. Default is "probe-cogstack-availability") Customise a job to enable grouping in the dashboard
20-
ip_address: "123.0.0.1" # (Optional) The IP address
21-
host: a_hostname # (Optional) A readable hostname
22-
custom_label: a_custom_label # (Optional) Any other label
23-
11+
To add a new prober target:
12+
13+
1. Navigate to the folder:
14+
15+
```
16+
prometheus/scrape-configs/probers/
17+
```
18+
19+
2. Create a new YAML file (e.g., `probe.my-services.yml`) with the following structure:
20+
21+
```
22+
# probe.my-services.yml
23+
- targets:
24+
- https://myservice.example.com/health
25+
labels:
26+
name: my-service # Mandatory - the name of the service being probed
27+
job: my-services # Mandatory - used to group probes in dashboards
28+
ip_address: "10.0.0.12" # Optional - IP of the host being probed
29+
host: service-hostname # Optional - Human-readable hostname
30+
region: eu-west # Optional - Any additional metadata label
31+
```
32+
33+
3. Ensure the folder is mounted in docker under `/etc/prometheus/cogstack/site/prometheus/scrape-configs/probers`, which it should be by default if you've followed the setup guids. Any valid `.yml` files in this folder will be automatically picked up and used as Blackbox targets.
34+
35+
---
36+
37+
## Advanced Setup
38+
39+
### How to add Auth to the prober or further configurations
40+
41+
To define how a probe behaves (e.g., add basic auth, headers, timeout, method), we will configure a module in the Blackbox Exporter config.
42+
43+
#### Create a Blackbox Exporter Config file
44+
You will need to create a new file, and then mount it over the existing provided vconfig
45+
46+
47+
1. Create a new file:
48+
49+
```
50+
prometheus/blackbox-exporter/custom-blackbox-config.yml
51+
```
52+
53+
2. Add the existing defaults
54+
55+
```
56+
modules:
57+
http_get_200:
58+
prober: http
59+
timeout: 5s
60+
http:
61+
valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
62+
valid_status_codes: [200] # Defaults to 2xx
63+
method: GET
64+
preferred_ip_protocol: "ip4" # defaults to "ip6"
65+
tls_config:
66+
insecure_skip_verify: true
67+
```
68+
69+
3. Add your own module to the modules in that file
2470
```
71+
http_2xx_custom:
72+
prober: http
73+
timeout: 5s
74+
http:
75+
valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
76+
valid_status_codes: [200] # Defaults to 2xx
77+
method: GET
78+
preferred_ip_protocol: "ip4" # defaults to "ip6"
79+
tls_config:
80+
insecure_skip_verify: true
81+
basic_auth:
82+
username: my-user
83+
password: example-pass
84+
```
85+
86+
This example adds a module named `http_2xx_custom` that adds some basic auth credentials
87+
88+
---
89+
90+
#### Reference the new module in your prober config
91+
92+
In your probe YAML file, reference the module in the `module` field of the `labels` section:
93+
94+
```
95+
- targets:
96+
- https://myservice.example.com/health
97+
labels:
98+
name: my-service
99+
module: http_2xx_custom # Optional - overrides the default Blackbox module
100+
```
101+
102+
#### Mount the config file
103+
You lastly need to mount the new config file and refer to it in docker compose
104+
105+
```
106+
blackbox-exporter:
107+
image: cogstacksystems/cogstack-observability-blackbox-exporter:latest
108+
restart: unless-stopped
109+
networks:
110+
- observability
111+
volumes:
112+
- ./prometheus/blackbox-exporter:/config
113+
command:
114+
- "--config.file=/config/custom-blackbox-config.yml"
115+
```
116+
117+
---
118+
119+
## Notes
120+
121+
* Changes will take effect on the next Prometheus reload or container restart.
122+
* Jobs with the same `job` label are grouped in dashboards to simplify analysis.
123+
* Job labels need to line up with defined SLOs to enable Alerting
124+
* Probers can be used for both external URLs, and direct to local docker containers. For example, we probe grafana on "cogstack-observability-grafana-1:3000/". If you want to probe local docker containers, note that the network has to line up
125+
25126

26-
## Configuring Probers
27-
- How to setup custom exporter module
28-
- How to use the module in my yml
127+
## External links
128+
For full Blackbox Exporter documentation, see:
29129

130+
- [Prometheus Blackbox Exporter](https://github.com/prometheus/blackbox_exporter)

0 commit comments

Comments
 (0)