Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Custom ignores
observability/examples/simple/observability-simple

_build

# Python ignores
# Byte-compiled / optimized / DLL files
Expand Down
17 changes: 17 additions & 0 deletions .readthedocs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# .readthedocs.yaml
# Read the Docs configuration file
# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details

version: 2

build:
os: ubuntu-20.04
tools:
python: "3.9"

sphinx:
configuration: docs/conf.py

python:
install:
- requirements: docs/requirements.txt
20 changes: 20 additions & 0 deletions docs/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Minimal makefile for Sphinx documentation
#

# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS ?=
SPHINXBUILD ?= sphinx-build
SOURCEDIR = .
BUILDDIR = _build

# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

.PHONY: help Makefile

# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
Binary file added docs/_static/screenshots-dashboards-alerts.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
37 changes: 37 additions & 0 deletions docs/conf.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Configuration file for the Sphinx documentation builder.
#
# For the full list of built-in configuration values, see the documentation:
# https://www.sphinx-doc.org/en/master/usage/configuration.html
import os
import sys
sys.path.insert(0, os.path.abspath("../observability/docs"))

print("Hello")
# -- Project information -----------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information

project = 'CogStack Platform Toolkit'
copyright = '2025, CogStack Org'
author = 'CogStack Org'
release = 'latest'
html_title = "CogStack Platform Toolkit"

# -- General configuration ---------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration

extensions = [
'sphinx_rtd_theme',
'sphinx.ext.autodoc',
'myst_parser',
'sphinx.ext.inheritance_diagram',
]
templates_path = ['_templates']
exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']



# -- Options for HTML output -------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output

html_theme = "furo"
html_static_path = ['_static']
15 changes: 15 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@

# Cogstack Platform Toolit

This project provides utilities for running Cogstack in production.

- [CogStack Observability](observability/_index.md)

```{toctree}
:hidden:

observability/_index

```


35 changes: 35 additions & 0 deletions docs/make.bat
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
@ECHO OFF

pushd %~dp0

REM Command file for Sphinx documentation

if "%SPHINXBUILD%" == "" (
set SPHINXBUILD=sphinx-build
)
set SOURCEDIR=.
set BUILDDIR=_build

if "%1" == "" goto help

%SPHINXBUILD% >NUL 2>NUL
if errorlevel 9009 (
echo.
echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
echo.installed, then set the SPHINXBUILD environment variable to point
echo.to the full path of the 'sphinx-build' executable. Alternatively you
echo.may add the Sphinx directory to PATH.
echo.
echo.If you don't have Sphinx installed, grab it from
echo.https://www.sphinx-doc.org/
exit /b 1
)

%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
goto end

:help
%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%

:end
popd
11 changes: 8 additions & 3 deletions observability/docs/overview.md → docs/observability/_index.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Cogstack Observability Stack
# Cogstack Observability

This project provides observability of a cogstack deployment.

Expand All @@ -9,10 +9,15 @@ It provides the following features:
- Blackbox Probing of services to find service level indicators of uptime and latency
- A working inventory of what is running where

## Contents

See the [Quickstart](./get-started/quickstart.md) to see how to easily run this stack.

```{toctree}
:maxdepth: 2

get-started/_index
setup/_index
customization/_index
reference/_index


```
19 changes: 19 additions & 0 deletions docs/observability/customization/_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Customization

```{include} custom-dashboards.md
:heading-offset: 1
```

```{include} custom-prometheus-configs.md
:heading-offset: 1
```


```{toctree}
:titlesonly:
:hidden:

custom-prometheus-configs.md
custom-dashboards.md

```
15 changes: 15 additions & 0 deletions docs/observability/customization/custom-dashboards.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Custom Dashboards
You can setup custom dashboards as json files, and include them along with the defaults in this project.

Grafana is setup with preconfigured dashboards, datasource, and alerting. These will work when prometheus is run in this stack, and is dependent on all the metrics following defined rules.

It is advised that any edits or new configs get committed back into your git repository, and stick with grafana provisioning instead of allowing manual edits.


## How to add a new dashboard with provisioning

- Mount new dashboard files in the `/etc/grafana/provisioning/dashboards/site` directory
- To remove or change the existing, mount over the existing files there

For more info see [Grafana Alerting Provisioning](https://grafana.com/docs/grafana/latest/administration/provisioning/#dashboards)

17 changes: 17 additions & 0 deletions docs/observability/customization/custom-prometheus-configs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Custom Prometheus Configuration
You can add compeltely custom prometheus scrape configs and recording rules by mounting in docker.

- `site/prometheus/scrape-configs/*.yml`. This is for advanced configuration.

Any yml file put in this directory will be used as standard promethues scrape configs. This will give full flexibility over what metrics are collected and all features in prometheus. Add any further configs that you want prometheus to use.

```yaml
# Custom scrape config definition
scrape_configs:
- job_name: custom-scrape-config # Scrape configuration to get metrics from elasticsearch, eg index size.
static_configs:
- targets:
- my-custom-target:9114
labels:
custom_label: custom # (Optional)
```
7 changes: 7 additions & 0 deletions docs/observability/get-started/_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Getting Started

```{toctree}
:maxdepth: 2
quickstart
userguide-tutorial
```
Original file line number Diff line number Diff line change
@@ -1,32 +1,34 @@
## QuickStart
# QuickStart

This tutorial guides you through running the simplest setup of the observability stack using example configuration files and Docker Compose.

After completing these steps, you will have a full observability stack running locally, showing the availability of web pages you want to target

### Requirements
## Requirements

- Docker installed ([install Docker](https://docs.docker.com/get-docker/))
- Docker Compose installed ([install Docker Compose](https://docs.docker.com/compose/install/))
- A terminal with network access

## Steps

### Step 1: Run the Quickstart script

Run this quickstart script to setup the project
```bash
curl https://raw.githubusercontent.com/CogStack/cogstack-platform-toolkit/main/observability/examples/simple/quickstart.sh | bash
curl https://raw.githubusercontent.com/CogStack/cogstack-platform-toolkit/refs/heads/main/observability/examples/simple/quickstart.sh | bash
```
Now go to "http://localhost/grafana" to see the dashboards

Thats everything. The stack is running and you can see the availability.

If you can't use the script, see the [Manual Quickstart](../advanced-usage/quickstart-manual.md) to setup your own files.


### Optional Step: Probe your own web page
Now you can look at getting monitoring of your own page

In your current folder, edit the file `prometheus/scrape-configs/probers/probe-simple.yml` that you downloaded from git.

Add the following yml to the bottom of the file:
1. In your current folder, in the file `prometheus/scrape-configs/probers/probe-simple.yml` add the following yml to the bottom of the file:

```yaml
- targets:
Expand All @@ -36,23 +38,24 @@ Add the following yml to the bottom of the file:
job: probe-my-own-site
```

Note to be careful of the indentation in yml, this target must be at the same depth as the existing contents.

The change should get applied automatically, but if you dont want to wait then run
2. Restart the containers with:
```
docker compose restart
```

Now refresh the grafana dashboard, and you can see the availability of google.com, it's probably 100%!


## Next steps
This is the end of this quickstart tutorial, that enables probing availability of endpoints.

For the next steps we can:
- Look deeper into the observability dashboards, on [Dashboards Userguide](./userguide-tutorial.md)
- Productionise our deployment to enable further features
- Enable *Telemetry* like VM memory usage, and Elasticsearch index size, by running Exporters
- Configure *Telemetry* like VM memory usage, and Elasticsearch index size, by running Exporters
- Enable *Alerting* based on our availability and a defined Service Level Objective (SLO)
- Look further into the available dashboards
- Setup further *Probing* of our running services to get availability metrics
- Fully customize the stack with our own dashboards, recording rules and metrics


Expand Down
76 changes: 76 additions & 0 deletions docs/observability/get-started/userguide-tutorial.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# Dashboard User Guide
This guide walks you through how to monitor your stack using the included Grafana dashboards. It shows how to use each dashboard, and some ideas of what things to look out for.

## Availability - How well are things running?
![Availability Dashboard](../../_static/screenshots-dashboards-availability.png)

Open the Cogstack Monitoring Dashboard on [localhost/grafana](http://localhost/grafana/d/NEzutrbMk/cogstack-monitoring-dashboard)

Use the percentage uptime charts at the top to see the availability over a given time period. For example, “Over the last 8 hours, we have 99.5% availability on my service”.

Use the time filter in the top right corner of the page to change the window, for example change it to 30 days to see availability for the total month.

Look for trends like:
- Has there been a full outage of a service for 5 minutes, where where 5m availability goes to 0
- Is there some disruption over the time period, where my 5m availability stays high, but my 6h availability is going down?
- Have we met the service level objective, if we set the time threshold to 30 days?

Use the filters at the top, or click in the table to better filter the view down to specific targets, services or hosts.

See [Setup Probing](../setup/probing.md) to do the full setup of probers.

## Inventory - What is running?
![Docker Metrics Dashboard](../../_static/screenshots-dashboards-docker-metrics.png)

Use the Docker Metrics dashboard to check which containers are running, where, and whether they're healthy. This is useful for verifying deployments or diagnosing issues.

The dashboard above includes the hostnames, IP addresses and any other details configured.

Check for things like:
- Containers not running where you thought they should be by looking at the hostname for each container
- Containers restarting unexpectedly, by looking at the "Running" column in the table

See [telemetry](../setup/telemetry.md) to set this up

## Telemetry - How can I see details of resources?
Some additional dashboards are setup to provide more metrics.

### VM Metrics
![ VM Metrics dashboard ](../../_static/screenshots-dashboards-vm-metrics.png)

Open the VM Metrics dashboard on [localhost/grafana](http://localhost/grafana/d/rYdddlPWk/vm-metrics-in-cogstack)

Select a VM from the host dropdown .

Look for things like:

- CPU Usage — is a process using too much CPU?
- Memory Usage — if you're running out of RAM
- Disk IO / Space — alerts you to low disk conditions
- Trends over time, by setting the time filter to 30 days. Is your disk usage increasing over time?

### Elasticsearch Metrics
![ElasticSearch Metrics Dashboard](../../_static/screenshots-dashboards-es-metrics.png)
Open the Elasticsearch Metrics dashboard on [localhost/grafana](http://localhost/grafana/d/n_nxrE_mk/elasticsearch-metrics-in-cogstack)

This dashboard helps you understand how your ElasticSearch or Opensearch cluster is behaving.

Look at:
- Cluster health status — shows yellow/red states immediately
- Index size per shard — to detect unbalanced index growth
- Query latency and throughput — useful during heavy search loads

See [telemetry](../setup/telemetry.md) to set this up

## Alerting - When should I look at this?
Alerting is setup using Grafana Alerts, but paused by default

When alerts are setup, the grafana graphs will show when the alerts were fired.
![Alerts Firing on dashboard](../../_static/screenshots-dashboards-alerts.png)

Two sets of rules are defined in this project:

- Basic alerts using uptime. If over 5m or 6h, if it drops below a certain percentage uptime, send an alert
- Alerting on SLOs by using burn rates, for multi-window multi-rate alerts following best practices defined in [Google SRE - Prometheus Alerting: Turn SLOs into Alerts](https://sre.google/workbook/alerting-on-slos/)

See [Alerting](../setup/alerting.md) to set this up
9 changes: 9 additions & 0 deletions docs/observability/reference/_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Reference

```{toctree}
:maxdepth: 2

project-details.md
concept-materials.md

```
7 changes: 7 additions & 0 deletions docs/observability/reference/concept-materials.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Concepts
```{toctree}
:maxdepth: 2
understanding-metrics.md

```

Loading