CogStack · alhendrickson · Jun 3, 2025 · Jun 2, 2025 · Jun 2, 2025 · Jun 2, 2025
diff --git a/.gitignore b/.gitignore
@@ -1,6 +1,6 @@
 # Custom ignores
 observability/examples/simple/observability-simple
-
+_build
 
 # Python ignores
 # Byte-compiled / optimized / DLL files

diff --git a/.readthedocs.yml b/.readthedocs.yml
@@ -0,0 +1,17 @@
+# .readthedocs.yaml
+# Read the Docs configuration file
+# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details
+
+version: 2
+
+build:
+  os: ubuntu-20.04
+  tools:
+    python: "3.9"
+
+sphinx:
+  configuration: docs/conf.py
+
+python:
+  install:
+    - requirements: docs/requirements.txt
diff --git a/docs/Makefile b/docs/Makefile
@@ -0,0 +1,20 @@
+# Minimal makefile for Sphinx documentation
+#
+
+# You can set these variables from the command line, and also
+# from the environment for the first two.
+SPHINXOPTS    ?=
+SPHINXBUILD   ?= sphinx-build
+SOURCEDIR     = .
+BUILDDIR      = _build
+
+# Put it first so that "make" without argument is like "make help".
+help:
+	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
+
+.PHONY: help Makefile
+
+# Catch-all target: route all unknown targets to Sphinx using the new
+# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
+%: Makefile
+	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
diff --git a/docs/_static/screenshots-dashboards-alerts.png b/docs/_static/screenshots-dashboards-alerts.png
diff --git a/docs/_static/screenshots-dashboards-availability.png b/docs/_static/screenshots-dashboards-availability.png
diff --git a/docs/_static/screenshots-dashboards-docker-metrics.png b/docs/_static/screenshots-dashboards-docker-metrics.png
diff --git a/docs/_static/screenshots-dashboards-es-metrics.png b/docs/_static/screenshots-dashboards-es-metrics.png
diff --git a/docs/_static/screenshots-dashboards-vm-metrics.png b/docs/_static/screenshots-dashboards-vm-metrics.png
diff --git a/docs/conf.py b/docs/conf.py
@@ -0,0 +1,37 @@
+# Configuration file for the Sphinx documentation builder.
+#
+# For the full list of built-in configuration values, see the documentation:
+# https://www.sphinx-doc.org/en/master/usage/configuration.html
+import os
+import sys
+sys.path.insert(0, os.path.abspath("../observability/docs"))
+
+print("Hello")
+# -- Project information -----------------------------------------------------
+# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information
+
+project = 'CogStack Platform Toolkit'
+copyright = '2025, CogStack Org'
+author = 'CogStack Org'
+release = 'latest'
+html_title = "CogStack Platform Toolkit"
+
+# -- General configuration ---------------------------------------------------
+# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration
+
+extensions = [
+    'sphinx_rtd_theme',
+    'sphinx.ext.autodoc',
+    'myst_parser',
+    'sphinx.ext.inheritance_diagram',
+]
+templates_path = ['_templates']
+exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
+
+
+
+# -- Options for HTML output -------------------------------------------------
+# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output
+
+html_theme = "furo"
+html_static_path = ['_static']
diff --git a/docs/index.md b/docs/index.md
@@ -0,0 +1,15 @@
+
+# Cogstack Platform Toolit
+
+This project provides utilities for running Cogstack in production.
+
+- [CogStack Observability](observability/_index.md) 
+
+```{toctree}
+:hidden:
+
+observability/_index
+
+```
+
+
diff --git a/docs/make.bat b/docs/make.bat
@@ -0,0 +1,35 @@
+@ECHO OFF
+
+pushd %~dp0
+
+REM Command file for Sphinx documentation
+
+if "%SPHINXBUILD%" == "" (
+	set SPHINXBUILD=sphinx-build
+)
+set SOURCEDIR=.
+set BUILDDIR=_build
+
+if "%1" == "" goto help
+
+%SPHINXBUILD% >NUL 2>NUL
+if errorlevel 9009 (
+	echo.
+	echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
+	echo.installed, then set the SPHINXBUILD environment variable to point
+	echo.to the full path of the 'sphinx-build' executable. Alternatively you
+	echo.may add the Sphinx directory to PATH.
+	echo.
+	echo.If you don't have Sphinx installed, grab it from
+	echo.https://www.sphinx-doc.org/
+	exit /b 1
+)
+
+%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
+goto end
+
+:help
+%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
+
+:end
+popd
diff --git a/observability/docs/overview.md → docs/observability/_index.md b/observability/docs/overview.md → docs/observability/_index.md
@@ -1,4 +1,4 @@
-# Cogstack Observability Stack
+# Cogstack Observability
 
 This project provides observability of a cogstack deployment.
 
@@ -9,10 +9,15 @@ It provides the following features:
 - Blackbox Probing of services to find service level indicators of uptime and latency
 - A working inventory of what is running where
 
-## Contents
 
 See the [Quickstart](./get-started/quickstart.md) to see how to easily run this stack.
 
+```{toctree}
+:maxdepth: 2
 
+get-started/_index
+setup/_index
+customization/_index
+reference/_index
 
-
+```
diff --git a/docs/observability/customization/_index.md b/docs/observability/customization/_index.md
@@ -0,0 +1,19 @@
+# Customization
+
+```{include} custom-dashboards.md
+:heading-offset: 1
+```
+
+```{include} custom-prometheus-configs.md
+:heading-offset: 1
+```
+
+
+```{toctree}
+:titlesonly:
+:hidden:
+
+custom-prometheus-configs.md
+custom-dashboards.md
+
+```
diff --git a/docs/observability/customization/custom-dashboards.md b/docs/observability/customization/custom-dashboards.md
@@ -0,0 +1,15 @@
+# Custom Dashboards
+You can setup custom dashboards as json files, and include them along with the defaults in this project. 
+
+Grafana is setup with preconfigured dashboards, datasource, and alerting. These will work when prometheus is run in this stack, and is dependent on all the metrics following defined rules. 
+
+It is advised that any edits or new configs get committed back into your git repository, and stick with grafana provisioning instead of allowing manual edits.
+
+
+## How to add a new dashboard with provisioning 
+
+- Mount new dashboard files in the `/etc/grafana/provisioning/dashboards/site` directory
+- To remove or change the existing, mount over the existing files there
+
+For more info see [Grafana Alerting Provisioning](https://grafana.com/docs/grafana/latest/administration/provisioning/#dashboards)
+
diff --git a/docs/observability/customization/custom-prometheus-configs.md b/docs/observability/customization/custom-prometheus-configs.md
@@ -0,0 +1,17 @@
+# Custom Prometheus Configuration
+You can add compeltely custom prometheus scrape configs and recording rules by mounting in docker.
+
+- `site/prometheus/scrape-configs/*.yml`. This is for advanced configuration. 
+
+Any yml file put in this directory will be used as standard promethues scrape configs. This will give full flexibility over what metrics are collected and all features in prometheus. Add any further configs that you want prometheus to use.
+
+```yaml
+# Custom scrape config definition
+scrape_configs:
+  - job_name: custom-scrape-config # Scrape configuration to get metrics from elasticsearch, eg index size.
+    static_configs:
+      - targets:
+          - my-custom-target:9114
+        labels:
+          custom_label: custom # (Optional)
+```
diff --git a/docs/observability/get-started/_index.md b/docs/observability/get-started/_index.md
@@ -0,0 +1,7 @@
+# Getting Started
+
+```{toctree}
+:maxdepth: 2
+quickstart
+userguide-tutorial
+```
diff --git a/observability/docs/get-started/quickstart.md → docs/observability/get-started/quickstart.md b/observability/docs/get-started/quickstart.md → docs/observability/get-started/quickstart.md
@@ -1,32 +1,34 @@
-## QuickStart
+# QuickStart
 
 This tutorial guides you through running the simplest setup of the observability stack using example configuration files and Docker Compose.
 
 After completing these steps, you will have a full observability stack running locally, showing the availability of web pages you want to target
 
-### Requirements
+## Requirements
 
 - Docker installed ([install Docker](https://docs.docker.com/get-docker/))
 - Docker Compose installed ([install Docker Compose](https://docs.docker.com/compose/install/))
 - A terminal with network access
 
+## Steps
+
 ### Step 1: Run the Quickstart script
 
 Run this quickstart script to setup the project
 ```bash
-curl https://raw.githubusercontent.com/CogStack/cogstack-platform-toolkit/main/observability/examples/simple/quickstart.sh | bash
+curl https://raw.githubusercontent.com/CogStack/cogstack-platform-toolkit/refs/heads/main/observability/examples/simple/quickstart.sh | bash
 ```
 Now go to "http://localhost/grafana" to see the dashboards
 
 Thats everything. The stack is running and you can see the availability.
 
+If you can't use the script, see the [Manual Quickstart](../advanced-usage/quickstart-manual.md) to setup your own files. 
+
 
 ### Optional Step: Probe your own web page
 Now you can look at getting monitoring of your own page
 
-In your current folder, edit the file `prometheus/scrape-configs/probers/probe-simple.yml` that you downloaded from git.
-
-Add the following yml to the bottom of the file:
+1. In your current folder, in the file `prometheus/scrape-configs/probers/probe-simple.yml` add the following yml to the bottom of the file:
 
 ```yaml
 - targets:
@@ -36,23 +38,24 @@ Add the following yml to the bottom of the file:
     job: probe-my-own-site
 ```
 
+Note to be careful of the indentation in yml, this target must be at the same depth as the existing contents. 
 
-The change should get applied automatically, but if you dont want to wait then run
+2. Restart the containers with:
 ```
 docker compose restart
 ```
 
 Now refresh the grafana dashboard, and you can see the availability of google.com, it's probably 100%!
 
-
 ## Next steps
 This is the end of this quickstart tutorial, that enables probing availability of endpoints.
 
 For the next steps we can:
+- Look deeper into the observability dashboards, on [Dashboards Userguide](./userguide-tutorial.md)
 - Productionise our deployment to enable further features
-- Enable *Telemetry* like VM memory usage, and Elasticsearch index size, by running Exporters
+- Configure *Telemetry* like VM memory usage, and Elasticsearch index size, by running Exporters
 - Enable *Alerting* based on our availability and a defined Service Level Objective (SLO)
-- Look further into the available dashboards
+- Setup further *Probing* of our running services to get availability metrics
 - Fully customize the stack with our own dashboards, recording rules and metrics
 
 

diff --git a/docs/observability/get-started/userguide-tutorial.md b/docs/observability/get-started/userguide-tutorial.md
@@ -0,0 +1,76 @@
+# Dashboard User Guide
+This guide walks you through how to monitor your stack using the included Grafana dashboards. It shows how to use each dashboard, and some ideas of what things to look out for.
+
+## Availability - How well are things running?
+![Availability Dashboard](../../_static/screenshots-dashboards-availability.png)
+
+Open the Cogstack Monitoring Dashboard on [localhost/grafana](http://localhost/grafana/d/NEzutrbMk/cogstack-monitoring-dashboard) 
+
+Use the percentage uptime charts at the top to see the availability over a given time period. For example, “Over the last 8 hours, we have 99.5% availability on my service”. 
+
+Use the time filter in the top right corner of the page to change the window, for example change it to 30 days to see availability for the total month. 
+
+Look for trends like:
+- Has there been a full outage of a service for 5 minutes, where where 5m availability goes to 0
+- Is there some disruption over the time period, where my 5m availability stays high, but my 6h availability is going down?
+- Have we met the service level objective, if we set the time threshold to 30 days? 
+
+Use the filters at the top, or click in the table to better filter the view down to specific targets, services or hosts. 
+
+See [Setup Probing](../setup/probing.md) to do the full setup of probers.
+
+## Inventory - What is running? 
+![Docker Metrics Dashboard](../../_static/screenshots-dashboards-docker-metrics.png)
+
+Use the Docker Metrics dashboard to check which containers are running, where, and whether they're healthy. This is useful for verifying deployments or diagnosing issues.
+
+The dashboard above includes the hostnames, IP addresses and any other details configured. 
+
+Check for things like:
+- Containers not running where you thought they should be by looking at the hostname for each container
+- Containers restarting unexpectedly, by looking at the "Running" column in the table
+
+See [telemetry](../setup/telemetry.md) to set this up
+
+## Telemetry - How can I see details of resources?
+Some additional dashboards are setup to provide more metrics.
+
+### VM Metrics
+![ VM Metrics dashboard ](../../_static/screenshots-dashboards-vm-metrics.png)
+
+Open the VM Metrics dashboard on [localhost/grafana](http://localhost/grafana/d/rYdddlPWk/vm-metrics-in-cogstack)
+
+Select a VM from the host dropdown .
+
+Look for things like:
+
+- CPU Usage — is a process using too much CPU?
+- Memory Usage — if you're running out of RAM 
+- Disk IO / Space — alerts you to low disk conditions
+- Trends over time, by setting the time filter to 30 days. Is your disk usage increasing over time?
+
+### Elasticsearch Metrics
+![ElasticSearch Metrics Dashboard](../../_static/screenshots-dashboards-es-metrics.png)
+Open the Elasticsearch Metrics dashboard on [localhost/grafana](http://localhost/grafana/d/n_nxrE_mk/elasticsearch-metrics-in-cogstack)
+
+This dashboard helps you understand how your ElasticSearch or Opensearch cluster is behaving. 
+
+Look at:
+- Cluster health status — shows yellow/red states immediately
+- Index size per shard — to detect unbalanced index growth
+- Query latency and throughput — useful during heavy search loads
+
+See [telemetry](../setup/telemetry.md) to set this up
+
+## Alerting - When should I look at this?
+Alerting is setup using Grafana Alerts, but paused by default
+
+When alerts are setup, the grafana graphs will show when the alerts were fired.
+![Alerts Firing on dashboard](../../_static/screenshots-dashboards-alerts.png)
+
+Two sets of rules are defined in this project:
+
+- Basic alerts using uptime. If over 5m or 6h, if it drops below a certain percentage uptime, send an alert
+- Alerting on SLOs by using burn rates, for multi-window multi-rate alerts following best practices defined in [Google SRE - Prometheus Alerting: Turn SLOs into Alerts](https://sre.google/workbook/alerting-on-slos/) 
+
+See [Alerting](../setup/alerting.md) to set this up
diff --git a/docs/observability/reference/_index.md b/docs/observability/reference/_index.md
@@ -0,0 +1,9 @@
+# Reference
+
+```{toctree}
+:maxdepth: 2
+
+project-details.md
+concept-materials.md
+
+```
diff --git a/docs/observability/reference/concept-materials.md b/docs/observability/reference/concept-materials.md
@@ -0,0 +1,7 @@
+# Concepts
+```{toctree}
+:maxdepth: 2
+understanding-metrics.md
+
+```
+