|
| 1 | +# Prometheus Monitoring Mixins |
| 2 | + |
| 3 | +## Using jsonnet to package together dashboards, alerts and exporters. |
| 4 | + |
| 5 | +Status: Draft |
| 6 | +Tom Wilkie, Grafana Labs |
| 7 | +Frederic Branczyk, Red Hat |
| 8 | + |
| 9 | +In this design doc we present a technique for packaging and deploying "Monitoring Mixins" - |
| 10 | +extensible and customisable combinations of dashboards, alert definitions and exporters. |
| 11 | + |
| 12 | +## Problem |
| 13 | + |
| 14 | +[Prometheus](#Notes) offers powerful open source monitoring and alerting - but that comes with higher |
| 15 | +degrees of freedom, making pre-configured monitoring configurations hard to build. |
| 16 | +Simultaneously, it has become accepted wisdom that the developers of a given software |
| 17 | +package are best placed to operate said software, or at least construct the basic monitoring |
| 18 | +configuration. |
| 19 | + |
| 20 | +This work aims to build on Julius Volz' document ["Prometheus Alerting and Dashboard Example Bundles"](#Notes) |
| 21 | +and subsequent PR ["Add initial node-exporter example bundle"](#Notes). In particular, we |
| 22 | +support the hypothesis that for Prometheus to gain increased traction we will need to appeal to |
| 23 | +non-monitoring-experts, and allow for a relatively seamless pre-configured monitoring |
| 24 | +experience. Where we disagree is around standardization: we do not want to prescribe a given |
| 25 | +label schema, example deployment or topology. That being said, a lot of the challenges |
| 26 | +surfaced in that doc are shared here. |
| 27 | + |
| 28 | +## Aims |
| 29 | + |
| 30 | +This solution aims to define a minimal standard for how to package together Prometheus alerts, |
| 31 | +Prometheus recording rules and [Grafana](#Notes) dashboards in a way that is: |
| 32 | + |
| 33 | +**Easy to install and use, platform agnostic.** The users of these packages are unlikely to be |
| 34 | +monitoring experts. These packages must be easily installable with a few commands. And they |
| 35 | +must be general enough to work in all the environments where Prometheus can work: we're not |
| 36 | +just trying to build for Kubernetes here. That being said, the experience will be first class on |
| 37 | +Kubernetes. |
| 38 | + |
| 39 | +**Hosted alongside the programs which expose Prometheus metrics.** More often than not, |
| 40 | +the best people to build the alerting rules and dashboards for a given application are the authors |
| 41 | +of that application. And if that is not the case, then at least users of a given application will look |
| 42 | +to its source for monitoring best practices. We aim to provide a packaging method which allows |
| 43 | +the repo hosting the application source to also host the applications monitoring package; for |
| 44 | +them to be versioned along side the application. For example, we envisage the monitoring |
| 45 | +mixin for Etcd to live in the etcd repo and the monitoring package for Hashicorp's Consul to live |
| 46 | +in the [consul_exporter](#Notes) repo. |
| 47 | + |
| 48 | +**We want the ability to iterate and collaborate on packages.** A challenge with the existing |
| 49 | +published dashboards and alerts is that they are static: the only way to use them is to copy them |
| 50 | +into your codebase, edit them to make them fit with your deployment. This makes it hard for |
| 51 | +users to contribute changes back to the original author; it makes it impossible to download new |
| 52 | +improved versions and stay up to date with improvements. We want these packages to be |
| 53 | +constantly evolving; we want to encourage drive-by commits. |
| 54 | + |
| 55 | +**Packages should be reusable, configurable and extensible.** Users should be able to |
| 56 | +configure the packages to fit their deployments and labels schema without modifying the |
| 57 | +packages. Users should be able to extend the packages with extra dashboard panels and extra |
| 58 | +alerts, without having to copy, paste and modify them. The packages must be configurable so |
| 59 | +that they support the many different label schemes used today by different organisations. |
| 60 | + |
| 61 | +## Proposal |
| 62 | + |
| 63 | +**Monitoring Mixins.** A monitoring mixin is a package of configuration containing Prometheus |
| 64 | +alerts, Prometheus recording rules and Grafana dashboards. Mixins will be maintained in |
| 65 | +version controlled repos (eg git) as a set of files. Versioning of mixins will be provided by the |
| 66 | +version control system; mixins themselves should not contain multiple versions. |
| 67 | + |
| 68 | +Mixins are intended just for the combination of Prometheus and Grafana, and not other |
| 69 | +monitoring or visualisation systems. Mixins are intended to be opinionated about the choice of |
| 70 | +monitoring technology. |
| 71 | + |
| 72 | +Mixins should not however be opinionated about how this configuration should be deployed; |
| 73 | +they should not contain manifests for deploying Prometheus and Grafana on Kubernetes, for |
| 74 | +instance. Multiple, separate projects can and should exist to help deploy mixins; we will provide |
| 75 | +example of how to do this on Kubernetes, and a tool for integrating with traditional config |
| 76 | +management systems. |
| 77 | + |
| 78 | +**Jsonnet.** We propose the use of [jsonnet](#Notes), a configuration language from Google, as the basis of |
| 79 | +our monitoring mixins. Jsonnet has some popularity in this space, as it is used in the [ksonnet](#Notes) |
| 80 | +project for achieving similar goals for Kubernetes. |
| 81 | + |
| 82 | +Jsonnet offers the ability to parameterise configuration, allowing for basic customisation. |
| 83 | +Furthermore, in Jsonnet one can reference another part of the data structure, reducing |
| 84 | +repetition. For example, with jsonnet one can specify a default job name, and then have all the |
| 85 | +alerts use that: |
| 86 | + |
| 87 | +``` |
| 88 | +{ |
| 89 | + _config+:: { |
| 90 | + kubeStateMetricsSelector: ‘job=”default/kube-state-metrics"', |
| 91 | + |
| 92 | + allowedNotReadyPods: 0, |
| 93 | + }, |
| 94 | + |
| 95 | + groups+: [ |
| 96 | + { |
| 97 | + name: "kubernetes", |
| 98 | + rules: [ |
| 99 | + { |
| 100 | + alert: "KubePodNotReady", |
| 101 | + expr: ||| |
| 102 | + sum by (namespace, pod) ( |
| 103 | + kube_pod_status_phase{%(kubeStateMetricsSelector)s, phase!~"Running|Succeeded"} |
| 104 | + ) > $(allowedNotReadyPods)s |
| 105 | + ||| % $._config, |
| 106 | + "for": "1h", |
| 107 | + labels: { |
| 108 | + severity: "critical", |
| 109 | + }, |
| 110 | + annotations: { |
| 111 | + message: "{{ $labels.namespace }}/{{ $labels.pod }} is not ready.", |
| 112 | + }, |
| 113 | + }, |
| 114 | + ], |
| 115 | + }, |
| 116 | + ], |
| 117 | +} |
| 118 | +``` |
| 119 | + |
| 120 | +**Configuration.* We'd like to suggest some standardisation of how configuration is supplied to |
| 121 | +mixins. A top level `_config` dictionary should be provided, containing various parameters for |
| 122 | +substitution into alerts and dashboards. In the above example, this is used to specify the |
| 123 | +selector for the kube-state-metrics pod, and the threshold for the alert. |
| 124 | + |
| 125 | +**Extension.** One of jsonnet's basic operations is to "merge” data structures - this also allows you |
| 126 | +to extend existing configurations. For example, given an existing dashboard: |
| 127 | + |
| 128 | +``` |
| 129 | +local g = import "klumps/lib/grafana.libsonnet"; |
| 130 | + |
| 131 | +{ |
| 132 | + dashboards+:: { |
| 133 | + "foo.json": g.dashboard("Foo") |
| 134 | + .addRow( |
| 135 | + g.row("Foo") |
| 136 | + .addPanel( |
| 137 | + g.panel("Bar") + |
| 138 | + g.queryPanel('irate(foor_bar_total[1m])', 'Foo Bar') |
| 139 | + ) |
| 140 | + ) |
| 141 | + }, |
| 142 | +} |
| 143 | +``` |
| 144 | + |
| 145 | +It is relatively easy to import it and add extra rows: |
| 146 | + |
| 147 | +``` |
| 148 | +local g = import "foo.libsonnet"; |
| 149 | + |
| 150 | +{ |
| 151 | + dashboards+:: { |
| 152 | + "foo.json"+: |
| 153 | + super.addRow( |
| 154 | + g.row("A new row") |
| 155 | + .addPanel( |
| 156 | + g.panel("A new panel") + |
| 157 | + g.queryPanel('irate(new_total[1m])', 'New') |
| 158 | + ) |
| 159 | + ) |
| 160 | + }, |
| 161 | +} |
| 162 | +``` |
| 163 | + |
| 164 | +These abilities offered by jsonnet are key to being able to separate out "upstream” alerts and |
| 165 | +dashboards from customizations, and keep upstream in sync with the source of the mixin. |
| 166 | + |
| 167 | +**Higher Order Abstractions.** jsonnet is a functional programming language, and as such |
| 168 | +allows you to build higher order abstractions over your configuration. For example, you can |
| 169 | +build functions to generate recording rules for a set of percentiles and labels aggregations, |
| 170 | +given a histogram: |
| 171 | + |
| 172 | +``` |
| 173 | +local histogramRules(metric, labels) = |
| 174 | + local vars = { |
| 175 | + metric: metric, |
| 176 | + labels_underscore: std.join("_", labels), |
| 177 | + labels_comma: std.join(", ", labels), |
| 178 | + }; |
| 179 | + [ |
| 180 | + { |
| 181 | + record: "%(labels_underscore)s:%(metric)s:99quantile" % vars, |
| 182 | + expr: "histogram_quantile(0.99, sum(rate(%(metric)s_bucket[5m])) by (le, |
| 183 | +%(labels_comma)s))" % vars, |
| 184 | + }, |
| 185 | + { |
| 186 | + record: "%(labels_underscore)s:%(metric)s:50quantile" % vars, |
| 187 | + expr: "histogram_quantile(0.50, sum(rate(%(metric)s_bucket[5m])) by (le, |
| 188 | +%(labels_comma)s))" % vars, |
| 189 | + }, |
| 190 | + { |
| 191 | + record: "%(labels_underscore)s:%(metric)s:avg" % vars, |
| 192 | + expr: "sum(rate(%(metric)s_sum[5m])) by (%(labels_comma)s) / |
| 193 | +sum(rate(%(metric)s_count[5m])) by (%(labels_comma)s)" % vars, |
| 194 | + }, |
| 195 | + ]; |
| 196 | + |
| 197 | +{ |
| 198 | + groups+: [{ |
| 199 | + name: "frontend_rules", |
| 200 | + rules: |
| 201 | + histogramRules("frontend_request_duration_seconds", ["job"]) + |
| 202 | + histogramRules("frontend_request_duration_seconds", ["job", "route"]), |
| 203 | + }], |
| 204 | +} |
| 205 | +``` |
| 206 | + |
| 207 | +Other potential examples include functions to generate alerts at different thresholds, omitting |
| 208 | +multiple alerts, warning and critical. |
| 209 | + |
| 210 | +**[Grafonnet](#Notes)** An emerging pattern in the jsonnet ecosystem is the existence of libraries of helper |
| 211 | +functions to generate objects for a given system. For example, ksonnet is a library to generate |
| 212 | +objects for the Kubernetes object model. Grafonnet is a library for generating Grafana |
| 213 | +Dashboards using jsonnet. We envisage a series of libraries, such as Grafonnet, to help people |
| 214 | +build mixins. As such, any system for installing mixins needs to deal with transitive |
| 215 | +dependencies. |
| 216 | + |
| 217 | +**Package Management.** The current proof of concepts for mixins (see below) use the new |
| 218 | +package manager [jsonnet-bundler](#Notes) enabling the following workflow: |
| 219 | + |
| 220 | +``` |
| 221 | +$ jb install kausal github.com/kausalco/public/consul-mixin |
| 222 | +``` |
| 223 | + |
| 224 | +This downloads a copy of the mixin into `vendor/consul-mixin` and allows users to include |
| 225 | +the mixin in their ksonnet config like so: |
| 226 | + |
| 227 | +``` |
| 228 | +local prometheus = import "prometheus-ksonnet/prometheus-ksonnet.libsonnet"; |
| 229 | +local consul_mixin = import "consul-mixin/mixin.libsonnet"; |
| 230 | + |
| 231 | +prometheus + consul_mixin { |
| 232 | + _config+:: { |
| 233 | + namespace: "default", |
| 234 | + }, |
| 235 | +} |
| 236 | +``` |
| 237 | + |
| 238 | +This example also uses the prometheus-ksonnet package from [Kausal](#Notes), which understands the |
| 239 | +structure of the mixins and manifests alerting rules, recording rules and dashboards as config |
| 240 | +maps in Kubernetes, mounted into the Kubernetes pods in the correct place. |
| 241 | + |
| 242 | +However, we think this is a wider problem than just monitoring mixins, and are exploring designs |
| 243 | +for a generic jsonnet package manager in a [separate design doc](#Notes). |
| 244 | + |
| 245 | +**Proposed Schema.** To allow multiple tools to utilise mixins, we must agree on some common |
| 246 | +naming. The proposal is that a mixin is a single dictionary containing three keys: |
| 247 | + |
| 248 | +- `grafanaDashboards` A dictionary of dashboard file name (foo.json) to dashboard json. |
| 249 | +- `prometheusAlerts` A list of Prometheus alert groups. |
| 250 | +- `prometheusRules` A list of Prometheus rule groups. |
| 251 | + |
| 252 | +Each of these values will be expressed as jsonnet objects - not strings. It is the responsibility of |
| 253 | +the tool consuming the mixin to render these out as JSON or YAML. Jsonnet scripts to do this |
| 254 | +for you will be provided. |
| 255 | + |
| 256 | +``` |
| 257 | +{ |
| 258 | + grafanaDashboards+:: { |
| 259 | + "dashboard-name.json”: {...}, |
| 260 | + }, |
| 261 | + prometheusAlerts+:: [...], |
| 262 | + prometheusRules+:: [...], |
| 263 | +} |
| 264 | +``` |
| 265 | + |
| 266 | +**Consuming a mixin.** |
| 267 | + |
| 268 | +- TODO examples of how we expect people to install, customise and extend mixins. |
| 269 | +- TODO Ability to manifest out jsonnet configuration in a variety of formats - YAML, JSON, INI etc |
| 270 | +- TODO show how it works with ksonnet but also with something like puppet.. |
| 271 | + |
| 272 | +Examples & Proof of Concepts |
| 273 | +We will probably put the specification and list of known mixins in a repo somewhere, as a |
| 274 | +readme. For now, these are the known mixins and related projects: |
| 275 | + |
| 276 | +| Application | Mixin | Author | |
| 277 | +|------------------|--------------------|--------------------------------| |
| 278 | +| CoreOS Etcd | etcd-mixin | Grapeshot / Tom Wilkie | |
| 279 | +| Cassandra | TBD | Grafana Labs | |
| 280 | +| Hashicorp Consul | consul-mixin | Kausal | |
| 281 | +| Hashicorp Vault | vault_exporter | Grapeshot / Tom Wilkie | |
| 282 | +| Kubernetes | kubernetes-mixin | Tom Wilkie & Frederic Branczyk | |
| 283 | +| Kubernetes | kubernetes-grafana | Frederic Branczyk | |
| 284 | +| Kubernetes | kube-prometheus | Frederic Branczyk | |
| 285 | +| Prometheus | prometheus-ksonnet | Kausal | |
| 286 | + |
| 287 | +**Open Questions** |
| 288 | + |
| 289 | +- Some systems require exporters; can / should these be packaged as part of the mixin? |
| 290 | + Hard to do generally, easy to do for kubernetes with ksonnet. |
| 291 | +- On the exporter topic, some systems need stats_exporter mappings to be consistent with alerts and dashboards. Even if |
| 292 | + we can include statds_exporter in the mixin, can we include the mappings? |
| 293 | +- A lot of questions from Julius' design are still open: how to deal with different aggregation windows, what labels to |
| 294 | + use on alerts etc. |
| 295 | + |
| 296 | +## Notes |
| 297 | + |
| 298 | +This was recreated from |
| 299 | +a [web.archive.org](https://web.archive.org/web/20211021151124/https://docs.google.com/document/d/1A9xvzwqnFVSOZ5fD3blKODXfsat5fg6ZhnKu9LK3lB4/edit) |
| 300 | +capture of the original document, the license of this file is unknown. |
| 301 | + |
| 302 | +The links in the archive do not work and have not been recreated. |
| 303 | + |
| 304 | +The license of this file is unknown, but judging by the intent it was meant to be shared freely. |
0 commit comments