Skip to content

Commit 1d0c7cc

Browse files
authored
Merge pull request #883 from deepy/extract-original-design-doc
Extract design document from web.archive.org
2 parents d9a7f3a + 469c00d commit 1d0c7cc

File tree

2 files changed

+305
-1
lines changed

2 files changed

+305
-1
lines changed

DESIGN.md

Lines changed: 304 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,304 @@
1+
# Prometheus Monitoring Mixins
2+
3+
## Using jsonnet to package together dashboards, alerts and exporters.
4+
5+
Status: Draft
6+
Tom Wilkie, Grafana Labs
7+
Frederic Branczyk, Red Hat
8+
9+
In this design doc we present a technique for packaging and deploying "Monitoring Mixins" -
10+
extensible and customisable combinations of dashboards, alert definitions and exporters.
11+
12+
## Problem
13+
14+
[Prometheus](#Notes) offers powerful open source monitoring and alerting - but that comes with higher
15+
degrees of freedom, making pre-configured monitoring configurations hard to build.
16+
Simultaneously, it has become accepted wisdom that the developers of a given software
17+
package are best placed to operate said software, or at least construct the basic monitoring
18+
configuration.
19+
20+
This work aims to build on Julius Volz' document ["Prometheus Alerting and Dashboard Example Bundles"](#Notes)
21+
and subsequent PR ["Add initial node-exporter example bundle"](#Notes). In particular, we
22+
support the hypothesis that for Prometheus to gain increased traction we will need to appeal to
23+
non-monitoring-experts, and allow for a relatively seamless pre-configured monitoring
24+
experience. Where we disagree is around standardization: we do not want to prescribe a given
25+
label schema, example deployment or topology. That being said, a lot of the challenges
26+
surfaced in that doc are shared here.
27+
28+
## Aims
29+
30+
This solution aims to define a minimal standard for how to package together Prometheus alerts,
31+
Prometheus recording rules and [Grafana](#Notes) dashboards in a way that is:
32+
33+
**Easy to install and use, platform agnostic.** The users of these packages are unlikely to be
34+
monitoring experts. These packages must be easily installable with a few commands. And they
35+
must be general enough to work in all the environments where Prometheus can work: we're not
36+
just trying to build for Kubernetes here. That being said, the experience will be first class on
37+
Kubernetes.
38+
39+
**Hosted alongside the programs which expose Prometheus metrics.** More often than not,
40+
the best people to build the alerting rules and dashboards for a given application are the authors
41+
of that application. And if that is not the case, then at least users of a given application will look
42+
to its source for monitoring best practices. We aim to provide a packaging method which allows
43+
the repo hosting the application source to also host the applications monitoring package; for
44+
them to be versioned along side the application. For example, we envisage the monitoring
45+
mixin for Etcd to live in the etcd repo and the monitoring package for Hashicorp's Consul to live
46+
in the [consul_exporter](#Notes) repo.
47+
48+
**We want the ability to iterate and collaborate on packages.** A challenge with the existing
49+
published dashboards and alerts is that they are static: the only way to use them is to copy them
50+
into your codebase, edit them to make them fit with your deployment. This makes it hard for
51+
users to contribute changes back to the original author; it makes it impossible to download new
52+
improved versions and stay up to date with improvements. We want these packages to be
53+
constantly evolving; we want to encourage drive-by commits.
54+
55+
**Packages should be reusable, configurable and extensible.** Users should be able to
56+
configure the packages to fit their deployments and labels schema without modifying the
57+
packages. Users should be able to extend the packages with extra dashboard panels and extra
58+
alerts, without having to copy, paste and modify them. The packages must be configurable so
59+
that they support the many different label schemes used today by different organisations.
60+
61+
## Proposal
62+
63+
**Monitoring Mixins.** A monitoring mixin is a package of configuration containing Prometheus
64+
alerts, Prometheus recording rules and Grafana dashboards. Mixins will be maintained in
65+
version controlled repos (eg git) as a set of files. Versioning of mixins will be provided by the
66+
version control system; mixins themselves should not contain multiple versions.
67+
68+
Mixins are intended just for the combination of Prometheus and Grafana, and not other
69+
monitoring or visualisation systems. Mixins are intended to be opinionated about the choice of
70+
monitoring technology.
71+
72+
Mixins should not however be opinionated about how this configuration should be deployed;
73+
they should not contain manifests for deploying Prometheus and Grafana on Kubernetes, for
74+
instance. Multiple, separate projects can and should exist to help deploy mixins; we will provide
75+
example of how to do this on Kubernetes, and a tool for integrating with traditional config
76+
management systems.
77+
78+
**Jsonnet.** We propose the use of [jsonnet](#Notes), a configuration language from Google, as the basis of
79+
our monitoring mixins. Jsonnet has some popularity in this space, as it is used in the [ksonnet](#Notes)
80+
project for achieving similar goals for Kubernetes.
81+
82+
Jsonnet offers the ability to parameterise configuration, allowing for basic customisation.
83+
Furthermore, in Jsonnet one can reference another part of the data structure, reducing
84+
repetition. For example, with jsonnet one can specify a default job name, and then have all the
85+
alerts use that:
86+
87+
```
88+
{
89+
_config+:: {
90+
kubeStateMetricsSelector: ‘job=”default/kube-state-metrics"',
91+
92+
allowedNotReadyPods: 0,
93+
},
94+
95+
groups+: [
96+
{
97+
name: "kubernetes",
98+
rules: [
99+
{
100+
alert: "KubePodNotReady",
101+
expr: |||
102+
sum by (namespace, pod) (
103+
kube_pod_status_phase{%(kubeStateMetricsSelector)s, phase!~"Running|Succeeded"}
104+
) > $(allowedNotReadyPods)s
105+
||| % $._config,
106+
"for": "1h",
107+
labels: {
108+
severity: "critical",
109+
},
110+
annotations: {
111+
message: "{{ $labels.namespace }}/{{ $labels.pod }} is not ready.",
112+
},
113+
},
114+
],
115+
},
116+
],
117+
}
118+
```
119+
120+
**Configuration.* We'd like to suggest some standardisation of how configuration is supplied to
121+
mixins. A top level `_config` dictionary should be provided, containing various parameters for
122+
substitution into alerts and dashboards. In the above example, this is used to specify the
123+
selector for the kube-state-metrics pod, and the threshold for the alert.
124+
125+
**Extension.** One of jsonnet's basic operations is to "merge” data structures - this also allows you
126+
to extend existing configurations. For example, given an existing dashboard:
127+
128+
```
129+
local g = import "klumps/lib/grafana.libsonnet";
130+
131+
{
132+
dashboards+:: {
133+
"foo.json": g.dashboard("Foo")
134+
.addRow(
135+
g.row("Foo")
136+
.addPanel(
137+
g.panel("Bar") +
138+
g.queryPanel('irate(foor_bar_total[1m])', 'Foo Bar')
139+
)
140+
)
141+
},
142+
}
143+
```
144+
145+
It is relatively easy to import it and add extra rows:
146+
147+
```
148+
local g = import "foo.libsonnet";
149+
150+
{
151+
dashboards+:: {
152+
"foo.json"+:
153+
super.addRow(
154+
g.row("A new row")
155+
.addPanel(
156+
g.panel("A new panel") +
157+
g.queryPanel('irate(new_total[1m])', 'New')
158+
)
159+
)
160+
},
161+
}
162+
```
163+
164+
These abilities offered by jsonnet are key to being able to separate out "upstream” alerts and
165+
dashboards from customizations, and keep upstream in sync with the source of the mixin.
166+
167+
**Higher Order Abstractions.** jsonnet is a functional programming language, and as such
168+
allows you to build higher order abstractions over your configuration. For example, you can
169+
build functions to generate recording rules for a set of percentiles and labels aggregations,
170+
given a histogram:
171+
172+
```
173+
local histogramRules(metric, labels) =
174+
local vars = {
175+
metric: metric,
176+
labels_underscore: std.join("_", labels),
177+
labels_comma: std.join(", ", labels),
178+
};
179+
[
180+
{
181+
record: "%(labels_underscore)s:%(metric)s:99quantile" % vars,
182+
expr: "histogram_quantile(0.99, sum(rate(%(metric)s_bucket[5m])) by (le,
183+
%(labels_comma)s))" % vars,
184+
},
185+
{
186+
record: "%(labels_underscore)s:%(metric)s:50quantile" % vars,
187+
expr: "histogram_quantile(0.50, sum(rate(%(metric)s_bucket[5m])) by (le,
188+
%(labels_comma)s))" % vars,
189+
},
190+
{
191+
record: "%(labels_underscore)s:%(metric)s:avg" % vars,
192+
expr: "sum(rate(%(metric)s_sum[5m])) by (%(labels_comma)s) /
193+
sum(rate(%(metric)s_count[5m])) by (%(labels_comma)s)" % vars,
194+
},
195+
];
196+
197+
{
198+
groups+: [{
199+
name: "frontend_rules",
200+
rules:
201+
histogramRules("frontend_request_duration_seconds", ["job"]) +
202+
histogramRules("frontend_request_duration_seconds", ["job", "route"]),
203+
}],
204+
}
205+
```
206+
207+
Other potential examples include functions to generate alerts at different thresholds, omitting
208+
multiple alerts, warning and critical.
209+
210+
**[Grafonnet](#Notes)** An emerging pattern in the jsonnet ecosystem is the existence of libraries of helper
211+
functions to generate objects for a given system. For example, ksonnet is a library to generate
212+
objects for the Kubernetes object model. Grafonnet is a library for generating Grafana
213+
Dashboards using jsonnet. We envisage a series of libraries, such as Grafonnet, to help people
214+
build mixins. As such, any system for installing mixins needs to deal with transitive
215+
dependencies.
216+
217+
**Package Management.** The current proof of concepts for mixins (see below) use the new
218+
package manager [jsonnet-bundler](#Notes) enabling the following workflow:
219+
220+
```
221+
$ jb install kausal github.com/kausalco/public/consul-mixin
222+
```
223+
224+
This downloads a copy of the mixin into `vendor/consul-mixin` and allows users to include
225+
the mixin in their ksonnet config like so:
226+
227+
```
228+
local prometheus = import "prometheus-ksonnet/prometheus-ksonnet.libsonnet";
229+
local consul_mixin = import "consul-mixin/mixin.libsonnet";
230+
231+
prometheus + consul_mixin {
232+
_config+:: {
233+
namespace: "default",
234+
},
235+
}
236+
```
237+
238+
This example also uses the prometheus-ksonnet package from [Kausal](#Notes), which understands the
239+
structure of the mixins and manifests alerting rules, recording rules and dashboards as config
240+
maps in Kubernetes, mounted into the Kubernetes pods in the correct place.
241+
242+
However, we think this is a wider problem than just monitoring mixins, and are exploring designs
243+
for a generic jsonnet package manager in a [separate design doc](#Notes).
244+
245+
**Proposed Schema.** To allow multiple tools to utilise mixins, we must agree on some common
246+
naming. The proposal is that a mixin is a single dictionary containing three keys:
247+
248+
- `grafanaDashboards` A dictionary of dashboard file name (foo.json) to dashboard json.
249+
- `prometheusAlerts` A list of Prometheus alert groups.
250+
- `prometheusRules` A list of Prometheus rule groups.
251+
252+
Each of these values will be expressed as jsonnet objects - not strings. It is the responsibility of
253+
the tool consuming the mixin to render these out as JSON or YAML. Jsonnet scripts to do this
254+
for you will be provided.
255+
256+
```
257+
{
258+
grafanaDashboards+:: {
259+
"dashboard-name.json”: {...},
260+
},
261+
prometheusAlerts+:: [...],
262+
prometheusRules+:: [...],
263+
}
264+
```
265+
266+
**Consuming a mixin.**
267+
268+
- TODO examples of how we expect people to install, customise and extend mixins.
269+
- TODO Ability to manifest out jsonnet configuration in a variety of formats - YAML, JSON, INI etc
270+
- TODO show how it works with ksonnet but also with something like puppet..
271+
272+
Examples & Proof of Concepts
273+
We will probably put the specification and list of known mixins in a repo somewhere, as a
274+
readme. For now, these are the known mixins and related projects:
275+
276+
| Application | Mixin | Author |
277+
|------------------|--------------------|--------------------------------|
278+
| CoreOS Etcd | etcd-mixin | Grapeshot / Tom Wilkie |
279+
| Cassandra | TBD | Grafana Labs |
280+
| Hashicorp Consul | consul-mixin | Kausal |
281+
| Hashicorp Vault | vault_exporter | Grapeshot / Tom Wilkie |
282+
| Kubernetes | kubernetes-mixin | Tom Wilkie & Frederic Branczyk |
283+
| Kubernetes | kubernetes-grafana | Frederic Branczyk |
284+
| Kubernetes | kube-prometheus | Frederic Branczyk |
285+
| Prometheus | prometheus-ksonnet | Kausal |
286+
287+
**Open Questions**
288+
289+
- Some systems require exporters; can / should these be packaged as part of the mixin?
290+
Hard to do generally, easy to do for kubernetes with ksonnet.
291+
- On the exporter topic, some systems need stats_exporter mappings to be consistent with alerts and dashboards. Even if
292+
we can include statds_exporter in the mixin, can we include the mappings?
293+
- A lot of questions from Julius' design are still open: how to deal with different aggregation windows, what labels to
294+
use on alerts etc.
295+
296+
## Notes
297+
298+
This was recreated from
299+
a [web.archive.org](https://web.archive.org/web/20211021151124/https://docs.google.com/document/d/1A9xvzwqnFVSOZ5fD3blKODXfsat5fg6ZhnKu9LK3lB4/edit)
300+
capture of the original document, the license of this file is unknown.
301+
302+
The links in the archive do not work and have not been recreated.
303+
304+
The license of this file is unknown, but judging by the intent it was meant to be shared freely.

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -248,7 +248,7 @@ While the community has not yet fully agreed on alert severities and their to be
248248

249249
* For more motivation, see
250250
"[The RED Method: How to instrument your services](https://kccncna17.sched.com/event/CU8K/the-red-method-how-to-instrument-your-services-b-tom-wilkie-kausal?iframe=no&w=100%&sidebar=yes&bg=no)" talk from CloudNativeCon Austin.
251-
* For more information about monitoring mixins, see this [design doc](https://docs.google.com/document/d/1A9xvzwqnFVSOZ5fD3blKODXfsat5fg6ZhnKu9LK3lB4/edit#).
251+
* For more information about monitoring mixins, see this [design doc](DESIGN.md).
252252

253253
## Note
254254

0 commit comments

Comments
 (0)