You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+8-1Lines changed: 8 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -77,6 +77,7 @@ rulesets:
77
77
78
78
* `cluster_variables`: (dict). A collection of variables that can be used in monitors. They can be used in monitors by prepending with `ClusterVariables`, eg `{{ ClusterVariables.var1 }}`.
79
79
* `rulesets`: (List). A collection of rulesets. A ruleset consists of a Kubernetes resource type, annotations the resource must have to be considered valid, and a collection of monitors to manage for the resource.
80
+
* `type`: (String). The type of resource to match if matching with annotations. Can also be `static` or `binding`. Currently supports `deployment`, `namespace`, `binding`, and `static` as values.
80
81
* `match_annotations`: (List). A collection of name/value pairs pairs of annotations that must be present on the resource to manage it.
81
82
* `bound_objects`: (List). A collection of object types that are bound to this object. For instance, if you have a ruleset for a namespace, you can bind other objects like deployments, services, etc. Then, when the bound objects in the namespace get updated, those rulesets apply to it.
82
83
* `monitors`: (Map). A collection of monitors to manage for any resource that matches the rules defined.
@@ -111,13 +112,19 @@ rulesets:
111
112
* `include_tags`: When true, notifications from this monitor automatically insert triggering tags into the title.
112
113
* `require_full_window`: boolean indicating if a monitor needs a full window of data to be evaluated.
113
114
* `locked`: boolean indicating if changes are only allowed from the creator or admins.
115
+
#### Static monitors
116
+
A static monitor is one that does not depend on the presence of a resource in the kubernetes cluster. An example of a
117
+
static monitor would be `Host CPU Usage`. There are a variety of example static monitors in the [static_conf.yml example](./static_conf.yml)
114
118
115
119
#### A Note on Templating
116
120
Since Datadog uses a very similar templating language to go templating, to pass a template variable to Datadog it must be "escaped" by inserting it as a template literal:
117
121
118
122
```
119
123
{{ "{{/is_alert}}" }}
120
124
```
125
+
126
+
The above note is not applicable for static monitors and if extra brackets are present, creation of the static monitors will fail.
127
+
121
128
## Overriding Configuration
122
129
123
130
It is possible to override monitor elements using Kubernetes resource annotations.
@@ -135,7 +142,7 @@ As of now, the only fields that can be overridden are:
135
142
* query
136
143
* type
137
144
138
-
Additionally, templating in the override is currently not available.
145
+
Templating in the override is currently not available.
139
146
140
147
## Contributing
141
148
PRs welcome! Check out the [Contributing Guidelines](CONTRIBUTING.md),
Copy file name to clipboardExpand all lines: conf-example.yml
+4-165Lines changed: 4 additions & 165 deletions
Original file line number
Diff line number
Diff line change
@@ -64,51 +64,10 @@ rulesets:
64
64
- name: astro/admin-bound
65
65
value: "true"
66
66
monitors:
67
-
ns-high-load-avg:
68
-
name: "High System Load Average"
69
-
type: metric alert
70
-
query: "avg(last_30m):avg:system.load.norm.5{k8s.io/role/master:1} by {host} > 2"
71
-
message: |-
72
-
Load average is high on {{ "{{host.name}} {{host.ip}}" }}.
73
-
This is a normalized load based on the number of CPUs (i.e. ActualLoadAverage / NumberOfCPUs)
74
-
Is this node over-provisioned? Pods may need to have a CPU limits closer to their requests
75
-
Is this node doing a lot of I/O? Load average could be high based on high disk or networking I/O. This may be acceptable if application performance is still ok. To reduce I/O-based system load, you may need to artificially limit the number of high-I/O pods running on a single node.
76
-
tags: []
77
-
options:
78
-
notify_audit: false
79
-
notify_no_data: false
80
-
new_host_delay: 300
81
-
thresholds:
82
-
critical: 2
83
-
locked: false
84
-
ns-high-mem-use:
85
-
name: "Memory Utilization"
86
-
type: query alert
87
-
query: "avg(last_15m):avg:system.mem.pct_usable{k8s.io/role/master:1} by {host} < 0.1"
88
-
message: |-
89
-
{{ "{{#is_alert}}" }}
90
-
Running out of free memory on {{ "{{host.name}}" }}
91
-
{{ "{{/is_alert}}" }}
92
-
{{ "{{#is_alert_to_warning}}" }}
93
-
Memory usage has decreased. There is about 30% free
There has been at least 1 pod Pending for 30 minutes.
@@ -126,128 +85,8 @@ rulesets:
126
85
notify_no_data: false
127
86
new_host_delay: 300
128
87
thresholds:
129
-
critical: 1
130
-
locked: false
131
-
ns-host-disk-use:
132
-
name: "Host Disk Usage"
133
-
type: metric alert
134
-
query: "avg(last_30m):(avg:system.disk.total{*} by {host} - avg:system.disk.free{*} by {host}) / avg:system.disk.total{*} by {host} * 100 > 90"
135
-
message: |-
136
-
{{ "{{#is_alert}}" }}
137
-
Disk Usage has been above threshold over 30 minutes on {{ "{{host.name}}" }}
138
-
{{ "{{/is_alert}}" }}
139
-
{{ "{{#is_warning}}" }}
140
-
Disk Usage has been above threshold over 30 minutes on {{ "{{host.name}}" }}
141
-
{{ "{{/is_warning}}" }}
142
-
{{ "{{^is_alert}}" }}
143
-
Disk Usage has recovered on {{ "{{host.name}}" }}
144
-
{{ "{{/is_alert}}" }}
145
-
{{ "{{^is_warning}}" }}
146
-
Disk Usage has recovered on {{ "{{host.name}}" }}
147
-
{{ "{{/is_warning}}" }}
148
-
tags: []
149
-
options:
150
-
notify_audit: false
151
-
notify_no_data: false
152
-
new_host_delay: 300
153
-
require_full_window: true
154
-
thresholds:
155
-
critical: 90
156
-
warning: 80
157
-
warning_recovery: 75
158
-
critical_recovery: 85
159
-
locked: false
160
-
ns-hpa-errors:
161
-
name: "HPA Errors"
162
-
type: event alert
163
-
query: "events('sources:kubernetes priority:all \"unable to fetch metrics from resource metrics API:\"').by('hpa').rollup('count').last('1h') > 200"
164
-
message: |-
165
-
{{ "{{#is_alert}}" }}
166
-
A high number of hpa failures (> {{ "{{threshold}}" }} ) are occurring. Can HPAs get metrics?
167
-
{{ "{{/is_alert}}" }}
168
-
{{ "{{#is_alert_recovery}}" }}
169
-
HPA Metric Retrieval Failure has recovered.
170
-
{{ "{{/is_alert_recovery}}" }}
171
-
tags: []
172
-
options:
173
-
notify_audit: false
174
-
notify_no_data: false
175
-
require_full_window: true
176
-
locked: false
177
-
ns-io-wait-times:
178
-
name: "I/O Wait Times"
179
-
type: metric alert
180
-
query: "avg(last_10m):avg:system.cpu.iowait{*} by {host} > 50"
181
-
message: |-
182
-
{{ "{{#is_alert}}" }}
183
-
The I/O wait time for {host.ip} is very high
184
-
- Is the EBS volume out of burst capacity for iops?
185
-
- Is something writing lots of errors to the journal?
186
-
- Is there a pod doing something unexpected (crash looping, etc)?
187
-
{{ "{{/is_alert}}" }}
188
-
{{ "{{^is_alert}}" }}
189
-
The EBS volume burst capacity is returning to normal.
190
-
{{ "{{/is_alert}}" }}
191
-
tags: []
192
-
options:
193
-
notify_audit: false
194
-
new_host_delay: 300
195
-
notify_no_data: false
196
-
require_full_window: true
197
-
locked: false
198
-
thresholds:
199
-
critical: 50
200
-
warning: 30
201
-
ns-nginx-config-reload-fail:
202
-
name: "Nginx Config Reload Failure"
203
-
type: metric alert
204
-
query: "max(last_5m):max:ingress.nginx_ingress_controller_config_last_reload_successful{*} by {kube_deployment} <= 0"
205
-
message: |-
206
-
{{ "{{#is_alert}}" }}
207
-
The last nginx config reload for {{ "{{kube_deployment.name}}" }} failed! Are there any bad ingress configs? Does the nginx config have a syntax error?
0 commit comments