Skip to content

Commit 86f7eb8

Browse files
author
Luke Reed
authored
static monitor config (#113)
adding static monitor configuration feature * Updated README.md * Created static_conf.yml
1 parent 4214ed5 commit 86f7eb8

File tree

12 files changed

+298
-246
lines changed

12 files changed

+298
-246
lines changed

README.md

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -77,6 +77,7 @@ rulesets:
7777
7878
* `cluster_variables`: (dict). A collection of variables that can be used in monitors. They can be used in monitors by prepending with `ClusterVariables`, eg `{{ ClusterVariables.var1 }}`.
7979
* `rulesets`: (List). A collection of rulesets. A ruleset consists of a Kubernetes resource type, annotations the resource must have to be considered valid, and a collection of monitors to manage for the resource.
80+
* `type`: (String). The type of resource to match if matching with annotations. Can also be `static` or `binding`. Currently supports `deployment`, `namespace`, `binding`, and `static` as values.
8081
* `match_annotations`: (List). A collection of name/value pairs pairs of annotations that must be present on the resource to manage it.
8182
* `bound_objects`: (List). A collection of object types that are bound to this object. For instance, if you have a ruleset for a namespace, you can bind other objects like deployments, services, etc. Then, when the bound objects in the namespace get updated, those rulesets apply to it.
8283
* `monitors`: (Map). A collection of monitors to manage for any resource that matches the rules defined.
@@ -111,13 +112,19 @@ rulesets:
111112
* `include_tags`: When true, notifications from this monitor automatically insert triggering tags into the title.
112113
* `require_full_window`: boolean indicating if a monitor needs a full window of data to be evaluated.
113114
* `locked`: boolean indicating if changes are only allowed from the creator or admins.
115+
#### Static monitors
116+
A static monitor is one that does not depend on the presence of a resource in the kubernetes cluster. An example of a
117+
static monitor would be `Host CPU Usage`. There are a variety of example static monitors in the [static_conf.yml example](./static_conf.yml)
114118

115119
#### A Note on Templating
116120
Since Datadog uses a very similar templating language to go templating, to pass a template variable to Datadog it must be "escaped" by inserting it as a template literal:
117121

118122
```
119123
{{ "{{/is_alert}}" }}
120124
```
125+
126+
The above note is not applicable for static monitors and if extra brackets are present, creation of the static monitors will fail.
127+
121128
## Overriding Configuration
122129

123130
It is possible to override monitor elements using Kubernetes resource annotations.
@@ -135,7 +142,7 @@ As of now, the only fields that can be overridden are:
135142
* query
136143
* type
137144

138-
Additionally, templating in the override is currently not available.
145+
Templating in the override is currently not available.
139146

140147
## Contributing
141148
PRs welcome! Check out the [Contributing Guidelines](CONTRIBUTING.md),

cmd/root.go

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,7 @@ func init() {
6767
rootCmd.PersistentFlags().StringVarP(&metricsPort, "metrics-port", "p", ":8080", "The address to serve prometheus metrics.")
6868
rootCmd.PersistentFlags().StringVar(&namespace, "namespace", "kube-system", "The namespace where astro is running")
6969
}
70-
func leaderElection(cmd *cobra.Command, args []string) {
70+
func leaderElection(*cobra.Command, []string) {
7171
log.SetOutput(os.Stdout)
7272
log.SetLevel(logLevels[strings.ToLower(logLevel)])
7373

@@ -127,7 +127,7 @@ func leaderElection(cmd *cobra.Command, args []string) {
127127
}
128128

129129
func run(ctx context.Context, cancel context.CancelFunc) {
130-
// create a channel to respond to SIGTERMs
130+
// create a channel to respond to SIGTERM and SIGINT
131131
signals := make(chan os.Signal, 1)
132132
defer close(signals)
133133

conf-example.yml

Lines changed: 4 additions & 165 deletions
Original file line numberDiff line numberDiff line change
@@ -64,51 +64,10 @@ rulesets:
6464
- name: astro/admin-bound
6565
value: "true"
6666
monitors:
67-
ns-high-load-avg:
68-
name: "High System Load Average"
69-
type: metric alert
70-
query: "avg(last_30m):avg:system.load.norm.5{k8s.io/role/master:1} by {host} > 2"
71-
message: |-
72-
Load average is high on {{ "{{host.name}} {{host.ip}}" }}.
73-
This is a normalized load based on the number of CPUs (i.e. ActualLoadAverage / NumberOfCPUs)
74-
Is this node over-provisioned? Pods may need to have a CPU limits closer to their requests
75-
Is this node doing a lot of I/O? Load average could be high based on high disk or networking I/O. This may be acceptable if application performance is still ok. To reduce I/O-based system load, you may need to artificially limit the number of high-I/O pods running on a single node.
76-
tags: []
77-
options:
78-
notify_audit: false
79-
notify_no_data: false
80-
new_host_delay: 300
81-
thresholds:
82-
critical: 2
83-
locked: false
84-
ns-high-mem-use:
85-
name: "Memory Utilization"
86-
type: query alert
87-
query: "avg(last_15m):avg:system.mem.pct_usable{k8s.io/role/master:1} by {host} < 0.1"
88-
message: |-
89-
{{ "{{#is_alert}}" }}
90-
Running out of free memory on {{ "{{host.name}}" }}
91-
{{ "{{/is_alert}}" }}
92-
{{ "{{#is_alert_to_warning}}" }}
93-
Memory usage has decreased. There is about 30% free
94-
{{ "{{/is_alert_to_warning}}" }}
95-
{{ "{{#is_alert_recovery}}" }}
96-
Memory is below treshold again
97-
{{ "{{/is_alert_recovery}}" }}
98-
tags: []
99-
options:
100-
notify_audit: false
101-
notify_no_data: false
102-
new_host_delay: 300
103-
require_full_window: true
104-
thresholds:
105-
critical: 0.1
106-
warning: 0.15
107-
locked: false
10867
ns-pending-pods:
109-
name: "Pending Pods"
110-
type: metric alert
111-
query: "min(last_30m):sum:kubernetes_state.pod.status_phase{phase:running} - sum:kubernetes_state.pod.status_phase{phase:running} + sum:kubernetes_state.pod.status_phase{phase:pending}.fill(zero) >= 1"
68+
name: "Pending Pods - {{ .ObjectMeta.Name }}"
69+
type: query alert
70+
query: "min(last_30m):sum:kubernetes_state.pod.status_phase{phase:running,namespace:{{ .ObjectMeta.Name }}} - sum:kubernetes_state.pod.status_phase{phase:running,namespace:{{ .ObjectMeta.Name }}} + sum:kubernetes_state.pod.status_phase{phase:pending,namespace:{{ .ObjectMeta.Name }}}.fill(zero) >= 1"
11271
message: |-
11372
{{ "{{#is_alert}}" }}
11473
There has been at least 1 pod Pending for 30 minutes.
@@ -126,128 +85,8 @@ rulesets:
12685
notify_no_data: false
12786
new_host_delay: 300
12887
thresholds:
129-
critical: 1
130-
locked: false
131-
ns-host-disk-use:
132-
name: "Host Disk Usage"
133-
type: metric alert
134-
query: "avg(last_30m):(avg:system.disk.total{*} by {host} - avg:system.disk.free{*} by {host}) / avg:system.disk.total{*} by {host} * 100 > 90"
135-
message: |-
136-
{{ "{{#is_alert}}" }}
137-
Disk Usage has been above threshold over 30 minutes on {{ "{{host.name}}" }}
138-
{{ "{{/is_alert}}" }}
139-
{{ "{{#is_warning}}" }}
140-
Disk Usage has been above threshold over 30 minutes on {{ "{{host.name}}" }}
141-
{{ "{{/is_warning}}" }}
142-
{{ "{{^is_alert}}" }}
143-
Disk Usage has recovered on {{ "{{host.name}}" }}
144-
{{ "{{/is_alert}}" }}
145-
{{ "{{^is_warning}}" }}
146-
Disk Usage has recovered on {{ "{{host.name}}" }}
147-
{{ "{{/is_warning}}" }}
148-
tags: []
149-
options:
150-
notify_audit: false
151-
notify_no_data: false
152-
new_host_delay: 300
153-
require_full_window: true
154-
thresholds:
155-
critical: 90
156-
warning: 80
157-
warning_recovery: 75
158-
critical_recovery: 85
159-
locked: false
160-
ns-hpa-errors:
161-
name: "HPA Errors"
162-
type: event alert
163-
query: "events('sources:kubernetes priority:all \"unable to fetch metrics from resource metrics API:\"').by('hpa').rollup('count').last('1h') > 200"
164-
message: |-
165-
{{ "{{#is_alert}}" }}
166-
A high number of hpa failures (> {{ "{{threshold}}" }} ) are occurring. Can HPAs get metrics?
167-
{{ "{{/is_alert}}" }}
168-
{{ "{{#is_alert_recovery}}" }}
169-
HPA Metric Retrieval Failure has recovered.
170-
{{ "{{/is_alert_recovery}}" }}
171-
tags: []
172-
options:
173-
notify_audit: false
174-
notify_no_data: false
175-
require_full_window: true
176-
locked: false
177-
ns-io-wait-times:
178-
name: "I/O Wait Times"
179-
type: metric alert
180-
query: "avg(last_10m):avg:system.cpu.iowait{*} by {host} > 50"
181-
message: |-
182-
{{ "{{#is_alert}}" }}
183-
The I/O wait time for {host.ip} is very high
184-
- Is the EBS volume out of burst capacity for iops?
185-
- Is something writing lots of errors to the journal?
186-
- Is there a pod doing something unexpected (crash looping, etc)?
187-
{{ "{{/is_alert}}" }}
188-
{{ "{{^is_alert}}" }}
189-
The EBS volume burst capacity is returning to normal.
190-
{{ "{{/is_alert}}" }}
191-
tags: []
192-
options:
193-
notify_audit: false
194-
new_host_delay: 300
195-
notify_no_data: false
196-
require_full_window: true
197-
locked: false
198-
thresholds:
199-
critical: 50
200-
warning: 30
201-
ns-nginx-config-reload-fail:
202-
name: "Nginx Config Reload Failure"
203-
type: metric alert
204-
query: "max(last_5m):max:ingress.nginx_ingress_controller_config_last_reload_successful{*} by {kube_deployment} <= 0"
205-
message: |-
206-
{{ "{{#is_alert}}" }}
207-
The last nginx config reload for {{ "{{kube_deployment.name}}" }} failed! Are there any bad ingress configs? Does the nginx config have a syntax error?
208-
{{ "{{/is_alert}}" }}
209-
{{ "{{#is_recovery}}" }}
210-
Nginx config reloaded successfully!
211-
{{ "{{/is_recovery}}" }}
212-
tags: []
213-
options:
214-
notify_audit: false
215-
new_host_delay: 300
216-
notify_no_data: false
217-
require_full_window: true
218-
locked: false
219-
thresholds:
220-
critical: 0
221-
critical_recovery: 1
222-
ns-node-not-ready:
223-
name: "Node is not Ready"
224-
type: service check
225-
query: |
226-
"kubernetes_state.node.ready".by("host").last(20).count_by_status()
227-
message: |-
228-
{{ "{{#is_alert}}" }}
229-
A Node is not ready!
230-
Cluster: {{ "{{kubernetescluster.name}}" }}
231-
Host: {{ "{{host.name}}" }}
232-
IP: {{ "{{host.ip}}" }}
233-
{{ "{{check_message}}" }}
234-
{{ "{{/is_alert}}" }}
235-
{{ "{{#is_recovery}}" }}
236-
Node is now ready.
237-
Cluster: {{ "{{kubernetescluster.name}}" }}
238-
Host: {{ "{{host.name}}" }}
239-
IP: {{ "{{host.ip}}" }}
240-
{{ "{{/is_recovery}}" }}
241-
tags: []
242-
options:
243-
notify_audit: false
244-
no_data_timeframe: 2
245-
new_host_delay: 900
246-
notify_no_data: false
88+
critical: 1.0
24789
locked: false
248-
thresholds:
249-
critical: 20
250-
ok: 2
25190
- type: namespace
25291
match_annotations:
25392
- name: astro/admin

0 commit comments

Comments
 (0)