Skip to content

Commit 378a0d3

Browse files
authored
Merge pull request #9 from kabisa/out-of-the-box-slo-threshold
Out of the box slo threshold
2 parents a5f6e8f + ebf2b8d commit 378a0d3

7 files changed

+70
-49
lines changed

README.md

Lines changed: 28 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -5,11 +5,16 @@
55

66
# Terraform module for Datadog Apm
77

8-
This module adds error and latency monitoring for APM data.
9-
It also includes SLO's for errors and latency but this requires some manual actions first.
10-
Datadog has a feature to generated metrics based on APM data.
11-
Unfortunately this is not a feature you can configure with Terraform.
12-
You'll have to create these metrics by hand unfortunately :(
8+
This module provides SLO's and other alerts based on APM data.
9+
Note that it's this module's opinion that you should prefer to alert on SLO burn rates in stead of latency thresholds.
10+
11+
There is also some backwards compatibility if you want to use generated metrics for your SLO's
12+
13+
## OLD SOLUTION FOR SLO's
14+
15+
Before datadog supported latency SLO's we used generated metrics to base our SLO's on.
16+
Creating the generated metrics is not something you can do with Terraform.
17+
You'll have to create these metrics by hand if you need/want this.
1318

1419
In Datadog go to APM -> Setup and Configuration -> Generate Metrics -> New Metric
1520

@@ -88,7 +93,7 @@ avg(last_10m):100 * (sum:trace.${var.trace_span_name}.errors{tag:xxx}.as_rate()
8893

8994
| variable | default | required | description |
9095
|------------------------------------|----------|----------|----------------------------------|
91-
| error_percentage_enabled | True | No | |
96+
| error_percentage_enabled | False | No | We prefer to alert on SLO's |
9297
| error_percentage_warning | 0.01 | No | |
9398
| error_percentage_critical | 0.05 | No | |
9499
| error_percentage_evaluation_period | last_10m | No | |
@@ -133,7 +138,7 @@ percentile(last_15m):p95:trace.${var.trace_span_name}{${local.latency_filter}} >
133138

134139
| variable | default | required | description |
135140
|-------------------------------------------|----------|----------|----------------------------------|
136-
| latency_p95_enabled | True | No | |
141+
| latency_p95_enabled | False | No | We prefer to alert on SLO's |
137142
| latency_p95_warning | 0.9 | No | P95 Latency in seconds. |
138143
| latency_p95_critical | 1.3 | No | P95 Latency warning in seconds. |
139144
| latency_p95_evaluation_period | last_15m | No | |
@@ -155,15 +160,14 @@ burn_rate(\"${local.latency_slo_id}\").over(\"${var.latency_slo_burn_rate_evalua
155160

156161
| variable | default | required | description |
157162
|-----------------------------------------------------|------------------------------------------|----------|------------------------------------------------------------------------------------------------------|
158-
| latency_slo_enabled | False | No | Note that this monitor requires custom metrics to be present. Those can unfortunately not be created with Terraform yet |
163+
| latency_slo_enabled | True | No | Note that this monitor requires custom metrics to be present. Those can unfortunately not be created with Terraform yet |
159164
| latency_slo_note | "" | No | |
160165
| latency_slo_docs | "" | No | |
161166
| latency_slo_filter_override | "" | No | |
162167
| latency_slo_warning | None | No | |
163168
| latency_slo_critical | 99.9 | No | |
169+
| latency_slo_latency_threshold | 1 | No | SLO latency threshold in seconds for APM traces |
164170
| latency_slo_alerting_enabled | True | No | |
165-
| latency_slo_status_ok_filter | ,status:ok | No | Filter string to select the non-errors for the latency SLO, Dont forget to include the comma or (AND or OR) keywords |
166-
| latency_slo_ms_bucket | 250 | No | We defined several latency buckets with custom metrics based on the APM traces that come in. Our buckets are 100, 250, 500, 1000, 2500, 5000, 10000 |
167171
| latency_slo_timeframe | 30d | No | |
168172
| latency_slo_burn_rate_priority | 3 | No | Number from 1 (high) to 5 (low). |
169173
| latency_slo_burn_rate_warning | None | No | |
@@ -176,6 +180,8 @@ burn_rate(\"${local.latency_slo_id}\").over(\"${var.latency_slo_burn_rate_evalua
176180
| latency_slo_burn_rate_notification_channel_override | "" | No | |
177181
| latency_slo_burn_rate_enabled | True | No | |
178182
| latency_slo_burn_rate_alerting_enabled | True | No | |
183+
| latency_slo_custom_numerator | "" | No | |
184+
| latency_slo_custom_denominator | "" | No | |
179185

180186

181187
## Apdex
@@ -242,18 +248,18 @@ Query:
242248
avg(last_10m):avg:trace.${var.trace_span_name}{tag:xxx} > 0.5
243249
```
244250

245-
| variable | default | required | description |
246-
|---------------------------------------|----------|----------|----------------------------------|
247-
| latency_enabled | True | No | |
248-
| latency_warning | 0.3 | No | |
249-
| latency_critical | 0.5 | No | |
250-
| latency_evaluation_period | last_10m | No | |
251-
| latency_note | "" | No | |
252-
| latency_docs | "" | No | |
253-
| latency_filter_override | "" | No | |
254-
| latency_alerting_enabled | True | No | |
255-
| latency_priority | 3 | No | Number from 1 (high) to 5 (low). |
256-
| latency_notification_channel_override | "" | No | |
251+
| variable | default | required | description |
252+
|---------------------------------------|----------|----------|---------------------------------------------|
253+
| latency_enabled | False | No | |
254+
| latency_warning | 0.3 | No | |
255+
| latency_critical | 0.5 | No | Latency threshold in seconds for APM traces |
256+
| latency_evaluation_period | last_10m | No | |
257+
| latency_note | "" | No | |
258+
| latency_docs | "" | No | |
259+
| latency_filter_override | "" | No | |
260+
| latency_alerting_enabled | True | No | |
261+
| latency_priority | 3 | No | Number from 1 (high) to 5 (low). |
262+
| latency_notification_channel_override | "" | No | |
257263

258264

259265
## Module Variables

error-percentage-variables.tf

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
variable "error_percentage_enabled" {
2-
type = bool
3-
default = true
2+
description = "We prefer to alert on SLO's"
3+
type = bool
4+
default = false
45
}
56

67
variable "error_percentage_warning" {

latency-p95-variables.tf

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
variable "latency_p95_enabled" {
2-
type = bool
3-
default = true
2+
description = "We prefer to alert on SLO's"
3+
type = bool
4+
default = false
45
}
56

67
variable "latency_p95_warning" {

latency-slo-variables.tf

Lines changed: 17 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
variable "latency_slo_enabled" {
22
type = bool
3-
default = false
3+
default = true
44
description = "Note that this monitor requires custom metrics to be present. Those can unfortunately not be created with Terraform yet"
55
}
66

@@ -29,23 +29,17 @@ variable "latency_slo_critical" {
2929
default = 99.9
3030
}
3131

32+
variable "latency_slo_latency_threshold" {
33+
description = "SLO latency threshold in seconds for APM traces"
34+
type = number
35+
default = 1
36+
}
37+
3238
variable "latency_slo_alerting_enabled" {
3339
type = bool
3440
default = true
3541
}
3642

37-
variable "latency_slo_status_ok_filter" {
38-
type = string
39-
description = "Filter string to select the non-errors for the latency SLO, Dont forget to include the comma or (AND or OR) keywords"
40-
default = ",status:ok"
41-
}
42-
43-
variable "latency_slo_ms_bucket" {
44-
type = number
45-
default = 250
46-
description = "We defined several latency buckets with custom metrics based on the APM traces that come in. Our buckets are 100, 250, 500, 1000, 2500, 5000, 10000"
47-
}
48-
4943
variable "latency_slo_timeframe" {
5044
validation {
5145
condition = contains(["7d", "30d", "90d"], var.latency_slo_timeframe)
@@ -111,3 +105,13 @@ variable "latency_slo_burn_rate_alerting_enabled" {
111105
type = bool
112106
default = true
113107
}
108+
109+
variable "latency_slo_custom_numerator" {
110+
type = string
111+
default = ""
112+
}
113+
114+
variable "latency_slo_custom_denominator" {
115+
type = string
116+
default = ""
117+
}

latency-slo.tf

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,9 @@ locals {
99
), "")
1010
latency_slo_burn_rate_enabled = var.latency_slo_enabled && var.latency_slo_burn_rate_enabled
1111
latency_slo_id = local.latency_slo_burn_rate_enabled ? datadog_service_level_objective.latency_slo[0].id : ""
12+
13+
latency_slo_numerator = coalesce(var.latency_slo_custom_numerator, "count(v: v<${var.latency_slo_latency_threshold}):trace.${var.trace_span_name}{${local.latency_slo_filter}}")
14+
latency_slo_denominator = coalesce(var.latency_slo_custom_denominator, "count:trace.${var.trace_span_name}{${local.latency_slo_filter}}")
1215
}
1316

1417

@@ -25,8 +28,8 @@ resource "datadog_service_level_objective" "latency_slo" {
2528
}
2629

2730
query {
28-
numerator = "sum:custom_trace.lt.${var.latency_slo_ms_bucket}ms.count{${local.latency_slo_filter}${var.latency_slo_status_ok_filter}}.as_count()"
29-
denominator = "sum:custom_trace.hits{${local.latency_slo_filter}${var.latency_slo_status_ok_filter}}.as_count()"
31+
numerator = local.latency_slo_numerator
32+
denominator = local.latency_slo_denominator
3033
}
3134

3235
tags = local.normalized_tags

latency-variables.tf

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
variable "latency_enabled" {
22
type = bool
3-
default = true
3+
default = false
44
}
55

66
variable "latency_warning" {
@@ -9,8 +9,9 @@ variable "latency_warning" {
99
}
1010

1111
variable "latency_critical" {
12-
type = number
13-
default = 0.5
12+
description = "Latency threshold in seconds for APM traces"
13+
type = number
14+
default = 0.5
1415
}
1516

1617
variable "latency_evaluation_period" {

module_description.md

Lines changed: 10 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,13 @@
1-
This module adds error and latency monitoring for APM data.
2-
It also includes SLO's for errors and latency but this requires some manual actions first.
3-
Datadog has a feature to generated metrics based on APM data.
4-
Unfortunately this is not a feature you can configure with Terraform.
5-
You'll have to create these metrics by hand unfortunately :(
1+
This module provides SLO's and other alerts based on APM data.
2+
Note that it's this module's opinion that you should prefer to alert on SLO burn rates in stead of latency thresholds.
3+
4+
There is also some backwards compatibility if you want to use generated metrics for your SLO's
5+
6+
## OLD SOLUTION FOR SLO's
7+
8+
Before datadog supported latency SLO's we used generated metrics to base our SLO's on.
9+
Creating the generated metrics is not something you can do with Terraform.
10+
You'll have to create these metrics by hand if you need/want this.
611

712
In Datadog go to APM -> Setup and Configuration -> Generate Metrics -> New Metric
813

0 commit comments

Comments
 (0)