-
Notifications
You must be signed in to change notification settings - Fork 178
Description
Issue
The following alert in apache-airflow-mixin/alerts/alerts.libsonnet:
alert: 'ApacheAirflowDAGFailures',
expr: |||
increase(airflow_dagrun_duration_failed_count[5m]) > %(alertsCriticalFailedDAGs)s
||| % $._config,
'for': '1m',
labels: {
severity: 'critical',
},
annotations: {
summary: 'There have been DAG failures detected.',
description: |||
The number of DAG failures seen is {{ printf "%%.0f" $value }} over the last 1m for {{ $labels.instance }} - {{ $labels.dag_id }} which is above the threshold of %(alertsCriticalFailedDAGs)s.
||| % $._config,
},cannot populate the variable labels.dag_id in the description, because the metric airflow_dagrun_duration_failed_count is generic. It isn't specific for each DAG, so o that label does not exist.
Potential solution
Replace the metric airflow_dagrun_duration_failed_count with airflow_dagrun_failed_count. The latter is available per DAG, making it suitable for extracting the dag_id from the labels and including it in the alert description. So we have:
increase(airflow_dagrun_failed_count[5m]) > 0
However, this metric is only created for a DAG once it fails for the first time. This means the increase function would not detect the very first failure of a DAG.
To handle the first failure case, we can enhance the query by adding an additional condition:
(airflow_dagrun_failed_count == 1 and
on (instance, job) (increase(airflow_dagrun_duration_failed_count[5m]) > 0))
And the final expression would be:
increase(airflow_dagrun_failed_count[5m]) > 0
or (airflow_dagrun_failed_count == 1 and
on (instance, job) (increase(airflow_dagrun_duration_failed_count[5m]) > 0))
Minor issue
The description is also wrong:
The number of DAG failures seen is ... over the last 1m
It should be "over the last 5m"