Skip to content

Fix Airflow DAG Failure Alert: Use DAG-Specific Metric to Enable dag_id in Alert Description #1485

@sim500

Description

@sim500

Issue

The following alert in apache-airflow-mixin/alerts/alerts.libsonnet:

alert: 'ApacheAirflowDAGFailures',
expr: |||
  increase(airflow_dagrun_duration_failed_count[5m]) > %(alertsCriticalFailedDAGs)s
||| % $._config,
'for': '1m',
labels: {
  severity: 'critical',
},
annotations: {
  summary: 'There have been DAG failures detected.',
  description: |||
    The number of DAG failures seen is {{ printf "%%.0f" $value }} over the last 1m for {{ $labels.instance }} - {{ $labels.dag_id }} which is above the threshold of %(alertsCriticalFailedDAGs)s.
  ||| % $._config,
},

cannot populate the variable labels.dag_id in the description, because the metric airflow_dagrun_duration_failed_count is generic. It isn't specific for each DAG, so o that label does not exist.

Potential solution

Replace the metric airflow_dagrun_duration_failed_count with airflow_dagrun_failed_count. The latter is available per DAG, making it suitable for extracting the dag_id from the labels and including it in the alert description. So we have:

increase(airflow_dagrun_failed_count[5m]) > 0 

However, this metric is only created for a DAG once it fails for the first time. This means the increase function would not detect the very first failure of a DAG.

To handle the first failure case, we can enhance the query by adding an additional condition:

(airflow_dagrun_failed_count == 1 and 
on (instance, job) (increase(airflow_dagrun_duration_failed_count[5m]) > 0))

And the final expression would be:

increase(airflow_dagrun_failed_count[5m]) > 0 
or (airflow_dagrun_failed_count == 1 and 
on (instance, job) (increase(airflow_dagrun_duration_failed_count[5m]) > 0))

Minor issue

The description is also wrong:

The number of DAG failures seen is ... over the last 1m

It should be "over the last 5m"

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions