Skip to content

Conversation

zxiiro
Copy link
Contributor

@zxiiro zxiiro commented Jul 31, 2025

We original set this up to try the anomalies feature but it alerts almost everyday making this monitor not all that useful. Let's just remove it to reduce noise. Our monitor for the deadletter queue should be sufficient for what we wanted to get alerted on.

We original set this up to try the anomalies feature but it alerts
almost everyday making this monitor not all that useful. Let's just
remove it to reduce noise. Our monitor for the deadletter queue should
be sufficient for what we wanted to get alerted on.

Signed-off-by: Thanh Ha <[email protected]>
@zxiiro zxiiro requested a review from Copilot July 31, 2025 14:00
@zxiiro zxiiro requested a review from a team as a code owner July 31, 2025 14:00
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR removes an anomaly detection monitor for AWS SQS queues that was generating excessive false positive alerts. The change reduces monitoring noise while maintaining coverage through an existing deadletter queue monitor.

Key Changes

  • Removed the all_queues_anomaly Datadog monitor resource from the Terraform configuration
  • Eliminated daily false positive alerts while preserving essential queue monitoring capabilities

Copy link

github-actions bot commented Jul 31, 2025

OpenTofu plan for prod

Plan: 0 to add, 0 to change, 1 to destroy.
OpenTofu used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
-   destroy

OpenTofu will perform the following actions:

  # datadog_monitor.all_queues_anomaly will be destroyed
  # (because datadog_monitor.all_queues_anomaly is not in configuration)
-   resource "datadog_monitor" "all_queues_anomaly" {
-       evaluation_delay     = 900 -> null
-       id                   = "9264779" -> null
-       include_tags         = true -> null
-       locked               = false -> null
-       message              = <<-EOT
            The number of visible messages in `{{queuename.name}}` is outside of the typical range.
            @slack-PyTorch-pytorch-infra-alerts
            @slack-Linux_Foundation-pytorch-alerts
            @webhook-lf-incident-io
        EOT -> null
-       name                 = "Queue **{{queuename.name}}** has a high number of visible messages" -> null
-       new_group_delay      = 0 -> null
-       new_host_delay       = 300 -> null
-       no_data_timeframe    = 0 -> null
-       notify_audit         = false -> null
-       notify_by            = [] -> null
-       notify_no_data       = false -> null
-       priority             = "5" -> null
-       query                = <<-EOT
            avg(last_1w):
            anomalies(
              avg:aws.sqs.approximate_number_of_messages_visible{project:pytorch/pytorch} by {queuename,region},
              'basic', 2, direction='both', interval=3600, alert_window='last_1d', count_default_zero='true'
            ) >= 1
        EOT -> null
-       renotify_interval    = 0 -> null
-       renotify_occurrences = 0 -> null
-       require_full_window  = false -> null
-       tags                 = [] -> null
-       timeout_h            = 0 -> null
-       type                 = "query alert" -> null

-       monitor_threshold_windows {
-           recovery_window = "last_15m" -> null
-           trigger_window  = "last_1d" -> null
        }

-       monitor_thresholds {
-           critical          = "1" -> null
-           critical_recovery = "0" -> null
-           warning           = "0.9" -> null
        }
    }

Plan: 0 to add, 0 to change, 1 to destroy.

✅ Plan applied in Tofu Apply #16

@zxiiro zxiiro merged commit bb1df57 into main Jul 31, 2025
2 checks passed
@zxiiro zxiiro deleted the zxiiro/queue-alerts branch July 31, 2025 14:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants