Skip to content

Conversation

zxiiro
Copy link
Contributor

@zxiiro zxiiro commented Jul 30, 2025

This adds a monitor for the ALI ValidationException Detected CloudWatch alarm.

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds a new Datadog monitor to detect validation exceptions in the ALI (Auto-scaling Lambda Infrastructure) system by monitoring CloudWatch alarm events from Amazon SNS.

  • Adds a new Datadog event-based monitor that triggers when ValidationException events are detected
  • Includes detailed messaging with troubleshooting guidance and notification routing
  • Configured to alert immediately when any ValidationException occurs in a 5-minute window
Comments suppressed due to low confidence (2)

monitors.tf:62

  • [nitpick] The resource name 'ALI_ValidationException_Detected' uses inconsistent naming convention. Consider using snake_case like 'ali_validation_exception_detected' to match Terraform naming conventions.
resource "datadog_monitor" "ALI_ValidationException_Detected" {

monitors.tf:68

  • [nitpick] The monitor name should be more descriptive and include the system context. Consider using a name like 'ALI Auto-scaling ValidationException Detected' to provide clearer context about what ALI refers to.
  name = "ALI ValidationException Detected"

Copy link

github-actions bot commented Jul 30, 2025

OpenTofu plan for prod

Plan: 1 to add, 0 to change, 0 to destroy.
OpenTofu used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
+   create

OpenTofu will perform the following actions:

  # datadog_monitor.ALI_ValidationException_Detected will be created
+   resource "datadog_monitor" "ALI_ValidationException_Detected" {
+       evaluation_delay    = (known after apply)
+       id                  = (known after apply)
+       include_tags        = false
+       message             = <<-EOT
            # ValidationException
            
            We've detected that a ValidationException has happened in the ALI. This could
            mean the ALI is having issues scaling up runners. Perhaps test-infra release
            was recently updated which may affect this.
            
            ## Action
            
            Review scale-up lambda logs in CloudWatch to triage issue and take any
            necessary action. Revert test-infra release to last known working version if
            necessary.
            
            @slack-PyTorch-pytorch-infra-alerts
            @slack-Linux_Foundation-pytorch-alerts
            @webhook-lf-incident-io
        EOT
+       name                = "ALI ValidationException Detected"
+       new_host_delay      = 300
+       notify_no_data      = false
+       query               = "events(\"source:amazon_sns @title:\\\"ALI ValidationException Detected\\\"\").rollup(\"count\").last(\"5m\") > 0"
+       require_full_window = false
+       tags                = (known after apply)
+       type                = "event-v2 alert"

+       monitor_thresholds {
+           critical = "0"
        }
    }

Plan: 1 to add, 0 to change, 0 to destroy.

✅ Plan applied in Tofu Apply #12

@datadog-pytorch-fdn
Copy link

datadog-pytorch-fdn bot commented Jul 30, 2025

No data reported at this time.
This comment will be updated automatically if new data arrives.
🔗 Commit SHA: a766ed1 | Docs | Was this helpful? Give us feedback!

This adds a monitor for the ALI ValidationException Detected
CloudWatch alarm.

Signed-off-by: Thanh Ha <[email protected]>
@zxiiro zxiiro force-pushed the zxiiro/validation-exception branch from d44ccc4 to a766ed1 Compare July 30, 2025 19:17
@zxiiro zxiiro merged commit 67181a0 into main Jul 30, 2025
2 checks passed
@zxiiro zxiiro deleted the zxiiro/validation-exception branch July 30, 2025 19:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants