Skip to content

Conversation

zxiiro
Copy link
Contributor

@zxiiro zxiiro commented Jul 31, 2025

Uses the PyTorch HUD API to check for long runner queues (>1h) and report to Slack if any runner queues have jobs waiting unusually long for a runner.

Copy link

github-actions bot commented Jul 31, 2025

OpenTofu plan for prod

Plan: 1 to add, 0 to change, 0 to destroy.
OpenTofu used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
+   create

OpenTofu will perform the following actions:

  # datadog_synthetics_test.pytorch-gha-runners-queue-check will be created
+   resource "datadog_synthetics_test" "pytorch-gha-runners-queue-check" {
+       id         = (known after apply)
+       locations  = [
+           "aws:us-west-2",
        ]
+       message    = <<-EOT
            Detected GitHub Runner Queue has jobs waiting unusually long for runners.
            
              Check https://hud.pytorch.org/metrics to determine which ones.
            
              @slack-pytorch-infra-alerts
        EOT
+       monitor_id = (known after apply)
+       name       = "GHA Runner Queue Check"
+       status     = "live"
+       tags       = [
+           "env:project",
+           "project:pytorch",
+           "service:gha-runners",
        ]
+       type       = "api"

+       assertion {
+           operator = "is"
+           target   = "200"
+           type     = "statusCode"
        }
+       assertion {
+           operator = "validatesJSONPath"
+           type     = "body"

+           targetjsonpath {
+               elementsoperator = "everyElementMatches"
+               jsonpath         = "$[?(@.avg_queue_s > 3600)].avg_queue_s"
+               operator         = "is"
            }
        }

+       options_list {
+           http_version        = "any"
+           min_location_failed = 1
+           tick_every          = 900
        }

+       request_definition {
+           method = "GET"
+           url    = "https://hud.pytorch.org/api/clickhouse/queued_jobs_by_label?parameters=%7B%7D"
        }
    }

Plan: 1 to add, 0 to change, 0 to destroy.

✅ Plan applied in Tofu Apply #17

Uses the PyTorch HUD API to check for long runner queues (>1h)
and report to Slack if any runner queues have jobs waiting unusually
long for a runner.

Signed-off-by: Thanh Ha <[email protected]>
@zxiiro zxiiro force-pushed the zxiiro/queue-alerts branch from b154123 to 4927f5f Compare July 31, 2025 18:26
@zxiiro zxiiro marked this pull request as ready for review July 31, 2025 18:29
@zxiiro zxiiro requested a review from a team as a code owner July 31, 2025 18:29
@zxiiro zxiiro requested a review from Copilot July 31, 2025 18:29
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds monitoring for GitHub Actions runner queue health by creating a Datadog synthetic test that checks for jobs waiting longer than 1 hour for available runners.

  • Introduces a new synthetic API test to monitor GitHub runner queue times
  • Configures the test to query PyTorch HUD API and alert via Slack when queues exceed 1 hour
  • Sets up automated monitoring with 15-minute intervals to catch runner capacity issues

@zxiiro zxiiro merged commit 63ee761 into main Jul 31, 2025
2 checks passed
@zxiiro zxiiro deleted the zxiiro/queue-alerts branch July 31, 2025 18:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants