Retry runner checks 3 times before failing #38

zxiiro · 2025-09-04T12:25:40Z

PyTorch HUD recently has been occassionally responding slowly. Let's add a retry function so that we avoid alerting due to HUD issues. This change will retry up to 3 times waiting 1 minute each before declaring an alert.

PyTorch HUD recently has been occassionally responding slowly. Let's add a retry function so that we avoid alerting due to HUD issues. This change will retry up to 3 times waiting 1 minute each before declaring an alert. Signed-off-by: Thanh Ha <[email protected]>

github-actions · 2025-09-04T12:26:30Z

OpenTofu plan for prod

Plan: 0 to add, 7 to change, 0 to destroy.

OpenTofu used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
!~  update in-place

OpenTofu will perform the following actions:

  # datadog_synthetics_test.pytorch-gha-runners-queue-check-amd will be updated in-place
!~  resource "datadog_synthetics_test" "pytorch-gha-runners-queue-check-amd" {
        id               = "yt8-7zy-xpj"
        name             = "GHA Runner Queue Check - AMD Runners"
        tags             = [
            "env:project",
            "project:pytorch",
            "service:gha-runners",
        ]
#        (10 unchanged attributes hidden)

!~      options_list {
#            (16 unchanged attributes hidden)

+           retry {
+               count    = 3
+               interval = 60000
            }
        }

#        (2 unchanged blocks hidden)
    }

  # datadog_synthetics_test.pytorch-gha-runners-queue-check-ibm will be updated in-place
!~  resource "datadog_synthetics_test" "pytorch-gha-runners-queue-check-ibm" {
        id               = "sc6-zip-2n9"
        name             = "GHA Runner Queue Check - IBM Runners"
        tags             = [
            "env:project",
            "project:pytorch",
            "service:gha-runners",
        ]
#        (10 unchanged attributes hidden)

!~      options_list {
#            (16 unchanged attributes hidden)

+           retry {
+               count    = 3
+               interval = 60000
            }
        }

#        (2 unchanged blocks hidden)
    }

  # datadog_synthetics_test.pytorch-gha-runners-queue-check-intel will be updated in-place
!~  resource "datadog_synthetics_test" "pytorch-gha-runners-queue-check-intel" {
        id               = "67g-icy-6mh"
        name             = "GHA Runner Queue Check - Intel Runners"
        tags             = [
            "env:project",
            "project:pytorch",
            "service:gha-runners",
        ]
#        (10 unchanged attributes hidden)

!~      options_list {
#            (16 unchanged attributes hidden)

+           retry {
+               count    = 3
+               interval = 60000
            }
        }

#        (2 unchanged blocks hidden)
    }

  # datadog_synthetics_test.pytorch-gha-runners-queue-check-lf will be updated in-place
!~  resource "datadog_synthetics_test" "pytorch-gha-runners-queue-check-lf" {
        id               = "p69-6vj-54b"
        name             = "GHA Runner Queue Check - Linux Foundation Runners"
        tags             = [
            "env:project",
            "project:pytorch",
            "service:gha-runners",
        ]
#        (10 unchanged attributes hidden)

!~      options_list {
#            (16 unchanged attributes hidden)

+           retry {
+               count    = 3
+               interval = 60000
            }
        }

#        (2 unchanged blocks hidden)
    }

  # datadog_synthetics_test.pytorch-gha-runners-queue-check-meta will be updated in-place
!~  resource "datadog_synthetics_test" "pytorch-gha-runners-queue-check-meta" {
        id               = "nnz-icu-8qk"
        name             = "GHA Runner Queue Check - Meta Runners"
        tags             = [
            "env:project",
            "project:pytorch",
            "service:gha-runners",
        ]
#        (10 unchanged attributes hidden)

!~      options_list {
#            (16 unchanged attributes hidden)

+           retry {
+               count    = 3
+               interval = 60000
            }
        }

#        (2 unchanged blocks hidden)
    }

  # datadog_synthetics_test.pytorch-gha-runners-queue-check-meta-h100 will be updated in-place
!~  resource "datadog_synthetics_test" "pytorch-gha-runners-queue-check-meta-h100" {
        id               = "hpi-psi-z8i"
        name             = "GHA Runner Queue Check - Meta Runners - AWS H100"
        tags             = [
            "env:project",
            "project:pytorch",
            "service:gha-runners",
        ]
#        (10 unchanged attributes hidden)

!~      options_list {
#            (16 unchanged attributes hidden)

+           retry {
+               count    = 3
+               interval = 60000
            }
        }

#        (2 unchanged blocks hidden)
    }

  # datadog_synthetics_test.pytorch-gha-runners-queue-check-nvidia will be updated in-place
!~  resource "datadog_synthetics_test" "pytorch-gha-runners-queue-check-nvidia" {
        id               = "sxd-d72-36u"
        name             = "GHA Runner Queue Check - Nvidia Runners"
        tags             = [
            "env:project",
            "project:pytorch",
            "service:gha-runners",
        ]
#        (10 unchanged attributes hidden)

!~      options_list {
#            (16 unchanged attributes hidden)

+           retry {
+               count    = 3
+               interval = 60000
            }
        }

#        (2 unchanged blocks hidden)
    }

Plan: 0 to add, 7 to change, 0 to destroy.

✅ Plan applied in Tofu Apply #37

jordanconway · 2025-09-04T12:34:30Z

I was also thinking, in an effort to reduce flapping can we check how long a queue has been in a long waiting state?

zxiiro · 2025-09-04T12:41:20Z

I was also thinking, in an effort to reduce flapping can we check how long a queue has been in a long waiting state?

That's what the main check is already doing isn't it? checks for queues > 3hrs

zxiiro requested a review from a team as a code owner September 4, 2025 12:25

zxiiro temporarily deployed to prod September 4, 2025 12:25 — with GitHub Actions Inactive

jordanconway approved these changes Sep 4, 2025

View reviewed changes

zxiiro merged commit 7392723 into main Sep 4, 2025
3 checks passed

zxiiro deleted the zxiiro/runner-alerts branch September 4, 2025 12:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Retry runner checks 3 times before failing #38

Retry runner checks 3 times before failing #38

Uh oh!

zxiiro commented Sep 4, 2025

Uh oh!

github-actions bot commented Sep 4, 2025 •

edited

Loading

Uh oh!

jordanconway commented Sep 4, 2025

Uh oh!

zxiiro commented Sep 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Retry runner checks 3 times before failing #38

Retry runner checks 3 times before failing #38

Uh oh!

Conversation

zxiiro commented Sep 4, 2025

Uh oh!

github-actions bot commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jordanconway commented Sep 4, 2025

Uh oh!

zxiiro commented Sep 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Sep 4, 2025 •

edited

Loading