Revert "Do not fail check if PyTorch HUD API is down (#37)" #39

zxiiro · 2025-09-04T14:40:51Z

This change causes more flapping than it was intended to prevent. I think the change 7392723 does a better job of handling the flapping so lets revert this one.

This reverts commit a928f56.

This change causes more flapping than it was intended to prevent. I think the change 7392723 does a better job of handling the flapping so lets revert this one. This reverts commit a928f56.

github-actions · 2025-09-04T14:41:42Z

OpenTofu plan for prod

Plan: 0 to add, 7 to change, 0 to destroy.

OpenTofu used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
!~  update in-place

OpenTofu will perform the following actions:

  # datadog_synthetics_test.pytorch-gha-runners-queue-check-amd will be updated in-place
!~  resource "datadog_synthetics_test" "pytorch-gha-runners-queue-check-amd" {
        id               = "yt8-7zy-xpj"
        name             = "GHA Runner Queue Check - AMD Runners"
        tags             = [
            "env:project",
            "project:pytorch",
            "service:gha-runners",
        ]
#        (10 unchanged attributes hidden)

!~      assertion {
!~          code = <<-EOT
              - if (dd.response.statusCode !== 200) {
              -   // We do not want to fail due to hud.pytorch.org API failure.
              -   console.log('Status code is not 200, stopping execution');
              -   dd.expect(true).to.equal(true);
              - }
              - else {
              -   const MACHINE_TYPE_FILTER = '.rocm.';
              -   const jsonData = dd.response.body;
              -   const parsedData = JSON.parse(jsonData);
              + dd.expect(dd.response.statusCode).to.equal(200);
                
              -   const highQueueItems = parsedData
              -     .filter(item => item.machine_type.includes(MACHINE_TYPE_FILTER) && item.avg_queue_s > 14400)
              -     .map(item => ({ machine_type: item.machine_type, avg_queue_s: item.avg_queue_s }));
              + const MACHINE_TYPE_FILTER = '.rocm.';
              + const jsonData = dd.response.body;
              + const parsedData = JSON.parse(jsonData);
                
              -   if (highQueueItems.length > 0) {
              -     const machineDetails = highQueueItems
              -       .map(item => `${item.machine_type} (${item.avg_queue_s}s)`)
              -       .join(', ');
              -     const message = `High queue detected for machine types containing ${MACHINE_TYPE_FILTER}: ${machineDetails}`;
              -     console.error(message);
              -   }
              + const highQueueItems = parsedData
              +   .filter(item => item.machine_type.includes(MACHINE_TYPE_FILTER) && item.avg_queue_s > 14400)
              +   .map(item => ({ machine_type: item.machine_type, avg_queue_s: item.avg_queue_s }));
                
              -   dd.expect(highQueueItems.length > 0).to.be.false;
              + if (highQueueItems.length > 0) {
              +   const machineDetails = highQueueItems
              +     .map(item => `${item.machine_type} (${item.avg_queue_s}s)`)
              +     .join(', ');
              +   const message = `High queue detected for machine types containing ${MACHINE_TYPE_FILTER}: ${machineDetails}`;
              +   console.error(message);
                }
              + 
              + dd.expect(highQueueItems.length > 0).to.be.false;
            EOT
#            (1 unchanged attribute hidden)
        }

#        (2 unchanged blocks hidden)
    }

  # datadog_synthetics_test.pytorch-gha-runners-queue-check-ibm will be updated in-place
!~  resource "datadog_synthetics_test" "pytorch-gha-runners-queue-check-ibm" {
        id               = "sc6-zip-2n9"
        name             = "GHA Runner Queue Check - IBM Runners"
        tags             = [
            "env:project",
            "project:pytorch",
            "service:gha-runners",
        ]
#        (10 unchanged attributes hidden)

!~      assertion {
!~          code = <<-EOT
              - if (dd.response.statusCode !== 200) {
              -   // We do not want to fail due to hud.pytorch.org API failure.
              -   console.log('Status code is not 200, stopping execution');
              -   dd.expect(true).to.equal(true);
              - }
              - else {
              -   const MACHINE_TYPE_FILTER = '.s390x';
              -   const jsonData = dd.response.body;
              -   const parsedData = JSON.parse(jsonData);
              + dd.expect(dd.response.statusCode).to.equal(200);
                
              -   const highQueueItems = parsedData
              -     .filter(item => item.machine_type.includes(MACHINE_TYPE_FILTER) && item.avg_queue_s > 10800)
              -     .map(item => ({ machine_type: item.machine_type, avg_queue_s: item.avg_queue_s }));
              + const MACHINE_TYPE_FILTER = '.s390x';
              + const jsonData = dd.response.body;
              + const parsedData = JSON.parse(jsonData);
                
              -   if (highQueueItems.length > 0) {
              -     const machineDetails = highQueueItems
              -       .map(item => `${item.machine_type} (${item.avg_queue_s}s)`)
              -       .join(', ');
              -     const message = `High queue detected for machine types containing ${MACHINE_TYPE_FILTER}: ${machineDetails}`;
              -     console.error(message);
              -   }
              + const highQueueItems = parsedData
              +   .filter(item => item.machine_type.includes(MACHINE_TYPE_FILTER) && item.avg_queue_s > 10800)
              +   .map(item => ({ machine_type: item.machine_type, avg_queue_s: item.avg_queue_s }));
                
              -   dd.expect(highQueueItems.length > 0).to.be.false;
              + if (highQueueItems.length > 0) {
              +   const machineDetails = highQueueItems
              +     .map(item => `${item.machine_type} (${item.avg_queue_s}s)`)
              +     .join(', ');
              +   const message = `High queue detected for machine types containing ${MACHINE_TYPE_FILTER}: ${machineDetails}`;
              +   console.error(message);
                }
              + 
              + dd.expect(highQueueItems.length > 0).to.be.false;
            EOT
#            (1 unchanged attribute hidden)
        }

#        (2 unchanged blocks hidden)
    }

  # datadog_synthetics_test.pytorch-gha-runners-queue-check-intel will be updated in-place
!~  resource "datadog_synthetics_test" "pytorch-gha-runners-queue-check-intel" {
        id               = "67g-icy-6mh"
        name             = "GHA Runner Queue Check - Intel Runners"
        tags             = [
            "env:project",
            "project:pytorch",
            "service:gha-runners",
        ]
#        (10 unchanged attributes hidden)

!~      assertion {
!~          code = <<-EOT
              - if (dd.response.statusCode !== 200) {
              -   // We do not want to fail due to hud.pytorch.org API failure.
              -   console.log('Status code is not 200, stopping execution');
              -   dd.expect(true).to.equal(true);
              - }
              - else {
              -   const MACHINE_TYPE_FILTER = '.idc.';
              -   const jsonData = dd.response.body;
              -   const parsedData = JSON.parse(jsonData);
              + dd.expect(dd.response.statusCode).to.equal(200);
                
              -   const highQueueItems = parsedData
              -     .filter(item => item.machine_type.includes(MACHINE_TYPE_FILTER) && item.avg_queue_s > 10800)
              -     .map(item => ({ machine_type: item.machine_type, avg_queue_s: item.avg_queue_s }));
              + const MACHINE_TYPE_FILTER = '.idc.';
              + const jsonData = dd.response.body;
              + const parsedData = JSON.parse(jsonData);
                
              -   if (highQueueItems.length > 0) {
              -     const machineDetails = highQueueItems
              -       .map(item => `${item.machine_type} (${item.avg_queue_s}s)`)
              -       .join(', ');
              -     const message = `High queue detected for machine types containing ${MACHINE_TYPE_FILTER}: ${machineDetails}`;
              -     console.error(message);
              -   }
              + const highQueueItems = parsedData
              +   .filter(item => item.machine_type.includes(MACHINE_TYPE_FILTER) && item.avg_queue_s > 10800)
              +   .map(item => ({ machine_type: item.machine_type, avg_queue_s: item.avg_queue_s }));
                
              -   dd.expect(highQueueItems.length > 0).to.be.false;
              + if (highQueueItems.length > 0) {
              +   const machineDetails = highQueueItems
              +     .map(item => `${item.machine_type} (${item.avg_queue_s}s)`)
              +     .join(', ');
              +   const message = `High queue detected for machine types containing ${MACHINE_TYPE_FILTER}: ${machineDetails}`;
              +   console.error(message);
                }
              + 
              + dd.expect(highQueueItems.length > 0).to.be.false;
            EOT
#            (1 unchanged attribute hidden)
        }

#        (2 unchanged blocks hidden)
    }

  # datadog_synthetics_test.pytorch-gha-runners-queue-check-lf will be updated in-place
!~  resource "datadog_synthetics_test" "pytorch-gha-runners-queue-check-lf" {
        id               = "p69-6vj-54b"
        name             = "GHA Runner Queue Check - Linux Foundation Runners"
        tags             = [
            "env:project",
            "project:pytorch",
            "service:gha-runners",
        ]
#        (10 unchanged attributes hidden)

!~      assertion {
!~          code = <<-EOT
              - if (dd.response.statusCode !== 200) {
              -   // We do not want to fail due to hud.pytorch.org API failure.
              -   console.log('Status code is not 200, stopping execution');
              -   dd.expect(true).to.equal(true);
              - }
              - else {
              -   const MACHINE_TYPE_FILTER = 'lf.';
              -   const jsonData = dd.response.body;
              -   const parsedData = JSON.parse(jsonData);
              + dd.expect(dd.response.statusCode).to.equal(200);
                
              -   const highQueueItems = parsedData
              -     .filter(item => item.machine_type.startsWith(MACHINE_TYPE_FILTER) && item.avg_queue_s > 10800)
              -     .map(item => ({ machine_type: item.machine_type, avg_queue_s: item.avg_queue_s }));
              + const MACHINE_TYPE_FILTER = 'lf.';
              + const jsonData = dd.response.body;
              + const parsedData = JSON.parse(jsonData);
                
              -   if (highQueueItems.length > 0) {
              -     const machineDetails = highQueueItems
              -       .map(item => `${item.machine_type} (${item.avg_queue_s}s)`)
              -       .join(', ');
              -     const message = `High queue detected for machine types containing ${MACHINE_TYPE_FILTER}: ${machineDetails}`;
              -     console.error(message);
              -   }
              + const highQueueItems = parsedData
              +   .filter(item => item.machine_type.startsWith(MACHINE_TYPE_FILTER) && item.avg_queue_s > 10800)
              +   .map(item => ({ machine_type: item.machine_type, avg_queue_s: item.avg_queue_s }));
                
              -   dd.expect(highQueueItems.length > 0).to.be.false;
              + if (highQueueItems.length > 0) {
              +   const machineDetails = highQueueItems
              +     .map(item => `${item.machine_type} (${item.avg_queue_s}s)`)
              +     .join(', ');
              +   const message = `High queue detected for machine types containing ${MACHINE_TYPE_FILTER}: ${machineDetails}`;
              +   console.error(message);
                }
              + 
              + dd.expect(highQueueItems.length > 0).to.be.false;
            EOT
#            (1 unchanged attribute hidden)
        }

#        (2 unchanged blocks hidden)
    }

  # datadog_synthetics_test.pytorch-gha-runners-queue-check-meta will be updated in-place
!~  resource "datadog_synthetics_test" "pytorch-gha-runners-queue-check-meta" {
        id               = "nnz-icu-8qk"
        name             = "GHA Runner Queue Check - Meta Runners"
        tags             = [
            "env:project",
            "project:pytorch",
            "service:gha-runners",
        ]
#        (10 unchanged attributes hidden)

!~      assertion {
!~          code = <<-EOT
              - if (dd.response.statusCode !== 200) {
              -   // We do not want to fail due to hud.pytorch.org API failure.
              -   console.log('Status code is not 200, stopping execution');
              -   dd.expect(true).to.equal(true);
              + dd.expect(dd.response.statusCode).to.equal(200);
              + const EXCLUDED_MACHINE_PATTERNS = ['.dgx.', '.idc.', '.rocm.', '.s390x', '^lf\\.', '^linux.aws.h100'];
              + const jsonData = dd.response.body;
              + const parsedData = JSON.parse(jsonData);
              + const highQueueItems = parsedData
              +   .filter(item => {
              +     const machineType = item.machine_type;
              +     return !EXCLUDED_MACHINE_PATTERNS.some(pattern =>
              +       pattern.startsWith('^') ?
              +         new RegExp(pattern).test(machineType) :
              +         machineType.includes(pattern)
              +     ) && item.avg_queue_s > 10800;
              +   })
              +   .map(item => ({ machine_type: item.machine_type, avg_queue_s: item.avg_queue_s }));
              + if (highQueueItems.length > 0) {
              +   const machineDetails = highQueueItems
              +     .map(item => `${item.machine_type} (${item.avg_queue_s}s)`)
              +     .join(', ');
              +   const message = `High queue detected for machine types: ${machineDetails}`;
              +   console.error(message);
                }
              - else {
              -   const EXCLUDED_MACHINE_PATTERNS = ['.dgx.', '.idc.', '.rocm.', '.s390x', '^lf\\.', '^linux.aws.h100'];
              -   const jsonData = dd.response.body;
              -   const parsedData = JSON.parse(jsonData);
              -   const highQueueItems = parsedData
              -     .filter(item => {
              -       const machineType = item.machine_type;
              -       return !EXCLUDED_MACHINE_PATTERNS.some(pattern =>
              -         pattern.startsWith('^') ?
              -           new RegExp(pattern).test(machineType) :
              -           machineType.includes(pattern)
              -       ) && item.avg_queue_s > 10800;
              -     })
              -     .map(item => ({ machine_type: item.machine_type, avg_queue_s: item.avg_queue_s }));
              -   if (highQueueItems.length > 0) {
              -     const machineDetails = highQueueItems
              -       .map(item => `${item.machine_type} (${item.avg_queue_s}s)`)
              -       .join(', ');
              -     const message = `High queue detected for machine types: ${machineDetails}`;
              -     console.error(message);
              -   }
              -   dd.expect(highQueueItems.length > 0).to.be.false;
              - }
              + dd.expect(highQueueItems.length > 0).to.be.false;
            EOT
#            (1 unchanged attribute hidden)
        }

#        (2 unchanged blocks hidden)
    }

  # datadog_synthetics_test.pytorch-gha-runners-queue-check-meta-h100 will be updated in-place
!~  resource "datadog_synthetics_test" "pytorch-gha-runners-queue-check-meta-h100" {
        id               = "hpi-psi-z8i"
        name             = "GHA Runner Queue Check - Meta Runners - AWS H100"
        tags             = [
            "env:project",
            "project:pytorch",
            "service:gha-runners",
        ]
#        (10 unchanged attributes hidden)

!~      assertion {
!~          code = <<-EOT
              - if (dd.response.statusCode !== 200) {
              -   // We do not want to fail due to hud.pytorch.org API failure.
              -   console.log('Status code is not 200, stopping execution');
              -   dd.expect(true).to.equal(true);
              - }
              - else {
              -   const MACHINE_TYPE_FILTER = 'linux.aws.h100';
              -   const jsonData = dd.response.body;
              -   const parsedData = JSON.parse(jsonData);
              + dd.expect(dd.response.statusCode).to.equal(200);
                
              -   const highQueueItems = parsedData
              -     .filter(item => item.machine_type === MACHINE_TYPE_FILTER && item.avg_queue_s > 21600)
              -     .map(item => ({ machine_type: item.machine_type, avg_queue_s: item.avg_queue_s }));
              + const MACHINE_TYPE_FILTER = 'linux.aws.h100';
              + const jsonData = dd.response.body;
              + const parsedData = JSON.parse(jsonData);
                
              -   if (highQueueItems.length > 0) {
              -     const machineDetails = highQueueItems
              -       .map(item => `${item.machine_type} (${item.avg_queue_s}s)`)
              -       .join(', ');
              -     const message = `High queue detected for machine type ${MACHINE_TYPE_FILTER}: ${machineDetails}`;
              -     console.error(message);
              -   }
              + const highQueueItems = parsedData
              +   .filter(item => item.machine_type === MACHINE_TYPE_FILTER && item.avg_queue_s > 21600)
              +   .map(item => ({ machine_type: item.machine_type, avg_queue_s: item.avg_queue_s }));
                
              -   dd.expect(highQueueItems.length > 0).to.be.false;
              + if (highQueueItems.length > 0) {
              +   const machineDetails = highQueueItems
              +     .map(item => `${item.machine_type} (${item.avg_queue_s}s)`)
              +     .join(', ');
              +   const message = `High queue detected for machine type ${MACHINE_TYPE_FILTER}: ${machineDetails}`;
              +   console.error(message);
                }
              + 
              + dd.expect(highQueueItems.length > 0).to.be.false;
            EOT
#            (1 unchanged attribute hidden)
        }

#        (2 unchanged blocks hidden)
    }

  # datadog_synthetics_test.pytorch-gha-runners-queue-check-nvidia will be updated in-place
!~  resource "datadog_synthetics_test" "pytorch-gha-runners-queue-check-nvidia" {
        id               = "sxd-d72-36u"
        name             = "GHA Runner Queue Check - Nvidia Runners"
        tags             = [
            "env:project",
            "project:pytorch",
            "service:gha-runners",
        ]
#        (10 unchanged attributes hidden)

!~      assertion {
!~          code = <<-EOT
              - if (dd.response.statusCode !== 200) {
              -   // We do not want to fail due to hud.pytorch.org API failure.
              -   console.log('Status code is not 200, stopping execution');
              -   dd.expect(true).to.equal(true);
              - }
              - else {
              -   const MACHINE_TYPE_FILTER = '.dgx.';
              -   const jsonData = dd.response.body;
              -   const parsedData = JSON.parse(jsonData);
              + dd.expect(dd.response.statusCode).to.equal(200);
                
              -   const highQueueItems = parsedData
              -     .filter(item => item.machine_type.includes(MACHINE_TYPE_FILTER) && item.avg_queue_s > 10800)
              -     .map(item => ({ machine_type: item.machine_type, avg_queue_s: item.avg_queue_s }));
              + const MACHINE_TYPE_FILTER = '.dgx.';
              + const jsonData = dd.response.body;
              + const parsedData = JSON.parse(jsonData);
                
              -   if (highQueueItems.length > 0) {
              -     const machineDetails = highQueueItems
              -       .map(item => `${item.machine_type} (${item.avg_queue_s}s)`)
              -       .join(', ');
              -     const message = `High queue detected for machine types containing ${MACHINE_TYPE_FILTER}: ${machineDetails}`;
              -     console.error(message);
              -   }
              + const highQueueItems = parsedData
              +   .filter(item => item.machine_type.includes(MACHINE_TYPE_FILTER) && item.avg_queue_s > 10800)
              +   .map(item => ({ machine_type: item.machine_type, avg_queue_s: item.avg_queue_s }));
                
              -   dd.expect(highQueueItems.length > 0).to.be.false;
              + if (highQueueItems.length > 0) {
              +   const machineDetails = highQueueItems
              +     .map(item => `${item.machine_type} (${item.avg_queue_s}s)`)
              +     .join(', ');
              +   const message = `High queue detected for machine types containing ${MACHINE_TYPE_FILTER}: ${machineDetails}`;
              +   console.error(message);
                }
              + 
              + dd.expect(highQueueItems.length > 0).to.be.false;
            EOT
#            (1 unchanged attribute hidden)
        }

#        (2 unchanged blocks hidden)
    }

Plan: 0 to add, 7 to change, 0 to destroy.

✅ Plan applied in Tofu Apply #38

Revert "Do not fail check if PyTorch HUD API is down (#37)"

e9e2982

This change causes more flapping than it was intended to prevent. I think the change 7392723 does a better job of handling the flapping so lets revert this one. This reverts commit a928f56.

zxiiro requested a review from a team as a code owner September 4, 2025 14:40

zxiiro temporarily deployed to prod September 4, 2025 14:40 — with GitHub Actions Inactive

jordanconway approved these changes Sep 4, 2025

View reviewed changes

zxiiro merged commit af93e82 into main Sep 4, 2025
3 checks passed

zxiiro deleted the zxiiro/runner-alerts branch September 4, 2025 15:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Revert "Do not fail check if PyTorch HUD API is down (#37)" #39

Revert "Do not fail check if PyTorch HUD API is down (#37)" #39

Uh oh!

zxiiro commented Sep 4, 2025

Uh oh!

github-actions bot commented Sep 4, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Revert "Do not fail check if PyTorch HUD API is down (#37)" #39

Revert "Do not fail check if PyTorch HUD API is down (#37)" #39

Uh oh!

Conversation

zxiiro commented Sep 4, 2025

Uh oh!

github-actions bot commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Sep 4, 2025 •

edited

Loading