Automatic monitoring/notification mechanism for infra failures on PRs

### Objective

Implement a monitoring/notification mechanism to track infra issues (runners dying, network timeouts, storage issues etc.) seen on PR jobs

### Background

The [HUD](https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50) is a great resource to spot infra issues for jobs running on `main` or other branches. However, there isn't a similarly convenient view available for PR-based jobs.

### Solution

Offline comments from @clee2000:
> For general PR failures, I'm not sure if we have a clickhouse query specifically for it, but I imagine something similar could be done where we have a CH query, then we run it periodically, and when it has too many failures and issue gets created -> slack message gets posted?

Aggregation of the failure signatures would be a requirement to ensure that one-off failures do not create noise.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Automatic monitoring/notification mechanism for infra failures on PRs #7522

Objective

Background

Solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Automatic monitoring/notification mechanism for infra failures on PRs #7522

Description

Objective

Background

Solution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions