Objective
Implement a monitoring/notification mechanism to track infra issues (runners dying, network timeouts, storage issues etc.) seen on PR jobs
Background
The HUD is a great resource to spot infra issues for jobs running on main or other branches. However, there isn't a similarly convenient view available for PR-based jobs.
Solution
Offline comments from @clee2000:
For general PR failures, I'm not sure if we have a clickhouse query specifically for it, but I imagine something similar could be done where we have a CH query, then we run it periodically, and when it has too many failures and issue gets created -> slack message gets posted?
Aggregation of the failure signatures would be a requirement to ensure that one-off failures do not create noise.