|
| 1 | +--- |
| 2 | +name: dyad:check-workflows |
| 3 | +description: Check GitHub Actions workflow runs from the past day, identify severe or consistent failures, and file an issue if actionable problems are found. |
| 4 | +--- |
| 5 | + |
| 6 | +# Check Workflows |
| 7 | + |
| 8 | +Check GitHub Actions workflow runs from the past day for severe or consistent failures and file a GitHub issue if actionable problems are found. |
| 9 | + |
| 10 | +## Arguments |
| 11 | + |
| 12 | +- `$ARGUMENTS`: (Optional) Number of hours to look back (default: 24) |
| 13 | + |
| 14 | +## Instructions |
| 15 | + |
| 16 | +### 1. Gather recent workflow runs |
| 17 | + |
| 18 | +Fetch all workflow runs from the past N hours (default 24): |
| 19 | + |
| 20 | +``` |
| 21 | +gh run list --limit 100 --json workflowName,status,conclusion,event,headBranch,createdAt,databaseId,url,name |
| 22 | +``` |
| 23 | + |
| 24 | +Filter to only runs created within the lookback window. Group runs by workflow name. |
| 25 | + |
| 26 | +### 2. Classify each failure |
| 27 | + |
| 28 | +For each failed run, determine if it is **expected** or **actionable** by checking these rules: |
| 29 | + |
| 30 | +#### Expected failures (IGNORE these): |
| 31 | + |
| 32 | +1. **Nightly Runner Cleanup**: This workflow intentionally reboots self-hosted macOS runners, which kills the runner process mid-job. It will almost always show as "failed" even when working correctly. **Always skip this workflow entirely.** |
| 33 | + |
| 34 | +2. **Cascading failures from CI**: When the main CI workflow fails, these downstream workflows will also fail because they depend on CI artifacts (e.g. `html-report`, blob reports). This is noise, not an independent problem: |
| 35 | + - Playwright Report Comment (fails with "artifact not found") |
| 36 | + - Upload to Flakiness.io (fails when no flakiness reports exist) |
| 37 | + - Merge PR when ready (skipped/fails when CI hasn't passed) |
| 38 | + |
| 39 | +3. **CLA Assistant**: Failures just mean a contributor hasn't signed the CLA yet. This resolves on its own. |
| 40 | + |
| 41 | +4. **Cancelled runs**: Runs cancelled due to concurrency groups (newer push cancels older run) are normal. |
| 42 | + |
| 43 | +5. **`action_required` / `neutral` conclusions**: Standard GitHub behavior for fork PRs or first-time contributors needing manual approval. |
| 44 | + |
| 45 | +6. **CI failures on non-main branches**: Individual PR CI failures are expected — contributors may have formatting issues, lockfile mismatches, test failures, etc. These are the contributor's responsibility. |
| 46 | + |
| 47 | +7. **Claude Deflake E2E**: This workflow is expected to sometimes have long runs or partial failures as it investigates flaky tests. |
| 48 | + |
| 49 | +#### Actionable failures (FLAG these): |
| 50 | + |
| 51 | +1. **Permission errors**: Workflow can't access secrets, missing `GITHUB_TOKEN`, 403/401 errors on API calls that should be authenticated, `Resource not accessible by integration` errors. |
| 52 | + |
| 53 | +2. **Consistent CI failures on main branch**: If the CI workflow fails on 2+ consecutive pushes to main, something is likely broken. Check if different commits are failing for the same reason. |
| 54 | + |
| 55 | +3. **Infrastructure failures**: Self-hosted runners not coming back online (check if Nightly Runner Cleanup's verify steps are failing), runners consistently unavailable, disk space issues. |
| 56 | + |
| 57 | +4. **Repeated rate limiting**: If GitHub API rate limiting is causing the same workflow to fail across multiple runs (not just a one-off). |
| 58 | + |
| 59 | +5. **Action version issues**: Deprecated or broken GitHub Action versions causing failures. |
| 60 | + |
| 61 | +6. **Workflow configuration errors**: YAML syntax errors, invalid inputs, missing required secrets (distinct from permission issues). |
| 62 | + |
| 63 | +7. **Scheduled workflow failures**: If a scheduled/cron workflow (other than Nightly Runner Cleanup) fails consistently, it likely indicates a systemic issue. |
| 64 | + |
| 65 | +### 3. Investigate actionable failures |
| 66 | + |
| 67 | +For each potentially actionable failure, get more details: |
| 68 | + |
| 69 | +``` |
| 70 | +gh run view <run_id> --log-failed 2>/dev/null | head -100 |
| 71 | +``` |
| 72 | + |
| 73 | +Look for: |
| 74 | + |
| 75 | +- The specific error message |
| 76 | +- Whether the failure is in a setup step (infrastructure) vs. a test/build step (code) |
| 77 | +- Whether the same failure appears across multiple runs |
| 78 | + |
| 79 | +### 4. Determine severity |
| 80 | + |
| 81 | +After investigation, categorize actionable failures: |
| 82 | + |
| 83 | +- **SEVERE**: Permission errors, infrastructure down, main branch consistently broken, workflow configuration errors |
| 84 | +- **MODERATE**: Repeated rate limiting, deprecated action warnings, intermittent infrastructure issues |
| 85 | +- **LOW**: One-off transient failures that resolved on retry |
| 86 | + |
| 87 | +Only proceed to file an issue if there are SEVERE or MODERATE findings. |
| 88 | + |
| 89 | +### 5. Check for existing issues |
| 90 | + |
| 91 | +Before creating a new issue, check if there's already an open issue about workflow problems: |
| 92 | + |
| 93 | +``` |
| 94 | +gh issue list --label "workflow-health" --state open --json number,title,body |
| 95 | +``` |
| 96 | + |
| 97 | +If an existing issue covers the same problems, do not create a duplicate. Instead, add a comment to the existing issue with the latest findings. |
| 98 | + |
| 99 | +### 6. File a GitHub issue |
| 100 | + |
| 101 | +If there are actionable findings (SEVERE or MODERATE), create a GitHub issue: |
| 102 | + |
| 103 | +``` |
| 104 | +gh issue create --title "Workflow issues: <X>, <Y>, and <Z>" --label "workflow-health" --body "$(cat <<'EOF' |
| 105 | +## Workflow Health Report |
| 106 | +
|
| 107 | +**Period:** <start_time> to <end_time> |
| 108 | +**Total runs checked:** <N> |
| 109 | +**Failures found:** <N actionable> actionable, <N expected> expected (ignored) |
| 110 | +
|
| 111 | +## Issues Found |
| 112 | +
|
| 113 | +### <Issue 1 Title> |
| 114 | +- **Workflow:** <workflow name> |
| 115 | +- **Severity:** SEVERE / MODERATE |
| 116 | +- **Failed runs:** |
| 117 | + - [Run #<id>](<url>) — <date> |
| 118 | + - [Run #<id>](<url>) — <date> |
| 119 | +- **Error:** <brief error description> |
| 120 | +- **Suggested fix:** <how to resolve> |
| 121 | +
|
| 122 | +### <Issue 2 Title> |
| 123 | +... |
| 124 | +
|
| 125 | +## Expected Failures (Ignored) |
| 126 | +<Brief summary of expected failures that were skipped and why> |
| 127 | +
|
| 128 | +--- |
| 129 | +*This issue was automatically created by the daily workflow health check.* |
| 130 | +EOF |
| 131 | +)" |
| 132 | +``` |
| 133 | + |
| 134 | +The issue title should list the specific problems found (e.g., "Workflow issues: CI permissions error, flakiness upload rate-limited"). Keep it concise but descriptive. |
| 135 | + |
| 136 | +### 7. Report results |
| 137 | + |
| 138 | +Summarize: |
| 139 | + |
| 140 | +- How many workflow runs were checked |
| 141 | +- How many were expected failures (and which categories) |
| 142 | +- How many were actionable (and what was found) |
| 143 | +- Whether an issue was filed (with link) or if everything looks healthy |
| 144 | +- If no actionable issues were found, report "All workflows healthy" and do not create an issue |
0 commit comments