Skip to content

Commit 3b99e99

Browse files
wwwillchenclaude
andauthored
Add check-workflows skill for monitoring GitHub Actions workflow health (#2738)
## Summary - Adds a new Claude Code skill that monitors GitHub Actions workflow runs from the past day - Identifies severe or consistent failures and files GitHub issues for actionable problems - Includes detailed instructions for classifying failures, investigating issues, and filing reports ## Test plan - The skill can be manually triggered via `/dyad:check-workflows` or run on a daily schedule at 13:00 UTC - Check the generated GitHub issue (if any) to verify the classification and reporting logic works correctly - Verify that expected failures (like Nightly Runner Cleanup, cascading CI failures) are properly filtered out 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/dyad-sh/dyad/pull/2738" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a> <!-- devin-review-badge-end --> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 49375c8 commit 3b99e99

File tree

2 files changed

+185
-0
lines changed

2 files changed

+185
-0
lines changed
Lines changed: 144 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,144 @@
1+
---
2+
name: dyad:check-workflows
3+
description: Check GitHub Actions workflow runs from the past day, identify severe or consistent failures, and file an issue if actionable problems are found.
4+
---
5+
6+
# Check Workflows
7+
8+
Check GitHub Actions workflow runs from the past day for severe or consistent failures and file a GitHub issue if actionable problems are found.
9+
10+
## Arguments
11+
12+
- `$ARGUMENTS`: (Optional) Number of hours to look back (default: 24)
13+
14+
## Instructions
15+
16+
### 1. Gather recent workflow runs
17+
18+
Fetch all workflow runs from the past N hours (default 24):
19+
20+
```
21+
gh run list --limit 100 --json workflowName,status,conclusion,event,headBranch,createdAt,databaseId,url,name
22+
```
23+
24+
Filter to only runs created within the lookback window. Group runs by workflow name.
25+
26+
### 2. Classify each failure
27+
28+
For each failed run, determine if it is **expected** or **actionable** by checking these rules:
29+
30+
#### Expected failures (IGNORE these):
31+
32+
1. **Nightly Runner Cleanup**: This workflow intentionally reboots self-hosted macOS runners, which kills the runner process mid-job. It will almost always show as "failed" even when working correctly. **Always skip this workflow entirely.**
33+
34+
2. **Cascading failures from CI**: When the main CI workflow fails, these downstream workflows will also fail because they depend on CI artifacts (e.g. `html-report`, blob reports). This is noise, not an independent problem:
35+
- Playwright Report Comment (fails with "artifact not found")
36+
- Upload to Flakiness.io (fails when no flakiness reports exist)
37+
- Merge PR when ready (skipped/fails when CI hasn't passed)
38+
39+
3. **CLA Assistant**: Failures just mean a contributor hasn't signed the CLA yet. This resolves on its own.
40+
41+
4. **Cancelled runs**: Runs cancelled due to concurrency groups (newer push cancels older run) are normal.
42+
43+
5. **`action_required` / `neutral` conclusions**: Standard GitHub behavior for fork PRs or first-time contributors needing manual approval.
44+
45+
6. **CI failures on non-main branches**: Individual PR CI failures are expected — contributors may have formatting issues, lockfile mismatches, test failures, etc. These are the contributor's responsibility.
46+
47+
7. **Claude Deflake E2E**: This workflow is expected to sometimes have long runs or partial failures as it investigates flaky tests.
48+
49+
#### Actionable failures (FLAG these):
50+
51+
1. **Permission errors**: Workflow can't access secrets, missing `GITHUB_TOKEN`, 403/401 errors on API calls that should be authenticated, `Resource not accessible by integration` errors.
52+
53+
2. **Consistent CI failures on main branch**: If the CI workflow fails on 2+ consecutive pushes to main, something is likely broken. Check if different commits are failing for the same reason.
54+
55+
3. **Infrastructure failures**: Self-hosted runners not coming back online (check if Nightly Runner Cleanup's verify steps are failing), runners consistently unavailable, disk space issues.
56+
57+
4. **Repeated rate limiting**: If GitHub API rate limiting is causing the same workflow to fail across multiple runs (not just a one-off).
58+
59+
5. **Action version issues**: Deprecated or broken GitHub Action versions causing failures.
60+
61+
6. **Workflow configuration errors**: YAML syntax errors, invalid inputs, missing required secrets (distinct from permission issues).
62+
63+
7. **Scheduled workflow failures**: If a scheduled/cron workflow (other than Nightly Runner Cleanup) fails consistently, it likely indicates a systemic issue.
64+
65+
### 3. Investigate actionable failures
66+
67+
For each potentially actionable failure, get more details:
68+
69+
```
70+
gh run view <run_id> --log-failed 2>/dev/null | head -100
71+
```
72+
73+
Look for:
74+
75+
- The specific error message
76+
- Whether the failure is in a setup step (infrastructure) vs. a test/build step (code)
77+
- Whether the same failure appears across multiple runs
78+
79+
### 4. Determine severity
80+
81+
After investigation, categorize actionable failures:
82+
83+
- **SEVERE**: Permission errors, infrastructure down, main branch consistently broken, workflow configuration errors
84+
- **MODERATE**: Repeated rate limiting, deprecated action warnings, intermittent infrastructure issues
85+
- **LOW**: One-off transient failures that resolved on retry
86+
87+
Only proceed to file an issue if there are SEVERE or MODERATE findings.
88+
89+
### 5. Check for existing issues
90+
91+
Before creating a new issue, check if there's already an open issue about workflow problems:
92+
93+
```
94+
gh issue list --label "workflow-health" --state open --json number,title,body
95+
```
96+
97+
If an existing issue covers the same problems, do not create a duplicate. Instead, add a comment to the existing issue with the latest findings.
98+
99+
### 6. File a GitHub issue
100+
101+
If there are actionable findings (SEVERE or MODERATE), create a GitHub issue:
102+
103+
```
104+
gh issue create --title "Workflow issues: <X>, <Y>, and <Z>" --label "workflow-health" --body "$(cat <<'EOF'
105+
## Workflow Health Report
106+
107+
**Period:** <start_time> to <end_time>
108+
**Total runs checked:** <N>
109+
**Failures found:** <N actionable> actionable, <N expected> expected (ignored)
110+
111+
## Issues Found
112+
113+
### <Issue 1 Title>
114+
- **Workflow:** <workflow name>
115+
- **Severity:** SEVERE / MODERATE
116+
- **Failed runs:**
117+
- [Run #<id>](<url>) — <date>
118+
- [Run #<id>](<url>) — <date>
119+
- **Error:** <brief error description>
120+
- **Suggested fix:** <how to resolve>
121+
122+
### <Issue 2 Title>
123+
...
124+
125+
## Expected Failures (Ignored)
126+
<Brief summary of expected failures that were skipped and why>
127+
128+
---
129+
*This issue was automatically created by the daily workflow health check.*
130+
EOF
131+
)"
132+
```
133+
134+
The issue title should list the specific problems found (e.g., "Workflow issues: CI permissions error, flakiness upload rate-limited"). Keep it concise but descriptive.
135+
136+
### 7. Report results
137+
138+
Summarize:
139+
140+
- How many workflow runs were checked
141+
- How many were expected failures (and which categories)
142+
- How many were actionable (and what was found)
143+
- Whether an issue was filed (with link) or if everything looks healthy
144+
- If no actionable issues were found, report "All workflows healthy" and do not create an issue
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
name: Claude Check Workflows
2+
3+
on:
4+
schedule:
5+
# Daily at 13:00 UTC (5 AM PST / 6 AM PDT)
6+
- cron: "0 13 * * *"
7+
workflow_dispatch:
8+
inputs:
9+
hours:
10+
description: "Number of hours to look back for workflow runs"
11+
required: false
12+
default: "24"
13+
type: string
14+
15+
jobs:
16+
check-workflows:
17+
environment: ai-bots
18+
runs-on:
19+
- self-hosted
20+
- macOS
21+
- ARM64
22+
permissions:
23+
issues: write
24+
steps:
25+
- name: Checkout repository
26+
uses: actions/checkout@v5
27+
28+
- name: Check workflow health
29+
uses: anthropics/claude-code-action@v1
30+
env:
31+
GH_TOKEN: ${{ secrets.DEFLAKE_E2E_PAT_GITHUB_TOKEN }}
32+
CLAUDE_CODE_MAX_OUTPUT_TOKENS: 48000
33+
with:
34+
claude_code_oauth_token: ${{ secrets.CLAUDE_CODE_OAUTH_TOKEN }}
35+
github_token: ${{ secrets.DEFLAKE_E2E_PAT_GITHUB_TOKEN }}
36+
claude_args: --model claude-sonnet-4-5-20250929
37+
prompt: |
38+
/dyad:check-workflows ${{ inputs.hours || '24' }}
39+
- name: Cleanup (self-hosted macOS)
40+
if: always()
41+
run: bash scripts/ci-cleanup-macos.sh

0 commit comments

Comments
 (0)