-
-
Notifications
You must be signed in to change notification settings - Fork 638
Fix CI safety check to evaluate latest workflow attempt #2062
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…run conclusion The ensure-master-docs-safety action was preventing docs-only commits from skipping CI when previous master commits had failing workflows. However, it was checking the overall workflow run conclusion, which GitHub marks as "failed" even if a manual rerun succeeded. This created a problematic situation where: 1. A commit would fail a workflow 2. The workflow would be manually re-run and succeed 3. But subsequent commits would still be blocked because the run.conclusion was still "failure" (GitHub doesn't update this on rerun) The fix: - Fetch the jobs for each workflow run - Find the latest attempt number (run_attempt) - Check if any jobs in the LATEST attempt failed - Only block if the latest attempt has failures This allows the safety check to correctly recognize when a failure has been resolved via a manual rerun, while still preventing docs-only skips when there are genuine unresolved failures. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
WalkthroughThe workflow failure detection logic in the ensure-master-docs-safety action has been refactored. Instead of relying on the overall run conclusion, it now evaluates the latest job attempt for each workflow run, determining failure status based on individual job conclusions within that attempt. Changes
Sequence Diagram(s)sequenceDiagram
participant Action as Ensure Docs Safety
participant API as GitHub API
participant Check as Job Assessment
Action->>API: Fetch latest workflow runs
activate API
API-->>Action: Workflow run list
deactivate API
loop For each run
Action->>API: Fetch all jobs for run
activate API
API-->>Action: Job list with run_attempt
deactivate API
Note over Action: Identify latest run_attempt
rect rgb(200, 220, 255)
Note over Check: NEW: Analyze latest attempt jobs only
Action->>Check: Evaluate job conclusions<br/>(failure|timed_out|cancelled|action_required)
end
alt Any job failed in latest attempt
Action->>Action: Mark run as FAILED
else No jobs found
Action->>Action: Mark run as INCOMPLETE
else All jobs succeeded
Action->>Action: Mark run as SUCCESS
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Suggested labels
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Code ReviewI've reviewed PR #2062 and the changes look solid overall. Here's my detailed feedback: ✅ Strengths
🔍 Potential Issues & Suggestions1. API Rate Limiting Concern (Medium Priority)Issue: The new code makes an additional API call per workflow run (line 114-119). For repos with many workflows, this could hit rate limits. Recommendation: Add error handling for rate limit responses with try/catch block around the API call. 2. Pagination Not Handled for Jobs (High Priority)Issue: Line 118 uses Recommendation: Use pagination with 3. Edge Case: Empty Jobs Array (Low Priority)Issue: Line 123-127 treats no jobs as incomplete/failing. This might be overly conservative for workflows that were cancelled before any jobs started. Consideration: Should this check 4. Array.prototype.map with Potential Empty ArrayIssue: Line 130 could fail if Recommendation: Filter out undefined values before calling Math.max, or add a check to ensure run_attempt exists on jobs. 🎯 Testing RecommendationsSince this is critical CI infrastructure:
📊 Performance Impact
🔒 Security ConsiderationsNo security concerns identified. The code:
📝 Minor NitLine 158: The error message still shows SummaryThe fix is fundamentally sound and addresses the core issue. The main concerns are:
Great work documenting the problem and solution! The PR description is exemplary. 🎉 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
.github/actions/ensure-master-docs-safety/action.yml (1)
157-159: Consider clarifying error message to reflect job-level determination.The error message references
run.conclusion, but the failure determination is now based on individual job conclusions from the latest attempt. For clarity, consider whether the message should indicate which specific jobs failed or note that it's checking the latest attempt's jobs. This will reduce confusion for developers investigating why their docs-only commit was blocked.For example, you might include a summary like: "...failed workflows (from latest rerun attempt):" or collect job names from
latestJobsthat actually failed.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
.github/actions/ensure-master-docs-safety/action.yml(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: claude-review
🔇 Additional comments (3)
.github/actions/ensure-master-docs-safety/action.yml (3)
107-150: ✓ Core logic correctly implements latest-attempt checking.The refactor properly addresses the issue: fetching jobs, identifying the latest attempt via
Math.max(run_attempt), and checking job conclusions from that specific attempt. The handling of empty jobs (lines 123–127) prevents Math.max edge cases, and the conclusion checks (failure, timed_out, cancelled, action_required) reasonably categorize blocking failures vs. non-blocking outcomes (skipped, neutral).
114-119: No error handling on API calls—verify intent.The
github.rest.actions.listJobsForWorkflowRuncall has no try-catch. If the API call fails, the entire action fails, which may be intentional as a fail-safe (conservatively block docs-only skips on uncertainty). Confirm this is the desired behavior, or add handling for transient API failures.
73-76: Excellent explanatory comments.The new comments clearly document why fetching the latest attempt is necessary and explain the GitHub API quirk where
run.conclusionpersists even after reruns. This helps future maintainers understand the design decision.
Code Review: PR #2062SummaryThis PR effectively fixes a critical issue where the CI safety check was incorrectly blocking docs-only commits due to stale workflow failure states. The solution correctly addresses the root cause by checking the latest attempt of workflow runs instead of the overall run conclusion. ✅ Strengths
|
…management This commit adds comprehensive guidance to prevent the type of CI breakage we experienced with the package-scripts.yml path issue. New modular documentation files (using @ references): - .claude/docs/testing-build-scripts.md - Mandatory testing after package.json/package-scripts.yml changes - Manual yalc publish testing requirements - Real example of what went wrong (7-week silent failure) - .claude/docs/master-health-monitoring.md - Immediate actions after PR merges - How to handle broken master - Understanding workflow reruns and circular dependencies - .claude/docs/managing-file-paths.md - Path verification checklist - Post-refactor validation steps - Real example of the package-scripts.yml bug Enhanced existing content: - Updated Merge Conflict Resolution Workflow with path verification steps - Added critical testing requirements after resolving build config conflicts Documentation files included: - CLAUDE_MD_UPDATES.md - Detailed implementation guide - claude-md-improvements.md - Original analysis These changes address the root causes identified in PR #2062 and #2065. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
- Added table of contents to CLAUDE_MD_UPDATES.md for easier navigation - Converted PR mentions to clickable links in master-health-monitoring.md - PR #2062: CI safety check fix - PR #2065: Breaking the circular dependency example This makes the documentation more user-friendly and easier to reference. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
## Summary This PR breaks the circular dependency created by previous master commits that failed the CI safety check, and adds comprehensive documentation to prevent similar issues in the future. ## Problem After PR #2062 merged, master is still in a broken state because: 1. Commit `beb70f009` and earlier commits have workflows that failed due to the circular dependency bug 2. When we re-run those workflows, they use the **action code from that commit** (with the old bug), not from current master 3. GitHub Actions always checks out workflow files from the commit being tested, not from master 4. This means the reruns still fail with the same circular dependency issue 5. The new safety check (now fixed) correctly detects these failures and blocks subsequent docs-only commits ## Solution This PR includes two components: ### 1. Cycle-Breaking Change Makes a trivial non-docs change that will: - Trigger **full CI** (not be detected as docs-only) - Use the **fixed version** of the ensure-master-docs-safety action from PR #2062 - Pass all tests (since the fix is now in place) - Become the new "previous commit" for future safety checks - Break the circular dependency cycle ### 2. Comprehensive Documentation Updates Adds modular documentation (using @ references) to prevent similar issues: **New Documentation Files:** - `.claude/docs/testing-build-scripts.md` - Mandatory testing after package.json/package-scripts.yml changes - Manual yalc publish testing requirements - Real example of what went wrong (7-week silent failure) - `.claude/docs/master-health-monitoring.md` - Immediate actions after PR merges - How to handle broken master - Understanding workflow reruns and circular dependencies - `.claude/docs/managing-file-paths.md` - Path verification checklist before committing - Post-refactor validation steps - Real example of the package-scripts.yml bug **Enhanced Existing Content:** - Updated Merge Conflict Resolution Workflow with path verification steps - Added critical testing requirements after resolving build config conflicts **Supporting Documentation:** - `CLAUDE_MD_UPDATES.md` - Detailed implementation guide for applying these updates - `claude-md-improvements.md` - Original root cause analysis ## Why This Approach We **cannot** fix the old commits by re-running their workflows because: - Workflow reruns use the action code from the original commit - The bug fix in PR #2062 only helps commits that come AFTER it - We need a new commit with passing tests to establish a clean baseline ## Changes ### Code Changes - Updated action description for clarity (minor wording improvement) ### Documentation Changes - Added 3 new modular documentation files using @ references - Enhanced merge conflict resolution workflow - Included comprehensive guides to prevent future issues ## Root Cause Addressed The documentation specifically addresses the failures that led to this issue: - ❌ No manual testing of yalc publish → Now required after build config changes - ❌ No path verification after structure changes → Now has checklist - ❌ No master health monitoring → Now has immediate action plan - ❌ Silent failures went unnoticed for weeks → Now has detection strategies ## Test Plan - [x] Change is non-docs (triggers full CI) - [ ] CI passes with the fixed action - [ ] Future docs-only commits will check this commit and find it passed - [ ] Circular dependency is broken - [ ] Documentation @ references work correctly ## Related - Fixes the remaining issues from PR #2062 - Implements the fix for the circular dependency bug - Adds preventive documentation based on root cause analysis 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Clarified CI action description to better reflect behavior for docs-only commits. * Added new guides on managing file paths, testing build/package scripts, and monitoring master branch health. * Expanded CLAUDE.md with conflict-resolution, testing, and architecture guidance; included an incident write-up and validation/implementation plans. * **Chores** * Added comprehensive operational checklists and workflows to reduce CI breakage during large refactors and build/config changes. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub> <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Claude <[email protected]>
…se-otp-timing * origin/master: (27 commits) Fix doctor command false version mismatch for beta/prerelease versions (#2064) Fix beta/RC version handling in generator (#2066) Document Rails Engine development nuances and add tests for automatic rake task loading (#2067) Add /run-skipped-tests as alias for /run-skipped-ci (#XXXX) (#2068) Fix: Add Rails 5.2-6.0 compatibility for compact_blank (#2058) Break CI circular dependency with non-docs change (#2065) Fix CI safety check to evaluate latest workflow attempt (#2062) Fix yalc publish (#2054) Add Shakapacker 9.0+ private_output_path integration for server bundles (#2028) Consolidate all beta versions into v16.2.0.beta.10 (#2057) Improve reliability of CI debugging scripts (#2056) Clarify monorepo changelog structure in documentation (#2055) Bump version to 16.2.0.beta.10 Bump version to 16.2.0.beta.9 Fix duplicate rake task execution by removing explicit task loading (#2052) Simplify precompile hook and restore Pro dummy app to async loading (#2053) Add Shakapacker precompile hook with ReScript support to Pro dummy app (#1977) Guard master docs-only pushes and ensure full CI runs (#2042) Refactor: Extract JS dependency management into shared module (#2051) Add workflow to detect invalid CI command attempts (#2037) ... # Conflicts: # rakelib/release.rake
Summary
Fixes the
ensure-master-docs-safetyGitHub Action to check the latest attempt of workflow runs instead of the overall run conclusion. This prevents false positives when workflows are manually re-run and succeed.Problem
The safety check was blocking docs-only commits from skipping CI when previous master commits had workflows marked as "failed", even if those workflows had been successfully re-run. This happened because:
run.conclusionfield is never updated to "success" when reruns succeedrun.conclusion, not the actual latest attemptThis created a situation where:
Solution
Modified the action to:
run_attemptnumber to identify the latest attemptThis allows the safety check to correctly recognize when failures have been resolved via manual reruns, while still preventing docs-only skips when there are genuine unresolved failures.
Test Plan
🤖 Generated with Claude Code
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.