Skip to content

Conversation

@justin808
Copy link
Member

@justin808 justin808 commented Nov 19, 2025

Summary

Fixes the ensure-master-docs-safety GitHub Action to check the latest attempt of workflow runs instead of the overall run conclusion. This prevents false positives when workflows are manually re-run and succeed.

Problem

The safety check was blocking docs-only commits from skipping CI when previous master commits had workflows marked as "failed", even if those workflows had been successfully re-run. This happened because:

  1. GitHub marks a workflow run as "failed" even after a manual rerun succeeds
  2. The run.conclusion field is never updated to "success" when reruns succeed
  3. The safety check was only looking at run.conclusion, not the actual latest attempt

This created a situation where:

  • Commit A fails a workflow
  • The workflow is manually re-run and succeeds
  • Commit B (docs-only) is blocked because commit A's workflow is still marked as "failed"
  • This continues indefinitely, blocking all subsequent commits

Solution

Modified the action to:

  1. Fetch the jobs for each workflow run via the GitHub API
  2. Find the maximum run_attempt number to identify the latest attempt
  3. Filter jobs to only those from the latest attempt
  4. Check if any jobs in the latest attempt have failed conclusions
  5. Only block docs-only commits if the latest attempt has actual failures

This allows the safety check to correctly recognize when failures have been resolved via manual reruns, while still preventing docs-only skips when there are genuine unresolved failures.

Test Plan

  • Code changes reviewed for correctness
  • Linting passes locally
  • CI passes on this PR (will verify the fix works)
  • After merge, verify that docs-only commits are no longer blocked by previously-fixed workflow failures

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Chores
    • Improved the accuracy of documentation safety checks by enhancing how workflow failures are detected and assessed, ensuring more precise identification of build issues during the latest workflow attempts.

✏️ Tip: You can customize this high-level summary in your review settings.

…run conclusion

The ensure-master-docs-safety action was preventing docs-only commits from
skipping CI when previous master commits had failing workflows. However, it
was checking the overall workflow run conclusion, which GitHub marks as
"failed" even if a manual rerun succeeded.

This created a problematic situation where:
1. A commit would fail a workflow
2. The workflow would be manually re-run and succeed
3. But subsequent commits would still be blocked because the run.conclusion
   was still "failure" (GitHub doesn't update this on rerun)

The fix:
- Fetch the jobs for each workflow run
- Find the latest attempt number (run_attempt)
- Check if any jobs in the LATEST attempt failed
- Only block if the latest attempt has failures

This allows the safety check to correctly recognize when a failure has been
resolved via a manual rerun, while still preventing docs-only skips when
there are genuine unresolved failures.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 19, 2025

Walkthrough

The workflow failure detection logic in the ensure-master-docs-safety action has been refactored. Instead of relying on the overall run conclusion, it now evaluates the latest job attempt for each workflow run, determining failure status based on individual job conclusions within that attempt.

Changes

Cohort / File(s) Summary
Workflow failure detection refactor
.github/actions/ensure-master-docs-safety/action.yml
Replaced run.conclusion-based failure checking with per-job latest-attempt assessment. Now fetches all jobs per run, identifies the latest run_attempt, and evaluates job conclusions from that attempt. Marks runs as failing if any job in the latest attempt has conclusions in {failure, timed_out, cancelled, action_required}. Treats runs with no jobs as incomplete.

Sequence Diagram(s)

sequenceDiagram
    participant Action as Ensure Docs Safety
    participant API as GitHub API
    participant Check as Job Assessment
    
    Action->>API: Fetch latest workflow runs
    activate API
    API-->>Action: Workflow run list
    deactivate API
    
    loop For each run
        Action->>API: Fetch all jobs for run
        activate API
        API-->>Action: Job list with run_attempt
        deactivate API
        
        Note over Action: Identify latest run_attempt
        
        rect rgb(200, 220, 255)
            Note over Check: NEW: Analyze latest attempt jobs only
            Action->>Check: Evaluate job conclusions<br/>(failure|timed_out|cancelled|action_required)
        end
        
        alt Any job failed in latest attempt
            Action->>Action: Mark run as FAILED
        else No jobs found
            Action->>Action: Mark run as INCOMPLETE
        else All jobs succeeded
            Action->>Action: Mark run as SUCCESS
        end
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Areas requiring extra attention:
    • Logic correctness for latest attempt identification and filtering
    • Handling of edge cases (missing jobs, no attempts, API response variations)
    • Async/await patterns in API call sequencing
    • Verification that all failure states {failure, timed_out, cancelled, action_required} are appropriately captured

Suggested labels

bug, review-needed

Poem

🐰 A rabbit hops through workflow logs,
Finding truth in each job's fate—
No more guessing by the run's old fog,
Just the latest attempt, accurate!
Failures caught with clearer sight,
Docs stay safe, day and night. ✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Fix CI safety check to evaluate latest workflow attempt' directly and accurately summarizes the main change: updating the CI safety check to evaluate the latest workflow attempt instead of overall run conclusion.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch justin808/fix-ci-safety-loop

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link

claude bot commented Nov 19, 2025

Code Review

I've reviewed PR #2062 and the changes look solid overall. Here's my detailed feedback:

✅ Strengths

  1. Correct Problem Identification: The fix properly addresses a real GitHub API limitation where run.conclusion doesn't update after successful reruns.

  2. Well-Documented Logic: Excellent inline comments explaining the rationale (lines 74-76, 107-109).

  3. Preserves Existing Safety Checks: The failure condition list remains comprehensive (failure, timed_out, cancelled, action_required).

  4. Maintains Early Exit Logic: Still handles edge cases like no workflows found (line 63-66) and incomplete runs (line 92-105).

🔍 Potential Issues & Suggestions

1. API Rate Limiting Concern (Medium Priority)

Issue: The new code makes an additional API call per workflow run (line 114-119). For repos with many workflows, this could hit rate limits.

Recommendation: Add error handling for rate limit responses with try/catch block around the API call.

2. Pagination Not Handled for Jobs (High Priority)

Issue: Line 118 uses per_page: 100, but some workflow runs may have >100 jobs (especially matrix builds).

Recommendation: Use pagination with github.paginate.iterator similar to how workflow runs are fetched, or at minimum check jobsResponse.data.total_count and log a warning if it exceeds 100.

3. Edge Case: Empty Jobs Array (Low Priority)

Issue: Line 123-127 treats no jobs as incomplete/failing. This might be overly conservative for workflows that were cancelled before any jobs started.

Consideration: Should this check run.status === 'completed' to distinguish between "no jobs yet" vs "completed with no jobs"? This may be acceptable as-is, but worth documenting the rationale.

4. Array.prototype.map with Potential Empty Array

Issue: Line 130 could fail if jobs array exists but all jobs lack run_attempt field.

Recommendation: Filter out undefined values before calling Math.max, or add a check to ensure run_attempt exists on jobs.

🎯 Testing Recommendations

Since this is critical CI infrastructure:

  1. Manual Testing: Test with a workflow that has:

    • Initial failure → manual rerun → success
    • Multiple reruns (attempt 3+)
    • Matrix jobs with >100 total jobs
    • Cancelled runs with no jobs
  2. Consider Adding Tests: While GitHub Actions are hard to unit test, consider adding example workflow run scenarios in comments to document expected behavior.

📊 Performance Impact

  • Before: 1 API call (list workflow runs)
  • After: 1 + N API calls (where N = unique workflows on previous commit)
  • Typical Impact: For ~10 workflows, this is acceptable
  • Mitigation: Consider caching if this becomes an issue

🔒 Security Considerations

No security concerns identified. The code:

  • Uses authenticated GitHub API via github-script action
  • Doesn't expose sensitive data
  • Only reads public workflow information

📝 Minor Nit

Line 158: The error message still shows run.conclusion which might now be misleading since you're checking job conclusions. Consider updating the message to clarify that failing jobs in the latest attempt were detected.


Summary

The fix is fundamentally sound and addresses the core issue. The main concerns are:

  1. Must add: Pagination for jobs API (high priority)
  2. Should add: Error handling for rate limits (medium priority)
  3. Nice to have: Edge case handling for empty arrays (low priority)

Great work documenting the problem and solution! The PR description is exemplary. 🎉

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
.github/actions/ensure-master-docs-safety/action.yml (1)

157-159: Consider clarifying error message to reflect job-level determination.

The error message references run.conclusion, but the failure determination is now based on individual job conclusions from the latest attempt. For clarity, consider whether the message should indicate which specific jobs failed or note that it's checking the latest attempt's jobs. This will reduce confusion for developers investigating why their docs-only commit was blocked.

For example, you might include a summary like: "...failed workflows (from latest rerun attempt):" or collect job names from latestJobs that actually failed.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between beb70f0 and 8a7e317.

📒 Files selected for processing (1)
  • .github/actions/ensure-master-docs-safety/action.yml (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: claude-review
🔇 Additional comments (3)
.github/actions/ensure-master-docs-safety/action.yml (3)

107-150: ✓ Core logic correctly implements latest-attempt checking.

The refactor properly addresses the issue: fetching jobs, identifying the latest attempt via Math.max(run_attempt), and checking job conclusions from that specific attempt. The handling of empty jobs (lines 123–127) prevents Math.max edge cases, and the conclusion checks (failure, timed_out, cancelled, action_required) reasonably categorize blocking failures vs. non-blocking outcomes (skipped, neutral).


114-119: No error handling on API calls—verify intent.

The github.rest.actions.listJobsForWorkflowRun call has no try-catch. If the API call fails, the entire action fails, which may be intentional as a fail-safe (conservatively block docs-only skips on uncertainty). Confirm this is the desired behavior, or add handling for transient API failures.


73-76: Excellent explanatory comments.

The new comments clearly document why fetching the latest attempt is necessary and explain the GitHub API quirk where run.conclusion persists even after reruns. This helps future maintainers understand the design decision.

@justin808 justin808 merged commit 50f3baf into master Nov 19, 2025
28 checks passed
@justin808 justin808 deleted the justin808/fix-ci-safety-loop branch November 19, 2025 23:09
@claude
Copy link

claude bot commented Nov 19, 2025

Code Review: PR #2062

Summary

This PR effectively fixes a critical issue where the CI safety check was incorrectly blocking docs-only commits due to stale workflow failure states. The solution correctly addresses the root cause by checking the latest attempt of workflow runs instead of the overall run conclusion.


✅ Strengths

  1. Correct Problem Analysis: The PR description clearly identifies the GitHub API behavior where run.conclusion doesn't update after successful reruns.

  2. Well-Documented Code: Excellent inline comments explaining:

    • Why fetching jobs is necessary (lines 74-76, 107-109)
    • What each failure conclusion means (lines 136-142)
    • The logic for finding latest attempts (line 129)
  3. Proper API Usage: Correctly uses github.rest.actions.listJobsForWorkflowRun to get job-level details.

  4. Maintains Conservative Approach: Correctly treats edge cases (no jobs found) as failures, which is the safe default.


⚠️ Issues & Recommendations

🐛 Critical: Pagination Missing for Jobs API

Location: Line 114-119

Issue: The listJobsForWorkflowRun call uses per_page: 100 but doesn't handle pagination. Workflow runs with >100 jobs will have incomplete data, potentially missing failures in jobs beyond the first page.

Impact: False negatives - a failure in job #101 would be missed, allowing docs-only skips when they shouldn't be allowed.

Fix: Use pagination iterator similar to the workflow runs fetch (lines 35-61).

🔍 Performance: Multiple API Calls in Loop

Location: Lines 112-150

Issue: The code makes one API call per workflow run sequentially. For repos with many workflows, this could be slow and hit rate limits.

Current: O(n) API calls where n = number of unique workflows
Typical case: 3-5 workflows = 3-5 sequential API calls
Edge case: 10+ workflows could cause noticeable delays

Impact: Minor performance concern, but acceptable for most cases. GitHub Actions rate limit is 1000 req/hr, so this is unlikely to be a problem in practice.

Potential optimization (if needed in future):

  • Batch requests using Promise.all() for parallel execution
  • Cache job results if the same workflow is checked multiple times

🧪 Test Coverage

Missing: This PR doesn't include automated tests for the new logic.

Recommended test cases:

  1. Workflow with successful rerun should NOT block docs-only skip
  2. Workflow with failed rerun should block docs-only skip
  3. Workflow with multiple reruns - only latest attempt matters
  4. Workflow with >100 jobs (pagination test)
  5. Workflow with no jobs (edge case)

Note: Testing GitHub Actions can be challenging. Consider:

  • Integration test by triggering actual workflow runs
  • Unit test by mocking GitHub API responses
  • Manual verification (which was likely done)

📝 Minor: Inconsistent Error Message

Location: Line 158

Issue: The error message shows run.conclusion which might not accurately reflect the latest attempt's status anymore.

Suggestion: Update message to clarify it's based on latest attempt analysis rather than run.conclusion value.


🔒 Security Considerations

No security concerns identified. The code:

  • Uses authenticated GitHub API calls (implicit via github-script action)
  • Doesn't expose sensitive data
  • Properly validates inputs (checking for empty SHA, null checks)
  • Uses GitHub's official Octokit client

🎯 Performance Considerations

API Rate Limits:

  • Current: ~5 API calls per docs-only commit check (1 for runs + 1-4 for jobs)
  • GitHub Actions limit: 1000 requests/hour
  • Verdict: Acceptable. Would need 200+ docs-only commits/hour to hit limits.

Execution Time:

  • Sequential API calls: ~1-2 seconds per workflow
  • Typical case (3-5 workflows): 3-10 seconds total
  • Verdict: Minor delay but acceptable for safety check

📋 Checklist Review

From PR description:

  • ✅ Code changes reviewed for correctness
  • ✅ Linting passes locally
  • ⏳ CI passes on this PR (merged, so presumably passed)
  • ⏳ Post-merge verification pending

🎓 Best Practices Assessment

Excellent:

  • ✅ Clear problem statement with reproduction steps
  • ✅ Thorough inline documentation
  • ✅ Conservative failure handling
  • ✅ Maintains existing safety guarantees

Could be improved:

  • ⚠️ Add pagination for job fetching
  • ⚠️ Add automated tests
  • ⚠️ Update error message for clarity

🏁 Final Verdict

Overall Quality: ⭐⭐⭐⭐ (4/5)

This is a well-implemented fix that correctly solves the identified problem. The code is readable, well-documented, and handles most edge cases appropriately.

Primary concern: Missing pagination for jobs API could cause false negatives in edge cases.

Recommendation: Consider a follow-up PR to add pagination for complete robustness, though the current implementation should work fine for typical React on Rails CI workflows.


📚 References


Review conducted using repository's CLAUDE.md guidelines

justin808 added a commit that referenced this pull request Nov 20, 2025
…management

This commit adds comprehensive guidance to prevent the type of CI breakage
we experienced with the package-scripts.yml path issue.

New modular documentation files (using @ references):
- .claude/docs/testing-build-scripts.md
  - Mandatory testing after package.json/package-scripts.yml changes
  - Manual yalc publish testing requirements
  - Real example of what went wrong (7-week silent failure)

- .claude/docs/master-health-monitoring.md
  - Immediate actions after PR merges
  - How to handle broken master
  - Understanding workflow reruns and circular dependencies

- .claude/docs/managing-file-paths.md
  - Path verification checklist
  - Post-refactor validation steps
  - Real example of the package-scripts.yml bug

Enhanced existing content:
- Updated Merge Conflict Resolution Workflow with path verification steps
- Added critical testing requirements after resolving build config conflicts

Documentation files included:
- CLAUDE_MD_UPDATES.md - Detailed implementation guide
- claude-md-improvements.md - Original analysis

These changes address the root causes identified in PR #2062 and #2065.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
justin808 added a commit that referenced this pull request Nov 20, 2025
- Added table of contents to CLAUDE_MD_UPDATES.md for easier navigation
- Converted PR mentions to clickable links in master-health-monitoring.md
  - PR #2062: CI safety check fix
  - PR #2065: Breaking the circular dependency example

This makes the documentation more user-friendly and easier to reference.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
justin808 added a commit that referenced this pull request Nov 20, 2025
## Summary

This PR breaks the circular dependency created by previous master
commits that failed the CI safety check, and adds comprehensive
documentation to prevent similar issues in the future.

## Problem

After PR #2062 merged, master is still in a broken state because:

1. Commit `beb70f009` and earlier commits have workflows that failed due
to the circular dependency bug
2. When we re-run those workflows, they use the **action code from that
commit** (with the old bug), not from current master
3. GitHub Actions always checks out workflow files from the commit being
tested, not from master
4. This means the reruns still fail with the same circular dependency
issue
5. The new safety check (now fixed) correctly detects these failures and
blocks subsequent docs-only commits

## Solution

This PR includes two components:

### 1. Cycle-Breaking Change

Makes a trivial non-docs change that will:
- Trigger **full CI** (not be detected as docs-only)
- Use the **fixed version** of the ensure-master-docs-safety action from
PR #2062
- Pass all tests (since the fix is now in place)
- Become the new "previous commit" for future safety checks
- Break the circular dependency cycle

### 2. Comprehensive Documentation Updates

Adds modular documentation (using @ references) to prevent similar
issues:

**New Documentation Files:**
- `.claude/docs/testing-build-scripts.md`
  - Mandatory testing after package.json/package-scripts.yml changes
  - Manual yalc publish testing requirements
  - Real example of what went wrong (7-week silent failure)

- `.claude/docs/master-health-monitoring.md`
  - Immediate actions after PR merges
  - How to handle broken master
  - Understanding workflow reruns and circular dependencies

- `.claude/docs/managing-file-paths.md`
  - Path verification checklist before committing
  - Post-refactor validation steps
  - Real example of the package-scripts.yml bug

**Enhanced Existing Content:**
- Updated Merge Conflict Resolution Workflow with path verification
steps
- Added critical testing requirements after resolving build config
conflicts

**Supporting Documentation:**
- `CLAUDE_MD_UPDATES.md` - Detailed implementation guide for applying
these updates
- `claude-md-improvements.md` - Original root cause analysis

## Why This Approach

We **cannot** fix the old commits by re-running their workflows because:
- Workflow reruns use the action code from the original commit
- The bug fix in PR #2062 only helps commits that come AFTER it
- We need a new commit with passing tests to establish a clean baseline

## Changes

### Code Changes
- Updated action description for clarity (minor wording improvement)

### Documentation Changes
- Added 3 new modular documentation files using @ references
- Enhanced merge conflict resolution workflow
- Included comprehensive guides to prevent future issues

## Root Cause Addressed

The documentation specifically addresses the failures that led to this
issue:

- ❌ No manual testing of yalc publish → Now required after build config
changes
- ❌ No path verification after structure changes → Now has checklist
- ❌ No master health monitoring → Now has immediate action plan
- ❌ Silent failures went unnoticed for weeks → Now has detection
strategies

## Test Plan

- [x] Change is non-docs (triggers full CI)
- [ ] CI passes with the fixed action
- [ ] Future docs-only commits will check this commit and find it passed
- [ ] Circular dependency is broken
- [ ] Documentation @ references work correctly

## Related

- Fixes the remaining issues from PR #2062
- Implements the fix for the circular dependency bug
- Adds preventive documentation based on root cause analysis

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Documentation**
* Clarified CI action description to better reflect behavior for
docs-only commits.
* Added new guides on managing file paths, testing build/package
scripts, and monitoring master branch health.
* Expanded CLAUDE.md with conflict-resolution, testing, and architecture
guidance; included an incident write-up and validation/implementation
plans.

* **Chores**
* Added comprehensive operational checklists and workflows to reduce CI
breakage during large refactors and build/config changes.

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Claude <[email protected]>
@justin808 justin808 mentioned this pull request Nov 20, 2025
66 tasks
justin808 added a commit that referenced this pull request Nov 20, 2025
…se-otp-timing

* origin/master: (27 commits)
  Fix doctor command false version mismatch for beta/prerelease versions (#2064)
  Fix beta/RC version handling in generator (#2066)
  Document Rails Engine development nuances and add tests for automatic rake task loading (#2067)
  Add /run-skipped-tests as alias for /run-skipped-ci (#XXXX) (#2068)
  Fix: Add Rails 5.2-6.0 compatibility for compact_blank (#2058)
  Break CI circular dependency with non-docs change (#2065)
  Fix CI safety check to evaluate latest workflow attempt (#2062)
  Fix yalc publish (#2054)
  Add Shakapacker 9.0+ private_output_path integration for server bundles (#2028)
  Consolidate all beta versions into v16.2.0.beta.10 (#2057)
  Improve reliability of CI debugging scripts (#2056)
  Clarify monorepo changelog structure in documentation (#2055)
  Bump version to 16.2.0.beta.10
  Bump version to 16.2.0.beta.9
  Fix duplicate rake task execution by removing explicit task loading (#2052)
  Simplify precompile hook and restore Pro dummy app to async loading (#2053)
  Add Shakapacker precompile hook with ReScript support to Pro dummy app (#1977)
  Guard master docs-only pushes and ensure full CI runs (#2042)
  Refactor: Extract JS dependency management into shared module (#2051)
  Add workflow to detect invalid CI command attempts (#2037)
  ...

# Conflicts:
#	rakelib/release.rake
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants