Skip to content

Conversation

@jeremyeder
Copy link
Collaborator

Summary

Add exponential backoff retry pattern to harden PR label addition against transient GitHub API failures.

Problem

Amber workflow occasionally fails with RequestError [HttpError]: Unexpected end of JSON input, status: 500 when adding labels to PRs. This is a transient GitHub API error that should be retried.

Solution

  • Add retryWithBackoff() helper function with exponential backoff
  • 3 retry attempts with 1s, 2s, 4s delays between attempts
  • Only retries on retriable errors (HTTP 5xx, JSON parse errors)
  • Fails fast on client errors (4xx) since retrying won't help

Test Plan

  • Workflow logic reviewed
  • Test with intentional API failure (manual testing required)
  • Monitor next Amber automation run for successful label addition

Changes

  • Modified .github/workflows/amber-issue-handler.yml:
    • Added retry helper function (lines 324-342)
    • Wrapped label addition with retry logic (lines 379-387)

References

🤖 Generated with Claude Code (https://claude.com/claude-code)

…ber workflow

Add exponential backoff retry pattern to harden PR label addition against
transient GitHub API failures (HTTP 500 errors, JSON parse errors).

- Add retryWithBackoff() helper function (3 attempts, 1s/2s/4s delays)
- Wrap github.rest.issues.addLabels() with retry logic
- Only retry on retriable errors (5xx, JSON), fail fast on client errors
- Log retry attempts for debugging

Fixes intermittent 'Unexpected end of JSON input, status: 500' errors
during PR creation workflow.

🤖 Generated with Claude Code (https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@github-actions

This comment has been minimized.

Improvements based on automated review:

1. Enhanced retry detection to handle network errors
   - Now retries on: !error.status (network errors), 5xx errors
   - Simplified logic: assume all errors retriable except 4xx

2. Added error details to retry logs
   - Include error status and message in console output
   - Format: 'Attempt N failed (status: message), retrying in Xms...'

3. Added explicit final throw for defensive coding
   - Prevents unreachable code path (should never occur)
   - Makes function behavior crystal clear

Addresses review feedback from Claude Code Review on PR ambient-code#388

🤖 Generated with Claude Code (https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@jeremyeder
Copy link
Collaborator Author

Review Feedback Addressed ✅

I've addressed the high-priority feedback from the automated code review:

1. Enhanced Retry Detection for Network Errors

Before:

const isRetriable = error.status >= 500 || error.message.includes('JSON');

After:

// Retry on: network errors (undefined status), 5xx errors, or specific error patterns
const isRetriable = \!error.status || error.status >= 500;

Why: Network errors (timeouts, connection resets, DNS failures) don't have a .status property. The new logic assumes all errors are retriable except 4xx client errors, which is more robust.

2. Added Error Details to Logs

Before:

console.log(`Attempt ${i + 1} failed, retrying in ${delay}ms...`);

After:

const errorMsg = error.message || 'Unknown error';
const errorStatus = error.status || 'network error';
console.log(`Attempt ${i + 1} failed (${errorStatus}: ${errorMsg}), retrying in ${delay}ms...`);

Why: Debugging is much easier when you can see exactly what error occurred on each retry attempt.

3. Added Explicit Final Throw

After the loop:

// Defensive: Should never reach here due to throw in loop, but explicit for clarity
throw new Error('retryWithBackoff: max retries exceeded');

Why: Makes the function's contract crystal clear and prevents unreachable code paths.


These changes make the retry logic more robust against network failures while improving debuggability. Ready for re-review! 🚀

@github-actions
Copy link
Contributor

Claude Code Review

Summary

This PR implements a robust retry mechanism with exponential backoff to handle transient GitHub API failures when adding labels to PRs in the Amber workflow. The implementation has been iterated based on previous review feedback and now includes comprehensive error handling for network failures, detailed logging, and defensive coding practices.

Overall Assessment:APPROVED WITH CONFIDENCE - This is production-ready defensive code that addresses a real issue with minimal risk.


Issues by Severity

No blocking, critical, or major issues found. All previous concerns have been addressed.

🔵 Minor Issues

1. Retry Logic Could Be Applied to Other GitHub API Calls

The workflow makes several other GitHub API calls that could benefit from the same retry logic:

  • Line 258: github.rest.issues.createComment (report no changes)
  • Line 298: github.rest.issues.createComment (validation comment)
  • Line 352: github.rest.pulls.create (PR creation)
  • Line 396: github.rest.issues.createComment (link PR to issue)
  • Line 409: github.rest.issues.createComment (failure notification)
  • Line 434: github.rest.issues.createComment (report failure)

Impact: Low - These are in error handling paths or less critical flows. Label addition was the observed failure point.

Recommendation: Consider wrapping the PR creation call (line 352) with retry logic in a future enhancement, as it's the most critical operation. Comment additions are less critical but could be hardened as well.

Reference: Lines 258, 298, 352, 396, 409, 434.


2. Retry Helper Function Is Duplicated in Workflow

The retryWithBackoff function is defined inline in this workflow. If retry logic is needed in other workflows (e.g., amber-dependency-sync.yml), this code would be duplicated.

Impact: Very Low - Currently only used in one place, and GitHub Actions doesn't have great patterns for sharing JavaScript code between workflows.

Recommendation: Document this as a reusable pattern if other workflows need similar reliability. For now, keeping it inline is pragmatic.


3. Max Retries Constant Is Embedded

The function signature has maxRetries = 3 as a default parameter, but it's called without arguments, so it always retries 3 times. This is fine for the current use case but makes the parameter somewhat misleading.

Impact: Very Low - The default is sensible (3 retries = 1s + 2s + 4s = 7s total max delay).

Recommendation: Consider adding a comment explaining why 3 retries was chosen (based on typical transient API failure patterns).

Reference: Line 326.


Positive Highlights

Addresses Real Production Issue - Directly solves the transient API failure observed in PR #387 (RequestError [HttpError]: Unexpected end of JSON input, status: 500)

Comprehensive Error Handling - Now handles:

  • HTTP 5xx errors (server-side failures)
  • Network errors (no status property)
  • JSON parse errors (malformed responses)
  • All unexpected errors (defensive default)

Excellent Retry Strategy - Exponential backoff with sensible defaults:

  • 3 attempts total (1 initial + 2 retries)
  • 1s, 2s, 4s delays (max 7s total)
  • Fails fast on 4xx client errors

Detailed Logging - Retry attempts include error status and message for debugging

Defensive Coding - Explicit final throw ensures no unreachable code paths

Minimal Scope - Only touches the label addition logic, reducing risk

Well-Documented - Inline comments explain retry strategy and rationale

Follows Repository Patterns - Similar to retry logic in:

  • Frontend: components/frontend/src/lib/query-client.ts (React Query exponential backoff)
  • Backend: components/backend/handlers/helpers.go (RetryWithBackoff function)

No Security Concerns - No user input, no credential handling, no injection risks

No Performance Impact - Only adds latency on failure (which already fails the workflow)

Iterative Improvement - Author addressed all previous review feedback comprehensively


Recommendations

Before Merge (Optional)

  1. Add inline comment explaining retry count choice:
    // 3 retries = max 7s total delay (1s + 2s + 4s), balancing reliability vs workflow timeout
    async function retryWithBackoff(fn, maxRetries = 3, initialDelay = 1000) {

Future Enhancements (Not This PR)

  1. Consider applying retry logic to PR creation (line 352) - most critical API call
  2. Monitor workflow runs - validate that retries work in production
  3. Document retry pattern in workflow README if reused elsewhere

Code Quality Assessment

  • JavaScript Best Practices: Proper async/await, const/let usage
  • Error Handling: Comprehensive, with graceful degradation
  • Maintainability: Clear function naming, well-commented
  • Testing: Difficult to unit test GitHub Actions, but logic is sound
  • Documentation: PR description clearly explains problem and solution
  • Security: No vulnerabilities introduced
  • Performance: Minimal impact, only on failure path
  • Compatibility: Works with existing Octokit error handling

Architecture & Design Notes

Pattern Alignment:
This implementation aligns well with existing retry patterns in the codebase:

  • Frontend (query-client.ts:19): retryDelay: (attemptIndex) => Math.min(1000 * 2 ** attemptIndex, 30000)
  • Backend (helpers.go:36): delay := time.Duration(float64(initialDelay) * math.Pow(2, float64(i)))

The workflow implementation uses the same exponential backoff formula (initialDelay * Math.pow(2, i)), ensuring consistency across the platform.

Why This Approach:

  • GitHub Actions workflows don't have direct access to shared libraries
  • Inline implementation is standard practice for GitHub Actions scripts
  • The retry logic is specific to Octokit error handling patterns
  • Keeps the workflow self-contained and easy to debug

Test Plan Validation

From PR description:

  • ✅ Workflow logic reviewed
  • ⏳ Test with intentional API failure (manual testing required)
  • ⏳ Monitor next Amber automation run

Recommendation: The next Amber workflow run will validate this in production. Consider manually triggering a test issue with the amber:auto-fix label to verify behavior.


Final Verdict

This PR represents high-quality defensive programming that:

  1. Solves a documented production issue
  2. Follows established patterns in the codebase
  3. Includes comprehensive error handling
  4. Has minimal risk and no breaking changes
  5. Is well-documented and maintainable

The code is production-ready and ready to merge. 🚀

The only suggestions are minor documentation enhancements and future considerations for applying this pattern more broadly.


🤖 Generated by Claude Code Review

@jeremyeder jeremyeder merged commit f7288ce into ambient-code:main Dec 4, 2025
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants