feat(automation): add retry logic for GitHub API label addition in Amber workflow #388

jeremyeder · 2025-11-28T13:59:37Z

Summary

Add exponential backoff retry pattern to harden PR label addition against transient GitHub API failures.

Problem

Amber workflow occasionally fails with RequestError [HttpError]: Unexpected end of JSON input, status: 500 when adding labels to PRs. This is a transient GitHub API error that should be retried.

Solution

Add retryWithBackoff() helper function with exponential backoff
3 retry attempts with 1s, 2s, 4s delays between attempts
Only retries on retriable errors (HTTP 5xx, JSON parse errors)
Fails fast on client errors (4xx) since retrying won't help

Test Plan

Workflow logic reviewed
Test with intentional API failure (manual testing required)
Monitor next Amber automation run for successful label addition

Changes

Modified .github/workflows/amber-issue-handler.yml:
- Added retry helper function (lines 324-342)
- Wrapped label addition with retry logic (lines 379-387)

References

Related to successful PR [Amber] Fix: [Amber] sessions timeout and cant resume #387 (labels failed due to transient API error)
Addresses issue [Amber] sessions timeout and cant resume #380 follow-up (workflow hardening)

🤖 Generated with Claude Code (https://claude.com/claude-code)

…ber workflow Add exponential backoff retry pattern to harden PR label addition against transient GitHub API failures (HTTP 500 errors, JSON parse errors). - Add retryWithBackoff() helper function (3 attempts, 1s/2s/4s delays) - Wrap github.rest.issues.addLabels() with retry logic - Only retry on retriable errors (5xx, JSON), fail fast on client errors - Log retry attempts for debugging Fixes intermittent 'Unexpected end of JSON input, status: 500' errors during PR creation workflow. 🤖 Generated with Claude Code (https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Improvements based on automated review: 1. Enhanced retry detection to handle network errors - Now retries on: !error.status (network errors), 5xx errors - Simplified logic: assume all errors retriable except 4xx 2. Added error details to retry logs - Include error status and message in console output - Format: 'Attempt N failed (status: message), retrying in Xms...' 3. Added explicit final throw for defensive coding - Prevents unreachable code path (should never occur) - Makes function behavior crystal clear Addresses review feedback from Claude Code Review on PR ambient-code#388 🤖 Generated with Claude Code (https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

jeremyeder · 2025-11-28T14:06:18Z

Review Feedback Addressed ✅

I've addressed the high-priority feedback from the automated code review:

1. Enhanced Retry Detection for Network Errors

Before:

const isRetriable = error.status >= 500 || error.message.includes('JSON');

After:

// Retry on: network errors (undefined status), 5xx errors, or specific error patterns
const isRetriable = \!error.status || error.status >= 500;

Why: Network errors (timeouts, connection resets, DNS failures) don't have a .status property. The new logic assumes all errors are retriable except 4xx client errors, which is more robust.

2. Added Error Details to Logs

Before:

console.log(`Attempt ${i + 1} failed, retrying in ${delay}ms...`);

After:

const errorMsg = error.message || 'Unknown error';
const errorStatus = error.status || 'network error';
console.log(`Attempt ${i + 1} failed (${errorStatus}: ${errorMsg}), retrying in ${delay}ms...`);

Why: Debugging is much easier when you can see exactly what error occurred on each retry attempt.

3. Added Explicit Final Throw

After the loop:

// Defensive: Should never reach here due to throw in loop, but explicit for clarity
throw new Error('retryWithBackoff: max retries exceeded');

Why: Makes the function's contract crystal clear and prevents unreachable code paths.

These changes make the retry logic more robust against network failures while improving debuggability. Ready for re-review! 🚀

github-actions · 2025-11-28T14:08:19Z

Claude Code Review

Summary

This PR implements a robust retry mechanism with exponential backoff to handle transient GitHub API failures when adding labels to PRs in the Amber workflow. The implementation has been iterated based on previous review feedback and now includes comprehensive error handling for network failures, detailed logging, and defensive coding practices.

Overall Assessment: ✅ APPROVED WITH CONFIDENCE - This is production-ready defensive code that addresses a real issue with minimal risk.

Issues by Severity

No blocking, critical, or major issues found. All previous concerns have been addressed.

🔵 Minor Issues

1. Retry Logic Could Be Applied to Other GitHub API Calls

The workflow makes several other GitHub API calls that could benefit from the same retry logic:

Line 258: github.rest.issues.createComment (report no changes)
Line 298: github.rest.issues.createComment (validation comment)
Line 352: github.rest.pulls.create (PR creation)
Line 396: github.rest.issues.createComment (link PR to issue)
Line 409: github.rest.issues.createComment (failure notification)
Line 434: github.rest.issues.createComment (report failure)

Impact: Low - These are in error handling paths or less critical flows. Label addition was the observed failure point.

Recommendation: Consider wrapping the PR creation call (line 352) with retry logic in a future enhancement, as it's the most critical operation. Comment additions are less critical but could be hardened as well.

Reference: Lines 258, 298, 352, 396, 409, 434.

2. Retry Helper Function Is Duplicated in Workflow

The retryWithBackoff function is defined inline in this workflow. If retry logic is needed in other workflows (e.g., amber-dependency-sync.yml), this code would be duplicated.

Impact: Very Low - Currently only used in one place, and GitHub Actions doesn't have great patterns for sharing JavaScript code between workflows.

Recommendation: Document this as a reusable pattern if other workflows need similar reliability. For now, keeping it inline is pragmatic.

3. Max Retries Constant Is Embedded

The function signature has maxRetries = 3 as a default parameter, but it's called without arguments, so it always retries 3 times. This is fine for the current use case but makes the parameter somewhat misleading.

Impact: Very Low - The default is sensible (3 retries = 1s + 2s + 4s = 7s total max delay).

Recommendation: Consider adding a comment explaining why 3 retries was chosen (based on typical transient API failure patterns).

Reference: Line 326.

Positive Highlights

✅ Addresses Real Production Issue - Directly solves the transient API failure observed in PR #387 (RequestError [HttpError]: Unexpected end of JSON input, status: 500)

✅ Comprehensive Error Handling - Now handles:

HTTP 5xx errors (server-side failures)
Network errors (no status property)
JSON parse errors (malformed responses)
All unexpected errors (defensive default)

✅ Excellent Retry Strategy - Exponential backoff with sensible defaults:

3 attempts total (1 initial + 2 retries)
1s, 2s, 4s delays (max 7s total)
Fails fast on 4xx client errors

✅ Detailed Logging - Retry attempts include error status and message for debugging

✅ Defensive Coding - Explicit final throw ensures no unreachable code paths

✅ Minimal Scope - Only touches the label addition logic, reducing risk

✅ Well-Documented - Inline comments explain retry strategy and rationale

✅ Follows Repository Patterns - Similar to retry logic in:

Frontend: components/frontend/src/lib/query-client.ts (React Query exponential backoff)
Backend: components/backend/handlers/helpers.go (RetryWithBackoff function)

✅ No Security Concerns - No user input, no credential handling, no injection risks

✅ No Performance Impact - Only adds latency on failure (which already fails the workflow)

✅ Iterative Improvement - Author addressed all previous review feedback comprehensively

Recommendations

Before Merge (Optional)

Add inline comment explaining retry count choice:

// 3 retries = max 7s total delay (1s + 2s + 4s), balancing reliability vs workflow timeout
async function retryWithBackoff(fn, maxRetries = 3, initialDelay = 1000) {

Future Enhancements (Not This PR)

Consider applying retry logic to PR creation (line 352) - most critical API call
Monitor workflow runs - validate that retries work in production
Document retry pattern in workflow README if reused elsewhere

Code Quality Assessment

✅ JavaScript Best Practices: Proper async/await, const/let usage
✅ Error Handling: Comprehensive, with graceful degradation
✅ Maintainability: Clear function naming, well-commented
✅ Testing: Difficult to unit test GitHub Actions, but logic is sound
✅ Documentation: PR description clearly explains problem and solution
✅ Security: No vulnerabilities introduced
✅ Performance: Minimal impact, only on failure path
✅ Compatibility: Works with existing Octokit error handling

Architecture & Design Notes

Pattern Alignment:
This implementation aligns well with existing retry patterns in the codebase:

Frontend (query-client.ts:19): retryDelay: (attemptIndex) => Math.min(1000 * 2 ** attemptIndex, 30000)
Backend (helpers.go:36): delay := time.Duration(float64(initialDelay) * math.Pow(2, float64(i)))

The workflow implementation uses the same exponential backoff formula (initialDelay * Math.pow(2, i)), ensuring consistency across the platform.

Why This Approach:

GitHub Actions workflows don't have direct access to shared libraries
Inline implementation is standard practice for GitHub Actions scripts
The retry logic is specific to Octokit error handling patterns
Keeps the workflow self-contained and easy to debug

Test Plan Validation

From PR description:

✅ Workflow logic reviewed
⏳ Test with intentional API failure (manual testing required)
⏳ Monitor next Amber automation run

Recommendation: The next Amber workflow run will validate this in production. Consider manually triggering a test issue with the amber:auto-fix label to verify behavior.

Final Verdict

This PR represents high-quality defensive programming that:

Solves a documented production issue
Follows established patterns in the codebase
Includes comprehensive error handling
Has minimal risk and no breaking changes
Is well-documented and maintainable

The code is production-ready and ready to merge. 🚀

The only suggestions are minor documentation enhancements and future considerations for applying this pattern more broadly.

🤖 Generated by Claude Code Review

This comment has been minimized.

Sign in to view

natifridman approved these changes Nov 30, 2025

View reviewed changes

jeremyeder merged commit f7288ce into ambient-code:main Dec 4, 2025
21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(automation): add retry logic for GitHub API label addition in Amber workflow #388

feat(automation): add retry logic for GitHub API label addition in Amber workflow #388

Uh oh!

jeremyeder commented Nov 28, 2025

Uh oh!

This comment has been minimized.

jeremyeder commented Nov 28, 2025

Uh oh!

github-actions bot commented Nov 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat(automation): add retry logic for GitHub API label addition in Amber workflow #388

feat(automation): add retry logic for GitHub API label addition in Amber workflow #388

Uh oh!

Conversation

jeremyeder commented Nov 28, 2025

Summary

Problem

Solution

Test Plan

Changes

References

Uh oh!

This comment has been minimized.

jeremyeder commented Nov 28, 2025

Review Feedback Addressed ✅

1. Enhanced Retry Detection for Network Errors

2. Added Error Details to Logs

3. Added Explicit Final Throw

Uh oh!

github-actions bot commented Nov 28, 2025

Claude Code Review

Summary

Issues by Severity

🔵 Minor Issues

Positive Highlights

Recommendations

Before Merge (Optional)

Future Enhancements (Not This PR)

Code Quality Assessment

Architecture & Design Notes

Test Plan Validation

Final Verdict

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants