Skip to content

add retries and timeouts#51

Merged
derekmisler merged 1 commit intodocker:mainfrom
derekmisler:add-fallback-models-retries-and-timeouts
Feb 23, 2026
Merged

add retries and timeouts#51
derekmisler merged 1 commit intodocker:mainfrom
derekmisler:add-fallback-models-retries-and-timeouts

Conversation

@derekmisler
Copy link
Copy Markdown
Contributor

@derekmisler derekmisler commented Feb 23, 2026

Summary

Adds retry logic with exponential backoff and timeout handling to improve reliability of agent execution. Updates cagent version to v1.23.6 across all workflows and documentation. Enhances PR review workflow with verification that reviews are actually posted and fallback error handling for token expiry scenarios.

Changes

  • action.yml: Implemented retry loop with exponential backoff (max 2 retries by default), added max-retries and retry-delay inputs, improved timeout handling using PIPESTATUS to distinguish timeout (124) from other failures
  • review-pr/action.yml: Added 20-minute timeout to prevent GitHub App token expiry, implemented review verification step to confirm bot review was posted, added fallback comments when review fails or isn't posted, improved reaction handling to use github.token (6h lifetime) instead of potentially expired App token
  • Version bumps: Updated cagent from v1.23.4 to v1.23.6 in all workflows, actions, and documentation
  • README.md: Updated retry-delay description to clarify exponential backoff behavior

Breaking Changes

None

How to Test

  • Trigger a workflow that uses the action and verify it completes successfully
  • Test retry behavior by simulating transient failures (should retry up to 2 times with exponential backoff)
  • Test PR review workflow and verify that reviews are posted and fallback comments appear when reviews fail

Closes: https://github.com/docker/gordon/issues/171

@derekmisler derekmisler changed the title add fallback models, retries, and timeouts add retries and timeouts Feb 23, 2026
@derekmisler
Copy link
Copy Markdown
Contributor Author

/describe

@docker-agent
Copy link
Copy Markdown
Contributor

docker-agent bot commented Feb 23, 2026

✅ PR description has been generated and updated!

@derekmisler derekmisler force-pushed the add-fallback-models-retries-and-timeouts branch 3 times, most recently from 2d66210 to bf8b8fc Compare February 23, 2026 20:21
@derekmisler derekmisler marked this pull request as ready for review February 23, 2026 20:22
@derekmisler derekmisler requested a review from a team as a code owner February 23, 2026 20:22
docker-agent[bot]
docker-agent bot previously requested changes Feb 23, 2026
Copy link
Copy Markdown
Contributor

@docker-agent docker-agent bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical Issues Found

This PR adds important retry and timeout functionality, but contains critical bugs in exit code handling that will prevent the retry logic from working correctly.

Summary

  • 2 HIGH severity issues: Exit codes are captured incorrectly in both timeout and non-timeout cases, causing all failures to be reported as success
  • 2 MEDIUM severity issues: Timestamp comparisons could miss reviews posted in the same second

The retry logic implementation is solid, but the exit code capture bugs will completely break its functionality.

@derekmisler derekmisler force-pushed the add-fallback-models-retries-and-timeouts branch from bf8b8fc to b27edd7 Compare February 23, 2026 20:56
@derekmisler
Copy link
Copy Markdown
Contributor Author

/review

Copy link
Copy Markdown
Contributor

@docker-agent docker-agent bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

This PR successfully adds retry logic with exponential backoff and review verification. However, I've identified 3 medium-severity issues in the error handling and verification logic that should be addressed:

  1. Clock skew vulnerability in review verification - The timestamp comparison could miss legitimate reviews if clocks are not perfectly synchronized
  2. API error masking - API failures are silently treated as "no reviews found", making it impossible to distinguish between legitimate zero-review results and API errors
  3. Silent fallback failures - If the fallback comment posting fails, users receive no notification

These issues don't prevent the core retry functionality from working, but they could cause confusion when verification or fallback mechanisms fail.

Signed-off-by: Derek Misler <derek.misler@docker.com>
@derekmisler derekmisler force-pushed the add-fallback-models-retries-and-timeouts branch from b27edd7 to 3224dd6 Compare February 23, 2026 21:16
@derekmisler
Copy link
Copy Markdown
Contributor Author

/review

@derekmisler
Copy link
Copy Markdown
Contributor Author

/describe

@docker-agent
Copy link
Copy Markdown
Contributor

docker-agent bot commented Feb 23, 2026

✅ PR description has been generated and updated!

Copy link
Copy Markdown
Contributor

@docker-agent docker-agent bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

I've identified one issue in the newly added review verification logic that should be addressed to improve robustness. The retry logic and timeout handling look good overall.

The main concern is that the timestamp generation lacks validation, which could lead to incorrect verification results if both date command variants fail (though this is extremely unlikely on standard runners).

fi

- name: Save reviewer memory
if: always()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is always necessary for some failure cases or and i missing something? feels odd, but then again this is GHA after all so...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, if the previous step failed, GHA would fail the whole thing and we would not save memory. but even if the run fails, i still want to save whatever memory accumulated during the run before it failed.

if [ "$REVIEW_VERIFIED" == "false" ]; then
# Agent succeeded but review wasn't posted (likely token expiry)
gh api "repos/${{ github.repository }}/issues/comments/${{ steps.resolve-context.outputs.comment-id }}/reactions" \
-X POST -f content='confused' || true
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

content='confused' 😂

@derekmisler derekmisler merged commit 509e721 into docker:main Feb 23, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants