Skip to content

Conversation

@creatorrr
Copy link
Contributor

@creatorrr creatorrr commented Sep 17, 2025

User description

Summary

  • import tenacity's stop_after_attempt utility for the worker entrypoint retry decorator
  • cap retries at five attempts (~100 seconds) with an AIDEV note explaining the resiliency versus fail-fast balance

Testing

  • not run (not requested)

https://chatgpt.com/codex/tasks/task_b_68cb0221656483219bea456acd286bfa


PR Type

Enhancement


Description

  • Add retry limit to worker main function

  • Import stop_after_attempt from tenacity library

  • Cap retries at 5 attempts with explanatory comment


Diagram Walkthrough

flowchart LR
  A["Import stop_after_attempt"] --> B["Add stop policy to retry decorator"]
  B --> C["Limit retries to 5 attempts"]
  C --> D["Add AIDEV comment explaining policy"]
Loading

File Walkthrough

Relevant files
Enhancement
__main__.py
Add retry limit to worker retry decorator                               

src/agents-api/agents_api/worker/main.py

  • Import stop_after_attempt from tenacity library
  • Add stop=stop_after_attempt(5) to retry decorator
  • Add AIDEV comment explaining retry policy rationale
+4/-1     


Important

Adds a retry stop policy to the worker's main function in __main__.py, limiting retries to five attempts using stop_after_attempt.

  • Behavior:
    • Adds stop_after_attempt(5) to the @retry decorator in __main__.py to limit retries to five attempts (~100 seconds total).
    • Includes an AIDEV note explaining the balance between resiliency and fail-fast behavior.
  • Imports:
    • Imports stop_after_attempt from tenacity in __main__.py.

This description was created by Ellipsis for 6e9c489. You can customize this summary. It will automatically update as commits are pushed.

@qodo-merge-for-open-source
Copy link
Contributor

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Retry Policy Fit

Confirm that capping retries at 5 with 20s wait (≈100s total) aligns with operational expectations for worker liveness and orchestration backoff, and that failing after 5 attempts does not prematurely terminate during longer transient outages.

# AIDEV-NOTE: Allow up to five retries (~100s total) so transient Temporal outages
# can recover while still failing fast enough for orchestration to replace the worker.
@retry(
    wait=wait_fixed(20),
    retry=retry_if_exception_type(RuntimeError),
    stop=stop_after_attempt(5),
    after=after_log(logger, logging.DEBUG),
)
Exception Scope

Validate that retrying only on RuntimeError is sufficient; if Temporal client/connectivity errors raise different exception types, they may bypass retries unexpectedly.

@retry(
    wait=wait_fixed(20),
    retry=retry_if_exception_type(RuntimeError),
    stop=stop_after_attempt(5),
    after=after_log(logger, logging.DEBUG),

@github-actions
Copy link
Contributor

github-actions bot commented Sep 17, 2025

Claude finished @creatorrr's task —— View job


Changelog Bot Analysis

Todo List:

  • Read diff.patch to understand changes
  • Determine if changes are user-visible
  • Create changelog entry if needed
  • Commit and push changes if applicable

Analysis:
The changes modify the worker retry logic by:

  • Adding stop_after_attempt(5) to limit retries to 5 attempts
  • Adding explanatory comment about the ~100s retry window

Decision: No changelog entry needed. This is an internal infrastructure change that improves worker resilience but doesn't affect user-facing functionality. Users don't directly interact with the worker retry mechanism - this is purely an operational improvement for system stability.

Result: No action taken - exiting without writing, committing, or making changes as per changelog bot rules for non-user-visible changes.


Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important

Looks good to me! 👍

Reviewed everything up to 6e9c489 in 57 seconds. Click for details.
  • Reviewed 26 lines of code in 1 files
  • Skipped 0 files when reviewing.
  • Skipped posting 3 draft comments. View those below.
  • Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. src/agents-api/agents_api/worker/__main__.py:11
  • Draft comment:
    Good: Added stop_after_attempt import to limit retries.
  • Reason this comment was not posted:
    Confidence changes required: 0% <= threshold 50% None
2. src/agents-api/agents_api/worker/__main__.py:25
  • Draft comment:
    AIDEV-NOTE clarifies retry rationale. Consider removing or adjusting internal notes in production.
  • Reason this comment was not posted:
    Confidence changes required: 33% <= threshold 50% None
3. src/agents-api/agents_api/worker/__main__.py:30
  • Draft comment:
    Using stop_after_attempt(5) appropriately limits retries. Confirm that catching only RuntimeError is intended.
  • Reason this comment was not posted:
    Confidence changes required: 0% <= threshold 50% None

Workflow ID: wflow_yF95vU8rD09UVuWN

You can customize Ellipsis by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.

@qodo-merge-for-open-source
Copy link
Contributor

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
General
Expand retry exception types

Expand the retry condition to include more specific transient exceptions, such
as ConnectionError and TimeoutError, in addition to RuntimeError.

src/agents-api/agents_api/worker/main.py [27-32]

 @retry(
     wait=wait_fixed(20),
-    retry=retry_if_exception_type(RuntimeError),
+    retry=retry_if_exception_type((RuntimeError, ConnectionError, TimeoutError)),
     stop=stop_after_attempt(5),
     after=after_log(logger, logging.DEBUG),
 )
  • Apply / Chat
Suggestion importance[1-10]: 7

__

Why: The suggestion improves the robustness of the retry logic by proposing to catch more specific and relevant transient exceptions (ConnectionError, TimeoutError), which aligns well with the PR's goal of handling transient Temporal outages.

Medium
  • More

@claude
Copy link
Contributor

claude bot commented Sep 17, 2025

Code Review for PR #1568: Add stop policy to worker retry loop

Overall Assessment

This is a good improvement that adds necessary bounds to the retry logic in the worker's main function. The change prevents infinite retry loops while maintaining reasonable resilience against transient failures.

Strengths

  1. Good retry strategy: The 5-attempt limit (~100 seconds total with 20-second waits) strikes a reasonable balance between resilience and fail-fast behavior
  2. Clear documentation: The AIDEV-NOTE comment effectively explains the rationale behind the retry policy
  3. Consistent with codebase patterns: The implementation aligns with retry patterns used elsewhere in the codebase (e.g., rec_sum/generate.py also uses stop_after_attempt(5))
  4. Proper import: The stop_after_attempt import is correctly added to the existing tenacity imports

Observations & Suggestions

  1. Consider logging enhancement: When the retry limit is reached, it might be helpful to log a more explicit message. The current after_log will show retry attempts, but consider adding explicit error handling when all retries are exhausted by adding reraise=True to explicitly reraise after all retries.

  2. Configuration consideration: For production environments, you might want to make the retry count configurable via environment variables, similar to how other service parameters are configured:

    • WORKER_MAX_RETRIES = int(os.getenv('WORKER_MAX_RETRIES', '5'))
    • Then use: stop=stop_after_attempt(WORKER_MAX_RETRIES)
  3. Error specificity: Currently retrying only on RuntimeError. Consider if there are other transient errors (like ConnectionError or specific Temporal client errors) that should also trigger retries.

No Critical Issues Found

  • No security vulnerabilities identified
  • No performance degradation expected
  • Change is backward compatible
  • Dependencies are properly managed (tenacity>=9.0.0 is already in pyproject.toml)

Test Coverage Note

While no direct tests for main.py were found (which is common for entry point files), the retry behavior could be valuable to verify in integration tests, especially to ensure the worker properly fails after exhausting retries.

Approval

This change improves the robustness of the worker service by preventing infinite retry loops while maintaining reasonable resilience. The implementation is clean and well-documented.

Recommendation: Ready to merge after considering the optional suggestions above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants