Skip to content

Conversation

@samuelvkwong
Copy link
Collaborator

@samuelvkwong samuelvkwong commented Jan 22, 2026

Summary

  • Fixed tasks getting stuck in PENDING status due to extremely long exponential backoff wait times for retry attempts
  • Switched from exponential to linear backoff for task retries (2min → 4min → 6min) for more predictable and reasonable retry timing
  • Also fixed an off-by-one error in the retry condition check (secondary issue)
  • Added test to verify tasks correctly transition to FAILURE after exhausting all retry attempts

Test plan

  • All existing tests pass (uv run cli test -- adit/core/tests/test_tasks.py -v)
  • New test test_process_dicom_task_transitions_to_failure_after_max_retries verifies the fix
  • Linter checks pass (uv run cli lint)

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

    • Fixed retry logic for DICOM task processing so attempts are counted and final failure is reached correctly.
    • Improved final error messaging so the real error is shown when retries are exhausted.
  • Behavioral Changes

    • Switched task-level retry timing from exponential to a linear backoff pattern (base wait + per-attempt increment).
  • Tests

    • Added test validating transition to failure after max retries and final error message.

✏️ Tip: You can customize this high-level summary in your review settings.

Fixed an off-by-one error in the retry logic where tasks would stay in
PENDING status with "Task failed, but will be retried" even after
exhausting all retry attempts.

The issue was in the condition checking whether to retry:
- Old: `context.job.attempts < settings.DICOM_TASK_MAX_ATTEMPTS`
- New: `context.job.attempts + 1 < settings.DICOM_TASK_MAX_ATTEMPTS`

Procrastinate's `attempts` is 0-indexed (counts previous attempts), so
on the final attempt (attempt 3 with max_attempts=3), attempts=2.
The old check `2 < 3` was True, incorrectly allowing another retry.
The fix `2 + 1 < 3` = False correctly transitions to FAILURE.

Also switched from exponential to linear backoff for task retries
(2min → 4min → 6min) for more predictable retry timing.

Added test to verify tasks correctly transition to FAILURE after
exhausting all retry attempts.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @samuelvkwong, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly improves the reliability and predictability of task processing by addressing critical issues in the retry mechanism. It transitions from an exponential to a linear backoff strategy for task retries, preventing tasks from becoming stuck in a PENDING state due to excessively long delays. Additionally, it corrects an off-by-one error in the retry attempt tracking, ensuring that tasks reliably transition to a FAILURE state once their maximum retry attempts are exhausted. These changes enhance the system's ability to manage task failures and recovery effectively.

Highlights

  • Retry Strategy Update: The task retry mechanism has been switched from an exponential backoff to a linear backoff strategy. This change ensures more predictable and reasonable retry timings for tasks, moving from a 1min → 2min → 4min pattern to a 2min → 4min → 6min pattern.
  • Off-by-One Error Fix: An off-by-one error in the task retry condition within the _monitor_task function has been corrected. This fix prevents tasks from getting stuck in a PENDING status even after exhausting all allowed retry attempts, ensuring they correctly transition to a FAILURE state.
  • New Configuration Settings: New settings, DICOM_TASK_RETRY_WAIT and DICOM_TASK_LINEAR_WAIT, have been introduced in adit/settings/base.py to configure the new linear backoff strategy, replacing the previous DICOM_TASK_EXPONENTIAL_WAIT.
  • Test Coverage Enhancement: A new test case, test_process_dicom_task_transitions_to_failure_after_max_retries, has been added to adit/core/tests/test_tasks.py. This test specifically validates that tasks correctly transition to a FAILURE status once all retry attempts are exhausted, verifying the fix for the off-by-one error.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link

coderabbitai bot commented Jan 22, 2026

📝 Walkthrough

Walkthrough

The diff changes DICOM task retry behavior from an exponential backoff to a linear backoff (base wait + per-attempt increment), updates related settings/constants, adjusts the task retry conditional to shift when the final attempt is detected, and adds a test asserting failure after exhausting retries.

Changes

Cohort / File(s) Summary
Configuration
adit/settings/base.py, adit/core/utils/retry_config.py
Removed DICOM_TASK_EXPONENTIAL_WAIT; added DICOM_TASK_RETRY_WAIT (120s) and DICOM_TASK_LINEAR_WAIT (120s). Updated task-layer retry documentation to describe linear 2/4/6-minute pattern.
Core Retry Logic
adit/core/tasks.py
Replaced exponential_wait usage with wait and linear_wait in process_dicom_task RetryStrategy. In _monitor_task, changed retry condition from context.job.attempts < max to context.job.attempts + 1 < max, altering when the final attempt triggers failure.
Tests
adit/core/tests/test_tasks.py
Added test_process_dicom_task_transitions_to_failure_after_max_retries to assert job and task transition to FAILURE when retries are exhausted and that the final message contains the actual error (not a retry prompt).

Sequence Diagram(s)

mermaid
sequenceDiagram
participant TaskRunner as Task Runner
participant Processor as DICOM Processor
participant Monitor as _monitor_task
participant Queue as Job Queue
TaskRunner->>Processor: invoke processing
Processor-->>TaskRunner: success / RetriableDicomError
alt RetriableDicomError
TaskRunner->>Monitor: report error + context
Monitor->>Queue: read context.job.attempts
Monitor-->>Monitor: if attempts + 1 < MAX then schedule retry (wait + linear_wait * attempts)
Monitor->>Queue: update job state (RETRY) / schedule retry
Monitor-->>TaskRunner: requeue for next attempt
else Success or non-retriable
Monitor->>Queue: mark job/task SUCCESS or FAILURE
end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I nibbled at waits, then found a line,
Two, four, six minutes — a steady design.
When hops run out and attempts are through,
I ink the end with the error true. 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title directly and accurately describes the main fix: addressing tasks stuck in PENDING status due to long exponential backoff retries. This aligns with the core problem being solved and the primary change from exponential to linear backoff.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link

claude bot commented Jan 22, 2026

PR Review: Fix task retry causing stuck PENDING status

Summary

This PR addresses a critical bug in task retry logic and improves retry timing behavior. The changes are well-thought-out and properly tested.

✅ Strengths

1. Excellent Bug Fix - Off-by-one Error

The fix in adit/core/tasks.py:138 correctly addresses the off-by-one error:

  • Root cause: Procrastinate's attempts is 0-indexed (counts previous attempts)
  • Old logic: context.job.attempts < settings.DICOM_TASK_MAX_ATTEMPTS incorrectly allowed retry on final attempt
  • New logic: context.job.attempts + 1 < settings.DICOM_TASK_MAX_ATTEMPTS correctly transitions to FAILURE
  • Example: With max_attempts=3, on the 3rd attempt attempts=2, the check 2 + 1 < 3 evaluates to False, correctly preventing further retries

The inline comment explaining this logic is excellent and will help future maintainers.

2. Improved Retry Strategy

Switching from exponential to linear backoff is a practical improvement:

  • Before: 60s → 2min → 4min (exponential backoff, could grow very long)
  • After: 2min → 4min → 6min (linear backoff, predictable timing)
  • Rationale: Makes sense given that Stamina already handles fast retries (5-10 attempts over 2-5 min). Procrastinate retries are for longer-term issues and benefit from predictable timing.

3. Comprehensive Test Coverage

The new test test_process_dicom_task_transitions_to_failure_after_max_retries is excellent:

  • Directly tests the bug scenario
  • Simulates final attempt by setting attempts=max_attempts-1
  • Verifies both task status and message content
  • Well-documented with clear comments explaining the test logic

4. Documentation Updates

  • retry_config.py: Updated comments to reflect new linear backoff strategy
  • base.py: Clear comments explaining the wait formula and rationale
  • All changes maintain consistency with the project's documentation standards

🔍 Minor Observations

1. Settings Variable Naming

The new settings variables in base.py are clear:

DICOM_TASK_RETRY_WAIT = 120  # 2 minute base wait
DICOM_TASK_LINEAR_WAIT = 120  # Add 2 minutes per retry

However, you might consider whether DICOM_TASK_RETRY_WAIT_BASE would be more descriptive than DICOM_TASK_RETRY_WAIT, since it's the base value for linear backoff. This is purely stylistic and not critical.

2. Test Assertion Order

In the new test (line 206-213), consider asserting the DicomTask status before the DicomJob status for better locality:

# Current order:
dicom_job.refresh_from_db()
assert dicom_job.status == DicomJob.Status.FAILURE

dicom_task.refresh_from_db()
assert dicom_task.status == DicomTask.Status.FAILURE

# Suggested order:
dicom_task.refresh_from_db()
assert dicom_task.status == DicomTask.Status.FAILURE
assert dicom_task.message == error_message
assert "Task failed, but will be retried" not in dicom_task.message

dicom_job.refresh_from_db()
assert dicom_job.status == DicomJob.Status.FAILURE

This groups all task assertions together and all job assertions together, improving readability. Again, this is a minor style preference.

🎯 Code Quality Assessment

  • Logic: The fix is correct and addresses the root cause
  • Testing: Comprehensive test coverage with clear documentation
  • Documentation: Well-documented changes in both code and comments
  • Style: Follows project conventions (Google Python Style Guide)
  • Performance: Linear backoff provides more predictable retry timing
  • Security: No security concerns introduced
  • Backwards Compatibility: Changes are backward compatible (only affects retry timing)

📊 Impact Analysis

Positive Impact:

  • Tasks will no longer get stuck in PENDING status indefinitely
  • More predictable retry timing makes debugging easier
  • Better user experience with clearer task status transitions

Risk Assessment: Low

  • The change is well-tested
  • Linear backoff timing is still reasonable for recovery scenarios
  • No breaking changes to the API or data models

✨ Recommendation

APPROVE - This is a high-quality fix that addresses a real bug with proper testing and documentation. The code changes are minimal, focused, and correct.

The only suggestions above are minor style preferences that don't affect functionality. Great work on identifying and fixing this issue!


Review performed by Claude Code

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses the issue of tasks getting stuck in a PENDING state due to long exponential backoff periods. The switch to a linear backoff strategy is a sensible improvement for more predictable and manageable retry intervals. The fix for the off-by-one error in the retry attempt condition is also crucial and correctly implemented. The addition of a new test case to verify this fix is a great inclusion.

I've provided a couple of suggestions to improve the readability of the retry logic and the robustness of the new test case. Overall, this is a solid contribution that improves the reliability of the task processing system.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@adit/core/tests/test_tasks.py`:
- Around line 188-193: The local test stub for ExampleProcessor.process declares
an unused parameter named self causing Ruff ARG001; update the mock function
signature from def process(self): to def process(_self): (or another unused-name
starting with underscore) and patch ExampleProcessor.process with that renamed
function so the test still raises RetriableDicomError as before.
🧹 Nitpick comments (1)
adit/core/tests/test_tasks.py (1)

195-202: Target the queued job when forcing the final attempt.
This avoids unintentionally mutating other queued jobs in the test DB.

♻️ Suggested refinement
-    ProcrastinateJob.objects.filter(status="todo").update(attempts=final_attempt)
+    queued_job_id = dicom_task.queued_job_id
+    assert queued_job_id is not None
+    ProcrastinateJob.objects.filter(id=queued_job_id).update(attempts=final_attempt)

Comment on lines +188 to +193
error_message = "Connection refused by server"

def process(self):
raise RetriableDicomError(error_message)

mocker.patch.object(ExampleProcessor, "process", process)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's check the file structure and verify the test file exists
fd test_tasks.py

Repository: openradx/adit

Length of output: 86


🏁 Script executed:

# Read the specific lines mentioned in the review (188-193 plus context)
head -193 adit/core/tests/test_tasks.py | tail -30

Repository: openradx/adit

Length of output: 1206


🏁 Script executed:

# Run Ruff on the test file to see what issues it flags
rg --version && ruff check adit/core/tests/test_tasks.py --select ARG

Repository: openradx/adit

Length of output: 2202


Fix unused self argument to satisfy Ruff ARG001.

The local process() function at line 190 declares a self parameter that is never used. Rename it to _self to follow Python conventions for intentionally unused parameters.

     error_message = "Connection refused by server"
 
-    def process(self):
+    def process(_self):
         raise RetriableDicomError(error_message)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
error_message = "Connection refused by server"
def process(self):
raise RetriableDicomError(error_message)
mocker.patch.object(ExampleProcessor, "process", process)
error_message = "Connection refused by server"
def process(_self):
raise RetriableDicomError(error_message)
mocker.patch.object(ExampleProcessor, "process", process)
🧰 Tools
🪛 Ruff (0.14.13)

190-190: Unused function argument: self

(ARG001)

🤖 Prompt for AI Agents
In `@adit/core/tests/test_tasks.py` around lines 188 - 193, The local test stub
for ExampleProcessor.process declares an unused parameter named self causing
Ruff ARG001; update the mock function signature from def process(self): to def
process(_self): (or another unused-name starting with underscore) and patch
ExampleProcessor.process with that renamed function so the test still raises
RetriableDicomError as before.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@claude
Copy link

claude bot commented Jan 27, 2026

Pull Request Review: Fix Task Retry Logic

Summary

This PR addresses a critical bug where tasks would get stuck in PENDING status due to an off-by-one error in retry logic, and improves retry timing by switching from exponential to linear backoff. Overall, this is a well-implemented fix with excellent test coverage and documentation.


✅ Strengths

1. Correct Off-by-One Fix

The core fix correctly handles Procrastinate's 0-indexed attempts counter:

# Old: context.job.attempts < settings.DICOM_TASK_MAX_ATTEMPTS
# New: context.job.attempts + 1 < settings.DICOM_TASK_MAX_ATTEMPTS

Analysis: With DICOM_TASK_MAX_ATTEMPTS=3:

  • Attempt 1: attempts=0, check: 0+1 < 3 ✓ (retry)
  • Attempt 2: attempts=1, check: 1+1 < 3 ✓ (retry)
  • Attempt 3: attempts=2, check: 2+1 < 3 ✗ (FAILURE)

This correctly ensures tasks transition to FAILURE on the final attempt rather than staying PENDING with "will be retried" message.

2. Improved Retry Timing

Switching to linear backoff (2min → 4min → 6min) is more appropriate than exponential for this use case:

  • Exponential backoff with 60s base: 1min → 2min → 4min (7min total)
  • Linear backoff with 120s: 2min → 4min → 6min (12min total, formula: wait + linear_wait * attempts)

Rationale: Since Stamina already handles fast retries (5-10 attempts), Procrastinate retries target longer-term issues (server restarts, maintenance). Linear backoff provides predictable timing without excessive delays.

3. Excellent Test Coverage

The new test test_process_dicom_task_transitions_to_failure_after_max_retries is well-designed:

  • Simulates final retry attempt by setting attempts=max_attempts-1
  • Verifies task transitions to FAILURE (not PENDING)
  • Validates error message is the actual error (not "will be retried")
  • Includes comprehensive docstring explaining the fix

4. Thorough Documentation

  • Inline comments explain Procrastinate's 0-indexed behavior (lines 135-137 in tasks.py)
  • Updated retry_config.py docstring to reflect linear backoff
  • Updated settings comments to explain new configuration
  • Commit messages clearly describe the problem and solution

🔍 Code Quality Observations

Minor: Formula Documentation Clarity

In settings/base.py:413, the comment states:

# Formula: wait + linear_wait * attempts

Note: This describes Procrastinate's formula, where attempts is 0-indexed. The actual wait times are:

  • Retry 1: 120 + 120 * 0 = 120s (2 min)
  • Retry 2: 120 + 120 * 1 = 240s (4 min)
  • Retry 3: 120 + 120 * 2 = 360s (6 min)

The comment is correct but could optionally clarify that attempts is 0-indexed to avoid confusion.

Code Style: Follows Project Standards ✓

  • Adheres to Google Python Style Guide
  • Line length under 100 characters
  • Clear variable naming and comments
  • Proper exception handling preserved

🔒 Security Analysis

No security concerns identified. Changes are limited to:

  • Retry logic timing and condition checks
  • Configuration parameters for backoff strategy
  • Test coverage additions

No changes to authentication, authorization, data validation, or external interfaces.


⚡ Performance Considerations

Positive Impact

  • Linear backoff prevents exponential explosion: With the old exponential backoff, a misconfiguration or code bug could lead to extremely long retry delays (e.g., if max_attempts was increased to 5, delay would be 16 minutes for the 4th retry)
  • Predictable resource usage: Linear backoff makes it easier to capacity plan for worker resources

Neutral Impact

  • Total retry time increased from ~7 min to ~12 min, which is acceptable given the target use case (long-term outages)
  • Worker threads blocked for similar durations during retries

🧪 Testing Recommendations

Current Coverage: Excellent ✓

The new test covers the critical path (final retry → FAILURE transition).

Optional: Additional Edge Cases

Consider testing these scenarios in future work (not blocking for this PR):

  1. First retry behavior: Verify attempts=0 correctly transitions to PENDING
  2. Mid-retry behavior: Verify attempts=1 correctly transitions to PENDING
  3. Custom max_attempts: Test with DICOM_TASK_MAX_ATTEMPTS=1 and =5
  4. Non-retriable errors: Ensure they still fail immediately (existing test likely covers this)

📋 Checklist Review

Based on PR description:

  • ✅ All existing tests pass
  • ✅ New test added and passing
  • ✅ Linter checks pass
  • ✅ Clear documentation of changes

🎯 Final Recommendation

APPROVE

This PR successfully fixes a critical bug that could cause tasks to get stuck indefinitely. The implementation is correct, well-tested, and thoroughly documented. The switch to linear backoff is a sensible architectural improvement.

Suggested Action

Merge after confirming CI passes. No changes required.


📚 References

Great work on identifying and fixing this subtle off-by-one error! The detailed commit messages and test documentation make the fix easy to understand and maintain.

@medihack medihack merged commit 29b30c5 into main Jan 27, 2026
3 checks passed
@medihack medihack deleted the fix/task-retry-off-by-one-error branch January 27, 2026 13:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants