-
Notifications
You must be signed in to change notification settings - Fork 7
Fix task retry causing stuck PENDING status due to long exponential backoff #287
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Fixed an off-by-one error in the retry logic where tasks would stay in PENDING status with "Task failed, but will be retried" even after exhausting all retry attempts. The issue was in the condition checking whether to retry: - Old: `context.job.attempts < settings.DICOM_TASK_MAX_ATTEMPTS` - New: `context.job.attempts + 1 < settings.DICOM_TASK_MAX_ATTEMPTS` Procrastinate's `attempts` is 0-indexed (counts previous attempts), so on the final attempt (attempt 3 with max_attempts=3), attempts=2. The old check `2 < 3` was True, incorrectly allowing another retry. The fix `2 + 1 < 3` = False correctly transitions to FAILURE. Also switched from exponential to linear backoff for task retries (2min → 4min → 6min) for more predictable retry timing. Added test to verify tasks correctly transition to FAILURE after exhausting all retry attempts. Co-Authored-By: Claude Opus 4.5 <[email protected]>
Summary of ChangesHello @samuelvkwong, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly improves the reliability and predictability of task processing by addressing critical issues in the retry mechanism. It transitions from an exponential to a linear backoff strategy for task retries, preventing tasks from becoming stuck in a PENDING state due to excessively long delays. Additionally, it corrects an off-by-one error in the retry attempt tracking, ensuring that tasks reliably transition to a FAILURE state once their maximum retry attempts are exhausted. These changes enhance the system's ability to manage task failures and recovery effectively. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
📝 WalkthroughWalkthroughThe diff changes DICOM task retry behavior from an exponential backoff to a linear backoff (base wait + per-attempt increment), updates related settings/constants, adjusts the task retry conditional to shift when the final attempt is detected, and adds a test asserting failure after exhausting retries. Changes
Sequence Diagram(s)mermaid Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
PR Review: Fix task retry causing stuck PENDING statusSummaryThis PR addresses a critical bug in task retry logic and improves retry timing behavior. The changes are well-thought-out and properly tested. ✅ Strengths1. Excellent Bug Fix - Off-by-one ErrorThe fix in
The inline comment explaining this logic is excellent and will help future maintainers. 2. Improved Retry StrategySwitching from exponential to linear backoff is a practical improvement:
3. Comprehensive Test CoverageThe new test
4. Documentation Updates
🔍 Minor Observations1. Settings Variable NamingThe new settings variables in DICOM_TASK_RETRY_WAIT = 120 # 2 minute base wait
DICOM_TASK_LINEAR_WAIT = 120 # Add 2 minutes per retryHowever, you might consider whether 2. Test Assertion OrderIn the new test (line 206-213), consider asserting the DicomTask status before the DicomJob status for better locality: # Current order:
dicom_job.refresh_from_db()
assert dicom_job.status == DicomJob.Status.FAILURE
dicom_task.refresh_from_db()
assert dicom_task.status == DicomTask.Status.FAILURE
# Suggested order:
dicom_task.refresh_from_db()
assert dicom_task.status == DicomTask.Status.FAILURE
assert dicom_task.message == error_message
assert "Task failed, but will be retried" not in dicom_task.message
dicom_job.refresh_from_db()
assert dicom_job.status == DicomJob.Status.FAILUREThis groups all task assertions together and all job assertions together, improving readability. Again, this is a minor style preference. 🎯 Code Quality Assessment
📊 Impact AnalysisPositive Impact:
Risk Assessment: Low
✨ RecommendationAPPROVE - This is a high-quality fix that addresses a real bug with proper testing and documentation. The code changes are minimal, focused, and correct. The only suggestions above are minor style preferences that don't affect functionality. Great work on identifying and fixing this issue! Review performed by Claude Code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively addresses the issue of tasks getting stuck in a PENDING state due to long exponential backoff periods. The switch to a linear backoff strategy is a sensible improvement for more predictable and manageable retry intervals. The fix for the off-by-one error in the retry attempt condition is also crucial and correctly implemented. The addition of a new test case to verify this fix is a great inclusion.
I've provided a couple of suggestions to improve the readability of the retry logic and the robustness of the new test case. Overall, this is a solid contribution that improves the reliability of the task processing system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@adit/core/tests/test_tasks.py`:
- Around line 188-193: The local test stub for ExampleProcessor.process declares
an unused parameter named self causing Ruff ARG001; update the mock function
signature from def process(self): to def process(_self): (or another unused-name
starting with underscore) and patch ExampleProcessor.process with that renamed
function so the test still raises RetriableDicomError as before.
🧹 Nitpick comments (1)
adit/core/tests/test_tasks.py (1)
195-202: Target the queued job when forcing the final attempt.
This avoids unintentionally mutating other queued jobs in the test DB.♻️ Suggested refinement
- ProcrastinateJob.objects.filter(status="todo").update(attempts=final_attempt) + queued_job_id = dicom_task.queued_job_id + assert queued_job_id is not None + ProcrastinateJob.objects.filter(id=queued_job_id).update(attempts=final_attempt)
| error_message = "Connection refused by server" | ||
|
|
||
| def process(self): | ||
| raise RetriableDicomError(error_message) | ||
|
|
||
| mocker.patch.object(ExampleProcessor, "process", process) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, let's check the file structure and verify the test file exists
fd test_tasks.pyRepository: openradx/adit
Length of output: 86
🏁 Script executed:
# Read the specific lines mentioned in the review (188-193 plus context)
head -193 adit/core/tests/test_tasks.py | tail -30Repository: openradx/adit
Length of output: 1206
🏁 Script executed:
# Run Ruff on the test file to see what issues it flags
rg --version && ruff check adit/core/tests/test_tasks.py --select ARGRepository: openradx/adit
Length of output: 2202
Fix unused self argument to satisfy Ruff ARG001.
The local process() function at line 190 declares a self parameter that is never used. Rename it to _self to follow Python conventions for intentionally unused parameters.
error_message = "Connection refused by server"
- def process(self):
+ def process(_self):
raise RetriableDicomError(error_message)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| error_message = "Connection refused by server" | |
| def process(self): | |
| raise RetriableDicomError(error_message) | |
| mocker.patch.object(ExampleProcessor, "process", process) | |
| error_message = "Connection refused by server" | |
| def process(_self): | |
| raise RetriableDicomError(error_message) | |
| mocker.patch.object(ExampleProcessor, "process", process) |
🧰 Tools
🪛 Ruff (0.14.13)
190-190: Unused function argument: self
(ARG001)
🤖 Prompt for AI Agents
In `@adit/core/tests/test_tasks.py` around lines 188 - 193, The local test stub
for ExampleProcessor.process declares an unused parameter named self causing
Ruff ARG001; update the mock function signature from def process(self): to def
process(_self): (or another unused-name starting with underscore) and patch
ExampleProcessor.process with that renamed function so the test still raises
RetriableDicomError as before.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Pull Request Review: Fix Task Retry LogicSummaryThis PR addresses a critical bug where tasks would get stuck in PENDING status due to an off-by-one error in retry logic, and improves retry timing by switching from exponential to linear backoff. Overall, this is a well-implemented fix with excellent test coverage and documentation. ✅ Strengths1. Correct Off-by-One FixThe core fix correctly handles Procrastinate's 0-indexed attempts counter: # Old: context.job.attempts < settings.DICOM_TASK_MAX_ATTEMPTS
# New: context.job.attempts + 1 < settings.DICOM_TASK_MAX_ATTEMPTSAnalysis: With
This correctly ensures tasks transition to FAILURE on the final attempt rather than staying PENDING with "will be retried" message. 2. Improved Retry TimingSwitching to linear backoff (
Rationale: Since Stamina already handles fast retries (5-10 attempts), Procrastinate retries target longer-term issues (server restarts, maintenance). Linear backoff provides predictable timing without excessive delays. 3. Excellent Test CoverageThe new test
4. Thorough Documentation
🔍 Code Quality ObservationsMinor: Formula Documentation ClarityIn # Formula: wait + linear_wait * attemptsNote: This describes Procrastinate's formula, where
The comment is correct but could optionally clarify that Code Style: Follows Project Standards ✓
🔒 Security AnalysisNo security concerns identified. Changes are limited to:
No changes to authentication, authorization, data validation, or external interfaces. ⚡ Performance ConsiderationsPositive Impact
Neutral Impact
🧪 Testing RecommendationsCurrent Coverage: Excellent ✓The new test covers the critical path (final retry → FAILURE transition). Optional: Additional Edge CasesConsider testing these scenarios in future work (not blocking for this PR):
📋 Checklist ReviewBased on PR description:
🎯 Final RecommendationAPPROVE ✅ This PR successfully fixes a critical bug that could cause tasks to get stuck indefinitely. The implementation is correct, well-tested, and thoroughly documented. The switch to linear backoff is a sensible architectural improvement. Suggested ActionMerge after confirming CI passes. No changes required. 📚 References
Great work on identifying and fixing this subtle off-by-one error! The detailed commit messages and test documentation make the fix easy to understand and maintain. |
Summary
Test plan
uv run cli test -- adit/core/tests/test_tasks.py -v)test_process_dicom_task_transitions_to_failure_after_max_retriesverifies the fixuv run cli lint)🤖 Generated with Claude Code
Summary by CodeRabbit
Bug Fixes
Behavioral Changes
Tests
✏️ Tip: You can customize this high-level summary in your review settings.