fix: prevent stale RQ STARTED from overwriting watchdog FAILURE#523
Merged
dolfim-ibm merged 7 commits intomainfrom Mar 3, 2026
Merged
fix: prevent stale RQ STARTED from overwriting watchdog FAILURE#523dolfim-ibm merged 7 commits intomainfrom
dolfim-ibm merged 7 commits intomainfrom
Conversation
Guard _get_task_from_rq_direct() against consulting RQ when the task is already in a terminal state. Without this, a watchdog-published FAILURE was silently overwritten on every subsequent poll because the temp-task swap pattern bypassed _update_task_from_rq()'s own is_completed() guard. Also pins docling-jobkit to cau/rq-worker-add-heartbeat for the worker heartbeat thread and orchestrator watchdog task. Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Contributor
|
✅ DCO Check Passed Thanks @cau-git, all your commits are properly signed off. 🎉 |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
…before RQ Override _on_task_status_changed() to write every pub/sub status update (including watchdog-published FAILURE) to Redis immediately, making terminal states durable across pod restarts and pub/sub delivery gaps. Add a Redis terminal-state check at the top of task_status() before querying RQ, preventing stale STARTED entries in RQ's StartedJobRegistry from overwriting a watchdog-published FAILURE on fresh pods. Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
…t result-key cache
- Use deterministic `results_prefix:{task_id}` lookup in `task_result()` to make result retrieval cross-pod safe.
- Check Redis terminal state first, then verify RQ job existence before caching to preserve cleanup semantics.
- Persist/hydrate task lifecycle timestamps in Redis metadata; restore `processing_meta` defaults on read.
- Enable `StartedJobRegistry` cleanup in watchdog scans.
- Remove redundant `docling:tasks:{task_id}:result_key` read/write paths in `docling-serve`.
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
…cau/rq-worker-add-heartbeat
|
Documentation Updates 1 document(s) were updated by changes in this PR: How can I use Docling's REST API to convert a PDF from a URL to Markdown with OpenAI-generated image descriptions, and is base64 encoding required for URL sources?View Changes@@ -71,7 +71,7 @@
When using asynchronous conversion endpoints, you can poll the task status endpoint to track conversion progress. The task status response provides information about the current state of your conversion task.
-Docling-serve includes automatic zombie task reconciliation that prevents infinite polling loops. When an RQ job expires or is lost (due to Redis eviction, worker restart, etc.) while metadata persists, the system automatically detects this condition and reconciles the state, marking orphaned tasks as `FAILURE` with a descriptive error message.
+Docling-serve includes automatic zombie task reconciliation and watchdog-based failure detection that prevents infinite polling loops. When an RQ job expires or is lost (due to Redis eviction, worker restart, etc.) while metadata persists, the system automatically detects this condition and reconciles the state, marking orphaned tasks as `FAILURE` with a descriptive error message.
### Task Status Response
@@ -81,6 +81,26 @@
- `task_position`: Queue position (if still queued)
- `task_meta`: Processing metadata (document counts, progress)
- `error_message`: Descriptive error information when tasks fail (available when `task_status` is `FAILURE`)
+
+### Task Status Resolution
+
+The system resolves task status using a three-step process designed to prevent stale RQ statuses from overwriting watchdog-published terminal states:
+
+1. **Redis terminal-state gate**: Redis is checked first for terminal states (`SUCCESS`, `FAILURE`, `CANCELLED`). If a terminal state is found in Redis, it is returned immediately without consulting RQ. This prevents stale RQ `STARTED` statuses (which can persist up to 4 hours after a worker kill) from overwriting watchdog-published `FAILURE` states.
+
+2. **RQ authoritative check**: RQ is consulted only for tasks not already in a terminal state. RQ provides the authoritative status for actively running jobs. Returns a `Task` object when the job is non-`PENDING`, `None` for `PENDING` jobs, or a special marker when the job is not found (NoSuchJobError).
+
+3. **Redis fallback**: Reached only when RQ had no useful answer (e.g., job is `PENDING` or has expired). Handles job-gone reconciliation and stale-status cross-checks using the same Redis key as step 1, but serving a different role in the resolution process.
+
+### Watchdog Failure Detection
+
+Docling-serve includes a watchdog mechanism (part of the docling-jobkit worker heartbeat thread and orchestrator watchdog task) that monitors worker health:
+
+- **Heartbeat monitoring**: Workers send periodic heartbeat signals to indicate they are actively processing tasks
+- **Automatic failure marking**: When a worker's heartbeat expires beyond the grace period, the watchdog automatically publishes a `FAILURE` state to Redis
+- **Terminal state protection**: Watchdog-published `FAILURE` states are authoritative and protected from being overwritten by stale RQ statuses through the terminal-state gate (step 1 above)
+
+This ensures that tasks whose workers have died or become unresponsive are promptly marked as failed, even if the RQ job metadata still shows `STARTED`.
### Automatic Zombie Task Detection
@@ -95,6 +115,7 @@
The `error_message` field provides detailed diagnostic information when tasks fail, helping you understand what went wrong:
+- **Watchdog failures**: Tasks marked as `FAILURE` by the watchdog mechanism when worker heartbeats expire, indicating the worker died or became unresponsive during processing
- **Task orphaning**: Automatically detected when RQ jobs expire while tasks show `PENDING` or `STARTED` status, with error messages indicating the likely cause (worker restart or Redis eviction)
- **Infrastructure issues**: Details about worker restarts, Redis connection problems, or job TTL expiry
- **General failures**: Descriptive error messages for task processing failures
@@ -102,8 +123,8 @@
### Best Practices for Async Conversions
1. **Poll for status**: Regularly check the task status endpoint to monitor conversion progress
-2. **Check error messages**: When a task returns `FAILURE` status, inspect the `error_message` field for diagnostic details
-3. **Automatic orphaned task handling**: The system automatically detects and marks orphaned tasks as `FAILURE` when RQ jobs expire, preventing infinite polling loops
+2. **Check error messages**: When a task returns `FAILURE` status, inspect the `error_message` field for diagnostic details, including watchdog-detected worker failures
+3. **Terminal state reliability**: Once a task reaches a terminal state (`SUCCESS`, `FAILURE`, `CANCELLED`), the status is authoritative and will not change, even if stale data exists in RQ
4. **Implement timeouts**: Set reasonable polling timeouts to avoid indefinite waiting
5. **Retry logic**: For infrastructure-related failures (indicated in `error_message`), consider implementing retry logic
@@ -152,4 +173,4 @@
## Summary
To generate Markdown with dynamic image alt text, set `image_alt_mode` to `static`, `caption`, or `description` in your API payload. For AI-generated alt text, use `description` and configure the OpenAI API options. Base64 encoding is only needed for direct file uploads, not for URLs.
-For asynchronous conversions, poll the task status endpoint to monitor progress. The system automatically detects and reconciles zombie/orphaned tasks, preventing infinite polling loops. Check the `error_message` field when tasks fail to diagnose issues such as task orphaning or infrastructure problems.
+For asynchronous conversions, poll the task status endpoint to monitor progress. The system uses a three-step status resolution process that protects terminal states from being overwritten by stale RQ data. Watchdog-based failure detection automatically marks tasks as failed when worker heartbeats expire, and zombie task reconciliation handles orphaned tasks. Check the `error_message` field when tasks fail to diagnose issues such as watchdog-detected worker failures, task orphaning, or infrastructure problems. |
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
dolfim-ibm
approved these changes
Mar 3, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Guard _get_task_from_rq_direct() against consulting RQ when the task is already in a terminal state. Without this, a watchdog-published FAILURE was silently overwritten on every subsequent poll because the temp-task swap pattern bypassed _update_task_from_rq()'s own is_completed() guard.
Also pins docling-jobkit to cau/rq-worker-add-heartbeat for the worker heartbeat thread and orchestrator watchdog task.
See docling-project/docling-jobkit#102