Skip to content

fix: prevent stale RQ STARTED from overwriting watchdog FAILURE#523

Merged
dolfim-ibm merged 7 commits intomainfrom
cau/rq-worker-add-heartbeat
Mar 3, 2026
Merged

fix: prevent stale RQ STARTED from overwriting watchdog FAILURE#523
dolfim-ibm merged 7 commits intomainfrom
cau/rq-worker-add-heartbeat

Conversation

@cau-git
Copy link
Member

@cau-git cau-git commented Feb 25, 2026

Guard _get_task_from_rq_direct() against consulting RQ when the task is already in a terminal state. Without this, a watchdog-published FAILURE was silently overwritten on every subsequent poll because the temp-task swap pattern bypassed _update_task_from_rq()'s own is_completed() guard.

Also pins docling-jobkit to cau/rq-worker-add-heartbeat for the worker heartbeat thread and orchestrator watchdog task.

See docling-project/docling-jobkit#102

Guard _get_task_from_rq_direct() against consulting RQ when the task is
already in a terminal state. Without this, a watchdog-published FAILURE
was silently overwritten on every subsequent poll because the temp-task
swap pattern bypassed _update_task_from_rq()'s own is_completed() guard.

Also pins docling-jobkit to cau/rq-worker-add-heartbeat for the worker
heartbeat thread and orchestrator watchdog task.

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
@github-actions
Copy link
Contributor

github-actions bot commented Feb 25, 2026

DCO Check Passed

Thanks @cau-git, all your commits are properly signed off. 🎉

@mergify
Copy link

mergify bot commented Feb 25, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

cau-git added 4 commits March 3, 2026 12:31
…before RQ

Override _on_task_status_changed() to write every pub/sub status update
(including watchdog-published FAILURE) to Redis immediately, making
terminal states durable across pod restarts and pub/sub delivery gaps.

Add a Redis terminal-state check at the top of task_status() before
querying RQ, preventing stale STARTED entries in RQ's StartedJobRegistry
from overwriting a watchdog-published FAILURE on fresh pods.

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
…t result-key cache

- Use deterministic `results_prefix:{task_id}` lookup in `task_result()` to make result retrieval cross-pod safe.
- Check Redis terminal state first, then verify RQ job existence before caching to preserve cleanup semantics.
- Persist/hydrate task lifecycle timestamps in Redis metadata; restore `processing_meta` defaults on read.
- Enable `StartedJobRegistry` cleanup in watchdog scans.
- Remove redundant `docling:tasks:{task_id}:result_key` read/write paths in `docling-serve`.

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
@cau-git cau-git marked this pull request as ready for review March 3, 2026 13:12
@dosubot
Copy link

dosubot bot commented Mar 3, 2026

Documentation Updates

1 document(s) were updated by changes in this PR:

How can I use Docling's REST API to convert a PDF from a URL to Markdown with OpenAI-generated image descriptions, and is base64 encoding required for URL sources?
View Changes
@@ -71,7 +71,7 @@
 
 When using asynchronous conversion endpoints, you can poll the task status endpoint to track conversion progress. The task status response provides information about the current state of your conversion task.
 
-Docling-serve includes automatic zombie task reconciliation that prevents infinite polling loops. When an RQ job expires or is lost (due to Redis eviction, worker restart, etc.) while metadata persists, the system automatically detects this condition and reconciles the state, marking orphaned tasks as `FAILURE` with a descriptive error message.
+Docling-serve includes automatic zombie task reconciliation and watchdog-based failure detection that prevents infinite polling loops. When an RQ job expires or is lost (due to Redis eviction, worker restart, etc.) while metadata persists, the system automatically detects this condition and reconciles the state, marking orphaned tasks as `FAILURE` with a descriptive error message.
 
 ### Task Status Response
 
@@ -81,6 +81,26 @@
 - `task_position`: Queue position (if still queued)
 - `task_meta`: Processing metadata (document counts, progress)
 - `error_message`: Descriptive error information when tasks fail (available when `task_status` is `FAILURE`)
+
+### Task Status Resolution
+
+The system resolves task status using a three-step process designed to prevent stale RQ statuses from overwriting watchdog-published terminal states:
+
+1. **Redis terminal-state gate**: Redis is checked first for terminal states (`SUCCESS`, `FAILURE`, `CANCELLED`). If a terminal state is found in Redis, it is returned immediately without consulting RQ. This prevents stale RQ `STARTED` statuses (which can persist up to 4 hours after a worker kill) from overwriting watchdog-published `FAILURE` states.
+
+2. **RQ authoritative check**: RQ is consulted only for tasks not already in a terminal state. RQ provides the authoritative status for actively running jobs. Returns a `Task` object when the job is non-`PENDING`, `None` for `PENDING` jobs, or a special marker when the job is not found (NoSuchJobError).
+
+3. **Redis fallback**: Reached only when RQ had no useful answer (e.g., job is `PENDING` or has expired). Handles job-gone reconciliation and stale-status cross-checks using the same Redis key as step 1, but serving a different role in the resolution process.
+
+### Watchdog Failure Detection
+
+Docling-serve includes a watchdog mechanism (part of the docling-jobkit worker heartbeat thread and orchestrator watchdog task) that monitors worker health:
+
+- **Heartbeat monitoring**: Workers send periodic heartbeat signals to indicate they are actively processing tasks
+- **Automatic failure marking**: When a worker's heartbeat expires beyond the grace period, the watchdog automatically publishes a `FAILURE` state to Redis
+- **Terminal state protection**: Watchdog-published `FAILURE` states are authoritative and protected from being overwritten by stale RQ statuses through the terminal-state gate (step 1 above)
+
+This ensures that tasks whose workers have died or become unresponsive are promptly marked as failed, even if the RQ job metadata still shows `STARTED`.
 
 ### Automatic Zombie Task Detection
 
@@ -95,6 +115,7 @@
 
 The `error_message` field provides detailed diagnostic information when tasks fail, helping you understand what went wrong:
 
+- **Watchdog failures**: Tasks marked as `FAILURE` by the watchdog mechanism when worker heartbeats expire, indicating the worker died or became unresponsive during processing
 - **Task orphaning**: Automatically detected when RQ jobs expire while tasks show `PENDING` or `STARTED` status, with error messages indicating the likely cause (worker restart or Redis eviction)
 - **Infrastructure issues**: Details about worker restarts, Redis connection problems, or job TTL expiry
 - **General failures**: Descriptive error messages for task processing failures
@@ -102,8 +123,8 @@
 ### Best Practices for Async Conversions
 
 1. **Poll for status**: Regularly check the task status endpoint to monitor conversion progress
-2. **Check error messages**: When a task returns `FAILURE` status, inspect the `error_message` field for diagnostic details
-3. **Automatic orphaned task handling**: The system automatically detects and marks orphaned tasks as `FAILURE` when RQ jobs expire, preventing infinite polling loops
+2. **Check error messages**: When a task returns `FAILURE` status, inspect the `error_message` field for diagnostic details, including watchdog-detected worker failures
+3. **Terminal state reliability**: Once a task reaches a terminal state (`SUCCESS`, `FAILURE`, `CANCELLED`), the status is authoritative and will not change, even if stale data exists in RQ
 4. **Implement timeouts**: Set reasonable polling timeouts to avoid indefinite waiting
 5. **Retry logic**: For infrastructure-related failures (indicated in `error_message`), consider implementing retry logic
 
@@ -152,4 +173,4 @@
 ## Summary
 To generate Markdown with dynamic image alt text, set `image_alt_mode` to `static`, `caption`, or `description` in your API payload. For AI-generated alt text, use `description` and configure the OpenAI API options. Base64 encoding is only needed for direct file uploads, not for URLs.
 
-For asynchronous conversions, poll the task status endpoint to monitor progress. The system automatically detects and reconciles zombie/orphaned tasks, preventing infinite polling loops. Check the `error_message` field when tasks fail to diagnose issues such as task orphaning or infrastructure problems.
+For asynchronous conversions, poll the task status endpoint to monitor progress. The system uses a three-step status resolution process that protects terminal states from being overwritten by stale RQ data. Watchdog-based failure detection automatically marks tasks as failed when worker heartbeats expire, and zombie task reconciliation handles orphaned tasks. Check the `error_message` field when tasks fail to diagnose issues such as watchdog-detected worker failures, task orphaning, or infrastructure problems.

How did I do? Any feedback?  Join Discord

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
@cau-git cau-git requested a review from dolfim-ibm March 3, 2026 13:31
@dolfim-ibm dolfim-ibm merged commit f4c42f4 into main Mar 3, 2026
10 checks passed
@dolfim-ibm dolfim-ibm deleted the cau/rq-worker-add-heartbeat branch March 3, 2026 14:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants