fix: prevent stale RQ STARTED from overwriting watchdog FAILURE by cau-git · Pull Request #523 · docling-project/docling-serve

cau-git · 2026-02-25T16:23:19Z

Guard _get_task_from_rq_direct() against consulting RQ when the task is already in a terminal state. Without this, a watchdog-published FAILURE was silently overwritten on every subsequent poll because the temp-task swap pattern bypassed _update_task_from_rq()'s own is_completed() guard.

Also pins docling-jobkit to cau/rq-worker-add-heartbeat for the worker heartbeat thread and orchestrator watchdog task.

See docling-project/docling-jobkit#102

Guard _get_task_from_rq_direct() against consulting RQ when the task is already in a terminal state. Without this, a watchdog-published FAILURE was silently overwritten on every subsequent poll because the temp-task swap pattern bypassed _update_task_from_rq()'s own is_completed() guard. Also pins docling-jobkit to cau/rq-worker-add-heartbeat for the worker heartbeat thread and orchestrator watchdog task. Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

github-actions · 2026-02-25T16:23:31Z

✅ DCO Check Passed

Thanks @cau-git, all your commits are properly signed off. 🎉

mergify · 2026-02-25T16:23:54Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

…before RQ Override _on_task_status_changed() to write every pub/sub status update (including watchdog-published FAILURE) to Redis immediately, making terminal states durable across pod restarts and pub/sub delivery gaps. Add a Redis terminal-state check at the top of task_status() before querying RQ, preventing stale STARTED entries in RQ's StartedJobRegistry from overwriting a watchdog-published FAILURE on fresh pods. Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

…t result-key cache - Use deterministic `results_prefix:{task_id}` lookup in `task_result()` to make result retrieval cross-pod safe. - Check Redis terminal state first, then verify RQ job existence before caching to preserve cleanup semantics. - Persist/hydrate task lifecycle timestamps in Redis metadata; restore `processing_meta` defaults on read. - Enable `StartedJobRegistry` cleanup in watchdog scans. - Remove redundant `docling:tasks:{task_id}:result_key` read/write paths in `docling-serve`. Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

…cau/rq-worker-add-heartbeat

dosubot · 2026-03-03T13:16:22Z

Documentation Updates

1 document(s) were updated by changes in this PR:

How can I use Docling's REST API to convert a PDF from a URL to Markdown with OpenAI-generated image descriptions, and is base64 encoding required for URL sources?

View Changes

@@ -71,7 +71,7 @@
 
 When using asynchronous conversion endpoints, you can poll the task status endpoint to track conversion progress. The task status response provides information about the current state of your conversion task.
 
-Docling-serve includes automatic zombie task reconciliation that prevents infinite polling loops. When an RQ job expires or is lost (due to Redis eviction, worker restart, etc.) while metadata persists, the system automatically detects this condition and reconciles the state, marking orphaned tasks as `FAILURE` with a descriptive error message.
+Docling-serve includes automatic zombie task reconciliation and watchdog-based failure detection that prevents infinite polling loops. When an RQ job expires or is lost (due to Redis eviction, worker restart, etc.) while metadata persists, the system automatically detects this condition and reconciles the state, marking orphaned tasks as `FAILURE` with a descriptive error message.
 
 ### Task Status Response
 
@@ -81,6 +81,26 @@
 - `task_position`: Queue position (if still queued)
 - `task_meta`: Processing metadata (document counts, progress)
 - `error_message`: Descriptive error information when tasks fail (available when `task_status` is `FAILURE`)
+
+### Task Status Resolution
+
+The system resolves task status using a three-step process designed to prevent stale RQ statuses from overwriting watchdog-published terminal states:
+
+1. **Redis terminal-state gate**: Redis is checked first for terminal states (`SUCCESS`, `FAILURE`, `CANCELLED`). If a terminal state is found in Redis, it is returned immediately without consulting RQ. This prevents stale RQ `STARTED` statuses (which can persist up to 4 hours after a worker kill) from overwriting watchdog-published `FAILURE` states.
+
+2. **RQ authoritative check**: RQ is consulted only for tasks not already in a terminal state. RQ provides the authoritative status for actively running jobs. Returns a `Task` object when the job is non-`PENDING`, `None` for `PENDING` jobs, or a special marker when the job is not found (NoSuchJobError).
+
+3. **Redis fallback**: Reached only when RQ had no useful answer (e.g., job is `PENDING` or has expired). Handles job-gone reconciliation and stale-status cross-checks using the same Redis key as step 1, but serving a different role in the resolution process.
+
+### Watchdog Failure Detection
+
+Docling-serve includes a watchdog mechanism (part of the docling-jobkit worker heartbeat thread and orchestrator watchdog task) that monitors worker health:
+
+- **Heartbeat monitoring**: Workers send periodic heartbeat signals to indicate they are actively processing tasks
+- **Automatic failure marking**: When a worker's heartbeat expires beyond the grace period, the watchdog automatically publishes a `FAILURE` state to Redis
+- **Terminal state protection**: Watchdog-published `FAILURE` states are authoritative and protected from being overwritten by stale RQ statuses through the terminal-state gate (step 1 above)
+
+This ensures that tasks whose workers have died or become unresponsive are promptly marked as failed, even if the RQ job metadata still shows `STARTED`.
 
 ### Automatic Zombie Task Detection
 
@@ -95,6 +115,7 @@
 
 The `error_message` field provides detailed diagnostic information when tasks fail, helping you understand what went wrong:
 
+- **Watchdog failures**: Tasks marked as `FAILURE` by the watchdog mechanism when worker heartbeats expire, indicating the worker died or became unresponsive during processing
 - **Task orphaning**: Automatically detected when RQ jobs expire while tasks show `PENDING` or `STARTED` status, with error messages indicating the likely cause (worker restart or Redis eviction)
 - **Infrastructure issues**: Details about worker restarts, Redis connection problems, or job TTL expiry
 - **General failures**: Descriptive error messages for task processing failures
@@ -102,8 +123,8 @@
 ### Best Practices for Async Conversions
 
 1. **Poll for status**: Regularly check the task status endpoint to monitor conversion progress
-2. **Check error messages**: When a task returns `FAILURE` status, inspect the `error_message` field for diagnostic details
-3. **Automatic orphaned task handling**: The system automatically detects and marks orphaned tasks as `FAILURE` when RQ jobs expire, preventing infinite polling loops
+2. **Check error messages**: When a task returns `FAILURE` status, inspect the `error_message` field for diagnostic details, including watchdog-detected worker failures
+3. **Terminal state reliability**: Once a task reaches a terminal state (`SUCCESS`, `FAILURE`, `CANCELLED`), the status is authoritative and will not change, even if stale data exists in RQ
 4. **Implement timeouts**: Set reasonable polling timeouts to avoid indefinite waiting
 5. **Retry logic**: For infrastructure-related failures (indicated in `error_message`), consider implementing retry logic
 
@@ -152,4 +173,4 @@
 ## Summary
 To generate Markdown with dynamic image alt text, set `image_alt_mode` to `static`, `caption`, or `description` in your API payload. For AI-generated alt text, use `description` and configure the OpenAI API options. Base64 encoding is only needed for direct file uploads, not for URLs.
 
-For asynchronous conversions, poll the task status endpoint to monitor progress. The system automatically detects and reconciles zombie/orphaned tasks, preventing infinite polling loops. Check the `error_message` field when tasks fail to diagnose issues such as task orphaning or infrastructure problems.
+For asynchronous conversions, poll the task status endpoint to monitor progress. The system uses a three-step status resolution process that protects terminal states from being overwritten by stale RQ data. Watchdog-based failure detection automatically marks tasks as failed when worker heartbeats expire, and zombie task reconciliation handles orphaned tasks. Check the `error_message` field when tasks fail to diagnose issues such as watchdog-detected worker failures, task orphaning, or infrastructure problems.

^{How did I do? Any feedback?}

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

cau-git added 4 commits March 3, 2026 12:31

Clean up source dependencies

6f095b2

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

Merge branch 'main' of github.com:docling-project/docling-serve into …

be44a77

…cau/rq-worker-add-heartbeat

cau-git marked this pull request as ready for review March 3, 2026 13:12

dolfim-ibm added 2 commits March 3, 2026 14:28

restore main files

3c9ad9a

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

restore main

bea0116

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

cau-git requested a review from dolfim-ibm March 3, 2026 13:31

dolfim-ibm approved these changes Mar 3, 2026

View reviewed changes

dolfim-ibm merged commit f4c42f4 into main Mar 3, 2026
10 checks passed

dolfim-ibm deleted the cau/rq-worker-add-heartbeat branch March 3, 2026 14:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent stale RQ STARTED from overwriting watchdog FAILURE#523

fix: prevent stale RQ STARTED from overwriting watchdog FAILURE#523
dolfim-ibm merged 7 commits intomainfrom
cau/rq-worker-add-heartbeat

cau-git commented Feb 25, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 25, 2026 •

edited

Loading

Uh oh!

mergify bot commented Feb 25, 2026

Uh oh!

dosubot bot commented Mar 3, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cau-git commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Feb 25, 2026

Merge Protections

🟢 Enforce conventional commit

Uh oh!

dosubot bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How can I use Docling's REST API to convert a PDF from a URL to Markdown with OpenAI-generated image descriptions, and is base64 encoding required for URL sources?

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cau-git commented Feb 25, 2026 •

edited

Loading

github-actions bot commented Feb 25, 2026 •

edited

Loading

dosubot bot commented Mar 3, 2026 •

edited

Loading