fix: recover orphaned job status when worker pod is killed mid-execution by cau-git · Pull Request #102 · docling-project/docling-jobkit

cau-git · 2026-02-25T16:23:38Z

When a SimpleWorker process is killed (OOMKill, node eviction, etc.), the in-flight job is permanently frozen in STARTED state because the main thread is blocked during conversion and RQ's built-in heartbeat stops being refreshed. The job previously remained stuck for 4 hours (the job timeout), with no pub/sub notification ever reaching polling or WebSocket clients.

Two components fix this:

Per-job heartbeat thread in CustomRQWorker.perform_job(): a daemon thread (_heartbeat_loop) is started when perform_job() begins and stopped in the finally block. Every 20 s it writes docling:job:alive:{job_id} with a 60 s TTL to Redis via a dedicated connection. If the process is killed, the thread dies with it and the key expires naturally.
Watchdog asyncio task in RQOrchestrator.process_queue(): _watchdog_task() runs alongside the existing pub/sub listener. Every 30 s it scans all STARTED tasks older than the 90 s grace period and checks whether their liveness key still exists. A missing key means the worker is dead: a FAILURE _TaskUpdate is published to the docling:updates channel, which is received by _listen_for_updates(), persisted, and delivered to WebSocket subscribers, reducing the detection window from 4 hours to ~90 s.

With this fix, a worker that is killed mid-execution will still lose the task and no retry-mechanism is in place yet, so the conversion will track as failed instead of stale.

When a SimpleWorker process is killed (OOMKill, node eviction, etc.) the in-flight job is permanently frozen in STARTED state because the main thread is blocked during conversion and RQ's built-in heartbeat stops being refreshed. The job previously remained stuck for 4 hours (the job timeout), with no pub/sub notification ever reaching polling or WebSocket clients. Two components fix this: 1. Per-job heartbeat thread in CustomRQWorker.perform_job() A daemon thread (_heartbeat_loop) is started when perform_job() begins and stopped in the finally block. Every 20 s it writes docling:job:alive:{job_id} with a 60 s TTL to Redis via a dedicated connection. If the process is killed the thread dies with it and the key expires naturally. 2. Watchdog asyncio task in RQOrchestrator.process_queue() _watchdog_task() runs alongside the existing pub/sub listener. Every 30 s it scans all STARTED tasks older than the 90 s grace period and checks whether their liveness key still exists. A missing key means the worker is dead: a FAILURE _TaskUpdate is published to the docling:updates channel, which is received by _listen_for_updates(), persisted, and delivered to WebSocket subscribers — reducing the detection window from 4 hours to ~90 s. Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

github-actions · 2026-02-25T16:23:50Z

✅ DCO Check Passed

Thanks @cau-git, all your commits are properly signed off. 🎉

mergify · 2026-02-25T16:24:13Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

docling_jobkit/orchestrators/rq/worker.py

docling_jobkit/orchestrators/rq/orchestrator.py

Watchdog now queries RQ's StartedJobRegistry instead of per-pod self.tasks to detect orphaned jobs, closing the coverage gap when the enqueue pod is recycled during rolling updates. Also adds _on_task_status_changed() hook (no-op by default) called after every pub/sub status update, allowing subclasses to persist terminal states to durable storage. Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

…hdog - Make `task_result()` use deterministic fallback key (`{results_prefix}:{task_id}`) when in-memory result-key cache is missing. - Return `None` when result blob is absent and cache resolved key when present. - Run watchdog `StartedJobRegistry.get_job_ids()` with cleanup enabled to clear abandoned STARTED entries during scans. Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

codecov · 2026-03-03T12:09:19Z

Codecov Report

❌ Patch coverage is 52.11268% with 34 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
docling_jobkit/orchestrators/rq/orchestrator.py	40.81%	29 Missing ⚠️
docling_jobkit/orchestrators/rq/worker.py	77.27%	5 Missing ⚠️

📢 Thoughts on this report? Let us know!

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

dolfim-ibm

nice!

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

cau-git added 3 commits February 25, 2026 15:15

Fix first-seen-started tracking

d9b8f8b

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

Add logging for watchdog

4a0e23f

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

cau-git mentioned this pull request Feb 25, 2026

fix: prevent stale RQ STARTED from overwriting watchdog FAILURE docling-project/docling-serve#523

Merged

cau-git changed the title ~~Cau/rq worker add heartbeat~~ fix: recover orphaned jobs when worker pod is killed mid-execution Feb 25, 2026

dolfim-ibm reviewed Mar 1, 2026

View reviewed changes

docling_jobkit/orchestrators/rq/worker.py Show resolved Hide resolved

docling_jobkit/orchestrators/rq/orchestrator.py Outdated Show resolved Hide resolved

cau-git added 2 commits March 3, 2026 12:23

cau-git marked this pull request as ready for review March 3, 2026 12:28

cau-git changed the title ~~fix: recover orphaned jobs when worker pod is killed mid-execution~~ fix: recover orphaned job status when worker pod is killed mid-execution Mar 3, 2026

Add heartbeat_key_prefix in RQOrchestrator

96710e8

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

dolfim-ibm previously approved these changes Mar 3, 2026

View reviewed changes

Reduce log levels for watchdog

0e7f00c

Signed-off-by: Christoph Auer <cau@zurich.ibm.com>

cau-git dismissed dolfim-ibm’s stale review via 0e7f00c March 3, 2026 12:55

dolfim-ibm approved these changes Mar 3, 2026

View reviewed changes

cau-git merged commit 4738900 into main Mar 3, 2026
10 checks passed

cau-git deleted the cau/rq-worker-add-heartbeat branch March 3, 2026 13:10

kindaSys331 mentioned this pull request Mar 4, 2026

Watchdog false positives: InstrumentedRQWorker skips heartbeat thread docling-project/docling-serve#531

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: recover orphaned job status when worker pod is killed mid-execution#102

fix: recover orphaned job status when worker pod is killed mid-execution#102
cau-git merged 7 commits intomainfrom
cau/rq-worker-add-heartbeat

cau-git commented Feb 25, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 25, 2026 •

edited

Loading

Uh oh!

mergify bot commented Feb 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Mar 3, 2026

Uh oh!

dolfim-ibm left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cau-git commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge Protections

🟢 Enforce conventional commit

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Mar 3, 2026

Codecov Report

Uh oh!

dolfim-ibm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cau-git commented Feb 25, 2026 •

edited

Loading

github-actions bot commented Feb 25, 2026 •

edited

Loading

mergify bot commented Feb 25, 2026 •

edited

Loading