-
Notifications
You must be signed in to change notification settings - Fork 4.8k
Feature: Support incremental output and task resumption for long documents #4736
Description
Summary
For long documents processed with the hybrid engine (sliding window mode), with 270 pages, intermediate results are not persisted to disk until all windows complete. If the process is interrupted (timeout, crash, etc.), all progress is lost and must restart from scratch.
Problem
When processing a 270-page PDF textbook with hybrid-auto-engine, the CLI client times out while the server-side processing is still running. This results in:
- No incremental output: Results from completed windows are kept in memory only. If the process fails at window 3/5, results from windows 1-2 are lost.
- Client timeout kills the task: The CLI polling timeout not only marks the client-side as failed, but also terminates the server-side processing, wasting GPU compute time.
Suggested Improvements
-
Incremental disk output: After each sliding window completes (e.g., every 64 pages), write the intermediate results to disk. This enables:
- Recovery from failures without recomputing completed windows
- Partial results available even if the full document fails
-
Decouple client timeout from server processing: Client polling timeout should only affect the client, not terminate the server-side task. The server should continue processing until completion or a separate server-side timeout.
-
Resume support: Add a `--resume` flag or API parameter to continue processing from the last completed window, skipping already-processed pages.
Environment
- MinerU 3.0.8
- Windows 11, RTX 4060 Ti 8GB
- Python 3.13, PyTorch 2.11.0+cu126
- Processing a 270-page Chinese math textbook PDF
Workaround
Currently working around this by using `mineru-api` server directly with manual HTTP polling, avoiding the CLI client timeout issue.
Co-Authored-By: Claude Sonnet 4.6 noreply@anthropic.com