Skip to content

Feature: Support incremental output and task resumption for long documents #4736

@kevinmw

Description

@kevinmw

Summary

For long documents processed with the hybrid engine (sliding window mode), with 270 pages, intermediate results are not persisted to disk until all windows complete. If the process is interrupted (timeout, crash, etc.), all progress is lost and must restart from scratch.

Problem

When processing a 270-page PDF textbook with hybrid-auto-engine, the CLI client times out while the server-side processing is still running. This results in:

  1. No incremental output: Results from completed windows are kept in memory only. If the process fails at window 3/5, results from windows 1-2 are lost.
  2. Client timeout kills the task: The CLI polling timeout not only marks the client-side as failed, but also terminates the server-side processing, wasting GPU compute time.

Suggested Improvements

  1. Incremental disk output: After each sliding window completes (e.g., every 64 pages), write the intermediate results to disk. This enables:

    • Recovery from failures without recomputing completed windows
    • Partial results available even if the full document fails
  2. Decouple client timeout from server processing: Client polling timeout should only affect the client, not terminate the server-side task. The server should continue processing until completion or a separate server-side timeout.

  3. Resume support: Add a `--resume` flag or API parameter to continue processing from the last completed window, skipping already-processed pages.

Environment

  • MinerU 3.0.8
  • Windows 11, RTX 4060 Ti 8GB
  • Python 3.13, PyTorch 2.11.0+cu126
  • Processing a 270-page Chinese math textbook PDF

Workaround

Currently working around this by using `mineru-api` server directly with manual HTTP polling, avoiding the CLI client timeout issue.

Co-Authored-By: Claude Sonnet 4.6 noreply@anthropic.com

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions