Skip to content

feat: add progress_callback support for real-time conversion tracking#1851

Open
echavet wants to merge 1 commit intomicrosoft:mainfrom
echavet:feat/progress-callback
Open

feat: add progress_callback support for real-time conversion tracking#1851
echavet wants to merge 1 commit intomicrosoft:mainfrom
echavet:feat/progress-callback

Conversation

@echavet
Copy link
Copy Markdown

@echavet echavet commented Apr 30, 2026

Summary

Adds a progress_callback mechanism to MarkItDown converters, enabling callers to receive real-time progress updates during document conversion.

Problem Solved

When converting large documents (e.g. a 213-page PDF), convert() blocks for several minutes with no way to report progress to the user. Server-side applications, worker processes, and UIs cannot display meaningful feedback during long conversions.

Design

Two new public types in _base_converter.py:

  • ConversionProgress — a frozen @dataclass carrying current, total, unit (page/slide/chapter/sheet), and source (converter class name).
  • ProgressCallback — a @runtime_checkable Protocol that any callable matching (ConversionProgress) -> None satisfies (structural subtyping / duck typing).

The callback is passed through the existing **kwargs chain (convert()_convert()converter.convert()). Converters that do not support progress simply ignore the kwarg.

Converters Updated

Converter Unit Loop
PdfConverter page per-page in pdfplumber loop
PptxConverter slide per-slide
EpubConverter chapter per-spine-item
XlsxConverter sheet per-sheet
XlsConverter sheet per-sheet

Usage Example

from markitdown import MarkItDown, ConversionProgress

def on_progress(p: ConversionProgress) -> None:
    print(f"[{p.source}] {p.current}/{p.total} {p.unit}s")

md = MarkItDown()
result = md.convert("large_report.pdf", progress_callback=on_progress)
# Output:
# [PdfConverter] 1/213 pages
# [PdfConverter] 2/213 pages
# ...

Backward Compatibility

  • ✅ All existing functionality unchanged
  • ✅ Callback is optional — omitting it preserves current behavior
  • ✅ No new dependencies
  • ✅ No changes to existing method signatures
  • graphrag-input and other consumers work without modification

Type of Change

  • New feature (non-breaking change)

Testing

  • All existing tests pass (no modifications needed)
  • Manual verification with PDF, PPTX, and EPUB files
  • ConversionProgress dataclass and ProgressCallback Protocol validated at runtime

Add ConversionProgress dataclass and ProgressCallback Protocol to enable
real-time progress reporting during document conversion.

Converters emit progress events for each logical unit processed:
- PdfConverter: per page
- PptxConverter: per slide
- EpubConverter: per chapter
- XlsxConverter / XlsConverter: per sheet

The callback is optional and passed via kwargs (progress_callback).
Converters that do not support progress simply ignore it.
Fully backward-compatible — no changes to existing API signatures.

Signed-off-by: Eric Chavet <echavet@gmail.com>
@echavet
Copy link
Copy Markdown
Author

echavet commented Apr 30, 2026

@microsoft-github-policy-service agree

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant