fix(pdf): handle partially numbered lists in PDF conversion (#68) #1519

pallaprolus · 2025-12-30T15:32:05Z

Fixes #68

This PR addresses the issue where partially numbered lists (common in MasterFormat documents, e.g., .1, .2) are extracted as plain text lines indistinguishable from regular paragraphs.

Changes:

Adds a lightweight regex post-processing step in _pdf_converter.py to identify lines starting with .Number and convert them into Markdown lists (- .Number).
This keeps the solution dependency-free and lightweight as requested by maintainers.

Verification:

Verified that lines like .1 Item are now converted to - .1 Item.
Ran standard tests to ensure no regressions.

berryblessingb-blip

markitdown path-to-file.pdf -o document.md -d -e "<document_intelligence_endpoint>"

fix(pdf): handle partially numbered lists using regex (microsoft#68)

76a674a

pallaprolus force-pushed the fix/issue-68-pdf-lists branch from 87d7b54 to 76a674a Compare December 30, 2025 15:33

berryblessingb-blip approved these changes Jan 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(pdf): handle partially numbered lists in PDF conversion (#68) #1519

fix(pdf): handle partially numbered lists in PDF conversion (#68) #1519

pallaprolus commented Dec 30, 2025

Uh oh!

berryblessingb-blip left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix(pdf): handle partially numbered lists in PDF conversion (#68) #1519

Are you sure you want to change the base?

fix(pdf): handle partially numbered lists in PDF conversion (#68) #1519

Conversation

pallaprolus commented Dec 30, 2025

Uh oh!

berryblessingb-blip left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants