Skip to content

Conversation

@pallaprolus
Copy link

Fixes #68

This PR addresses the issue where partially numbered lists (common in MasterFormat documents, e.g., .1, .2) are extracted as plain text lines indistinguishable from regular paragraphs.

Changes:

  • Adds a lightweight regex post-processing step in _pdf_converter.py to identify lines starting with .Number and convert them into Markdown lists (- .Number).
  • This keeps the solution dependency-free and lightweight as requested by maintainers.

Verification:

  • Verified that lines like .1 Item are now converted to - .1 Item.
  • Ran standard tests to ensure no regressions.

@pallaprolus pallaprolus force-pushed the fix/issue-68-pdf-lists branch from 87d7b54 to 76a674a Compare December 30, 2025 15:33
Copy link

@berryblessingb-blip berryblessingb-blip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

markitdown path-to-file.pdf -o document.md -d -e "<document_intelligence_endpoint>"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PDF parsing doesn't support partially numbered lists

2 participants