Skip to content

Commit 251dddc

Browse files
authored
[MS] Update PDF table extraction to support aligned Markdown (#1499)
* Added PDF table extraction feature with aligned Markdown (#1419) * Add PDF test files and enhance extraction tests - Added a medical report scan PDF for testing scanned PDF handling. - Included a retail purchase receipt PDF to validate receipt extraction functionality. - Introduced a multipage invoice PDF to test extraction of complex invoice structures. - Added a borderless table PDF for testing inventory reconciliation report extraction. - Implemented comprehensive tests for PDF table extraction, ensuring proper structure and data integrity. - Enhanced existing tests to validate the order and presence of extracted content across various PDF types. * fix: update dependencies for PDF processing and improve table extraction logic * Bumped version of pdfminer.six --------- Authored-by: Ashok <ashh010101@gmail.com>
1 parent dde250a commit 251dddc

File tree

8 files changed

+1289
-21
lines changed

8 files changed

+1289
-21
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,7 @@ coverage.xml
5252
.hypothesis/
5353
.pytest_cache/
5454
cover/
55+
.test-logs/
5556

5657
# Translations
5758
*.mo

packages/markitdown/pyproject.toml

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -41,19 +41,20 @@ all = [
4141
"openpyxl",
4242
"xlrd",
4343
"lxml",
44-
"pdfminer.six>=20251107",
44+
"pdfminer.six>=20251230",
45+
"pdfplumber>=0.11.9",
4546
"olefile",
4647
"pydub",
4748
"SpeechRecognition",
4849
"youtube-transcript-api~=1.0.0",
4950
"azure-ai-documentintelligence",
50-
"azure-identity"
51+
"azure-identity",
5152
]
5253
pptx = ["python-pptx"]
5354
docx = ["mammoth~=1.11.0", "lxml"]
5455
xlsx = ["pandas", "openpyxl"]
5556
xls = ["pandas", "xlrd"]
56-
pdf = ["pdfminer.six"]
57+
pdf = ["pdfminer.six>=20251230", "pdfplumber>=0.11.9"]
5758
outlook = ["olefile"]
5859
audio-transcription = ["pydub", "SpeechRecognition"]
5960
youtube-transcription = ["youtube-transcript-api"]

0 commit comments

Comments
 (0)