Skip to content

Conversation

@lesyk
Copy link
Contributor

@lesyk lesyk commented Dec 10, 2025

Results of processed PDFs:

================================================================================
FILE: MEDRPT-2024-PAT-3847_medical_report_scan.pdf
================================================================================


================================================================================
FILE: RECEIPT-2024-TXN-98765_retail_purchase.pdf
================================================================================
TECHMART ELECTRONICS
4567 Innovation Blvd
San Francisco, CA 94103
(415) 555-0199

===================================

Store #0342 - Downtown SF
11/23/2024 14:32:18 PST
TXN: TXN-98765-2024
Cashier: Emily Rodriguez
Register: POS-07

-----------------------------------

Wireless Noise-Cancelling
Headphones - Premium Black
AUDIO-5521 1 @ $349.99
Member Discount $-50.00
$299.99
USB-C Hub 7-in-1 Adapter
with HDMI & Ethernet
ACC-8834 2 @ $79.99
$159.98
Portable SSD 2TB
Thunderbolt 3 Compatible
STOR-2241 1 @ $289.00
Member Discount $-29.00
$260.00
Ergonomic Wireless Mouse
Rechargeable Battery
ACC-9012 1 @ $59.99
$59.99
Screen Cleaning Kit
Professional Grade
CARE-1156 3 @ $12.99
$38.97
HDMI 2.1 Cable 6ft
8K Resolution Support
CABLE-7789 2 @ $24.99
Member Discount $-5.00
$44.98
-----------------------------------

SUBTOTAL $863.91
Member Discount (15%)-$84.00
Sales Tax (8.5%) $66.23
Rewards Applied -$25.00
===================================
TOTAL $821.14
===================================

PAYMENT METHOD
Visa Card ending in 4782
Auth: 847392
Ref: REF-20241123-98765

-----------------------------------

REWARDS MEMBER
Sarah Mitchell
ID: TM-447821
Points Earned: 821
Total Points: 3,247
Next Reward: $50 gift card
at 5,000 pts (1,753 to go)

-----------------------------------

RETURN POLICY
Returns within 30 days
Receipt required
Electronics must be unopened

*TXN98765202411231432*

Thank you for shopping!
www.techmart.example.com

===================================



================================================================================
FILE: REPAIR-2022-INV-001_multipage.pdf
================================================================================
ZAVA AUTO REPAIR
Certified Collision Repair
123 Main Street, Redmond, WA 98052
Phone: (425) 000-0000
Preliminary Estimate (ID: EST-1008)
| Customer Information |                     |     | Vehicle Information |                   |
| -------------------- | ------------------- | --- | ------------------- | ----------------- |
| Insured name         | Gabriel Diaz        |     | Year                | 2022              |
| Claim #              | SF-1008             |     | Make                | Jeep              |
| Policy #             | POL-2022-555        |     | Model               | Grand Cherokee    |
| Phone                | (425) 111-1111      |     | Trim                | Limited           |
| Email                | [email protected] |     | VIN                 | 1C4RJFBG2NC123456 |
|                      |                     |     | Color               | White             |
|                      |                     |     | Odometer            | 9,800             |
| Repair Order #       | RO-20221108         |     | Estimator           | Ellis Turner      |
Estimate Totals
|                  |     | Hours | Rate | Cost  |
| ---------------- | --- | ----- | ---- | ----- |
| Parts            |     |       |      | 2,100 |
| Body Labor       |     | 2     | 150  | 300   |
| Paint Labor      |     | 1.5   | 150  | 225   |
| Mechanical Labor |     | -     | -    | -     |
Supplies
|               | Paint Supplies           |     |        | 60     |
| ------------- | ------------------------ | --- | ------ | ------ |
|               | Body Supplies            |     |        | 30     |
| Other Charges |                          |     |        | 15     |
| Subtotal      |                          |     |        | 2,730  |
| Sales Tax     |                          |     | 10.20% | 278.46 |
| GRAND TOTAL   |                          |     |        | 5,738  |
| Note          | Minor rear bumper repair |     |        |        |
This is a preliminary estimate for the visible damage of the vehicle. Additional damage / repairs / parts may be found
after the vehicle has been disassembled and damaged parts have been removed. Suspension damages may be
present, but can not be determined until an alignment on the vehicle has been done. Parts Prices may vary due to
models and vehicle maker price updates. Please be advised if vehicle owner elects to have vehicle sent to service for
any mechanical concerns, ALL service departments charge a vehicle diagnostic charge. If the mechanical concern is
deemed not related to an insurance claim, vehicle owner will be reponsible for charges.

ZAVA AUTO REPAIR
Certified Collision Repair
123 Main Street, Redmond, WA 98052
Phone: (425) 000-0000
Preliminary Estimate (ID: EST-1008)
Customer Information Vehicle Information
| Insured name   | Bruce Wayne                |     | Year      | 2025         |
| -------------- | -------------------------- | --- | --------- | ------------ |
| Claim #        |

================================================================================
FILE: SPARSE-2024-INV-1234_borderless_table.pdf
================================================================================
INVENTORY RECONCILIATION REPORT
Report ID: SPARSE-2024-INV-1234
Warehouse: Distribution Center East
Report Date: 2024-11-15
Prepared By: Sarah Martinez
| Product Code | Location | Expected | Actual | Variance | Status   |
| ------------ | -------- | -------- | ------ | -------- | -------- |
| SKU-8847     | A-12     | 450      |        |          |          |
|              | B-07     |          | 289    | -23      |          |
| SKU-9201     |          | 780      | 778    |          | OK       |
|              | C-15     |          |        | +15      |          |
| SKU-4563     | D-22     |          | 156    |          | CRITICAL |
|              |          | 180      |        | -24      |          |
| SKU-7728     | A-08     | 920      |        |          |          |
|              |          |          | 935    | +15      | OK       |
Variance Analysis:
Summary Statistics:
Total Variance Cost: $4,287.50
Critical Items: 1
Overall Accuracy: 97.2%
Detailed Analysis by Category:
The inventory reconciliation reveals several key findings. The primary variance driver is SKU-4563,
which shows a -24 unit discrepancy requiring immediate investigation. Location B-07 handling of
SKU-8847 also demonstrates significant variance. Cross-location verification protocols should be

reviewed to prevent future discrepancies. The overall accuracy rate of 97.2% meets our target
threshold, but critical items require expedited resolution to maintain operational efficiency.
Extended Inventory Review:
| Product Code | Category    | Unit Cost | Total Value | Last Audit | Notes      |
| ------------ | ----------- | --------- | ----------- | ---------- | ---------- |
| SKU-8847     | Electronics | $45.00    | $13,005.00  | 2024-10-15 |            |
| SKU-9201     | Hardware    | $32.50    | $25,285.00  | 2024-10-22 | Verified   |
| SKU-4563     | Software    | $120.00   | $18,720.00  |            | Critical   |
| SKU-7728     | Accessories | $15.75    | $14,726.25  | 2024-11-01 |            |
| SKU-3345     | Electronics | $67.00    | $22,445.00  | 2024-10-18 |            |
| SKU-5512     | Hardware    | $89.00    | $31,150.00  |            | Pending    |
| SKU-6678     | Software    | $200.00   | $42,000.00  | 2024-10-25 | High Value |
| SKU-7789     | Accessories | $8.50     | $5,950.00   | 2024-11-05 |            |
| SKU-2234     | Electronics | $125.00   | $35,000.00  |            |            |
| SKU-1123     | Hardware    | $55.00    | $27,500.00  | 2024-10-30 | Verified   |
Recommendations:
1. Immediate review of SKU-4563 handling procedures. 2. Implement additional verification for critical
items. 3. Schedule follow-up audit for high-value products (SKU-6678, SKU-2234).
Approval:

================================================================================
FILE: test.pdf
================================================================================
1

Introduction

Large language models (LLMs) are becoming a crucial building block in developing powerful agents
that utilize LLMs for reasoning, tool usage, and adapting to new observations (Yao et al., 2022; Xi
et al., 2023; Wang et al., 2023b) in many real-world tasks. Given the expanding tasks that could
benefit from LLMs and the growing task complexity, an intuitive approach to scale up the power of
agents is to use multiple agents that cooperate. Prior work suggests that multiple agents can help
encourage divergent thinking (Liang et al., 2023), improve factuality and reasoning (Du et al., 2023),
and provide validation (Wu et al., 2023). In light of the intuition and early evidence of promise, it is
intriguing to ask the following question: how can we facilitate the development of LLM applications
that could span a broad spectrum of domains and complexities based on the multi-agent approach?

Our insight is to use multi-agent conversations to achieve it. There are at least three reasons con-
firming its general feasibility and utility thanks to recent advances in LLMs: First, because chat-
optimized LLMs (e.g., GPT-4) show the ability to incorporate feedback, LLM agents can cooperate
through conversations with each other or human(s), e.g., a dialog where agents provide and seek rea-
soning, observations, critiques, and validation. Second, because a single LLM can exhibit a broad
range of capabilities (especially when configured with the correct prompt and inference settings),
conversations between differently configured agents can help combine these broad LLM capabilities
in a modular and complementary manner. Third, LLMs have demonstrated ability to solve complex
tasks when the tasks are broken into simpler subtasks. Multi-agent conversations can enable this
partitioning and integration in an intuitive manner. How can we leverage the above insights and
support different applications with the common requirement of coordinating multiple agents, poten-
tially backed by LLMs, humans, or tools exhibiting different capacities? We desire a multi-agent
conversation framework with generic abstraction and effective implementation that has the flexibil-
ity to satisfy different application needs. Achieving this requires addressing two critical questions:
(1) How can we design individual agents that are capable, reusable, customizable, and effective in
multi-agent collaboration? (2) How can we develop a straightforward, unified interface that can
accommodate a wide range of agent conversation patterns? In practice, applications of varying
complexities may need distinct sets of agents with specific capabilities, and may require different
conversation patterns, such as single- or multi-turn dialogs, different human involvement modes, and
static vs. dynamic conversation. Moreover, developers may prefer the flexibility to program agent
interactions in natural language or code. Failing to adequately address these two questions would
limit the framework’s

@lesyk lesyk changed the title Added PDF table extraction feature with aligned Markdown (#1419) [MS] Update PDF table extraction to support aligned Markdown Dec 10, 2025
lesyk and others added 4 commits December 10, 2025 19:36
- Added a medical report scan PDF for testing scanned PDF handling.
- Included a retail purchase receipt PDF to validate receipt extraction functionality.
- Introduced a multipage invoice PDF to test extraction of complex invoice structures.
- Added a borderless table PDF for testing inventory reconciliation report extraction.
- Implemented comprehensive tests for PDF table extraction, ensuring proper structure and data integrity.
- Enhanced existing tests to validate the order and presence of extracted content across various PDF types.
@lesyk lesyk marked this pull request as ready for review December 11, 2025 16:15
@afourney
Copy link
Member

afourney commented Jan 7, 2026

Generally this looks good, and it passes in the CI. Locally, I am getting the following errors (that might be unique to my environment), but I want to understand them before merging:

====================================================================== test session starts =======================================================================
platform linux -- Python 3.10.12, pytest-8.4.2, pluggy-1.6.0
rootdir: /home/afourney/repos/tmp/markitdown/packages/markitdown
configfile: pyproject.toml
plugins: mock-3.15.1, anyio-4.12.1, xdist-3.8.0, rerunfailures-14.0
collected 191 items

tests/test_cli_misc.py ..                                                                                                                                  [  1%]
tests/test_cli_vectors.py ..................................................                                                                               [ 27%]
tests/test_docintel_html.py ..                                                                                                                             [ 28%]
tests/test_module_misc.py .............                                                                                                                    [ 35%]
tests/test_module_vectors.py .............................................................................................................                 [ 92%]
tests/test_pdf_tables.py F...F..FFFF...F                                                                                                                   [100%]

============================================================================ FAILURES ============================================================================
____________________________________________________ TestPdfTableExtraction.test_borderless_table_extraction _____________________________________________________

self = <tests.test_pdf_tables.TestPdfTableExtraction object at 0x7468df010640>, markitdown = <markitdown._markitdown.MarkItDown object at 0x7468dcbc6500>

    def test_borderless_table_extraction(self, markitdown):
        """Test extraction of borderless tables from SPARSE inventory PDF.

        Expected output structure:
        - Header: INVENTORY RECONCILIATION REPORT with Report ID, Warehouse, Date, Prepared By
        - Pipe-separated rows with inventory data
        - Text section: Variance Analysis with Summary Statistics
        - More pipe-separated rows with extended inventory review
        - Footer: Recommendations section
        """
        pdf_path = os.path.join(
            TEST_FILES_DIR, "SPARSE-2024-INV-1234_borderless_table.pdf"
        )

        if not os.path.exists(pdf_path):
            pytest.skip(f"Test file not found: {pdf_path}")

        result = markitdown.convert(pdf_path)
        text_content = result.text_content

        # Validate document header content
        expected_strings = [
            "INVENTORY RECONCILIATION REPORT",
            "Report ID: SPARSE-2024-INV-1234",
            "Warehouse: Distribution Center East",
            "Report Date: 2024-11-15",
            "Prepared By: Sarah Martinez",
        ]
        validate_strings(result, expected_strings)

        # Validate pipe-separated format is used
>       assert "|" in text_content, "Should have pipe separators for form-style data"
E       AssertionError: Should have pipe separators for form-style data
E       assert '|' in 'INVENTORY RECONCILIATION REPORT\n\nReport ID: SPARSE-2024-INV-1234\nWarehouse: Distribution Center East\nReport Date:...cation for critical\nitems. 3. Schedule follow-up audit for high-value products (SKU-6678, SKU-2234).\n\nApproval:\n\n'

tests/test_pdf_tables.py:138: AssertionError
____________________________________________________ TestPdfTableExtraction.test_multipage_invoice_extraction ____________________________________________________

self = <tests.test_pdf_tables.TestPdfTableExtraction object at 0x7468df011270>, markitdown = <markitdown._markitdown.MarkItDown object at 0x7468de7826b0>

    def test_multipage_invoice_extraction(self, markitdown):
        """Test extraction of multipage invoice PDF with form-style layout.

        Expected output: Pipe-separated format with clear cell boundaries.
        Form data should be extracted with pipes indicating column separations.
        """
        pdf_path = os.path.join(TEST_FILES_DIR, "REPAIR-2022-INV-001_multipage.pdf")

        if not os.path.exists(pdf_path):
            pytest.skip(f"Test file not found: {pdf_path}")

        result = markitdown.convert(pdf_path)
        text_content = result.text_content

        # Validate basic content is extracted
        expected_strings = [
            "ZAVA AUTO REPAIR",
            "Collision Repair",
            "Redmond, WA",
            "Gabriel Diaz",
            "Jeep",
            "Grand Cherokee",
            "Parts",
            "Body Labor",
            "Paint Labor",
            "GRAND TOTAL",
            # Second page content
            "Bruce Wayne",
            "Batmobile",
        ]
        validate_strings(result, expected_strings)

        # Validate pipe-separated table format
        # Form-style documents should use pipes to separate cells
>       assert "|" in text_content, "Form-style PDF should contain pipe separators"
E       AssertionError: Form-style PDF should contain pipe separators
E       assert '|' in 'ZAVA AUTO REPAIR\nCertified Collision Repair\n123 Main Street, Redmond, WA 98052\nPhone: (425) 000-0000\n\nPreliminar...the mechanical concern is\ndeemed not related to an insurance claim, vehicle owner will be reponsible for charges.\n\n'

tests/test_pdf_tables.py:531: AssertionError
_________________________________________________ TestPdfTableMarkdownFormat.test_markdown_table_has_pipe_format _________________________________________________

self = <tests.test_pdf_tables.TestPdfTableMarkdownFormat object at 0x7468df011540>, markitdown = <markitdown._markitdown.MarkItDown object at 0x7468df2ea470>

    def test_markdown_table_has_pipe_format(self, markitdown):
        """Test that form-style PDFs have pipe-separated format."""
        pdf_path = os.path.join(
            TEST_FILES_DIR, "SPARSE-2024-INV-1234_borderless_table.pdf"
        )

        if not os.path.exists(pdf_path):
            pytest.skip(f"Test file not found: {pdf_path}")

        result = markitdown.convert(pdf_path)
        text_content = result.text_content

        # Find rows with pipes
        lines = text_content.split("\n")
        pipe_rows = [
            line for line in lines if line.startswith("|") and line.endswith("|")
        ]

>       assert len(pipe_rows) > 0, "Should have pipe-separated rows"
E       AssertionError: Should have pipe-separated rows
E       assert 0 > 0
E        +  where 0 = len([])

tests/test_pdf_tables.py:683: AssertionError
_______________________________________________ TestPdfTableMarkdownFormat.test_markdown_table_columns_have_pipes ________________________________________________

self = <tests.test_pdf_tables.TestPdfTableMarkdownFormat object at 0x7468df0100a0>, markitdown = <markitdown._markitdown.MarkItDown object at 0x7468dc72afe0>

    def test_markdown_table_columns_have_pipes(self, markitdown):
        """Test that form-style PDF columns are separated with pipes."""
        pdf_path = os.path.join(
            TEST_FILES_DIR, "SPARSE-2024-INV-1234_borderless_table.pdf"
        )

        if not os.path.exists(pdf_path):
            pytest.skip(f"Test file not found: {pdf_path}")

        result = markitdown.convert(pdf_path)
        text_content = result.text_content

        # Find table rows and verify column structure
        lines = text_content.split("\n")
        table_rows = [
            line for line in lines if line.startswith("|") and line.endswith("|")
        ]

>       assert len(table_rows) > 0, "Should have markdown table rows"
E       AssertionError: Should have markdown table rows
E       assert 0 > 0
E        +  where 0 = len([])

tests/test_pdf_tables.py:707: AssertionError
________________________________________________ TestPdfTableStructureConsistency.test_borderless_table_structure ________________________________________________

self = <tests.test_pdf_tables.TestPdfTableStructureConsistency object at 0x7468df5a0ee0>
markitdown = <markitdown._markitdown.MarkItDown object at 0x7468dc5b8100>

    def test_borderless_table_structure(self, markitdown):
        """Test that borderless table PDF has pipe-separated structure."""
        pdf_path = os.path.join(
            TEST_FILES_DIR, "SPARSE-2024-INV-1234_borderless_table.pdf"
        )

        if not os.path.exists(pdf_path):
            pytest.skip(f"Test file not found: {pdf_path}")

        result = markitdown.convert(pdf_path)
        text_content = result.text_content

        # Should have pipe-separated content
>       assert "|" in text_content, "Borderless table PDF should have pipe separators"
E       AssertionError: Borderless table PDF should have pipe separators
E       assert '|' in 'INVENTORY RECONCILIATION REPORT\n\nReport ID: SPARSE-2024-INV-1234\nWarehouse: Distribution Center East\nReport Date:...cation for critical\nitems. 3. Schedule follow-up audit for high-value products (SKU-6678, SKU-2234).\n\nApproval:\n\n'

tests/test_pdf_tables.py:737: AssertionError
____________________________________________ TestPdfTableStructureConsistency.test_multipage_invoice_table_structure _____________________________________________

self = <tests.test_pdf_tables.TestPdfTableStructureConsistency object at 0x7468df5a0fa0>
markitdown = <markitdown._markitdown.MarkItDown object at 0x7468dc217f10>

    def test_multipage_invoice_table_structure(self, markitdown):
        """Test that multipage invoice PDF has pipe-separated format."""
        pdf_path = os.path.join(TEST_FILES_DIR, "REPAIR-2022-INV-001_multipage.pdf")

        if not os.path.exists(pdf_path):
            pytest.skip(f"Test file not found: {pdf_path}")

        result = markitdown.convert(pdf_path)
        text_content = result.text_content

        # Should have pipe-separated content
>       assert "|" in text_content, "Invoice PDF should have pipe separators"
E       AssertionError: Invoice PDF should have pipe separators
E       assert '|' in 'ZAVA AUTO REPAIR\nCertified Collision Repair\n123 Main Street, Redmond, WA 98052\nPhone: (425) 000-0000\n\nPreliminar...the mechanical concern is\ndeemed not related to an insurance claim, vehicle owner will be reponsible for charges.\n\n'

tests/test_pdf_tables.py:755: AssertionError
_____________________________________________ TestPdfTableStructureConsistency.test_borderless_table_data_integrity ______________________________________________

self = <tests.test_pdf_tables.TestPdfTableStructureConsistency object at 0x7468df5a3340>
markitdown = <markitdown._markitdown.MarkItDown object at 0x7468dcb90fa0>

    def test_borderless_table_data_integrity(self, markitdown):
        """Test that borderless table extraction preserves data integrity."""
        pdf_path = os.path.join(
            TEST_FILES_DIR, "SPARSE-2024-INV-1234_borderless_table.pdf"
        )

        if not os.path.exists(pdf_path):
            pytest.skip(f"Test file not found: {pdf_path}")

        result = markitdown.convert(pdf_path)
        tables = extract_markdown_tables(result.text_content)

>       assert len(tables) >= 2, "Should have at least 2 tables"
E       AssertionError: Should have at least 2 tables
E       assert 0 >= 2
E        +  where 0 = len([])

tests/test_pdf_tables.py:862: AssertionError
======================================================================== warnings summary ========================================================================
tests/test_pdf_tables.py:12
  /home/afourney/repos/tmp/markitdown/packages/markitdown/tests/test_pdf_tables.py:12: PytestUnknownMarkWarning: Unknown pytest.mark.unittests - is this a typo?  You can register custom marks to avoid this warning - for details, see https://docs.pytest.org/en/stable/how-to/mark.html
    pytestmark = pytest.mark.unittests

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
==================================================================== short test summary info =====================================================================
FAILED tests/test_pdf_tables.py::TestPdfTableExtraction::test_borderless_table_extraction - AssertionError: Should have pipe separators for form-style data
FAILED tests/test_pdf_tables.py::TestPdfTableExtraction::test_multipage_invoice_extraction - AssertionError: Form-style PDF should contain pipe separators
FAILED tests/test_pdf_tables.py::TestPdfTableMarkdownFormat::test_markdown_table_has_pipe_format - AssertionError: Should have pipe-separated rows
FAILED tests/test_pdf_tables.py::TestPdfTableMarkdownFormat::test_markdown_table_columns_have_pipes - AssertionError: Should have markdown table rows
FAILED tests/test_pdf_tables.py::TestPdfTableStructureConsistency::test_borderless_table_structure - AssertionError: Borderless table PDF should have pipe separators
FAILED tests/test_pdf_tables.py::TestPdfTableStructureConsistency::test_multipage_invoice_table_structure - AssertionError: Invoice PDF should have pipe separators
FAILED tests/test_pdf_tables.py::TestPdfTableStructureConsistency::test_borderless_table_data_integrity - AssertionError: Should have at least 2 tables
====================================================== 7 failed, 184 passed, 1 warning in 211.57s (0:03:31) ======================================================

@afourney
Copy link
Member

afourney commented Jan 8, 2026

Looks like the issues are related to an older version of "hatch" on my system. Updated and the problems cleared up.

@afourney afourney merged commit 251dddc into microsoft:main Jan 8, 2026
3 checks passed
@lesyk lesyk deleted the u/vilesyk/table_fix branch January 8, 2026 13:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants