To improve accuracy of converting pdf to md file #2156

akashsinghkunwar · 2025-08-28T19:45:48Z

akashsinghkunwar
Aug 28, 2025

Hi
I am converting pdf to the MD file. I want to improve accuracy of the extraction.

My pdf is a digitally printed pdf but it has also embedded image in the footer. I need to extract text from the footer. So how can I process it?
Some of the tables are not extracted correctly like if column name is taking two rows so its messing up the column.

Conversion result:

County	Territory Hurricane	Application Rule	Distance to Coast Eligibility
51	Mandatory: 2%	Not Eligible	Dukes, Nantucket All Risks
37	Mandatory: 2%

It would be great if you could provide insight on this?

More context -- I am converting pdf to md file and tables inside md to HTML format and then feeding to Bedrock knowledge base.
I am using this simple code but I have tried other examples too to use OCR (force ocr on full page =true/false)

`
pipeline = PdfPipelineOptions(do_ocr=False)
return DocumentConverter(
format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline)}
)

`

@akashsinghkunwar · 2025-08-28T19:48:15Z

dosubot[bot]
bot Aug 28, 2025

Hi @akashsinghkunwar! I'm Dosu and I’m helping the docling team.

To improve PDF to Markdown extraction accuracy in Docling, especially for footers with embedded images and complex tables:

Extracting text from image footers:
Enable OCR by setting do_ocr=True in PdfPipelineOptions. For images embedded in footers, you can force OCR on the entire page—even if a text layer exists—by setting ocr_options.force_full_page_ocr=True. Docling supports several OCR engines (EasyOCR, Tesseract, RapidOCR, and new transformer-based models like GOT2_TRANSFORMERS and DOLPHIN_TRANSFORMERS) that can improve text extraction from images. You can also adjust the scale parameter to improve accuracy for small text in images. For best results with RapidOCR, make sure to install the latest package:
pip install rapidocr onnxruntime
details
OCR engine options
RapidOCR upgrade
Improving table extraction (multi-row headers):
Set do_table_structure=True and table_structure_options.do_cell_matching=True in your pipeline options. Use table_structure_options.mode=TableFormerMode.ACCURATE for better handling of complex tables. Also, set generate_parsed_pages=True to keep detailed page parsing data, which helps with accurate extraction and post-processing.
table extraction config
parsed pages option
Region-specific extraction (e.g., footers):
Docling’s backend now supports extracting text from specific regions using methods like get_text_in_rect(bbox) on parsed pages. For custom extraction (such as always grabbing footer text), you can write a script that identifies the footer region’s bounding box and extracts text or runs OCR on just that area.
region extraction utility
Best practices:
Enable hardware acceleration if available, batch process large PDFs, and parallelize Markdown export for speed. If you encounter conversion hangs, switch to the pypdfium2 backend.
performance tips

Example config:

pipeline_options = PdfPipelineOptions(
    do_ocr=True,
    do_table_structure=True,
    table_structure_options={
        "do_cell_matching": True,
        "mode": TableFormerMode.ACCURATE
    },
    ocr_options={
        "force_full_page_ocr": True,
        "lang": ["en"]
    },
    generate_parsed_pages=True
)
converter = DocumentConverter(
    format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)}
)

For advanced needs (like extracting only footer regions or patching multi-row table headers), consider custom scripting using Docling’s document models and region extraction utilities.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

2 replies

akashsinghkunwar Aug 28, 2025
Author

@dosu could you write the complete script? as this snippet is not working

dosubot[bot] bot Aug 28, 2025

Here's a complete script for converting a PDF to Markdown using Docling, with improved accuracy for extracting text from footers (including images) and handling complex tables. This script enables OCR (with force_full_page_ocr) and accurate table extraction. It also shows how to export the results and includes troubleshooting tips.

import time
import json
from pathlib import Path

from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions, TableFormerMode, RapidOcrOptions
from docling.document_converter import DocumentConverter, PdfFormatOption

def main():
    # Path to your PDF file
    input_doc_path = "your_file.pdf"  # <-- change this to your PDF path

    # Configure pipeline options for best accuracy
    pipeline_options = PdfPipelineOptions()
    pipeline_options.do_ocr = True  # Enable OCR for images (e.g., footers)
    pipeline_options.do_table_structure = True  # Enable table extraction
    pipeline_options.table_structure_options.do_cell_matching = True
    pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE  # Best for multi-row headers

    # Use RapidOCR for best image text extraction (install with: pip install rapidocr onnxruntime)
    pipeline_options.ocr_options = RapidOcrOptions(force_full_page_ocr=True, lang=["en"])
    # For other OCR engines, use EasyOcrOptions, TesseractOcrOptions, etc.

    # Enable hardware acceleration if available
    pipeline_options.accelerator_options = AcceleratorOptions(num_threads=4, device=AcceleratorDevice.AUTO)

    # Instantiate the converter
    doc_converter = DocumentConverter(
        format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)}
    )

    # Convert and export
    start_time = time.time()
    conv_result = doc_converter.convert(input_doc_path)
    print(f"Conversion took {time.time() - start_time:.2f} seconds")

    # Export Markdown
    output_dir = Path("scratch")
    output_dir.mkdir(parents=True, exist_ok=True)
    doc_filename = Path(input_doc_path).stem
    with (output_dir / f"{doc_filename}.md").open("w", encoding="utf-8") as fp:
        fp.write(conv_result.document.export_to_markdown())

    # Print Markdown to console
    print(conv_result.document.export_to_markdown())

if __name__ == "__main__":
    main()

Troubleshooting tips:

Make sure you have installed all dependencies (pip install docling rapidocr onnxruntime).
If you get import errors, check that your script is not named docling.py and is not in a folder named docling (details).
For best OCR results on footers/images, use RapidOCR and set force_full_page_ocr=True as shown above.
If you want to extract only the footer region, you can use Docling's region extraction utilities (get_text_in_rect(bbox)) after conversion (example).
For other OCR engines, swap out RapidOcrOptions for EasyOcrOptions, TesseractOcrOptions, etc.

This script is based on the official Docling example (custom_convert.py) and incorporates best practices for accuracy and performance.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

akashsinghkunwar · 2025-08-28T22:29:40Z

akashsinghkunwar
Aug 28, 2025
Author

@dosu I tried using rapidocr as well as easyocr but it is not working; the image i shared in the question

; the resulting conversion for that image is

County	Territory Hurricane	Application Rule	Distance to Coast Eligibility
51	Mandatory: 2%	Not Eligible	Dukes, Nantucket All Risks
37	Mandatory: 2%

9 replies

dosubot[bot] bot Aug 29, 2025

Here's a script that uses Docling for region-based extraction with increased image scale and force_full_page_ocr enabled. This example shows how to extract text from a specific region (like a footer) on a PDF page using bounding box coordinates. Adjust the bbox values to match your region of interest.

from pathlib import Path
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
    PdfPipelineOptions, RapidOcrOptions, TableFormerMode
)
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.backend.bounding_box import BoundingBox

# Path to your PDF
input_doc_path = "your_file.pdf"  # <-- change this

# Configure pipeline for best OCR accuracy
pipeline_options = PdfPipelineOptions(
    do_ocr=True,
    do_table_structure=True,
    table_structure_options={
        "do_cell_matching": True,
        "mode": TableFormerMode.ACCURATE
    },
    ocr_options=RapidOcrOptions(force_full_page_ocr=True, lang=["en"]),
    images_scale=2.0,  # Increase for better OCR on small text
    generate_page_images=True,
    generate_parsed_pages=True
)

converter = DocumentConverter(
    format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)}
)

# Convert the document
conv_result = converter.convert(input_doc_path)

# Example: extract text from a region (e.g., footer) on the first page
# You need to set these coordinates (l, t, r, b) to match your region
# Units are in PDF points (1/72 inch); use a PDF viewer to estimate
bbox = BoundingBox(l=0, t=700, r=600, b=800)  # Example: bottom 100 points of an 800pt-tall page

page_backend = conv_result.input._backend.load_page(0)  # 0-based page index
footer_text = page_backend.get_text_in_rect(bbox)
print("Extracted footer text:", footer_text)

To find the right bounding box, use a PDF viewer to get page dimensions and visually estimate the region. You can also call page_backend.get_size() to get the page width and height.
If you want to extract from a different page, change the index in load_page(n).
Make sure you have installed all dependencies: pip install docling rapidocr onnxruntime.

This approach uses Docling's backend API for precise region extraction and is supported by the official backend and test suite (example). For more details on pipeline options and OCR configuration, see the pipeline options docs.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

akashsinghkunwar Aug 29, 2025
Author

@dosu its not working, can you request another team member to look into this?

dosubot[bot] bot Aug 29, 2025

I've documented your issue and the steps you've taken—including region-based extraction, multiple OCR engines, VLMs, and the correct pipeline options (like generate_parsed_pages=True)—but the extraction is still failing. This may be due to a regression or an undocumented edge case in the recent region extraction or OCR pipeline updates (see recent backend changes). I'll flag this for the maintainers so another team member can investigate further. If you have any error logs or specific failure details, sharing those could help the team diagnose the problem faster.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

akashsinghkunwar Sep 2, 2025
Author

@dosu how can i get text from footer of the page? its digitally printed pdf

dosubot[bot] bot Sep 2, 2025

For digitally printed PDFs, Docling labels footer text as page-footer in the furniture content layer. To extract footer text, iterate over the document items and include the FURNITURE layer:

from docling_core.types.doc.document import ContentLayer

for item, _ in doc.iterate_items(included_content_layers={ContentLayer.BODY, ContentLayer.FURNITURE}):
    if getattr(item, 'label', None) == 'page-footer':
        print(item.text)  # This is your footer text

By default, Docling skips furniture items (headers/footers) unless you include them like above. This method works as long as the footer is present in the PDF's text layer and correctly labeled by Docling’s layout processor details.

If you still don’t see footer text, check your conversion pipeline and inspect the document model to confirm the footer is labeled and present. If the footer is actually an embedded image, you’ll need to use OCR or region-based extraction instead.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

To improve accuracy of converting pdf to md file #2156

Uh oh!

Uh oh!

akashsinghkunwar Aug 28, 2025

Replies: 2 comments · 11 replies

Uh oh!

dosubot[bot] bot Aug 28, 2025

Uh oh!

akashsinghkunwar Aug 28, 2025 Author

Uh oh!

dosubot[bot] bot Aug 28, 2025

Uh oh!

akashsinghkunwar Aug 28, 2025 Author

Uh oh!

dosubot[bot] bot Aug 29, 2025

Uh oh!

akashsinghkunwar Aug 29, 2025 Author

Uh oh!

dosubot[bot] bot Aug 29, 2025

Uh oh!

akashsinghkunwar Sep 2, 2025 Author

Uh oh!

dosubot[bot] bot Sep 2, 2025

akashsinghkunwar
Aug 28, 2025

Replies: 2 comments 11 replies

dosubot[bot]
bot Aug 28, 2025

akashsinghkunwar Aug 28, 2025
Author

akashsinghkunwar
Aug 28, 2025
Author

akashsinghkunwar Aug 29, 2025
Author

akashsinghkunwar Sep 2, 2025
Author