To improve accuracy of converting pdf to md file #2156
Replies: 2 comments 11 replies
-
Hi @akashsinghkunwar! I'm Dosu and I’m helping the docling team. To improve PDF to Markdown extraction accuracy in Docling, especially for footers with embedded images and complex tables:
Example config: pipeline_options = PdfPipelineOptions(
do_ocr=True,
do_table_structure=True,
table_structure_options={
"do_cell_matching": True,
"mode": TableFormerMode.ACCURATE
},
ocr_options={
"force_full_page_ocr": True,
"lang": ["en"]
},
generate_parsed_pages=True
)
converter = DocumentConverter(
format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)}
) For advanced needs (like extracting only footer regions or patching multi-row table headers), consider custom scripting using Docling’s document models and region extraction utilities. To reply, just mention @dosu. How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other |
Beta Was this translation helpful? Give feedback.
-
@dosu I tried using rapidocr as well as easyocr but it is not working; the image i shared in the question ; the resulting conversion for that image is
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi
I am converting pdf to the MD file. I want to improve accuracy of the extraction.
My pdf is a digitally printed pdf but it has also embedded image in the footer. I need to extract text from the footer. So how can I process it?
Some of the tables are not extracted correctly like if column name is taking two rows so its messing up the column.
Conversion result:
It would be great if you could provide insight on this?
More context -- I am converting pdf to md file and tables inside md to HTML format and then feeding to Bedrock knowledge base.
I am using this simple code but I have tried other examples too to use OCR (force ocr on full page =true/false)
`
pipeline = PdfPipelineOptions(do_ocr=False)
return DocumentConverter(
format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline)}
)
`
Beta Was this translation helpful? Give feedback.
All reactions