Replies: 2 comments 2 replies
-
|
Hi @paulohss! I'm Dosu and I’m helping the docling team. Extracting tables with assembly-like or pseudo-code formatting from PDFs is tricky because most OCR tools—including EasyOCR, RapidOCR, and Tesserocr—are optimized for standard tabular layouts and may not recognize code-like structures as tables. Docling's modular architecture lets you build custom chunkers or parsers to post-process OCR output, but there isn't an out-of-the-box solution for this specific case. A practical approach is to use EasyOCR for initial text extraction, then apply a custom parser (for example, using regex or line-based heuristics) to reconstruct the table structure from the raw OCR output. Docling-core supports extensibility, so you can integrate your parser into the workflow to handle these complex tables. This way, you can transform loosely structured markdown into a more accurate representation of your table data. If you need to automate this, consider writing a Python script that processes the OCR output line-by-line, detects patterns (like FDB directives or signal group entries), and assembles them into structured data. Docling's chunking and transformation modules provide a solid foundation for plugging in such custom logic. EasyOCR is included by default in Docling's auto-ocr workflows, so you can focus on enhancing the post-processing step rather than the OCR itself. Reference To reply, just mention @dosu. How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other |
Beta Was this translation helpful? Give feedback.
-
|
nice! @dosu But how do I integrate this with the pipeline? I mean, were the "line by line" stream is returned by a pipeline like this: #-------------------------------------------------------- Function to convert document with image annotation#-------------------------------------------------------- |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello guys!
I've been using Docling to process some PDFs for a RAG system, and it's generally good. What I'm struggling with is to extract tables like these:

The resulting markdown is like this:

I've tried 'easyocr', 'rapidocr', and 'tesserocr'. EasyOCR is the best one so far for table extraction, but it does not work well with the type of table I shared above.
Any suggestions?
Cheers
Beta Was this translation helpful? Give feedback.
All reactions