Trick "table" structure to be extracted #2496

paulohss · 2025-10-20T04:05:40Z

paulohss
Oct 20, 2025

Hello guys!

I've been using Docling to process some PDFs for a RAG system, and it's generally good. What I'm struggling with is to extract tables like these:

The resulting markdown is like this:

I've tried 'easyocr', 'rapidocr', and 'tesserocr'. EasyOCR is the best one so far for table extraction, but it does not work well with the type of table I shared above.

Any suggestions?

Cheers

@paulohss · 2025-10-20T04:07:49Z

dosubot[bot]
bot Oct 20, 2025

Hi @paulohss! I'm Dosu and I’m helping the docling team.

Extracting tables with assembly-like or pseudo-code formatting from PDFs is tricky because most OCR tools—including EasyOCR, RapidOCR, and Tesserocr—are optimized for standard tabular layouts and may not recognize code-like structures as tables. Docling's modular architecture lets you build custom chunkers or parsers to post-process OCR output, but there isn't an out-of-the-box solution for this specific case.

A practical approach is to use EasyOCR for initial text extraction, then apply a custom parser (for example, using regex or line-based heuristics) to reconstruct the table structure from the raw OCR output. Docling-core supports extensibility, so you can integrate your parser into the workflow to handle these complex tables. This way, you can transform loosely structured markdown into a more accurate representation of your table data.

If you need to automate this, consider writing a Python script that processes the OCR output line-by-line, detects patterns (like FDB directives or signal group entries), and assembles them into structured data. Docling's chunking and transformation modules provide a solid foundation for plugging in such custom logic.

EasyOCR is included by default in Docling's auto-ocr workflows, so you can focus on enhancing the post-processing step rather than the OCR itself. Reference

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

2 replies

paulohss Oct 21, 2025
Author

Hi @dosu,

Do you have sample code that does a generic script that processes the OCR output line-by-line?

dosubot[bot] bot Oct 21, 2025

Here’s a generic Python script that processes OCR output line-by-line and can be adapted for pseudo-code or assembly-like tables. This approach works well with EasyOCR, which typically returns a list of text lines. You can customize the pattern matching section to fit your table structure:

import re

# Example: OCR output as a list of lines (replace with your actual OCR result)
ocr_lines = [
    "FDB, 1, RED, INAMB, SET",
    "FDB, 2, SGAR, OFF",
    "CALLC, Fire Station, Group 3",
    # ... more lines
]

# Example pattern for pseudo-code table rows (customize as needed)
row_pattern = re.compile(r"^FDB,\s*(\d+),\s*(\w+),\s*(\w+),\s*(\w+)$")

table_rows = []
for line in ocr_lines:
    match = row_pattern.match(line)
    if match:
        # Extract fields from the line
        signal_group = match.group(1)
        color = match.group(2)
        flag1 = match.group(3)
        flag2 = match.group(4)
        table_rows.append({
            "signal_group": signal_group,
            "color": color,
            "flag1": flag1,
            "flag2": flag2
        })
    else:
        # Handle non-matching lines (comments, other table types, etc.)
        pass

# Output structured table
for row in table_rows:
    print(row)

You can adapt the row_pattern regex and extraction logic to match your specific table format. For more advanced chunking or integration with Docling, you can subclass or extend the chunker classes in docling-core, which are designed for customizable line-by-line processing and chunk recognition (reference).

Let me know if you want an example tailored to a specific table format or need help integrating this with Docling workflows!

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

paulohss · 2025-10-21T00:37:05Z

paulohss
Oct 21, 2025
Author

nice! @dosu

But how do I integrate this with the pipeline? I mean, were the "line by line" stream is returned by a pipeline like this:

#--------------------------------------------------------

Function to convert document with image annotation

#--------------------------------------------------------
def convert_with_image_annotation(input_doc_path):
try:

    import torch
    torch.cuda.is_available = lambda: False
    
    os.environ["CUDA_VISIBLE_DEVICES"] = ""
    api_key = os.environ.get("OPENAI_API_KEY")
    if not api_key:
        raise ValueError("OPENAI_API_KEY environment variable is required")
        
    model = "gpt-4o-mini"
    picture_desc_api_option = PictureDescriptionApiOptions(
        url="https://api.openai.com/v1/chat/completions",
        prompt=IMAGE_PROMPT,
        params=dict(
                model=model,
            ),
        headers={
            "Authorization": "Bearer " + api_key,
        },
        timeout=60,
    )

    #PDF
    pipeline_options = ThreadedPdfPipelineOptions(
        do_ocr=True,
        ocr_options=EasyOcrOptions(
            lang=["en"]
        ),
        table_structure_options=TableStructureOptions(
            mode=TableFormerMode.ACCURATE,
            #do_cell_matching=True
        ),
        #force_backend_text = False,
        do_table_structure =True,
    )

    #Pictures
    #pipeline_options.picture_description_options=picture_desc_api_option
    #pipeline_options.enable_remote_services = True
    #pipeline_options.generate_picture_images = True
    #pipeline_options.do_picture_description = True
    #pipeline_options.images_scale = 2
    pipeline_options.accelerator_options = AcceleratorOptions(
        num_threads=4, device=AcceleratorDevice.AUTO
    )
    #Converter
    doc_converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
        }
    )        
    conv_res = doc_converter.convert(source=input_doc_path)

    # hypothetical example: detect pseudo-code blocks by label
    for block in conv_res.document.blocks:
        if block.type == "Code" or block.label == "PseudoCodeTable":
            parsed = post_process_pseudocode_table(block.text)
            print("Parsed pseudo-code records:", parsed)

    return conv_res

except Exception as e:
    print(f"Error during conversion: {e}")
    _log.error("Conversion failed", exc_info=True)
    return None

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trick "table" structure to be extracted #2496

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Trick "table" structure to be extracted #2496

Uh oh!

paulohss Oct 20, 2025

Replies: 2 comments · 2 replies

Uh oh!

dosubot[bot] bot Oct 20, 2025

Uh oh!

paulohss Oct 21, 2025 Author

Uh oh!

dosubot[bot] bot Oct 21, 2025

Uh oh!

paulohss Oct 21, 2025 Author

Function to convert document with image annotation

paulohss
Oct 20, 2025

Replies: 2 comments 2 replies

dosubot[bot]
bot Oct 20, 2025

paulohss Oct 21, 2025
Author

paulohss
Oct 21, 2025
Author