Skip to content

RapidOCR sometimes outputs English text without spaces on low-resolution images #636

@huolongguo2021

Description

@huolongguo2021

问题描述 / Problem Description

When using RapidOCR integrated via Docling, the OCR result for English text frequently misses spaces between words, causing multiple words to be concatenated into a single token.

The problem is reproducible when parsing JPG documents containing English tables.
It appears to be an OCR recognition / spacing issue rather than a markdown or post-processing formatting problem.

Example output (parsed result):
Millionsofdollarsandsharesexceptpersharedata
Productsales
Totalrevenue
Operatingcostsandexpenses

Expected output:

Millions of dollars and shares except per share data
Product sales
Total revenue
Operating costs and expenses

运行环境 / Runtime Environment

Docling version: 2.72.0

OCR engine: RapidOCR (integrated via Docling)

RapidOCR backend: ONNX

Models:

Detection: ch_PP-OCRv5_server_det.onnx

Recognition: ch_PP-OCRv5_rec_server_infer.onnx

Classification: ch_ppocr_mobile_v2.0_cls_infer.onnx

Runtime: CPU-only (AWS Lambda–compatible environment)

Input format: JPG

Language: English

复现代码 / Reproduction Code

(Reproduction Code is copied from this website
https://docling-project.github.io/docling/examples/rapidocr_with_custom_models/)

import os
from modelscope import snapshot_download

from docling.datamodel.base_models import InputFormat
from docling.datamodel.document import ConversionResult
from docling.datamodel.pipeline_options import PdfPipelineOptions, RapidOcrOptions
from docling.document_converter import DocumentConverter, PdfFormatOption
def main():
    source = "https://arxiv.org/pdf/2408.09869v4"

    # Download RapidOCR models
    download_path = snapshot_download(repo_id="RapidAI/RapidOCR")

    det_model_path = os.path.join(
        download_path, "onnx", "PP-OCRv5", "det", "ch_PP-OCRv5_server_det.onnx"
    )
    rec_model_path = os.path.join(
        download_path, "onnx", "PP-OCRv5", "rec", "ch_PP-OCRv5_rec_server_infer.onnx"
    )
    cls_model_path = os.path.join(
        download_path, "onnx", "PP-OCRv4", "cls", "ch_ppocr_mobile_v2.0_cls_infer.onnx"
    )

    ocr_options = RapidOcrOptions(
        det_model_path=det_model_path,
        rec_model_path=rec_model_path,
        cls_model_path=cls_model_path,
    )

    pipeline_options = PdfPipelineOptions(
        ocr_options=ocr_options,
    )

    converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(
                pipeline_options=pipeline_options,
            ),
        },
    )

    conversion_result: ConversionResult = converter.convert(source=source)
    doc = conversion_result.document
    md = doc.export_to_markdown()
    print(md)


if __name__ == "__main__":
    main()

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions