RapidOCR sometimes outputs English text without spaces on low-resolution images #638

huolongguo2021 · 2026-02-06T15:51:10Z

huolongguo2021
Feb 6, 2026

问题描述 / Problem Description

When using RapidOCR integrated via Docling, the OCR result for English text frequently misses spaces between words, causing multiple words to be concatenated into a single token.

The problem is reproducible when parsing JPG documents containing English tables.
It appears to be an OCR recognition / spacing issue rather than a markdown or post-processing formatting problem.

Example output (parsed result):
Millionsofdollarsandsharesexceptpersharedata
Productsales
Totalrevenue
Operatingcostsandexpenses

Expected output:

Millions of dollars and shares except per share data
Product sales
Total revenue
Operating costs and expenses

运行环境 / Runtime Environment

Docling version: 2.72.0

OCR engine: RapidOCR (integrated via Docling)

RapidOCR backend: ONNX

Models:

Detection: ch_PP-OCRv5_server_det.onnx

Recognition: ch_PP-OCRv5_rec_server_infer.onnx

Classification: ch_ppocr_mobile_v2.0_cls_infer.onnx

Runtime: CPU-only (AWS Lambda–compatible environment)

Input format: JPG

Language: English

复现代码 / Reproduction Code

(Reproduction Code is copied from this website
https://docling-project.github.io/docling/examples/rapidocr_with_custom_models/)

import os
from modelscope import snapshot_download

from docling.datamodel.base_models import InputFormat
from docling.datamodel.document import ConversionResult
from docling.datamodel.pipeline_options import PdfPipelineOptions, RapidOcrOptions
from docling.document_converter import DocumentConverter, PdfFormatOption
def main():
    source = "https://arxiv.org/pdf/2408.09869v4"

    # Download RapidOCR models
    download_path = snapshot_download(repo_id="RapidAI/RapidOCR")

    det_model_path = os.path.join(
        download_path, "onnx", "PP-OCRv5", "det", "ch_PP-OCRv5_server_det.onnx"
    )
    rec_model_path = os.path.join(
        download_path, "onnx", "PP-OCRv5", "rec", "ch_PP-OCRv5_rec_server_infer.onnx"
    )
    cls_model_path = os.path.join(
        download_path, "onnx", "PP-OCRv4", "cls", "ch_ppocr_mobile_v2.0_cls_infer.onnx"
    )

    ocr_options = RapidOcrOptions(
        det_model_path=det_model_path,
        rec_model_path=rec_model_path,
        cls_model_path=cls_model_path,
    )

    pipeline_options = PdfPipelineOptions(
        ocr_options=ocr_options,
    )

    converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(
                pipeline_options=pipeline_options,
            ),
        },
    )

    conversion_result: ConversionResult = converter.convert(source=source)
    doc = conversion_result.document
    md = doc.export_to_markdown()
    print(md)


if __name__ == "__main__":
    main()

Answered by huolongguo2021

Feb 8, 2026

@SWHL it worked perfect with this english model, thank you so much helping me solve this problem.

View full answer

SWHL · 2026-02-08T01:03:24Z

SWHL
Feb 8, 2026
Maintainer

This is a known model issue. The reason is that there is limited data for mixed Chinese-English text, especially when English words are interspersed within Chinese text; as a result, the model fails to recognize and retain spaces between the English words.

Currently, there is no good solution for handling mixed Chinese-English text.

If your application involves only English, it is recommended to switch to an 'en' model, which will prevent the loss of spacing.

3 replies

huolongguo2021 Feb 8, 2026
Author

Thank you so much for answering my question so quickly. Yes, my application involves only English.

My current approach is using Chinese model with onnx runtime and PP-OCRv5, can you help me specific english model for below code, thank you in advance.

det_model_path = os.path.join(
download_path, "onnx", "PP-OCRv5", "det", "ch_PP-OCRv5_server_det.onnx"
)
rec_model_path = os.path.join(
download_path, "onnx", "PP-OCRv5", "rec", "ch_PP-OCRv5_rec_server_infer.onnx"
)
cls_model_path = os.path.join(
download_path, "onnx", "PP-OCRv4", "cls", "ch_ppocr_mobile_v2.0_cls_infer.onnx"
)

SWHL Feb 8, 2026
Maintainer

Confirm that the English model en_PP-OCRv5_rec_mobile_infer.onnx is available locally.

    det_model_path = os.path.join(
        download_path, "onnx", "PP-OCRv5", "det", "ch_PP-OCRv5_server_det.onnx"
    )
    rec_model_path = os.path.join(
        download_path, "onnx", "PP-OCRv5", "rec", "en_PP-OCRv5_rec_mobile_infer"
    )
    cls_model_path = os.path.join(
        download_path, "onnx", "PP-OCRv4", "cls", "ch_ppocr_mobile_v2.0_cls_infer.onnx"
    )

huolongguo2021 Feb 8, 2026
Author

@SWHL it worked perfect with this english model, thank you so much helping me solve this problem.

Answer selected by huolongguo2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RapidOCR sometimes outputs English text without spaces on low-resolution images #638

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

RapidOCR sometimes outputs English text without spaces on low-resolution images #638

Uh oh!

Uh oh!

huolongguo2021 Feb 6, 2026

问题描述 / Problem Description

运行环境 / Runtime Environment

复现代码 / Reproduction Code

Replies: 1 comment · 3 replies

Uh oh!

SWHL Feb 8, 2026 Maintainer

Uh oh!

Uh oh!

huolongguo2021 Feb 8, 2026 Author

Uh oh!

SWHL Feb 8, 2026 Maintainer

Uh oh!

huolongguo2021 Feb 8, 2026 Author

huolongguo2021
Feb 6, 2026

Replies: 1 comment 3 replies

SWHL
Feb 8, 2026
Maintainer

huolongguo2021 Feb 8, 2026
Author

SWHL Feb 8, 2026
Maintainer

huolongguo2021 Feb 8, 2026
Author