RapidOCR sometimes outputs English text without spaces on low-resolution images

#### 问题描述 / Problem Description

When using RapidOCR integrated via Docling, the OCR result for English text frequently misses spaces between words, causing multiple words to be concatenated into a single token.

The problem is reproducible when parsing JPG documents containing English tables.
It appears to be an OCR recognition / spacing issue rather than a markdown or post-processing formatting problem.

Example output (parsed result):
Millionsofdollarsandsharesexceptpersharedata
Productsales
Totalrevenue
Operatingcostsandexpenses

Expected output:

Millions of dollars and shares except per share data
Product sales
Total revenue
Operating costs and expenses

#### 运行环境 / Runtime Environment
Docling version: 2.72.0

OCR engine: RapidOCR (integrated via Docling)

RapidOCR backend: ONNX

Models:

Detection: ch_PP-OCRv5_server_det.onnx

Recognition: ch_PP-OCRv5_rec_server_infer.onnx

Classification: ch_ppocr_mobile_v2.0_cls_infer.onnx

Runtime: CPU-only (AWS Lambda–compatible environment)

Input format: JPG

Language: English

#### 复现代码 / Reproduction Code
(Reproduction Code is copied from this website 
https://docling-project.github.io/docling/examples/rapidocr_with_custom_models/)
```python
import os
from modelscope import snapshot_download

from docling.datamodel.base_models import InputFormat
from docling.datamodel.document import ConversionResult
from docling.datamodel.pipeline_options import PdfPipelineOptions, RapidOcrOptions
from docling.document_converter import DocumentConverter, PdfFormatOption
def main():
    source = "https://arxiv.org/pdf/2408.09869v4"

    # Download RapidOCR models
    download_path = snapshot_download(repo_id="RapidAI/RapidOCR")

    det_model_path = os.path.join(
        download_path, "onnx", "PP-OCRv5", "det", "ch_PP-OCRv5_server_det.onnx"
    )
    rec_model_path = os.path.join(
        download_path, "onnx", "PP-OCRv5", "rec", "ch_PP-OCRv5_rec_server_infer.onnx"
    )
    cls_model_path = os.path.join(
        download_path, "onnx", "PP-OCRv4", "cls", "ch_ppocr_mobile_v2.0_cls_infer.onnx"
    )

    ocr_options = RapidOcrOptions(
        det_model_path=det_model_path,
        rec_model_path=rec_model_path,
        cls_model_path=cls_model_path,
    )

    pipeline_options = PdfPipelineOptions(
        ocr_options=ocr_options,
    )

    converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(
                pipeline_options=pipeline_options,
            ),
        },
    )

    conversion_result: ConversionResult = converter.convert(source=source)
    doc = conversion_result.document
    md = doc.export_to_markdown()
    print(md)


if __name__ == "__main__":
    main()
```




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RapidOCR sometimes outputs English text without spaces on low-resolution images #636

问题描述 / Problem Description

运行环境 / Runtime Environment

复现代码 / Reproduction Code

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

RapidOCR sometimes outputs English text without spaces on low-resolution images #636

Description

问题描述 / Problem Description

运行环境 / Runtime Environment

复现代码 / Reproduction Code

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions