PaddleOCR OCR analyzes Left-to-Right instead of Right-to-Left for Arabic- How to process RTL languages correctly? #14971

roostaeecs · 2025-03-27T22:16:24Z

roostaeecs
Mar 27, 2025

🔎 Search before asking

I have searched the PaddleOCR Docs and found no similar bug report.
I have searched the PaddleOCR Issues and found no similar bug report.
I have searched the PaddleOCR Discussions and found no similar bug report.

🐛 Bug (问题描述)

I am using PaddleOCR with the Arabic language model (lang='ar') to perform OCR on Arabic images. While PaddleOCR correctly recognizes the Arabic characters, it processes the text in a Left-to-Right (LTR) order, which is incorrect for Arabic, a Right-to-Left (RTL) language. This results in the words and sentences being in reverse order.

I have reviewed the paddleocr --help output to see if there are any options to explicitly set the text direction or handle RTL languages like Arabic.

My question is:

Is there a specific option in PaddleOCR, possibly using ocr_order_method or another parameter, to correctly handle Right-to-Left languages like Arabic and ensure the output text is in the correct RTL order?

If there isn't a built-in option, what are the recommended workarounds to post-process the OCR output to reorder the text correctly for RTL languages in Python?

Any guidance or solutions on how to get PaddleOCR to output Arabic text in the correct Right-to-Left order would be greatly appreciated.

🏃‍♂️ Environment (运行环境)

OS Windows 11
Environment Pycharm
Python 3.8
PaddOCR 2.10

🌰 Minimal Reproducible Example (最小可复现问题的Demo)

from paddleocr import PaddleOCR, draw_ocr

ocr = PaddleOCR(use_angle_cls=True, lang='Ar')
img_path = 'image5.jpg'
result = ocr.ocr(img_path, cls=True)
for idx in range(len(result)):
res = result[idx]
for line in res:
print(line)

draw result

from PIL import Image
result = result[0]
image = Image.open(img_path).convert('RGB')
boxes = [line[0] for line in result]
txts = [line[1][0] for line in result]
scores = [line[1][1] for line in result]
im_show = draw_ocr(image, boxes, txts, scores, font_path='./doc/fonts/arabic.ttf')
im_show = Image.fromarray(im_show)
im_show.save('result.jpg')

GreatV · 2025-04-01T13:33:34Z

GreatV
Apr 1, 2025
Maintainer

PaddleOCR currently does not fully support automatic handling of Right-to-Left (RTL) languages like Arabic in terms of reading order. While the model can accurately recognize Arabic characters, the OCR output is returned in a Left-to-Right (LTR) sequence, which leads to reversed word and sentence structures.

As of PaddleOCR v2.10, there is no built-in parameter such as ocr_order_method or an automatic RTL layout analyzer that handles RTL script flow natively. This is a known limitation for languages like Arabic, Persian, and Hebrew.

📌 Workaround: Post-process the output to reverse the order of text

You can manually adjust the text direction using Python after obtaining the OCR results. Here's a suggested approach:

After running OCR, group detected lines or blocks based on their vertical position (y-coordinate).
Within each line/block, sort results from right to left instead of left to right.
Concatenate the texts in reverse order to form the correct RTL string.

Here is an example post-processing snippet:

# Example post-processing for RTL (Arabic)
from paddleocr import PaddleOCR
from operator import itemgetter

ocr = PaddleOCR(use_angle_cls=True, lang='ar')
img_path = 'image5.jpg'
result = ocr.ocr(img_path, cls=True)[0]

# Group lines based on y-coordinate (with some threshold)
lines = {}
threshold = 10

for box in result:
    position = box[0]
    text = box[1][0]
    y_center = (position[0][1] + position[2][1]) / 2
    added = False
    for key in lines:
        if abs(y_center - key) < threshold:
            lines[key].append((position, text))
            added = True
            break
    if not added:
        lines[y_center] = [(position, text)]

# Sort lines top to bottom
sorted_lines = sorted(lines.items(), key=lambda x: x[0])

for y, line_items in sorted_lines:
    # Sort boxes right to left based on x-coordinate of the rightmost point
    sorted_line = sorted(line_items, key=lambda x: -max(p[0] for p in x[0]))
    rtl_line = ' '.join([text for _, text in sorted_line])
    print(rtl_line)

📝 Notes:

The sorting strategy above assumes relatively horizontal text lines. For more complex layouts (e.g., curved or multi-column text), further layout analysis may be needed.
You may use libraries like python-bidi or arabic-reshaper along with an appropriate Arabic font to render visually correct RTL text, especially when displaying or printing.

🔮 Future Support:

While PaddleOCR supports multilingual OCR including Arabic in recognition, it currently lacks native document layout analysis or direction correction for RTL scripts. It’s recommended to follow or raise a feature request in the PaddleOCR GitHub Issues for improved RTL language support.

No currently available Issues or Discussions reference this particular RTL text order feature in PaddleOCR directly. You may consider opening a new Feature Request on GitHub.

Response generated by 🤖 feifei-bot | chatgpt-4o-latest

⚠️ This is an AI-generated response. Please verify any critical information.

0 replies

cqray1990 · 2025-05-23T03:30:29Z

cqray1990
May 23, 2025

how to annotated for real data， reading order is from right to left,the cut images for training is not suitable for the label
@GreatV @roostaeecs

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PaddleOCR OCR analyzes Left-to-Right instead of Right-to-Left for Arabic- How to process RTL languages correctly? #14971

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

PaddleOCR OCR analyzes Left-to-Right instead of Right-to-Left for Arabic- How to process RTL languages correctly? #14971

Uh oh!

roostaeecs Mar 27, 2025

🔎 Search before asking

🐛 Bug (问题描述)

🏃‍♂️ Environment (运行环境)

🌰 Minimal Reproducible Example (最小可复现问题的Demo)

draw result

Replies: 2 comments

Uh oh!

GreatV Apr 1, 2025 Maintainer

Uh oh!

Uh oh!

cqray1990 May 23, 2025

roostaeecs
Mar 27, 2025

GreatV
Apr 1, 2025
Maintainer

cqray1990
May 23, 2025