PaddleOCR OCR analyzes Left-to-Right instead of Right-to-Left for Arabic- How to process RTL languages correctly? #14947
Replies: 2 comments
-
Currently, PaddleOCR does not offer a built-in option such as ocr_order_method to explicitly specify or automatically detect Right-to-Left (RTL) text direction for languages like Arabic. This means that even if the Arabic characters are recognized correctly (as you’ve observed), the final output is often ordered in a Left-to-Right (LTR) fashion, which is unsuitable for correct Arabic text rendering. Recommended Workaround: Post-Processing the OCR Output To produce properly ordered Arabic text, you should handle the reordering in a post-processing step using Python. Here’s how to do so:
Note: The detection box order returned by PaddleOCR is influenced by the location of boxes on the image. You may need to sort boxes manually (e.g. top-to-bottom, then right-to-left within lines) for more accurate sentence ordering in complex layouts. Conclusion:
These steps should help you present OCR results in the correct Right-to-Left order for Arabic and other RTL languages. Response generated by 🤖 feifei-bot | chatgpt-4o-latest |
Beta Was this translation helpful? Give feedback.
-
Stumbled across this today aswell. This is kinda tricky. Arabic reshaper and the bidi algorithm did not work for me because the results from paddleocr are in a visual order and instead need to be converted to a correct logical order. Bidi already expects a logical order so this doesn't help us. import json
import re
from paddleocr import PaddleOCR
# Converts text from the OCR's non-standard 'reversed visual' order to the correct 'logical' order.
def convert_visual_to_logical(text: str) -> str:
ARABIC_CHARS = re.compile(r'[\u0600-\u06FF\u0750-\u077F\u08A0-\u08FF\uFB50-\uFDFF\uFE70-\uFEFF]+')
ARABIC_TRAILING_PUNCT = re.compile(r'([،؟؛!,.:?()\'"]+)$')
words = text.split()
fixed_words = []
arabic_words = []
for w in words:
if ARABIC_CHARS.search(w):
m = ARABIC_TRAILING_PUNCT.search(w)
if m:
punct = m.group(1)
core_word = w[:-len(punct)]
else:
punct = ''
core_word = w
reversed_core = core_word[::-1]
arabic_words.append(reversed_core + punct)
else:
if arabic_words:
fixed_words.extend(arabic_words[::-1])
arabic_words = []
fixed_words.append(w)
if arabic_words:
fixed_words.extend(arabic_words[::-1])
return ' '.join(fixed_words)
ocr = PaddleOCR(
lang='ar',
use_doc_orientation_classify=False,
use_doc_unwarping=False,
use_textline_orientation=False
)
img_path = r"path\to\your\image"
results = ocr.predict(img_path)
all_data = []
if results and results[0]:
ocr_data = results[0]
texts = ocr_data.get('rec_texts', [])
scores = ocr_data.get('rec_scores', [])
polygons = ocr_data.get('rec_polys', [])
for text, score, poly in zip(texts, scores, polygons):
all_data.append({
# Convert text
"text": convert_visual_to_logical(text),
"confidence": score,
"points": poly.tolist()
})
with open("output.json", "w", encoding="utf-8") as f:
json.dump(all_data, f, ensure_ascii=False, indent=2)
print("Saved converted OCR results to output.json") I hope it helps someone else. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I am using PaddleOCR with the Arabic language model (lang='ar') to perform OCR on Arabic images. While PaddleOCR correctly recognizes the Arabic characters, it processes the text in a Left-to-Right (LTR) order, which is incorrect for Arabic, a Right-to-Left (RTL) language. This results in the words and sentences being in reverse order.
I have reviewed the paddleocr --help output to see if there are any options to explicitly set the text direction or handle RTL languages like Arabic.
My question is:
Is there a specific option in PaddleOCR, possibly using ocr_order_method or another parameter, to correctly handle Right-to-Left languages like Arabic and ensure the output text is in the correct RTL order?
If there isn't a built-in option, what are the recommended workarounds to post-process the OCR output to reorder the text correctly for RTL languages in Python?
Any guidance or solutions on how to get PaddleOCR to output Arabic text in the correct Right-to-Left order would be greatly appreciated.
I tried to use the following code:
from paddleocr import PaddleOCR, draw_ocr
ocr = PaddleOCR(use_angle_cls=True, lang='Ar')
img_path = 'image5.jpg'
result = ocr.ocr(img_path, cls=True)
for idx in range(len(result)):
res = result[idx]
for line in res:
print(line)
draw result
from PIL import Image
result = result[0]
image = Image.open(img_path).convert('RGB')
boxes = [line[0] for line in result]
txts = [line[1][0] for line in result]
scores = [line[1][1] for line in result]
im_show = draw_ocr(image, boxes, txts, scores, font_path='./doc/fonts/arabic.ttf')
im_show = Image.fromarray(im_show)
im_show.save('result.jpg')
Beta Was this translation helpful? Give feedback.
All reactions