Skip to content

Arabic text reversed with connected letters not reshaped correctly #69

@AnasAG

Description

@AnasAG

I have a script for extracting Arabic text from PDF. pdfminer lib is used for pdf parsing. When extracting the Arabic text, sentences were reversed but the letters in each work were connected.

Original text in PDF: "وضح المقصود بكل من المصطلحات التالية"
Extracted text from PDF: "ﺓﻳﻼﺗﻼ ﺗﺎﺣﻠﻄﺼﻤﻼ ﻧﻢ ﻟﻜﺐ ﺩﻭﺻﻘﻤﻼ ﺣﻀﻮ"

When using arabic_reshaper I noticed a situation where the Arabic text is not formatted correctly.

Sample Code:

import arabic_reshaper
from bidi.algorithm import get_display

text = "ﺓﻳﻼﺗﻼ ﺗﺎﺣﻠﻄﺼﻤﻼ ﻧﻢ ﻟﻜﺐ ﺩﻭﺻﻘﻤﻼ ﺣﻀﻮ"

reshaped_text = arabic_reshaper.reshape(text)    # correct its shape
print(reshaped_text)
# result: ﺓﻳﻼﺗﻼ ﺗﺎﺣﻠﻄﺼﻤﻼ ﻧﻢ ﻟﻜﺐ ﺩﻭﺻﻘﻤﻼ ﺣﻀﻮ

bidi_text = get_display(reshaped_text)
print(bidi_text)
# result: ﻮﻀﺣ ﻼﻤﻘﺻﻭﺩ ﺐﻜﻟ ﻢﻧ ﻼﻤﺼﻄﻠﺣﺎﺗ ﻼﺗﻼﻳﺓ

But, when using an Arabic text similar to the previous example (reversed) but the letters are isolated (not connected), arabic_reshaper did work properly.

Original text in PDF: "على الترتيب (n-l-m-s) اكتب جميع اعداد الكم الاربعة"
Extracted text from PDF: "ﺐﻴﺗﺮﺘﻟﺍ ﻰﻠﻋ (n-l-m-s) ﺔﻌﺑﺭﻻﺍ ﻢﻜﻟﺍ ﺩﺍﺪﻋﺍ ﻊﻴﻤﺟ ﺐﺘﻛﺍ"

Sample code:

import arabic_reshaper
from bidi.algorithm import get_display

text = "ﺐﻴﺗﺮﺘﻟﺍ ﻰﻠﻋ (n-l-m-s) ﺔﻌﺑﺭﻻﺍ ﻢﻜﻟﺍ ﺩﺍﺪﻋﺍ ﻊﻴﻤﺟ ﺐﺘﻛﺍ"

reshaped_text = arabic_reshaper.reshape(text)    # correct its shape
print(reshaped_text)
# result:  ﺐﻴﺗﺮﺘﻟﺍ ﻰﻠﻋ (n-l-m-s) ﺔﻌﺑﺭﻻﺍ ﻢﻜﻟﺍ ﺩﺍﺪﻋﺍ ﻊﻴﻤﺟ ﺐﺘﻛﺍ

bidi_text = get_display(reshaped_text)
print(bidi_text)
# result: على الترتيب (n-l-m-s) اكتب جميع اعداد الكم الاربعة

I couldn't find out why it behaves this way. Also tried using the ArabicReshaper class with configuration and changing args such as use_unshaped_instead_of_isolated and support_ligatures, but the behavior was the same.
The pdf font affects the extracted text output, it might be also why the text sometimes is extracted with connected or isolated letters/alphabets. Though in general, I'm not sure if it's a bug or related to ligatures or other causes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions