Skip to content

Problem with non-overlapping matches of ligature pattern in string when ligature is skipped due to its form mismatch  #86

@jurajmichalak1

Description

@jurajmichalak1

When I render this text: "مين" (b'\u0645\u064a\u0646') using PIL(default: libraqm layout) where text is reshaped using libraqm library I get:
image
Text was transoformed into b'\ufee3\uFC94' . That is expected behavior because "\u064a\u0646" was transformed into ligature "\uFC94" and initial b'\u0645' transformed into initial form b'\ufee3'.
Note: b'\u0645\u064a\u0646' - all letters are in unshaped form

But when I use arabic_reshaper(text) I get:
image
NOTE: image was created using PIL(basic_layout). It means that PIL does rendering letter by letter from "left to right" and therefore your arabic_reshaper will do its work but makes failure in this case.
The text I get from reshaper is b'\ufee3\ufef4\ufee6' instead of expected b'\ufee3\uFC94'

The source of problem: ligature regex is performed and first match is '\u0645\u064A' y(ARABIC LIGATURE MEEM WITH YEH) which has only isolated form and you therefore correctly skip (continue) it here:
https://github.com/mpcabd/python-arabic-reshaper/blob/master/arabic_reshaper/arabic_reshaper.py#L220
But as subsequent ligature "\u064a\u0646" is overlapping with previous match "\u064a\u0646" it's not returned by finditer function and therefore not applied.

I have simple fix:

diff --git a/arabic_reshaper/arabic_reshaper.py b/arabic_reshaper/arabic_reshaper.py
index 4721a6a..a94cd0f 100644
--- a/arabic_reshaper/arabic_reshaper.py
+++ b/arabic_reshaper/arabic_reshaper.py
@@ -186,14 +186,17 @@ class ArabicReshaper(object):
             if delete_tatweel:
                 text = text.replace(TATWEEL, '')
 
-            for match in re.finditer(self._ligatures_re, text):
+            regex_start = 0
+            matchIt = re.finditer(self._ligatures_re, text)
+            match = next(matchIt, None)
+            while match:
                 group_index = next((
                     i for i, group in enumerate(match.groups()) if group
                 ), -1)
                 forms = self._get_ligature_forms_from_re_group_index(
                     group_index
                 )
-                a, b = match.span()
+                a, b = tuple(i+regex_start for i in match.span())
                 a_form = output[a][FORM]
                 b_form = output[b - 1][FORM]
                 ligature_form = None
@@ -218,9 +221,13 @@ class ArabicReshaper(object):
                     else:
                         ligature_form = MEDIAL
                 if not forms[ligature_form]:
+                    regex_start = a+1
+                    matchIt = re.finditer(self._ligatures_re, text[regex_start:])
+                    match = next(matchIt, None)
                     continue
                 output[a] = (forms[ligature_form], NOT_SUPPORTED)
                 output[a+1:b] = repeat(('', NOT_SUPPORTED), b - 1 - a)
+                match = next(matchIt, None)
 
         result = []
         if not delete_harakat and -1 in positions_harakat:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions