Problem with non-overlapping matches of ligature pattern in string when ligature is skipped due to its form mismatch 

When I render this text: "مين"  (b'\\u0645\\u064a\\u0646') using PIL(default: libraqm layout) where text is reshaped using [libraqm library](https://github.com/HOST-Oman/libraqm) I get:
![image](https://user-images.githubusercontent.com/23061228/195609698-01dac1e5-cb42-4a0e-95ef-f1f38a545e53.png)
Text was transoformed into b'\\ufee3\\uFC94' . That is expected behavior because "\\u064a\\u0646" was transformed into ligature "\\uFC94" and initial b'\\u0645' transformed into initial form b'\\ufee3'.
Note:  b'\\u0645\\u064a\\u0646' - all letters are in unshaped form

But when I use arabic_reshaper(text) I get:
![image](https://user-images.githubusercontent.com/23061228/195609972-a5078ea8-dfe9-46b2-8741-edc2bb4ac502.png)
NOTE: image was created using PIL(basic_layout). It means that PIL does rendering letter by letter from "left to right" and therefore your arabic_reshaper will do its work but makes failure in this case.
The text I get from reshaper is b'\\ufee3\\ufef4\\ufee6' instead of expected b'\\ufee3\\uFC94'

The source of problem: ligature regex is performed and first match is '\\u0645\\u064A' y(ARABIC LIGATURE MEEM WITH YEH) which has only isolated form and you therefore correctly skip (`continue`) it here:
https://github.com/mpcabd/python-arabic-reshaper/blob/master/arabic_reshaper/arabic_reshaper.py#L220
But as subsequent ligature "\\u064a\\u0646" is overlapping with previous match "\\u064a\\u0646" it's not returned by finditer function and therefore not applied.

I have simple fix:
```
diff --git a/arabic_reshaper/arabic_reshaper.py b/arabic_reshaper/arabic_reshaper.py
index 4721a6a..a94cd0f 100644
--- a/arabic_reshaper/arabic_reshaper.py
+++ b/arabic_reshaper/arabic_reshaper.py
@@ -186,14 +186,17 @@ class ArabicReshaper(object):
             if delete_tatweel:
                 text = text.replace(TATWEEL, '')
 
-            for match in re.finditer(self._ligatures_re, text):
+            regex_start = 0
+            matchIt = re.finditer(self._ligatures_re, text)
+            match = next(matchIt, None)
+            while match:
                 group_index = next((
                     i for i, group in enumerate(match.groups()) if group
                 ), -1)
                 forms = self._get_ligature_forms_from_re_group_index(
                     group_index
                 )
-                a, b = match.span()
+                a, b = tuple(i+regex_start for i in match.span())
                 a_form = output[a][FORM]
                 b_form = output[b - 1][FORM]
                 ligature_form = None
@@ -218,9 +221,13 @@ class ArabicReshaper(object):
                     else:
                         ligature_form = MEDIAL
                 if not forms[ligature_form]:
+                    regex_start = a+1
+                    matchIt = re.finditer(self._ligatures_re, text[regex_start:])
+                    match = next(matchIt, None)
                     continue
                 output[a] = (forms[ligature_form], NOT_SUPPORTED)
                 output[a+1:b] = repeat(('', NOT_SUPPORTED), b - 1 - a)
+                match = next(matchIt, None)
 
         result = []
         if not delete_harakat and -1 in positions_harakat:
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Problem with non-overlapping matches of ligature pattern in string when ligature is skipped due to its form mismatch #86

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Problem with non-overlapping matches of ligature pattern in string when ligature is skipped due to its form mismatch #86

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions