-
Notifications
You must be signed in to change notification settings - Fork 84
Description
When I render this text: "مين" (b'\u0645\u064a\u0646') using PIL(default: libraqm layout) where text is reshaped using libraqm library I get:

Text was transoformed into b'\ufee3\uFC94' . That is expected behavior because "\u064a\u0646" was transformed into ligature "\uFC94" and initial b'\u0645' transformed into initial form b'\ufee3'.
Note: b'\u0645\u064a\u0646' - all letters are in unshaped form
But when I use arabic_reshaper(text) I get:

NOTE: image was created using PIL(basic_layout). It means that PIL does rendering letter by letter from "left to right" and therefore your arabic_reshaper will do its work but makes failure in this case.
The text I get from reshaper is b'\ufee3\ufef4\ufee6' instead of expected b'\ufee3\uFC94'
The source of problem: ligature regex is performed and first match is '\u0645\u064A' y(ARABIC LIGATURE MEEM WITH YEH) which has only isolated form and you therefore correctly skip (continue) it here:
https://github.com/mpcabd/python-arabic-reshaper/blob/master/arabic_reshaper/arabic_reshaper.py#L220
But as subsequent ligature "\u064a\u0646" is overlapping with previous match "\u064a\u0646" it's not returned by finditer function and therefore not applied.
I have simple fix:
diff --git a/arabic_reshaper/arabic_reshaper.py b/arabic_reshaper/arabic_reshaper.py
index 4721a6a..a94cd0f 100644
--- a/arabic_reshaper/arabic_reshaper.py
+++ b/arabic_reshaper/arabic_reshaper.py
@@ -186,14 +186,17 @@ class ArabicReshaper(object):
if delete_tatweel:
text = text.replace(TATWEEL, '')
- for match in re.finditer(self._ligatures_re, text):
+ regex_start = 0
+ matchIt = re.finditer(self._ligatures_re, text)
+ match = next(matchIt, None)
+ while match:
group_index = next((
i for i, group in enumerate(match.groups()) if group
), -1)
forms = self._get_ligature_forms_from_re_group_index(
group_index
)
- a, b = match.span()
+ a, b = tuple(i+regex_start for i in match.span())
a_form = output[a][FORM]
b_form = output[b - 1][FORM]
ligature_form = None
@@ -218,9 +221,13 @@ class ArabicReshaper(object):
else:
ligature_form = MEDIAL
if not forms[ligature_form]:
+ regex_start = a+1
+ matchIt = re.finditer(self._ligatures_re, text[regex_start:])
+ match = next(matchIt, None)
continue
output[a] = (forms[ligature_form], NOT_SUPPORTED)
output[a+1:b] = repeat(('', NOT_SUPPORTED), b - 1 - a)
+ match = next(matchIt, None)
result = []
if not delete_harakat and -1 in positions_harakat: