Skip to content

Commit 8292c71

Browse files
authored
[BugFix] Deal with greek letter "sigma" when return offset_mapping (#2897)
* deal with greek letter sigma * update comments
1 parent 912e027 commit 8292c71

File tree

1 file changed

+8
-1
lines changed

1 file changed

+8
-1
lines changed

paddlenlp/transformers/tokenizer_utils.py

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1363,7 +1363,14 @@ def get_offset_mapping(self, text):
13631363
if token in self.all_special_tokens:
13641364
token = token.lower() if hasattr(
13651365
self, "do_lower_case") and self.do_lower_case else token
1366-
start = text[offset:].index(token) + offset
1366+
# The greek letter "sigma" has 2 forms of lowercase, σ and ς respectively.
1367+
# When used as a final letter of a word, the final form (ς) is used. Otherwise, the form (σ) is used.
1368+
# https://latin.stackexchange.com/questions/6168/how-and-when-did-we-get-two-forms-of-sigma
1369+
if "σ" in token or "ς" in token:
1370+
start = text[offset:].replace("ς", "σ").index(
1371+
token.replace("ς", "σ")) + offset
1372+
else:
1373+
start = text[offset:].index(token) + offset
13671374

13681375
end = start + len(token)
13691376

0 commit comments

Comments
 (0)