can't find tokenizer rule that keeps "A.B.C." together but splits "a.b.c." into "a.b.c" and "." #13732

jefhil · 2025-01-23T16:35:35Z

jefhil
Jan 23, 2025

I want to treat upper and lower case "a.b.c." the same but can't figure out where the rule that splits/keeps is located
TIA

jefhil · 2025-02-02T22:11:30Z

from spacy import blank
nlp = blank("en")
ABC = nlp.tokenizer.explain("A.B.C.")
abc = nlp.tokenizer.explain("a.b.c.")
print(f"{ABC=} {abc=}")

--> ABC=[('TOKEN', 'A.B.C.')] abc=[('TOKEN', 'a.b.c'), ('SUFFIX', '.')]

0 replies