tokenizer_exceptions problem with Persian #1772
mosynaq
started this conversation in
Language Support
Replies: 1 comment
-
It's true that we don't really have a good solution to this for Persian, Arabic and other similar languages. I'm still not positive what the best strategy should be. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I am trying to train Persian to spaCy. One of the problems is here: in
tokenizer_exceptions.py
, spaCy expects concatenation of two orths, form the word itself likedo + n't = don't
, but for Persian, this expectation is not valid for some cases.For example, the verb "بر نخواهد گشت" ( = s/he will not return), is made up of "بر" + "نـ" + "خواهد گشت".
("نـ" negates a Persian verb. Most of the times the negation thing comes in the beginning, but in some, cases like this one, it comes in between. )
As you can see you cannot simply concatenate orthes to form the full form.
Should spaCy expectation be changed? Or should I do something?
My Environment
Beta Was this translation helpful? Give feedback.
All reactions