Trainable Tokenizer #2220
mosynaq
started this conversation in
New Features & Project Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi there! I am trying to train Persian models for spaCy and have done a lot so far. But the tokenization part suffers some deficiencies: It cannot recognize clitics and split them and since those clitics connect to a large number of words, this phenomenon cannot be described using RegEx. Conllu tree banks provide a good deal of information about tokens and their boundaries. See below for a good example
The above sample in picture:

The sentence means literally : "His/her-face object-marker-token I-saw." meaning "I saw his/her face.".
In the first aolumn you can see "1-2" with the token and after that you can see the token properly tokenized. I believe this tree and others are a great source of information for a trainable tokenizer.
Questions
p.s.
Your Environment
Beta Was this translation helpful? Give feedback.
All reactions