Tamil models on spaCy #10412
Unanswered
koaning
asked this question in
Help: Coding & Implementations
Replies: 1 comment 5 replies
-
dataset: code done: question: |
Beta Was this translation helpful? Give feedback.
5 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm starting a thread here so that a discussion moves from Twitter to Github.
It seems that the goal is to train a Tamil model so that others may use it. I do not speak Tamil, but since spaCy does have a tokenizer for it (see the language guide) I'd argue that nothing is stopping anyone from making their own model. One might even go as far as pre-populating some vectors. Though I cannot gauge the quality; both fasttext and bpemb seem to support it.
The goal is to have this be a place for discussions on how to train a Tamil model. I personally think it’s better to worry about having a representative dataset first. It's very hard to make a meaningful general model so I'm inclined to first focus on a specific corpus that might yield a model that can be re-used. This is a personal gut feeling and I do not speak Tamil, though so I'll gladly hear it if other folks have suggestions.
Beta Was this translation helpful? Give feedback.
All reactions