Turkish lang support beta #6331
Replies: 2 comments
-
Hi @DuyguA, Last time I was trying to train a Turkish model for v2 here: #3056 (comment) Missing part was tag map and after I complete necessary part I could train successfully. But accuracy was low. After I added cc.tr.300.vec.gz vectors to train pipeline (with --prune-vectors 50000) I got following result: Time 1.96 s Now I have tried with v3 by converting UD_English-EWT to UD_Turkish_IMTS here (in project.yml): TOK 99.89 With IMST data set Stanza shows following result:
My understanding is that to improve accuracy I should use pretrain (https://nightly.spacy.io/usage/embeddings-transformers#pretraining) with plain text and add vectors to train pipeline. Tag map still missing in v3, the one I prepared can be used I guess? https://github.com/explosion/spaCy/tree/v3.0.0rc2/spacy/lang/tr Do you have any recommendation other then fasttext cc.tr.300.vec.gz for vectors to improve accuracy? Other thing about Turkish support in SpaCy, as far as I know SpaCy team are not willing to share a model which have restrictive license for the data. |
Beta Was this translation helpful? Give feedback.
-
Hellos @mehmetilker , Yes, we cannot use Turkish-IMST due to licence issues unfortunately. I'll look for another treebank for this reason unfortunately. Tag map is not missing indeed; since IMST is not usable, we left it as the standart UD tag map for now. When we come up with a commercially licenced treebank, then we can use its tag map 😉
I haven't seen a base work for Turkish indeed 😊 Task is basically port an external morphological analyzer, where it produces a lemma and a morphological analysis, lookup should be done by the POS. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Feature description
This issue is not a real issue, but rather informing everyone about Turkish language progress 😊
Here is the current status:
Other than lemmatizer issue, I don't see a good reason we shouldn't have Turkish statistical models 😊 Todo list is as follows:
Please add anything that I forgot and feel free to comment. Looking forward to get the Turkish support together 🎊 ❤️
Beta Was this translation helpful? Give feedback.
All reactions