Turkish lang support beta #6331

DuyguA · 2020-11-02T23:19:53Z

DuyguA
Nov 2, 2020

Feature description

This issue is not a real issue, but rather informing everyone about Turkish language progress 😊

Here is the current status:

stopwords : in a good shape, I'll do one more review before beta nevertheless
tokenizer exceptions : in a good shape
prefix, infix, suffix tokens : not applicable here
norm exceptions for v2 : in an acceptable shape, more is better
lexical attributes : in a good shape
syntax iterators : implemented recently and in a good shape
tag map : default tag map, so ready to go
morph rules for v2: in a good shape
unit tests : I added as much as I could, more contributions will be better but in a good shape
lemmatizer : Not in a useful and usable shape. One needs a full morphological analysis to find the lemma, so lemmatization via lookup table is not really very suitable for Turkish.

Other than lemmatizer issue, I don't see a good reason we shouldn't have Turkish statistical models 😊 Todo list is as follows:

Port a proper morphological analyzer
Prepare language model packages and share it

Please add anything that I forgot and feel free to comment. Looking forward to get the Turkish support together 🎊 ❤️

mehmetilker · 2020-11-04T07:35:41Z

mehmetilker
Nov 4, 2020

Hi @DuyguA,

Last time I was trying to train a Turkish model for v2 here: #3056 (comment)

Missing part was tag map and after I complete necessary part I could train successfully.
https://github.com/mehmetilker/spacy-tr/blob/master/tr/tag_map.py

But accuracy was low. After I added cc.tr.300.vec.gz vectors to train pipeline (with --prune-vectors 50000) I got following result:
================================== Results ==================================

Time 1.96 s
Words 10029
Words/s 5124
TOK 100.00
POS 92.97
UAS 66.82
LAS 56.38

Now I have tried with v3 by converting UD_English-EWT to UD_Turkish_IMTS here (in project.yml):
https://github.com/explosion/projects/tree/v3/pipelines/tagger_parser_ud
And I got following result:
================================== Results ==================================

TOK 99.89
TAG 88.17
POS 89.42
MORPH 81.38
UAS 61.09
LAS 49.76
SENT P 95.21
SENT R 97.15
SENT F 96.17
SPEED 3174

With IMST data set Stanza shows following result:
https://stanfordnlp.github.io/stanza/performance.html

Tokens | Sentences | Words | UPOS | XPOS | UFeats | AllTags | Lemmas | UAS | LAS | CLAS | MLAS | BLEX
99.89  | 97.62    | 98.07  | 94.21 | 93.43 | 92.08 | 90.27    | 94.92 | 70.78 | 64.5 | 61.62 | 56.04 | 59.6

My understanding is that to improve accuracy I should use pretrain (https://nightly.spacy.io/usage/embeddings-transformers#pretraining) with plain text and add vectors to train pipeline.

Tag map still missing in v3, the one I prepared can be used I guess? https://github.com/explosion/spaCy/tree/v3.0.0rc2/spacy/lang/tr
(Additional note, when tried the produced model, I can see pos and tag details... but there is no tag map in repository...)
I will prepare some data for NER training as well.
I do not know about how much text we need and which shape it should have (by sentence or paragraph?) for pretrain but I will do some experiments...

Do you have any recommendation other then fasttext cc.tr.300.vec.gz for vectors to improve accuracy?
And is there a base work I can look at for "Port a proper morphological analyzer" ?

Other thing about Turkish support in SpaCy, as far as I know SpaCy team are not willing to share a model which have restrictive license for the data.
And IMST license model is v1.3 License: CC BY-NC-SA (https://github.com/UniversalDependencies/UD_Turkish-IMST)

0 replies

DuyguA · 2020-11-14T19:30:20Z

DuyguA
Nov 14, 2020
Author

Hellos @mehmetilker ,

Yes, we cannot use Turkish-IMST due to licence issues unfortunately. I'll look for another treebank for this reason unfortunately.

Tag map is not missing indeed; since IMST is not usable, we left it as the standart UD tag map for now. When we come up with a commercially licenced treebank, then we can use its tag map 😉

And is there a base work I can look at for "Port a proper morphological analyzer" ?

I haven't seen a base work for Turkish indeed 😊 Task is basically port an external morphological analyzer, where it produces a lemma and a morphological analysis, lookup should be done by the POS.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Turkish lang support beta #6331

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Turkish lang support beta #6331

Uh oh!

Uh oh!

DuyguA Nov 2, 2020

Feature description

Replies: 2 comments

Uh oh!

Uh oh!

mehmetilker Nov 4, 2020

Uh oh!

DuyguA Nov 14, 2020 Author

DuyguA
Nov 2, 2020

mehmetilker
Nov 4, 2020

DuyguA
Nov 14, 2020
Author