Evaluation of German model on Universal Dependencies or Tiger #11134
-
Hi, Explosion! Recently you added a new German model
In case you are interested in reproducing, I composed a little
My teammate says that no option is particularly wrong, but they rather use different approaches to lemmatization. Maybe it is a known problem that you've already encountered. So, here are my questions:
Spacy project to reproduce evaluation
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
The TIGER dependency conversion is in an older CoNLL-2009 format that predates UD (https://ufal.mff.cuni.cz/conll2009-st/task-description.html). It won't contain UD POS and the morphological features are also in an older format, but if you adjust the token IDs and rearrange columns to correspond to CoNLL-U, then you should be able to convert it with I think the CoNLL-2009 columns are:
And CoNLL-U has:
You should convert POS->XPOS and I think you'll still need some valid UPOS tag in the UPOS column, so just for testing you could initially use https://universaldependencies.org/tagset-conversion/de-stts-uposf.html Be aware that there are some distinctions that this table doesn't quite get right when an STTS tag could map to more than one UPOS tag. In terms of the lemmas, the statistical lemmatizer definitely makes some mistakes (see #10953 for more examples), but the cases you mention are mostly related to conventions in the training corpora. The UD corpora are mostly conversions of existing corpora that were developed by different research groups. Although there are obviously shared guidelines for UPOS and UD dependency annotation, I don't think they've tried to develop universal annotation guidelines for lemmas. As a result there's a lot of variation between corpora. There are differences in how different German corpora handle punctuation, pronouns, articles, separable prefix verbs, sentence-initial capitalization, etc. TIGER uses |
Beta Was this translation helpful? Give feedback.
The TIGER dependency conversion is in an older CoNLL-2009 format that predates UD (https://ufal.mff.cuni.cz/conll2009-st/task-description.html). It won't contain UD POS and the morphological features are also in an older format, but if you adjust the token IDs and rearrange columns to correspond to CoNLL-U, then you should be able to convert it with
spacy convert
.I think the CoNLL-2009 columns are:
And CoNLL-U has:
You should convert POS->XPOS and I think you'll still need some valid UPOS tag in the UPOS column, so just for testing you could initi…