Edit Tree Lemmatizer: Some Questions #9758
Replies: 3 comments 3 replies
-
Thanks for trying the edit tree lemmatizer!
If longer prefixes are relevant, you may want to try something other than
You can use any of the string identified attributes of Token, such as Another possibility is to set the
For regular forms, the knowledge should be transferred to OOV forms. How well this works depends on the complexity of the morphology of a language.
Interesting! Do you freeze the tok2vec layer? Did you try to train the models from scratch? |
Beta Was this translation helpful? Give feedback.
-
I trained a model with transformers and edit_tree with backoff set to null and added the spacy lemmatizer set to use the lookup table. With this setup, lemmatization improved from 88 to 95. So, it is an improvement over lookup alone. Interestingly, edit_tree set with backoff set to orth, even using a transformer, remains around 88. There is no improvement if I do not use the lookup table. I imagine this has to do with the fact that grc is a highly inflected language and, therefore, orth is unlikely to be the lemma. Another interesting thing is that I get better results with tok2vec pretrained layer and fastext vectors. In this case, lematization improved from 88 to 96. |
Beta Was this translation helpful? Give feedback.
-
Hi, I have some good news to report. I got the surprising result of 99.21 with different settings: backoff = null I was not using lookup tables this time. This benchmark beats both stanza(88.26) and trankit (88.52). I was not using transformers but resources of the type used in canonical _lg models. The corpus used was grc ud_perseus with my own augmentations. I imagine that similar settings will also work for modern Greek. the link to the experiment is here: https://wandb.ai/jmyerston/ud_perseus_lg?workspace=user-jmyerston J. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi @danieldk,
I am experimenting with the new lemmatizer and ancient Greek, and have already trained some models. I think grc is a good test case because of the complex morphology of the language and the fact that ancient Greek also uses a verbal prefix system to mark past tenses (like German and Ducth ge but this feature is extended to all past tense forms in the indicative).
I will share some benchmarks and models soon but I have some questions.
The first is what are the options of backoff beside orth? is there a possibility to use the lookup table as a backoff option?
The other question is how does a model trained with this lemmatizer handles OOV terms? Will the model be able to transfer what it has learnt to words that were not in the training data?
By the way, with the default settings the pretrained tok2vec layer does not seem have any effect on the lemmatizer accuracy. I mean my _sm and _lg models are reporting the same accuracy.
Best,
Jacobo
Beta Was this translation helpful? Give feedback.
All reactions