Edit Tree Lemmatizer: Some Questions #9758

jmyerston · 2021-11-28T02:56:06Z

jmyerston
Nov 28, 2021

I am experimenting with the new lemmatizer and ancient Greek, and have already trained some models. I think grc is a good test case because of the complex morphology of the language and the fact that ancient Greek also uses a verbal prefix system to mark past tenses (like German and Ducth ge but this feature is extended to all past tense forms in the indicative).

I will share some benchmarks and models soon but I have some questions.

The first is what are the options of backoff beside orth? is there a possibility to use the lookup table as a backoff option?

The other question is how does a model trained with this lemmatizer handles OOV terms? Will the model be able to transfer what it has learnt to words that were not in the training data?

By the way, with the default settings the pretrained tok2vec layer does not seem have any effect on the lemmatizer accuracy. I mean my _sm and _lg models are reporting the same accuracy.

Best,

Jacobo

danieldk · 2021-11-28T09:33:21Z

danieldk
Nov 28, 2021

Thanks for trying the edit tree lemmatizer!

I think grc is a good test case because of the complex morphology of the language and the fact that ancient Greek also uses a verbal prefix system to mark past tenses

If longer prefixes are relevant, you may want to try something other than MultiHashEmbed since by default it only uses embeddings for prefixes of length 1. CharacterEmbed or a transformer model would be better options.

The first is what are the options of backoff beside orth? is there a possibility to use the lookup table as a backoff option?

You can use any of the string identified attributes of Token, such as orth, norm, or lower.

Another possibility is to set the backoff to null. Then no back-off will be used and the lemma will not be set when no edit tree could be applied. You could then add the spaCy lemmatizer component to use e.g. a lookup table or a set of rules.

The other question is how does a model trained with this lemmatizer handles OOV terms? Will the model be able to transfer what it has learnt to words that were not in the training data?

For regular forms, the knowledge should be transferred to OOV forms. How well this works depends on the complexity of the morphology of a language.

By the way, with the default settings the pretrained tok2vec layer does not seem have any effect on the lemmatizer accuracy. I mean my _sm and _lg models are reporting the same accuracy.

Interesting! Do you freeze the tok2vec layer? Did you try to train the models from scratch?

1 reply

jmyerston Nov 29, 2021
Author

I'm training from scratch and will make my project files available soon; just waiting for a gpu machine to finish a task so that I can train with transformers.

Using CharacterEmbed improved the performance of the lemmatizer, but I got the best results so far with MultiHashEmbed + fasttext vectors.

The lemma scores for the models I trained during the weekend were 88.03 (edit tree) vs 88.93 (lookup). I imagine that training with transformers will improve the edit tree lemmatizer's performance, something that does not change with the lookup table.

jmyerston · 2021-12-02T22:46:30Z

jmyerston
Dec 2, 2021
Author

I trained a model with transformers and edit_tree with backoff set to null and added the spacy lemmatizer set to use the lookup table. With this setup, lemmatization improved from 88 to 95. So, it is an improvement over lookup alone. Interestingly, edit_tree set with backoff set to orth, even using a transformer, remains around 88. There is no improvement if I do not use the lookup table. I imagine this has to do with the fact that grc is a highly inflected language and, therefore, orth is unlikely to be the lemma.

Another interesting thing is that I get better results with tok2vec pretrained layer and fastext vectors. In this case, lematization improved from 88 to 96.

1 reply

danieldk Dec 6, 2021

Cool, that's really a nice improvement! I don't know anything about ancient Greek, but does it have many irregular forms? If it does, it could explain why the lookup is required to achieve higher accuracies.

jmyerston · 2021-12-31T01:01:03Z

jmyerston
Dec 31, 2021
Author

Hi,

I have some good news to report. I got the surprising result of 99.21 with different settings:

backoff = null
min_tree_freq = 1
top_k =7

I was not using lookup tables this time. This benchmark beats both stanza(88.26) and trankit (88.52). I was not using transformers but resources of the type used in canonical _lg models.

The corpus used was grc ud_perseus with my own augmentations. I imagine that similar settings will also work for modern Greek.

the link to the experiment is here:

https://wandb.ai/jmyerston/ud_perseus_lg?workspace=user-jmyerston

J.

1 reply

svlandeg Jan 4, 2022

Wow, that's really great! Do you plan on sharing the code/project somewhere? Others might be interested to replicate this setup :-)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Edit Tree Lemmatizer: Some Questions #9758

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Edit Tree Lemmatizer: Some Questions #9758

Uh oh!

jmyerston Nov 28, 2021

Replies: 3 comments · 3 replies

Uh oh!

danieldk Nov 28, 2021

Uh oh!

jmyerston Nov 29, 2021 Author

Uh oh!

Uh oh!

jmyerston Dec 2, 2021 Author

Uh oh!

danieldk Dec 6, 2021

Uh oh!

Uh oh!

jmyerston Dec 31, 2021 Author

Uh oh!

svlandeg Jan 4, 2022

jmyerston
Nov 28, 2021

Replies: 3 comments 3 replies

danieldk
Nov 28, 2021

jmyerston Nov 29, 2021
Author

jmyerston
Dec 2, 2021
Author

jmyerston
Dec 31, 2021
Author