Variety in lemmatizer output #10632

spatiebalk · 2022-04-08T09:42:32Z

spatiebalk
Apr 8, 2022

I am training a NER model that also needs to have a lemmatizer. This lemmatizer I initialized with the source model en_core_web_sm as that is fine and does not need to be trained in my use case. The NER part works well, however, I have some questions about some of the behaviour of the lemmatizer.

I initialized the lemmatizer with the one from en_core_web_sm, and froze it during training, however I see that sometimes my trained model assigns a different lemma to a token then the en_core_web_sm model does.
For my trained model (and not trained lemmatizer) I see a varying performance of the lemmatizer where the same token gets different lemma values in different sentences.

So I was wondering, can this behaviour be explained by the lemmatizer being context dependent or are these artifacts of some mistake during training?
For completeness, I have added the config.cfg file I have used.

Answered by thomashacker

Apr 12, 2022

Hello,
you also need to freeze the tok2vec component, since all other components sourced from en_core_web_sm are listening to it.
The lemmatizer depends on the tagger, so you can also verify whether the POS tags are also correct.

View full answer

thomashacker · 2022-04-12T08:39:32Z

thomashacker
Apr 12, 2022

Hello,
you also need to freeze the tok2vec component, since all other components sourced from en_core_web_sm are listening to it.
The lemmatizer depends on the tagger, so you can also verify whether the POS tags are also correct.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Variety in lemmatizer output #10632

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Variety in lemmatizer output #10632

Uh oh!

spatiebalk Apr 8, 2022

Replies: 1 comment

Uh oh!

thomashacker Apr 12, 2022

spatiebalk
Apr 8, 2022

thomashacker
Apr 12, 2022