Does lemmatization of training data helps to improve real performance #8407

Pandalei97 · 2021-06-16T09:11:49Z

Pandalei97
Jun 16, 2021

Hi ! I'm thinking of training NER models and textcat models with lemmatized data, in order to improve the scores of my models.

When the NER/textcat models do predictions, are they using the lemmatized form or the raw form of the doc ? If they just use the raw form, even if my models get good scores, it may not be good for practical use.

polm · 2021-06-17T03:44:51Z

polm
Jun 17, 2021

By default lemmas are not used in training any models. I think it's possible to specify them as a feature to use in tok2vec, though you'd have to put a lemmatizer before the tok2vec to make them available. (Technically the extra attributes are just for subword features, but I think any attribute can be used.)

If you're working with English I think the most useful types of words for NER or textcat don't usually inflect, so using lemmas doesn't help you much. An exception would be if you have noisy data that has a lot of case issues, in which case you can just use lower instead.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Does lemmatization of training data helps to improve real performance #8407

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Does lemmatization of training data helps to improve real performance #8407

Uh oh!

Pandalei97 Jun 16, 2021

Replies: 1 comment

Uh oh!

polm Jun 17, 2021

Pandalei97
Jun 16, 2021

polm
Jun 17, 2021