Components Required for Lemmatizer #9194

alyserecord · 2021-09-13T00:58:53Z

alyserecord
Sep 13, 2021

Based on the diagram located here in your documentation, I believe I am reading that the tok2vec, tagger, and attribute-ruler all must be enabled in the pipeline in order to utilize the built-in lemmatizer in the small English model. Is my understanding correct?

I was a little bit confused by the description below the diagram that says ."...requires token.pos annotation from either tagger+attribute_ruler or morphologizer" because it did not mention the tok2vec component as a dependence. From what I've seen the lemmatizer does not produce lemmas without the tok2vec?

The reasoning behind my question is that after upgrading from using the v2 to v3 small english model and enabling toc2vec, tagger, and attribute-ruler components before the lemmatize component in my pipeline, I am noticing a big increase in the processing time. Looking for any way to trim the processing time down so I am reviewing which components are actually needed.

Answered by polm

Sep 13, 2021

I believe I am reading that the tok2vec, tagger, and attribute-ruler all must be enabled in the pipeline in order to utilize the built-in lemmatizer in the small English model. Is my understanding correct?

Yes.

... because it did not mention the tok2vec component as a dependence. From what I've seen the lemmatizer does not produce lemmas without the tok2vec?

The lemmatizer doesn't depend on the tok2vec directly, but in order for the tagger to work you need the tok2vec. If you had some way to get pos tags without the tok2vec the lemmatizer would happily use them.

I am noticing a big increase in the processing time.

What kind of documents are you working with (length/volume), and how …

View full answer

polm · 2021-09-13T04:49:03Z

polm
Sep 13, 2021

I believe I am reading that the tok2vec, tagger, and attribute-ruler all must be enabled in the pipeline in order to utilize the built-in lemmatizer in the small English model. Is my understanding correct?

Yes.

... because it did not mention the tok2vec component as a dependence. From what I've seen the lemmatizer does not produce lemmas without the tok2vec?

The lemmatizer doesn't depend on the tok2vec directly, but in order for the tagger to work you need the tok2vec. If you had some way to get pos tags without the tok2vec the lemmatizer would happily use them.

I am noticing a big increase in the processing time.

What kind of documents are you working with (length/volume), and how much slower is it? We had some slowdown related to the Matcher, and I wouldn't be surprised if some other parts of the pipeline had slowed a little over time, but I don't think we've had reports of major slowdown for stuff like the tagger before.

3 replies

alyserecord Sep 13, 2021
Author

Thanks for getting back to me so quickly.

My data is product reviews, so the documents vary a bit in size but are not terribly long. I've been comparing the runtime for the first 50,000 records in my dataset, and I'm finding that getting the lemmas for my dataset in v2 took less than 6 seconds but in v3 because of all the components I have to enable it is taking 2+ minutes for the same set of data.

Code (this is strictly just performance testing code to test how long it takes to get the lemmas..) for v2:

nlp = spacy.load(model,  disable=["tagger", "parser", "ner"])
print(nlp.pipe_names)
for doc in nlp.pipe(df.description[:50000]):
    [token.lemma_ for token in doc]

And for v3:

nlp = spacy.load("en_core_web_sm", exclude = [
        "ner",
        "senter",
    ])
print(nlp.pipe_names)
for doc in nlp.pipe(df.description[:50000]):
    [token.lemma_ for token in doc]

Let me know if there is anything I'm missing and there is a way to speed up getting the lemmas in v3.

Also curious if you could tell me if there is an accuracy improvement of the lemmatizer in v3? I was looking around the website for metrics on lemmatization accuracy but couldn't find any.

polm Sep 14, 2021

The main difference between your v2 and v3 code is that in v3 you haven't disabled the parser. If you aren't using the parser it should be disabled, that should make things faster. There may be more you can do but that would be the first thing.

Also, it may not make a difference here, but exclude and disable are different things - see the docs.

Also curious if you could tell me if there is an accuracy improvement of the lemmatizer in v3? I was looking around the website for metrics on lemmatization accuracy but couldn't find any.

We don't calculate any kind of scores for the lemmatizer, partly since it's rule based. I not even sure if our training data includes lemmas to check against.

adrianeboyd Sep 14, 2021

Let me jump in here:

This is a confusing part of v2 pipelines, and it's not related to the parser. With tagger enabled for en_core_web_sm v2.3.0, you get rule-based lemmas. With the tagger disabled, you get default/backoff lookup lemmas from a lookup table, which is much much faster.

In v3 there are more explicit lemmatizer settings and no default/backoff behavior. en_core_web_sm v3.1.0 only includes the rule-based lemmatizer and there's no lookup lemma table. If you want lookup lemmas in v3, first install the package spacy-lookups-data and then:

import spacy
nlp = spacy.blank("en")
nlp.add_pipe("lemmatizer", config={"mode": "lookup"})
nlp.initialize()
doc = nlp("...")

This is a fast pipeline that just produces lookup lemmas and doesn't do anything else (other than tokenize).

Our training data for en_core pipelines doesn't include lemmas, so we don't have an official accuracy score here. You'd have to run the pipeline on your own corpus with lemmas to evaluate it.

alyserecord · 2021-09-15T19:34:47Z

alyserecord
Sep 15, 2021
Author

Got it. Thank you both for the information!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Components Required for Lemmatizer #9194

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Components Required for Lemmatizer #9194

Uh oh!

alyserecord Sep 13, 2021

Replies: 2 comments · 3 replies

Uh oh!

polm Sep 13, 2021

Uh oh!

alyserecord Sep 13, 2021 Author

Uh oh!

polm Sep 14, 2021

Uh oh!

adrianeboyd Sep 14, 2021

Uh oh!

alyserecord Sep 15, 2021 Author

alyserecord
Sep 13, 2021

Replies: 2 comments 3 replies

polm
Sep 13, 2021

alyserecord Sep 13, 2021
Author

alyserecord
Sep 15, 2021
Author