Spanish lemmatizer doesn't work for future tense verbs #10376

buhrmann · 2022-02-24T11:57:15Z

buhrmann
Feb 24, 2022

How to reproduce the behaviour

The following applies to the attached text: Castilla y León Programa electoral 2022.txt

doc = es(txt)
lemmas = [(t.i, t, t.lemma_, t.pos_, t.morph) for t in doc if t.lemma_.endswith("rer")]
df = pd.DataFrame(lemmas, columns=["idx", "text", "lemma", "pos", "morph"]).head(25)
with pd.option_context("display.max_rows", 100, "display.width", None, "display.max_colwidth", None):
    display(df)

As you can see almost all lemmas are wrong, adding an extra "er" suffix to the correct lemma form.

Here is more detail on one particular verb in context:

token = doc[3352]
print(token.sent, "\n")
print(f"{token} ({token.tag_}): {token.lemma_} \n")

lemmatizer = es.get_pipe("lemmatizer")
print(lemmatizer.rule_lemmatize(token))
print(lemmatizer.lookup_lemmatize(token), "\n")

print(es.get_factory_meta("lemmatizer"), "\n")

print(f"{spacy.__version__=}")
print(f"{spacy_lookups_data.__version__=}")

Your Environment

Operating System: Linux
Python Version Used: 3.8.8
spaCy Version Used: 3.2.0

Environment Information:

es.meta: es_meta.json.txt

es.config: es_config.json.txt

adrianeboyd · 2022-02-25T08:06:03Z

adrianeboyd
Feb 25, 2022

I had a look at the underlying lemmatizer rules and I think this case would work correctly if the token.morph info were correct.

import spacy
nlp = spacy.load("es_core_news_sm")
doc = nlp("trabajaremos")
assert str(doc[0].morph) == "Mood=Ind|Number=Plur|Person=1|Tense=Pres|VerbForm=Fin"
doc[0].set_morph("Mood=Ind|Number=Plur|Person=1|Tense=Fut|VerbForm=Fin")
assert nlp.get_pipe("lemmatizer").rule_lemmatize(doc[0]) == ['trabajar']

The training corpus probably does not contain a lot of 1st person or future verbs, so the morphologizer doesn't learn how to tag these verbs correctly, which cascades into lemmatizer errors. In a rough count, only 0.2% of verbs have this morph tag in the training corpus.

I think that improving the morph tags would improve the lemmas for this kind of case, but it's not as simple as fixing as lemmatizer rules unfortunately.

4 replies

adrianeboyd Feb 25, 2022

As a note, the experimental edit tree lemmatizer does get this lemma correct in this trf model: https://huggingface.co/explosion/es_udv25_spanishancora_trf

In this pipeline the lemmatizer doesn't depend directly on the morphologizer output, either, but both have the correct tags, in this simple example anyway.

buhrmann Mar 9, 2022
Author

Hi, thanks for looking into this! I didn't even notice the wrong tense being inferred in my own examples... Looking forward to trying the edit tree lemmatizer once it's not experimental anymore.

nlovell1 Jun 6, 2022

hi, interested in trying out the experimental edit tree lemmatizer... could someone link me to where in the docs there would be an example of its use? can't seem to find it on my own

adrianeboyd Jun 20, 2022

It's no longer experimental and is now called trainable_lemmatizer in spacy: https://spacy.io/api/edittreelemmatizer

You can add it to your pipeline in the training quickstart or with spacy init config -p trainable_lemmatizer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Spanish lemmatizer doesn't work for future tense verbs #10376

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Spanish lemmatizer doesn't work for future tense verbs #10376

Uh oh!

buhrmann Feb 24, 2022

How to reproduce the behaviour

Your Environment

Environment Information:

Replies: 1 comment · 4 replies

Uh oh!

adrianeboyd Feb 25, 2022

Uh oh!

adrianeboyd Feb 25, 2022

Uh oh!

buhrmann Mar 9, 2022 Author

Uh oh!

nlovell1 Jun 6, 2022

Uh oh!

adrianeboyd Jun 20, 2022

buhrmann
Feb 24, 2022

Replies: 1 comment 4 replies

adrianeboyd
Feb 25, 2022

buhrmann Mar 9, 2022
Author