Incorrect Lemmas for Spanish: spacy return femenine lemmas #12645
-
Hello, For example, we observe this behaviour for the word 'clienta' which lemma is 'cliente'. See our code and results below: Code and results for es_dep_news_trf: _Code and results for es_core_news_sm: _Code and results for es_core_news_md: _Code and results for es_core_news_lg: For us, it is weird that the model is able to recognize the gender of the word correctly but it is not able to ''formulate'' the lemma correctly. Is there any lemmatizing function that we are missing? Environment: Operating System: Linux |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
Lots of decisions about lemma annotation (for example, how to lemmatize pronouns or punctuation) are task-specific or corpus-specific. (Using the masculine plural for nouns sounds unusual to me, though, is there a particular Spanish corpus or dictionary that does this?) Looking at If you need different lemmas, you could modify the rules+exceptions for the current rule-based lemmatizer or you could potentially use the trainable lemmatizer with training data that uses the alternate forms. The data behind the rule-based lemmatizer is available here under https://github.com/explosion/spacy-lookups-data/tree/master/spacy_lookups_data/data |
Beta Was this translation helpful? Give feedback.
-
Thank you very much for the rapid response! My mistake, I was thinking on the masculine SINGULAR for the base form of the words for Spanish, not the plurar. Sorry about that. Thanks a lot for the information about the UD_Spanish-Ancora corpus. Indeed, the femenine nouns are lemmatized as their femenine singular form. We were not aware of that. We'll try to use the rule-based lemmatizer. Best, |
Beta Was this translation helpful? Give feedback.
Lots of decisions about lemma annotation (for example, how to lemmatize pronouns or punctuation) are task-specific or corpus-specific. (Using the masculine plural for nouns sounds unusual to me, though, is there a particular Spanish corpus or dictionary that does this?)
Looking at
UD_Spanish-AnCora
, which we're using to evaluate the rule-based Spanish lemmatizer in thees_*
pipelines, it looks like the feminine forms of similar words are lemmatized to the feminine singular.If you need different lemmas, you could modify the rules+exceptions for the current rule-based lemmatizer or you could potentially use the trainable lemmatizer with training data that uses the alternate forms. The data…