Skip to content

Italian lemmatizer low performance on agglitinated verbs #12910

@ferrixio

Description

@ferrixio

Hi,

I recently used Spacy 3.4.4 to classify Italian verbs, but ran into the following problem using the pretrained model it_core_news_lg:

`
sentence = "aprimi la porta"
--- output ---

text lemma pos tag
aprimi aprimo ADJ A
la il DET RD
porta porta NOUN S

`
Sadly, the lemmatizer recognizes the verb "aprimi" as an adjective and in other cases it fails to recognize the right conjugation (I used "leggimi un libro" as sentence and Spacy said that "leggimi" comes from the verb "leggimare").
In general it seems that spacy has difficulty recognizing agglutinated verbs that involve pronouns. I tried to update Spacy to version 3.6.1, but the problem persists.

Is there any reason that explains it?

Many thanks!

Your Environment

  • Operating System: Windows 10 and Windows 11
  • Python Version Used: 3.9.6 and 3.11.1
  • spaCy Version Used: 3.4.4 and 3.6.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    feat / lemmatizerFeature: Rule-based and lookup lemmatizationlang / itItalian language data and models

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions