Improved Italian lemmatizer: ongoing work or plans? #7824

gtoffoli · 2021-04-19T11:08:32Z

gtoffoli
Apr 19, 2021

I strongly miss a good Italian lemmatizer in spaCy.
The reasons for that have been given in the past for similar languages: see, for example, #2710 and the work of Guadalupe Romero: https://twitter.com/_guadiromero/status/1213211033541758979.

Almost 2 years ago I wrote a general post ( #3801 ) on improving support for Italian in spaCy.
Now, I would like to know if any activities are ongoing concerning the lemmatizer.
If not, I would try to do it myself; in any case, I don't have much time, so it might take me a few months.

I believe I have essentially two-three options:

adapt to Italian the rule-based lemmatizer that was added in the last version of Spanish, following the work of Guadalupe Romero;
compile a look-up table in which each word form occurs several times, associated with the different possible POS-tags;
develop and train a statistical lemmatizer; but I have no competence to do this.

It seems to me that option 2 would imply a hybrid between the look-up approach and the rule-based approach. Each look-up table entry would include, in addition to the POS-tag, morphological attributes taken from a morphological lexicon and appropriately converted.
For this option, I could have to ask the permission to use the morph-it morphological lexicon, from Professor Marco Baroni or other rights holder, and also the permission to use part of the ITWAC (WaCky for Italian) corpus if I wanted to add to each table entry a frequency information extracted from that corpus; references:
https://docs.sslmit.unibo.it/doku.php?id=resources:morph-it
https://wacky.sslmit.unibo.it/doku.php?id=corpora

Actually, option 2 could be both a real alternative and a first step towards the development of a rule-based lemmatizer (option 1), but this is not yet clear to me.

I would appreciate any information and suggestions. Thanks.

gtoffoli · 2021-04-20T20:27:03Z

gtoffoli
Apr 20, 2021
Author

Sorry, I started this discussion under a wrong category. Should move it to "Language support and models", but don't know how to do it.

1 reply

polm Apr 21, 2021

Might be a permissions issue, I went ahead and moved it.

cayorodriguez · 2021-04-21T10:58:46Z

cayorodriguez
Apr 21, 2021

Is there any update on this discussion? We are also working on a catalan language lemmatizer that can assign lemma from an existing lookup table that has POS disambiguation ... Thanks

7 replies

adrianeboyd Apr 26, 2021

Hi, this looks like a nice contribution!

If we include the new lemmatizer in the existing it_core_news pipelines, we would evaluate it on the same corpus used for tagging+parsing: https://github.com/UniversalDependencies/UD_Italian-ISDT

You can convert it with:

spacy convert -n 10 -T file.conllu .

And evaluate with:

spacy evaluate it_core_news_sm_with_pos_lemmatizer file.spacy

Instead of a duplicate legacy table, would it be possible to try to use the existing lemma_lookup table as a backoff instead? It would better if these huge tables aren't duplicated in spacy-lookups-data, which is already quite large.

We would need to look into the details about the licensing to be sure this is something that we could redistribute with the pretrained pipelines.

gtoffoli Apr 28, 2021
Author

Hi Adriane, thanks for your indications!
Following them, I've evaluated the Italian with_pos_lemmatizer.
And got the following result values:

78.10 and 78.28, using it_isdt-ud-test.conllu, for it_core_news_sm_with_pos_lemmatizer and it_core_news_md_with_pos_lemmatizer respectively
78.77 and 78.78, using it_isdt-ud-train.conllu

At first, I was quite disappointed, since the results obtained at the application level (KWIC, keywords in context) had made me think of a much greater improvement over the values 74 and 74 being declared in the documentation of the Italian model.
In fact, the same tests run with the current lemmatizer give me the values 72.50 and 72.50, but this doesn't change things substantially.

The good thing is that, from a first visual inspection to the conllu corpus, I got the impression that said discrepancy would be explained to a large extent by the different treatment, between conllu and our lexicons, of articles, prepositions and articulated prepositions, with and without elision. I intend to look into the matter further.

As to keeping the existing lemma_lookup table as a backoff, without partially duplicating it, this is ok for me.

As to the licence and copyrights, I'll forward by email the kind response of the first author of morph-it.

gtoffoli Apr 28, 2021
Author

After fixing an entry in the tag_map (lookup_pos => lexicon_pos) used to generate the POS-based lookup tables, and restoring the entire "legacy" lookup table as a "backup", I got slightly better scores:

79.31 and 79.50, using it_isdt-ud-test.conllu, for it_core_news_sm_with_pos_lemmatizer and it_core_news_md_with_pos_lemmatizer respectively;
80.13 and 80.14, using it_isdt-ud-train.conllu.

From the visual inspection of the conllu corpus, I realized that:

the criteria of lemmatization of conllu and of our lexicons (it_lemma_lookup.json and morph-it) are different for articles; for example, conllu normalizes to number S and genre M, while our lexicons normalize only to number S;
when articles, prepositions and articulated prepositions undergo elision (one or more final letters being replaced by the apostrophe), in conllu and in the "legacy" lookup table they occur normalized to the form without elision, while morph-it keeps the form with the apostrophe;
unlike the lexicons, conllu splits the articulated preposition (ADP+DET) in its two parts, and then assigns a lemma to each of them.

My provisional conclusions are that:

due to inconsistent treatment, between conllu and the available lexicons, of definite and undefinite articles and articulated prepositions, there is no hope of further significant gains in the evaluation accuracy score for the Italian lookup lemmatizer, without modifying the upstream components in the pipeline and/or the evaluation criteria;
perhaps, removing or reducing the inconsistencies are not worth the effort required, taking also into account that they do not constitute significant problems for many text analysis applications that depend on lemmatization;
evaluation criteria "sterilizing" the effects of these inconsistencies would make easier the fine measurement of the effects of more significant improvements (from my point of view) of the lemmatization algorithms; improvements to be obtained by revising the morphological lexicon and possibly by adding some rules.

gtoffoli Apr 30, 2021
Author

I made further progress in refining the Italian POS-based lemmatizer: was able to address a few issues, mainly related to different ways of lemmatizing prepositions and articles in conllu and morph-it.

After adding some special treatment of ADP, DET, PRON and ADJ, I've got the following scores:

84.33 and 84.54, using it_isdt-ud-test.conllu, for it_core_news_sm_with_pos_lemmatizer and it_core_news_md_with_pos_lemmatizer respectively
85.15 and 85.14, using it_isdt-ud-train.conllu

This means a gain of more than 10 points over the current lookup version.
I can make other progress, but much less, along this path.

As already noticed, the big issue is represented by the treatment of the articulated preposition: unlike the old and new lexicon, conllu splits the articulated preposition (ADP+DET) in its two parts, and then assigns a lemma to each of them.
It seems that spaCy makes it difficult to cope with the UD Tokenization guidelines.

From the documentation of the Italian conllu (https://universaldependencies.org/docs/it/pos/ADP.html):

Italian distinguishes between simple and articulated prepositions: note however that to comply with the UD Tokenization guidelines the latter are systematically splitted into the following sequence of part-of-speech tags, ADP and DET (e.g. nello “in the” is splitted into in ADP lo DET).

From the documentation of the spaCy Tokenizer, method Tokenizer.add_special_case (https://spacy.io/api/tokenizer):

token_attrs: A sequence of dicts, where each dict describes a token and its attributes. The ORTH fields of the attributes must exactly match the string when they are concatenated.

Thus, I find an intrinsic obstacle in the pipeline architecture of spaCy, which I otherwise like very much. Any idea on how to overcome it? How it has been addressed in other languages with articulated prepositions, like French (du = de le) and Spanish (del = de el)? If the Spanish lemmatizer achieves such a high accuracy score, it certainly solved this problem!

gtoffoli May 12, 2021
Author

After doing some cleaning of the data and of the code, I made

a pull request in the repository explosion / spacy-lookups-data: Add lookup tables for Italian by POS. spacy-lookups-data#23
a pull request in this spaCy repository: Added Italian POS-aware lemmatizer. #8079.

Although I've always developed software and am using GitHub since many years, I had made only a couple of pull requests in the past and have no experience in contributing to such a complex shared project.
Thus, I apologize in advance for requiring some action by the project maintainers in order to make good use of my commits. Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Improved Italian lemmatizer: ongoing work or plans? #7824

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 8 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Improved Italian lemmatizer: ongoing work or plans? #7824

Uh oh!

gtoffoli Apr 19, 2021

Replies: 2 comments · 8 replies

Uh oh!

gtoffoli Apr 20, 2021 Author

Uh oh!

polm Apr 21, 2021

Uh oh!

cayorodriguez Apr 21, 2021

Uh oh!

adrianeboyd Apr 26, 2021

Uh oh!

gtoffoli Apr 28, 2021 Author

Uh oh!

gtoffoli Apr 28, 2021 Author

Uh oh!

gtoffoli Apr 30, 2021 Author

Uh oh!

gtoffoli May 12, 2021 Author

gtoffoli
Apr 19, 2021

Replies: 2 comments 8 replies

gtoffoli
Apr 20, 2021
Author

cayorodriguez
Apr 21, 2021

gtoffoli Apr 28, 2021
Author

gtoffoli Apr 28, 2021
Author

gtoffoli Apr 30, 2021
Author

gtoffoli May 12, 2021
Author