Improved Italian lemmatizer: ongoing work or plans? #7824
gtoffoli
started this conversation in
Language Support
Replies: 2 comments 8 replies
-
Sorry, I started this discussion under a wrong category. Should move it to "Language support and models", but don't know how to do it. |
Beta Was this translation helpful? Give feedback.
1 reply
-
Is there any update on this discussion? We are also working on a catalan language lemmatizer that can assign lemma from an existing lookup table that has POS disambiguation ... Thanks |
Beta Was this translation helpful? Give feedback.
7 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I strongly miss a good Italian lemmatizer in spaCy.
The reasons for that have been given in the past for similar languages: see, for example, #2710 and the work of Guadalupe Romero: https://twitter.com/_guadiromero/status/1213211033541758979.
Almost 2 years ago I wrote a general post ( #3801 ) on improving support for Italian in spaCy.
Now, I would like to know if any activities are ongoing concerning the lemmatizer.
If not, I would try to do it myself; in any case, I don't have much time, so it might take me a few months.
I believe I have essentially two-three options:
It seems to me that option 2 would imply a hybrid between the look-up approach and the rule-based approach. Each look-up table entry would include, in addition to the POS-tag, morphological attributes taken from a morphological lexicon and appropriately converted.
For this option, I could have to ask the permission to use the morph-it morphological lexicon, from Professor Marco Baroni or other rights holder, and also the permission to use part of the ITWAC (WaCky for Italian) corpus if I wanted to add to each table entry a frequency information extracted from that corpus; references:
https://docs.sslmit.unibo.it/doku.php?id=resources:morph-it
https://wacky.sslmit.unibo.it/doku.php?id=corpora
Actually, option 2 could be both a real alternative and a first step towards the development of a rule-based lemmatizer (option 1), but this is not yet clear to me.
I would appreciate any information and suggestions. Thanks.
Beta Was this translation helpful? Give feedback.
All reactions