Exploiting available linguistic resources for the Italian language #3801
gtoffoli
started this conversation in
Language Support
Replies: 1 comment
-
Thanks! |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Currently I have no resources to contribute to the spaCy project, but I think that it could be useful to point out some language resources available for the Italian language. The most interesting I am aware of is the free morphological lexicon morph-it, which was compiled based also on a large annotated corpus comprising many years of the national daily newspaper Repubblica; I had the opportunity of using both the lexicon and the corpus a few years ago, in learning to train some NLTK pos-taggers and chunkers.
morph-it contains about 500.000 word forms, annotated with pos-tags and other features, while the current Italian lemmatizer of spaCy contains about 333.000 word forms.
Since a classical example of Italian ambiguous sentence, containing multiple ambiguous words, is "La vecchia porta la sbarra" (The old woman carries the bar / The old door bars it), I looked for the word form "porta" and found only one (improbable) entry in the spaCy lemmatizer map, versus 5 entries (4 lemmata) in morph-it. References:
https://docs.sslmit.unibo.it/doku.php?id=resources:morph-it
https://github.com/giodegas/morphit-lemmatizer/tree/master/master
The corpus annotation is good, although not perfect; it was done in both manual and automatic way. Years ago the corpus wasn't open, but I had access to it without difficulty telling that I needed it to train some algorithms. Moreover, I think that the corpus was annotated using an approach similar to that being used in the "WaCky - The Web-As-Corpus" multi-lingual project, which probably you already know; the products of this project are open. References:
https://wacky.sslmit.unibo.it/doku.php
Beta Was this translation helpful? Give feedback.
All reactions