A Spacy pipeline component that annotates tokens with their corresponding Babelnet Synsets (and Lemmas).
The synsets are searched only in the specified language, but other languages can be retrieved through Babelnet.
If the token has a POS annotation, the synsets are searched only with that POS.
Notice that Babelnet uses only the 4 main POS tags: NAME,ADJ,VERB,ADV.
Only tokens that map to these 4 types are searched in BabelNet.
The babelnet module is a python wrapper to the Babelnet API jars
- install the
jccandopenjdkpackages
conda install jcc openjdk - or else
pip install jcc
and installopenjdkfor your Linux distribution - edit
Makefileand setJNI_DIRto the directory containinginclude/jni.h - download and unzip the Babelnet-API archive version 5.0
make get_api - build and install the
babelnetmodule
make babelnet
make spacy-babelnet
or elsepython setup.py install
The babelnet module must find the config directory in the current directory.
Copy the config directory and edit the config/babelnet.var.properties to add your BabelNet API key.
- download the BabelNet indices from BabelNet.org
- unzip them
- edit the
config/babelnet.vars.propertiesfile to set the indices directory
The wrapper adds the 'babelnet' property to tokens, containing a Babelnet object that can be used to retrieve its synsets or lemmas
import spacy, babelnet
from spacy_babelnet import BabelnetAnnotator
nlp = spacy.load('it_core_news_lg') # spacy 3
# with spacy 3
nlp.add_pipe('babelnet', config={
#'domain': babelnet.BabelDomain.BUSINESS_INDUSTRY_AND_FINANCE.value,
'source': babelnet.BabelSenseSource.OMWN_IT.toString()
})
doc = nlp('Mi piace la pizza') # I like pizza
for token in doc:
print(token, token.pos_, token._.babelnet.synsets(), token._.babelnet.lemmas(), sep='\n\t')
That produces the output
Mi
PRON
[]
[]
piace
VERB
[<BabelSynset: bn:00090362v__like#v#2>, <BabelSynset: bn:00086519v__delight#v#1>, <BabelSynset: bn:00090363v__like#v#3>, <BabelSynset: bn:00087646v__enjoy#v#3>, <BabelSynset: bn:00084526v__wish#v#2>]
[<BabelLemma: piacere>, <BabelLemma: contentare>, <BabelLemma: provare_gioia>, <BabelLemma: piacere>, <BabelLemma: assecondare>, <BabelLemma: dilettare>, <BabelLemma: preferire>, <BabelLemma: compiacere>, <BabelLemma: piacere>, <BabelLemma: godere>, <BabelLemma: accontentare>, <BabelLemma: appagare>, <BabelLemma: piace>, <BabelLemma: deliziare>, <BabelLemma: gratificare>, <BabelLemma: piacere>, <BabelLemma: prediligere>, <BabelLemma: piacere>, <BabelLemma: amare>, <BabelLemma: soddisfare>]
la
DET
[]
[]
pizza
NOUN
[<BabelSynset: bn:00062694n__pizza#n#1>, <BabelSynset: bn:00066766n__reel#n#1>, <BabelSynset: bn:00012225n__bore#n#1>]
[<BabelLemma: pizza_al_taglio>, <BabelLemma: mattone>, <BabelLemma: pizzetta>, <BabelLemma: pizza_surgelata>, <BabelLemma: impiastro>, <BabelLemma: pizza_a_taglio>, <BabelLemma: pizza_congelata>, <BabelLemma: pizza_pie>, <BabelLemma: pizze>, <BabelLemma: palla>, <BabelLemma: pasta_della_pizza>, <BabelLemma: bobina>, <BabelLemma: pizza>, <BabelLemma: noioso>, <BabelLemma: pizza>, <BabelLemma: pizza>, <BabelLemma: pizza_classica>, <BabelLemma: noia>, <BabelLemma: forno_da_pizza>, <BabelLemma: lagna>, <BabelLemma: pasta_per_pizza>, <BabelLemma: cataplasma>, <BabelLemma: pizza_tonda>, <BabelLemma: pizzette>, <BabelLemma: pizza_in_teglia>, <BabelLemma: pittima>, <BabelLemma: forno_per_la_pizza>, <BabelLemma: pizza_alla_pala>]
- add tests
- find a better location for the
babelnetconfiguration (~/babelnet_data?) - speed-up synsets retrieval
- convert BN to a simpler/faster database?
- build a lemma-synset external index?
- added domain and source parameters
- added (word,pos) caching of synsets shared at class level, with json save/load
- mapped other spacy POS types to Babelnet POS types
- filter out tokens with unmapped POS
- moved to faster LKB queries
- added source selection