Wrong location detection in Spanish #8778

rjgarciar · 2021-07-21T08:12:22Z

rjgarciar
Jul 21, 2021

How to reproduce the behaviour

With this simple code:

import spacy
text="""El auge de la variante Delta -originaria de la India- preocupa cada vez más por su alta transmisibilidad"""
nlp = spacy.load("es_core_news_lg")
doc = nlp(text)
print(*[f'{ent.text}|{ent.label_}' for ent in doc.ents], sep='\n')

the output is:

Delta|ORG
India-|LOC

as you can see, it detects India- instead of India as a location

Your Environment

Operating System: Windows 10 Pro
Python Version Used: 3.7.5
spaCy Version Used: 3.1.0
Environment Information: conda 4.7.12

polm · 2021-07-21T08:39:25Z

polm
Jul 21, 2021

First, for general notes about wrong model predictions, please see #3052. It's important to understand that the models are statistical and will be wrong sometimes, even in apparently simple cases.

This is a bit of a special case because this isn't really a wrong prediction so much as undesirable tokenizer behavior. spaCy is designed with newspaper articles as the default model of text, and the way your text is punctuated (the spaces) is kind of unusual. It looks like the behavior is the same in English and Spanish for hyphens like you have, and that means that India- is one token here.

spaCy can't apply entity labels to sub-parts of a token, so you need to modify the way the tokenizer works to fix this. I would take a good look at the tokenizer docs. I think you can fix this by adding - as a prefix/suffix character.

1 reply

rjgarciar Jul 21, 2021
Author

Yes, I know the way my text is punctuated is unusual (this text has been extracted of a piece of news published by a Spanish newspaper). I will try your suggestion about tokenizers.

Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Wrong location detection in Spanish #8778

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Wrong location detection in Spanish #8778

Uh oh!

rjgarciar Jul 21, 2021

How to reproduce the behaviour

Your Environment

Replies: 1 comment · 1 reply

Uh oh!

polm Jul 21, 2021

Uh oh!

rjgarciar Jul 21, 2021 Author

rjgarciar
Jul 21, 2021

Replies: 1 comment 1 reply

polm
Jul 21, 2021

rjgarciar Jul 21, 2021
Author