Something wrong with Italian stopwords #13150

thistlillo · 2023-11-23T18:59:55Z

thistlillo
Nov 23, 2023

Please consider the example in the screenshots. I am sharing pictures instead of plain text so that you can see the apostrophe leading to this strange behaviour.

with output:

The first string is split as [dell', asino] but dell'` is not considered a stopword. The apostrophe is a unicode char that does not correspond to that commonly used, but it is still an apostrophe.

The second string is split as [dell, ', asino]. It is the same apostrophe as in the first string and in this case it is separated from dell.

The third string is split correctly since it uses the correct apostrophe.

I cannot understand the ratio behind the first two splits: in the first case the apostrophe is kept with dell, in the second it is separated.

Is this the expected behaviour? To me it seems a bit weird...

adrianeboyd · 2023-11-27T07:43:40Z

adrianeboyd
Nov 27, 2023

Plain text is actually much better in this case because we can see exactly which characters are used underneath. Can you paste all the code and output as code blocks (using three backticks) instead of using screenshots?

The tokenizer uses regular expressions and exceptions that don't necessarily handle all possible apostrophe characters the same way. Whitespace will also make a difference for the tokenizer output.

The current Italian tokenizer defaults are designed to work well for UD Italian IDST, and you might need to customize it for your own data/task. It looks like these two apostrophe characters are treated as "elision" characters in the middle of a word '’.

The stop words are a list of strings and they also don't necessarily include different apostrophe variations. You can customize the stop words too, but it's not quite as easy as the tokenizer settings because you'd need a custom language class unless you want to modify them programmatically every time you load a new pipeline.

But if there are general changes that make sense for it, we can also consider updating the Italian defaults for the next minor version of spacy.

0 replies

thistlillo · 2023-11-28T07:42:56Z

thistlillo
Nov 28, 2023
Author

Thank you for your answer. The problem occurs only when processing text extracted from PDF files, where the apostrophe "\u2019" is used in place of the corresponding standard ASCII character.

I am not expert enough to suggest whether a modification is needed or not. For the time being, I will substitute the "wrong" occurrences with the expected character.

0 replies

thistlillo · 2023-11-29T09:06:43Z

thistlillo
Nov 29, 2023
Author

After some more tests, I think there is something to be revised for the Italian stopwords.

For example, consider this sentence i cani e i gatti o gli uccelli sono animali (the dogs and the cats or the birds are animals) and this code:

import spacy
nlp = spacy.load("it_core_news_lg")

s = "i cani e i gatti o gli uccelli sono animali"
doc = nlp(s)
for t in doc:
    print(t.text, t.is_stop)

The output is:

i False
cani False
e True
i False
gatti False
o False
gli True
uccelli False
sono True
animali False

So i is not a stopword, but gli is: i and gli are plural forms of the two singular articles il and lo, respectively. They all correspond to the English definite article the, but they are not treated in the same way.

Additionally, the term o, that stands for the English or, is not consider a stopword, but it should. Indeed, if I use oppure in place of o, oppure is correctly labelled as a stopword.

If I compare the lists of Italian stopwords in Spacy and nltk, I see that Spacy contains much more stopwords, but does not contain the following: {'c', 'i', 'o', 'l'}.

1 reply

adrianeboyd Nov 29, 2023

As a note, we're currently not accepting any PRs that modify stop words. You're welcome to customize stop words however you need on your end, but the plan is to freeze the current stop word lists and then remove them in spacy v4 (see #11318).

I actually wanted to remove them in v3, but we were worried that it would be too much of a change. I feel strongly that stop words are task-specific and users should be reviewing the lists (like you are!) rather than trusting a third-party library without checking whether it makes sense for their task. And as a small team we're just not going to be able to do a good job of maintaining stop word lists for 75+ languages.

The stop word functionality would remain in spacy v4, but the default lists would be empty, so you'd need to provide your own stop word list if necessary. We'd provide the legacy v3 stop word lists somewhere outside of spacy, probably in spacy-lookups-data. None of the built-components in spacy like the tagger or parser use the stop words.

thistlillo · 2023-11-29T16:22:01Z

thistlillo
Nov 29, 2023
Author

That makes sense. Thank you for your answer.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Something wrong with Italian stopwords #13150

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Something wrong with Italian stopwords #13150

Uh oh!

thistlillo Nov 23, 2023

Replies: 4 comments · 1 reply

Uh oh!

adrianeboyd Nov 27, 2023

Uh oh!

thistlillo Nov 28, 2023 Author

Uh oh!

Uh oh!

thistlillo Nov 29, 2023 Author

Uh oh!

adrianeboyd Nov 29, 2023

Uh oh!

thistlillo Nov 29, 2023 Author

thistlillo
Nov 23, 2023

Replies: 4 comments 1 reply

adrianeboyd
Nov 27, 2023

thistlillo
Nov 28, 2023
Author

thistlillo
Nov 29, 2023
Author

thistlillo
Nov 29, 2023
Author