Something wrong with Italian stopwords #13150
Replies: 4 comments 1 reply
-
Plain text is actually much better in this case because we can see exactly which characters are used underneath. Can you paste all the code and output as code blocks (using three backticks) instead of using screenshots? The tokenizer uses regular expressions and exceptions that don't necessarily handle all possible apostrophe characters the same way. Whitespace will also make a difference for the tokenizer output. The current Italian tokenizer defaults are designed to work well for UD Italian IDST, and you might need to customize it for your own data/task. It looks like these two apostrophe characters are treated as "elision" characters in the middle of a word The stop words are a list of strings and they also don't necessarily include different apostrophe variations. You can customize the stop words too, but it's not quite as easy as the tokenizer settings because you'd need a custom language class unless you want to modify them programmatically every time you load a new pipeline. But if there are general changes that make sense for |
Beta Was this translation helpful? Give feedback.
-
Thank you for your answer. The problem occurs only when processing text extracted from PDF files, where the apostrophe "\u2019" is used in place of the corresponding standard ASCII character. I am not expert enough to suggest whether a modification is needed or not. For the time being, I will substitute the "wrong" occurrences with the expected character. |
Beta Was this translation helpful? Give feedback.
-
After some more tests, I think there is something to be revised for the Italian stopwords. For example, consider this sentence
The output is:
So Additionally, the term If I compare the lists of Italian stopwords in Spacy and nltk, I see that Spacy contains much more stopwords, but does not contain the following: |
Beta Was this translation helpful? Give feedback.
-
That makes sense. Thank you for your answer. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Please consider the example in the screenshots. I am sharing pictures instead of plain text so that you can see the apostrophe leading to this strange behaviour.
with output:
The first string is split as
[dell', asino] but
dell'` is not considered a stopword. The apostrophe is a unicode char that does not correspond to that commonly used, but it is still an apostrophe.The second string is split as
[dell, ', asino]
. It is the same apostrophe as in the first string and in this case it is separated fromdell
.The third string is split correctly since it uses the correct apostrophe.
I cannot understand the ratio behind the first two splits: in the first case the apostrophe is kept with
dell
, in the second it is separated.Is this the expected behaviour? To me it seems a bit weird...
Beta Was this translation helpful? Give feedback.
All reactions