SpaCy removes a portion of the string when it contains a stopword separated by digits #12550

JarClass · 2023-04-19T15:30:21Z

JarClass
Apr 19, 2023

I'm trying to use spaCy to remove stopwords from a panda dataframe created from a csv. My issue is that I'm trying to account for words that might have a mix of words and numbers.

My issue:

If a number separates a word so that it contains a stop word, it will delete that portion of the word when looping through to delete stopword. How I'm removing stopwords currently:

df[col] = df[col].apply(lambda text: 
     "".join(token.lemma_ for token in nlp(text) 
     if not token.is_stop))

Through experimentation with different strings I've found:

Ex. With stop word at the end
    Input: 'co555in'
    Breaks up the word, separating it in 'co555' + 'in'
    Removes 'in' because it is a stopword.
    Output: 'co555'

Ex. Without stop word at the end. Similar to what i'm trying to accomplish with 'co555in'.
    Input: 'co555inn'
    Does not separate and passes 'co555inn'
    Will not remove 'inn' because it is not a stopword.
    Output: 'co555inn'

Ex. Regular word being passed
    Input: 'coin'
    Does not separate and passes 'coin'
    Will not remove anything because there are no stopwords.
    Output: 'coin'

Oddly enough it does this as well:
Ex. Including stopword on both sides of digit
    Input: 'in555in'
    Breaks up the word during tokenization, separating it in 'in555' + 'in'
    Removes the second 'in' because it is a stopword.
    Output: 'in555'

I understand that spaCy recognizes "wont" as "won't" which is conceptually two tokens - "will" and "not". For some reason it is doing something similar only when there is a number preceeding the stopword.

Answered by rmitsch

Apr 20, 2023

Hi @JarClass! This is happening because of spaCy's tokenization - the second "in" in "co555in" is determined to be a separate token, which is why it's removed due to being a stopword. Note that "co555in" is not a word you'd expect in a natural language corpus, which is why the tokenization might not work the way you'd want it to.

I recommend looking at our docs for customizing the tokenizer - that should help you to modify the tokenization rules.

View full answer

rmitsch · 2023-04-20T09:10:57Z

rmitsch
Apr 20, 2023

Hi @JarClass! This is happening because of spaCy's tokenization - the second "in" in "co555in" is determined to be a separate token, which is why it's removed due to being a stopword. Note that "co555in" is not a word you'd expect in a natural language corpus, which is why the tokenization might not work the way you'd want it to.

I recommend looking at our docs for customizing the tokenizer - that should help you to modify the tokenization rules.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

SpaCy removes a portion of the string when it contains a stopword separated by digits #12550

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

SpaCy removes a portion of the string when it contains a stopword separated by digits #12550

Uh oh!

Uh oh!

JarClass Apr 19, 2023

Replies: 1 comment

Uh oh!

rmitsch Apr 20, 2023

JarClass
Apr 19, 2023

rmitsch
Apr 20, 2023