SpaCy removes a portion of the string when it contains a stopword separated by digits #12550
-
I'm trying to use spaCy to remove stopwords from a panda dataframe created from a csv. My issue is that I'm trying to account for words that might have a mix of words and numbers. My issue: If a number separates a word so that it contains a stop word, it will delete that portion of the word when looping through to delete stopword. How I'm removing stopwords currently:
Through experimentation with different strings I've found:
I understand that spaCy recognizes "wont" as "won't" which is conceptually two tokens - "will" and "not". For some reason it is doing something similar only when there is a number preceeding the stopword. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Hi @JarClass! This is happening because of spaCy's tokenization - the second "in" in "co555in" is determined to be a separate token, which is why it's removed due to being a stopword. Note that "co555in" is not a word you'd expect in a natural language corpus, which is why the tokenization might not work the way you'd want it to. I recommend looking at our docs for customizing the tokenizer - that should help you to modify the tokenization rules. |
Beta Was this translation helpful? Give feedback.
Hi @JarClass! This is happening because of spaCy's tokenization - the second "in" in "co555in" is determined to be a separate token, which is why it's removed due to being a stopword. Note that "co555in" is not a word you'd expect in a natural language corpus, which is why the tokenization might not work the way you'd want it to.
I recommend looking at our docs for customizing the tokenizer - that should help you to modify the tokenization rules.