token.is_stop not the same as token.lemma_.lower() in nlp.Defaults.stop_words #8247
-
How to reproduce the behaviourimport spacy nlp = spacy.load("en_core_web_sm") example = 'Both go and goes should be removed as stopwords.' result_1 = [token for token in d if not token.is_stop] result_2 = [token.lemma_ for token in d if not token.lemma_.lower() in nlp.Defaults.stop_words] Your Environment
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
To give a concrete example, given that "go" is a stopword, I assumed token.is_stop would be True for token.text equals "goes" and so on. |
Beta Was this translation helpful? Give feedback.
-
Hi, the In general, for modern NLP techniques, it's not helpful to remove stop words, so you may not need this step at all. Some relevant discussions: #7228 (comment), #7637 (comment) |
Beta Was this translation helpful? Give feedback.
Hi, the
is_stop
attribute is designed so that it works without any additional pipeline components like atagger
or alemmatizer
, so by default it currently only checks whether the lowercase form of the token is in the stop word list. You can extend the stop word list by customizing the language defaults before loading the model or by creating a custom language. See: https://spacy.io/usage/linguistic-features#language-subclassIn general, for modern NLP techniques, it's not helpful to remove stop words, so you may not need this step at all. Some relevant discussions: #7228 (comment), #7637 (comment)