token.is_stop not the same as token.lemma_.lower() in nlp.Defaults.stop_words #8247

ariansajina · 2021-05-28T22:23:19Z

ariansajina
May 28, 2021

How to reproduce the behaviour

import spacy

nlp = spacy.load("en_core_web_sm")

example = 'Both go and goes should be removed as stopwords.'

result_1 = [token for token in d if not token.is_stop]

result_2 = [token.lemma_ for token in d if not token.lemma_.lower() in nlp.Defaults.stop_words]

Your Environment

spaCy version: 3.0.6
Platform: macOS-10.16-x86_64-i386-64bit
Python version: 3.8.5
Pipelines: en_core_web_sm (3.0.0)

Answered by adrianeboyd

May 31, 2021

Hi, the is_stop attribute is designed so that it works without any additional pipeline components like a tagger or a lemmatizer, so by default it currently only checks whether the lowercase form of the token is in the stop word list. You can extend the stop word list by customizing the language defaults before loading the model or by creating a custom language. See: https://spacy.io/usage/linguistic-features#language-subclass

In general, for modern NLP techniques, it's not helpful to remove stop words, so you may not need this step at all. Some relevant discussions: #7228 (comment), #7637 (comment)

View full answer

ariansajina · 2021-05-28T22:27:10Z

ariansajina
May 28, 2021
Author

To give a concrete example, given that "go" is a stopword, I assumed token.is_stop would be True for token.text equals "goes" and so on.

0 replies

adrianeboyd · 2021-05-31T13:07:46Z

adrianeboyd
May 31, 2021

Hi, the is_stop attribute is designed so that it works without any additional pipeline components like a tagger or a lemmatizer, so by default it currently only checks whether the lowercase form of the token is in the stop word list. You can extend the stop word list by customizing the language defaults before loading the model or by creating a custom language. See: https://spacy.io/usage/linguistic-features#language-subclass

In general, for modern NLP techniques, it's not helpful to remove stop words, so you may not need this step at all. Some relevant discussions: #7228 (comment), #7637 (comment)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

token.is_stop not the same as token.lemma_.lower() in nlp.Defaults.stop_words #8247

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

token.is_stop not the same as token.lemma_.lower() in nlp.Defaults.stop_words #8247

Uh oh!

Uh oh!

ariansajina May 28, 2021

How to reproduce the behaviour

Your Environment

Replies: 2 comments

Uh oh!

ariansajina May 28, 2021 Author

Uh oh!

adrianeboyd May 31, 2021

ariansajina
May 28, 2021

ariansajina
May 28, 2021
Author

adrianeboyd
May 31, 2021