Skip to content
Discussion options

You must be logged in to vote

Hi @JarClass! This is happening because of spaCy's tokenization - the second "in" in "co555in" is determined to be a separate token, which is why it's removed due to being a stopword. Note that "co555in" is not a word you'd expect in a natural language corpus, which is why the tokenization might not work the way you'd want it to.

I recommend looking at our docs for customizing the tokenizer - that should help you to modify the tokenization rules.

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by JarClass
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
usage General spaCy usage feat / tokenizer Feature: Tokenizer
2 participants