Skip to content
Discussion options

You must be logged in to vote

spaCy doesn't remove spaces - for ASCII spaces, the first one after a word can be recovered using the token.whitespace_ attribute. For other kinds of spaces (like full-width spaces), or multiple spaces in a row, they are preserved as tokens. You asked about this before in #8879.

Replies: 1 comment 3 replies

Comment options

You must be logged in to vote
3 replies
@lingvisa
Comment options

@lingvisa
Comment options

@polm
Comment options

Answer selected by adrianeboyd
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lang / zh Chinese language data and models feat / tokenizer Feature: Tokenizer
2 participants