Can the white space in Chinese segmentation be retained? #11203
-
For example:
Although the doc.text keep the white space, the token list doesn't. Instead it uses the 2nd token's (心动) start position to indicate that there is a white space preceding it (start+1) . In Chinese social media, especially Weibo, users often use a white space as a delimiter to conceptually isolate words, which can be a good clue for word boundary, as shown in the example above. After the white space is removed, how to conveniently check there is a space after the token "峻霖"? Is it possible that white spaces are retained as a normal token in Chinese language? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
spaCy doesn't remove spaces - for ASCII spaces, the first one after a word can be recovered using the |
Beta Was this translation helpful? Give feedback.
spaCy doesn't remove spaces - for ASCII spaces, the first one after a word can be recovered using the
token.whitespace_
attribute. For other kinds of spaces (like full-width spaces), or multiple spaces in a row, they are preserved as tokens. You asked about this before in #8879.