Can the white space in Chinese segmentation be retained? #11203

lingvisa · 2022-07-25T18:01:03Z

lingvisa
Jul 25, 2022

For example:

贺峻霖心动

Although the doc.text keep the white space, the token list doesn't. Instead it uses the 2nd token's (心动) start position to indicate that there is a white space preceding it (start+1) . In Chinese social media, especially Weibo, users often use a white space as a delimiter to conceptually isolate words, which can be a good clue for word boundary, as shown in the example above. After the white space is removed, how to conveniently check there is a space after the token "峻霖"? Is it possible that white spaces are retained as a normal token in Chinese language?

Answered by polm

Jul 26, 2022

spaCy doesn't remove spaces - for ASCII spaces, the first one after a word can be recovered using the token.whitespace_ attribute. For other kinds of spaces (like full-width spaces), or multiple spaces in a row, they are preserved as tokens. You asked about this before in #8879.

View full answer

polm · 2022-07-26T03:18:32Z

polm
Jul 26, 2022

spaCy doesn't remove spaces - for ASCII spaces, the first one after a word can be recovered using the token.whitespace_ attribute. For other kinds of spaces (like full-width spaces), or multiple spaces in a row, they are preserved as tokens. You asked about this before in #8879.

3 replies

lingvisa Jul 26, 2022
Author

I understand now. For the example above, the segmentation would be:
`[贺峻霖 space 心动]'

So bool(token[1].whitespace_) = True and everything else will be False.

That's good enough and thanks!

lingvisa Sep 16, 2022
Author

One more question on this: For '我们是' (We are), the idx of 是 is 2, and '我们是', the idx of 是 is 3 due to the white space. Is this intended?

polm Sep 18, 2022

Yes, that's intentional - idx is the index in the original string. For any token, if text is the input to nlp, token.text[0] == text[token.idx].

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Can the white space in Chinese segmentation be retained? #11203

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Can the white space in Chinese segmentation be retained? #11203

Uh oh!

lingvisa Jul 25, 2022

Replies: 1 comment · 3 replies

Uh oh!

polm Jul 26, 2022

Uh oh!

lingvisa Jul 26, 2022 Author

Uh oh!

lingvisa Sep 16, 2022 Author

Uh oh!

polm Sep 18, 2022

lingvisa
Jul 25, 2022

Replies: 1 comment 3 replies

polm
Jul 26, 2022

lingvisa Jul 26, 2022
Author

lingvisa Sep 16, 2022
Author