Chinese tokenization is bad #9860
Replies: 3 comments 4 replies
-
|
Note that this cannot be repro'ed with displaCy because #9857 |
Beta Was this translation helpful? Give feedback.
-
|
By default the Chinese tokenizer does character tokenization, so this is the expected behavior. https://spacy.io/usage/models#chinese The multi-language tokenizer is a rule-based tokenizer for languages with whitespace between tokens and not intended for use with Chinese or Japanese. |
Beta Was this translation helpful? Give feedback.
-
|
I love having the options for different tokenizers, but when I choose Jieba or Pkuseg in the way recommended at https://spacy.io/usage/models#chinese, I lose almost all word data, such as POS:
Is there a way to tokenize with Jieba or Pkuseg while still getting all that word data? |
Beta Was this translation helpful? Give feedback.

Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Both the Chinese-specific and multilanguage tokenizer are so bad as to be unusable.
Noticed by @phasmik
How to reproduce the behaviour
Chinese
Actual output:
i.e. it splits on every character boundary.
Multilanguage
Actual output:
i.e. no tokenization happened.
Expected behaviour
Using https://pypi.org/project/jieba/
Input:
Output:
Your Environment
Beta Was this translation helpful? Give feedback.
All reactions