Chinese tokenization is bad #9860

bittlingmayer · 2021-12-14T09:29:01Z

bittlingmayer
Dec 14, 2021

Both the Chinese-specific and multilanguage tokenizer are so bad as to be unusable.

How to reproduce the behaviour

Chinese

from spacy.lang.zh import Chinese
nlp_ch = Chinese()
print(*nlp_ch('2017年主要经济指标如下：国内生产总值（GDP）：1.72万亿欧元，人均国内生产总值：28290欧元'), sep='\n')

Actual output:

2
0
1
7
年
主
要
经
济
指
标
如
下
：
国
内
生
产
总
值
（
G
D
P
）
：
1
.
7
2
万
亿
欧
元
，
人
均
国
内
生
产
总
值
：
2
8
2
9
0
欧
元

i.e. it splits on every character boundary.

Multilanguage

from spacy.lang.xx import MultiLanguage
nlp = MultiLanguage()
print(*nlp('2017年主要经济指标如下：国内生产总值（GDP）：1.72万亿欧元，人均国内生产总值：28290欧元'), sep='\n')

Actual output:

2017年主要经济指标如下：国内生产总值（GDP）：1.72万亿欧元，人均国内生产总值：28290欧元

i.e. no tokenization happened.

Expected behaviour

Using https://pypi.org/project/jieba/

Input:

print(a) for (a, b, c) in jieba.tokenize('2017年主要经济指标如下：国内生产总值（GDP）：1.72万亿欧元，人均国内生产总值：28290欧元')

Output:

2017
年
主要
经济指标
如下
：
国内
生产总值
（
GDP
）
：
1.72
万亿
欧元
，
人均
国内
生产总值
：
28290
欧元

Your Environment

spaCy version: 3.0.3
Platform: Darwin-19.6.0-x86_64-i386-64bit
Python version: 3.6.12

bittlingmayer · 2021-12-14T09:37:04Z

bittlingmayer
Dec 14, 2021
Author

Note that this cannot be repro'ed with displaCy because #9857

0 replies

adrianeboyd · 2021-12-14T12:40:48Z

adrianeboyd
Dec 14, 2021

By default the Chinese tokenizer does character tokenization, so this is the expected behavior. Chinese also supports jieba and pkuseg, see the options here:

https://spacy.io/usage/models#chinese

The multi-language tokenizer is a rule-based tokenizer for languages with whitespace between tokens and not intended for use with Chinese or Japanese.

2 replies

bittlingmayer Dec 17, 2021
Author

Thanks for the tip!

Maybe jieba or pkuseg should be the default tokenizer for Chinese? Because splitting on characters is basically unusable.

Let me explain the use case, which is pretty common at orgs like Google Research or ModelFront and increasingly common at less hardcore ones:

We've built a system that supports 100+ languages (or 20 languages), and we need a solution for a basic NLP like sentence splitting or tokenization. We choose spaCy because there is language support and good community - peace of mind that nothing will go really wrong, even if it's not SoTA on every low-resource language. Much better than having to go research and integrate the best lib for each language. And if we find issues, we can report them and fix them for many more orgs than just our own. And our code can just loop over the language codes, nothing language-specific.

For us, the lib only pseudo-tokenizing Chinese like this is "silent failure". In this case, we got lucky, because our engineer happened to check Chinese. Our pref would be that it error loudly at us about a missing jieba dep.

And even now that we are aware of the issue, we need to add code and go see if this is an issue for other languages, and constantly monitor that in case spaCy changes it for a language or we add new languages or spaCy adds new languages.

(I understand reducing heavy deps for people who don't care about Chinese is a goal.)

adrianeboyd Dec 17, 2021

I understand what you're saying. It was mainly to reduce dependencies, and also because we do have a use case for character tokenization in mind or we wouldn't have added this option at all. I guess I also thought it might seem a bit odd if we were sort of recommending jieba as the default for Chinese while using pkuseg in the zh_core... models, but I also didn't really want pkuseg as the default because it's a pretty heavy dependency. I think many people might not notice jieba vs. pkuseg at first glance, and then end up with pipelines that train really poorly if there are mismatches.

I can promise that we would only make changes to this kind of default in a major version like v3.0. We may make minor adjustments to the tokenizer and default tokenizer settings for a particular language in minor releases, but not in patch releases.

Even if your pipeline just contains a tokenizer, you can save it with nlp.to_disk and it'll save all the relevant settings including the Chinese tokenizer options so you can reload it in a future version and have the same config. (Well, aside from minor changes to the tokenizer algorithm as in v3.2.0, but these are rare and with minimal differences for most use cases.)

jonathanknebel · 2023-07-19T14:19:02Z

jonathanknebel
Jul 19, 2023

I love having the options for different tokenizers, but when I choose Jieba or Pkuseg in the way recommended at https://spacy.io/usage/models#chinese, I lose almost all word data, such as POS:

Jieba
cfg = {"segmenter": "jieba"}
nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}})
OR
PKUSeg with "mixed" model provided by pkuseg
cfg = {"segmenter": "pkuseg"}
nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}})
nlp.tokenizer.initialize(pkuseg_model="mixed")

Is there a way to tokenize with Jieba or Pkuseg while still getting all that word data?

2 replies

rmitsch Jul 20, 2023

Hi @creolio, please open a new discussion for this and post your output as formatted code, not as image. Thanks!

jonathanknebel Jul 20, 2023

Sounds good: #12846

Uh oh!

Chinese tokenization is bad #9860

Uh oh!

Uh oh!

bittlingmayer Dec 14, 2021

How to reproduce the behaviour

Chinese

Multilanguage

Expected behaviour

Your Environment

Replies: 3 comments · 4 replies

Uh oh!

bittlingmayer Dec 14, 2021 Author

Uh oh!

adrianeboyd Dec 14, 2021

Uh oh!

Uh oh!

bittlingmayer Dec 17, 2021 Author

Uh oh!

adrianeboyd Dec 17, 2021

Uh oh!

jonathanknebel Jul 19, 2023

Jieba

OR

PKUSeg with "mixed" model provided by pkuseg

Uh oh!

rmitsch Jul 20, 2023

Uh oh!

jonathanknebel Jul 20, 2023

bittlingmayer
Dec 14, 2021

Replies: 3 comments 4 replies

bittlingmayer
Dec 14, 2021
Author

adrianeboyd
Dec 14, 2021

bittlingmayer Dec 17, 2021
Author

jonathanknebel
Jul 19, 2023