Chinese tokenization is bad

Both the Chinese-specific and multilanguage tokenizer are so bad as to be unusable.

Noticed by @phasmik

## How to reproduce the behaviour

### Chinese
```
from spacy.lang.zh import Chinese
nlp_ch = Chinese()
print(*nlp_ch('2017年主要经济指标如下：国内生产总值（GDP）：1.72万亿欧元，人均国内生产总值：28290欧元'), sep='\n')
```
Actual output:
```
2
0
1
7
年
主
要
经
济
指
标
如
下
：
国
内
生
产
总
值
（
G
D
P
）
：
1
.
7
2
万
亿
欧
元
，
人
均
国
内
生
产
总
值
：
2
8
2
9
0
欧
元
```
i.e. it splits on every character boundary.

### Multilanguage
```
from spacy.lang.xx import MultiLanguage
nlp = MultiLanguage()
print(*nlp('2017年主要经济指标如下：国内生产总值（GDP）：1.72万亿欧元，人均国内生产总值：28290欧元'), sep='\n')
```

Actual output:
```
2017年主要经济指标如下：国内生产总值（GDP）：1.72万亿欧元，人均国内生产总值：28290欧元
```
i.e. no tokenization happened.

## Expected behaviour

Using https://pypi.org/project/jieba/

Input:
```
print(a) for (a, b, c) in jieba.tokenize('2017年主要经济指标如下：国内生产总值（GDP）：1.72万亿欧元，人均国内生产总值：28290欧元')
```

Output:
```
2017
年
主要
经济指标
如下
：
国内
生产总值
（
GDP
）
：
1.72
万亿
欧元
，
人均
国内
生产总值
：
28290
欧元
```


## Your Environment

- **spaCy version:** 3.0.3
- **Platform:** Darwin-19.6.0-x86_64-i386-64bit
- **Python version:** 3.6.12


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Chinese tokenization is bad #9856

How to reproduce the behaviour

Chinese

Multilanguage

Expected behaviour

Your Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Chinese tokenization is bad #9856

Description

How to reproduce the behaviour

Chinese

Multilanguage

Expected behaviour

Your Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions