Errors in Chinese PKUSEG handling ascii characters #8670

lingvisa · 2021-07-09T20:31:57Z

lingvisa
Jul 9, 2021

It has consistently segmented bad on the boundary of Chinese and ascii characters. A few typical examples are below, where wrong segmentations are listed for each sentence. This should be fixed in PKUSEG's preprocessor.

回复@我是老血老血少女呀:不可以[怒]崔元明不够帅
']崔' & '[怒'
回复@loll_花萝不是花螺:(怎么突然!哪里有果茶!
'茶!'
18接个吻,开一枪烟袋斜街@意外制造者_KUN纯音乐哈哈哈哈
'KU' & 'N'
王一博跳舞好@UNIQ-王一博
'UN' & 'IQ'
摩登兄弟mdxd刘宇宁lyn棚主宁哥一直在一直爱
'刘宇宁ly' & 'n棚'
回复@赤发与千17:baobei
'千17:baobe'
SKINFOOD新推的矿物质甜糖
'SKIN' & 'FOOD'
高科技元素中iphone和ipad功能强大啊
'中iphone和ipad'

lingvisa · 2021-07-10T01:13:18Z

lingvisa
Jul 10, 2021
Author

The above are 8 random examples I took from my regression corpus when compared to jieba. Further, I ran the following 5 models and count the number of correct segmentations with regards to ascii segmentation, and the results are below:

pkuseg-mixed: 回复 @ 我是老血老血少女呀 : 不可以 [怒 ] 崔元明不够帅
pkuseg-ctb8: 回复 @ 我是老血老血少女呀 : 不可以 [怒 ]崔元明不够帅
pkuseg-weibo: 回复 @ 我是老血老血少女呀 : 不可以 [怒 ]崔元明不够帅
jieba: 回复 @ 我是老血老血少女呀 : 不可以 [ 怒 ] 崔元明不够帅
spacy-pkuseg: 回复 @ 我是老血老血少女呀 : 不可以 [怒 ]崔元明不够帅

pkuseg-mixed: 回复 @ loll _ 花萝不是花螺 : ( 怎么突然 ! 哪里有果茶 !
pkuseg-ctb8: 回复 @ loll _ 花萝不是花螺:( 怎么突然 ! 哪里有果茶 !
pkuseg-weibo: 回复 @ loll _ 花萝不是花螺 : ( 怎么突然 ! 哪里有果茶 !
jieba: 回复 @ loll _ 花萝不是花螺 : ( 怎么突然 ! 哪里有果茶 !
spacy-pkuseg: 回复 @ loll _ 花萝不是花螺 : ( 怎么突然 ! 哪里有果茶!

pkuseg-mixed: 18 接个吻 , 开一枪烟袋斜街 @ 意外制造者 _ KUN纯音乐哈哈哈哈
pkuseg-ctb8: 18 接个吻 , 开一枪烟袋斜街 @ 意外制造者 _ KUN 纯音乐哈哈哈哈
pkuseg-weibo: 18 接个吻 , 开一枪烟袋斜街 @ 意外制造者 _ KUN纯音乐哈哈哈哈
jieba: 18 接个吻 , 开一枪烟袋斜街 @ 意外制造者 _ KUN 纯音乐哈哈哈哈
spacy-pkuseg: 18 接个吻 , 开一枪烟袋斜街 @ 意外制造者 _ K U N 纯音乐哈哈哈哈

pkuseg-mixed: 王一博跳舞好 @ UNIQ - 王一博
pkuseg-ctb8: 王一博跳舞好@un IQ - 王一博
pkuseg-weibo: 王一博跳舞好 @ UNIQ - 王一博
jieba: 王一博跳舞好 @ UNIQ - 王一博
spacy-pkuseg: 王一博跳舞好 @ UN IQ - 王一博

pkuseg-mixed: 摩登兄弟 mdxd 刘宇宁 ly n棚主宁哥一直在一直爱
pkuseg-ctb8: 摩登兄弟 mdxd 刘宇宁lyn棚主宁哥一直在一直爱
pkuseg-weibo: 摩登兄弟 mdxd 刘宇宁l yn 棚主宁哥一直在一直爱
jieba: 摩登兄弟 mdxd 刘宇宁 lyn 棚主宁哥一直在一直爱
spacy-pkuseg: 摩登兄弟 mdxd 刘宇宁ly n棚主宁哥一直在一直爱

pkuseg-mixed: 回复 @ 赤发与千17:baobei
pkuseg-ctb8: 回复 @ 赤发与千17 : baobei
pkuseg-weibo: 回复 @ 赤发与千17:baobei
jieba: 回复 @ 赤发与千 17 : baobei
spacy-pkuseg: 回复 @ 赤发与千17:baobe i

pkuseg-mixed: SKIN FOOD 新推的矿物质甜糖
pkuseg-ctb8: SKIN FOOD 新推的矿物质甜糖
pkuseg-weibo: SKINF OOD 新推的矿物质甜糖
jieba: SKINFOOD 新推的矿物质甜糖
spacy-pkuseg: SKIN FOOD 新推的矿物质甜糖

pkuseg-mixed: 高科技元素中 iphone 和 ipad 功能强大啊
pkuseg-ctb8: 高科技元素中 iphone 和 ipad 功能强大啊
pkuseg-weibo: 高科技元素中 iphone 和 ipad 功能强大啊
jieba: 高科技元素中 iphone 和 ipad 功能强大啊
spacy-pkuseg: 高科技元素中iphone和ipad 功能强大啊

Correct segmentations with regards to ascii chars:
pkuseg-mixed: 6
pkuseg-ctb8: 6
pkuseg-weibo: 5
jieba: 12
pkuseg-spacy: 1

Although the examples are randomly taken, it would be biased to benefit jieba, because they are from a regression corpus when comparing pkuseg-spacy (I am using zh_core_web_lg) to jieba. But still, it illustrates that pkuseg-spacy needs to improve this aspect.

0 replies

lingvisa · 2021-07-10T01:35:05Z

lingvisa
Jul 10, 2021
Author

To run a quick test and see whether some of the errors will be removed, I unzipped the pkuseg-mixed model and also unzipped the zh_core_web_lg-3.1.0.tar.gz, and copied:

pkuseg-python/models/mixed/features.pkl
pkuseg-python/models/mixed/weights.npz

to
zh_core_web_lg-3.1.0/zh_core_web_lg/zh_core_web_lg-3.1.0/tokenizer/pkuseg_model/

That essentially overwrite spacy-pkuseg:

features.msgpack
weights.npz

I removed the 'features.msgpack' since the name is different from 'features.pkl' from the original package. I recompressed the model by:
tar -zcvf zh_core_web_lg-3.1.0.tar.gz zh_core_web_lg-3.1.0

However, when I tried to install the new model, I got an error message:

pip install zh_core_web_lg-3.1.0.tar.gz 
Processing ./zh_core_web_lg-3.1.0.tar.gz
ERROR: File "setup.py" not found for legacy project file:///Users/congminmin/nlp/spaCy/models/zh_core_web_lg-3.1.0.tar.gz.

The 'setup.py' is in the original package but it complained. I suspect it might be due to one of the following reasons:

The two files 'features.pkl' and 'msgpack.msgpack' are named differently.
It needs to be packaged by other ways, not 'tar -zcvf '

How can I run a quick test by switching the pkuseg models in spacy-pkuseg?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Errors in Chinese PKUSEG handling ascii characters #8670

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Errors in Chinese PKUSEG handling ascii characters #8670

Uh oh!

Uh oh!

lingvisa Jul 9, 2021

Replies: 2 comments

Uh oh!

Uh oh!

lingvisa Jul 10, 2021 Author

Uh oh!

lingvisa Jul 10, 2021 Author

lingvisa
Jul 9, 2021

lingvisa
Jul 10, 2021
Author

lingvisa
Jul 10, 2021
Author