Errors in Chinese PKUSEG handling ascii characters #8670
Replies: 2 comments
-
The above are 8 random examples I took from my regression corpus when compared to jieba. Further, I ran the following 5 models and count the number of correct segmentations with regards to ascii segmentation, and the results are below: pkuseg-mixed: 回复 @ 我 是 老血 老血 少女 呀 : 不 可以 [怒 ] 崔 元明 不 够帅 pkuseg-mixed: 回复 @ loll _ 花萝 不是 花螺 : ( 怎么 突然 ! 哪里 有 果茶 ! pkuseg-mixed: 18 接 个 吻 , 开 一 枪 烟袋斜街 @ 意外 制造者 _ KUN纯 音乐 哈哈哈哈 pkuseg-mixed: 王一博 跳舞 好 @ UNIQ - 王一博 pkuseg-mixed: 摩登 兄弟 mdxd 刘 宇宁 ly n棚 主宁哥 一直 在 一直 爱 pkuseg-mixed: 回复 @ 赤发 与 千17:baobei pkuseg-mixed: SKIN FOOD 新 推 的 矿物质 甜糖 pkuseg-mixed: 高科技 元素 中 iphone 和 ipad 功能 强大 啊 Correct segmentations with regards to ascii chars: Although the examples are randomly taken, it would be biased to benefit jieba, because they are from a regression corpus when comparing pkuseg-spacy (I am using zh_core_web_lg) to jieba. But still, it illustrates that pkuseg-spacy needs to improve this aspect. |
Beta Was this translation helpful? Give feedback.
-
To run a quick test and see whether some of the errors will be removed, I unzipped the pkuseg-mixed model and also unzipped the zh_core_web_lg-3.1.0.tar.gz, and copied:
to That essentially overwrite spacy-pkuseg:
I removed the 'features.msgpack' since the name is different from 'features.pkl' from the original package. I recompressed the model by: However, when I tried to install the new model, I got an error message:
The 'setup.py' is in the original package but it complained. I suspect it might be due to one of the following reasons:
How can I run a quick test by switching the pkuseg models in spacy-pkuseg? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
It has consistently segmented bad on the boundary of Chinese and ascii characters. A few typical examples are below, where wrong segmentations are listed for each sentence. This should be fixed in PKUSEG's preprocessor.
']崔' & '[怒'
'茶!'
'KU' & 'N'
'UN' & 'IQ'
'刘宇宁ly' & 'n棚'
'千17:baobe'
'SKIN' & 'FOOD'
'中iphone和ipad'
Beta Was this translation helpful? Give feedback.
All reactions