分词质量堪忧啊
#771
Replies: 4 comments 3 replies
-
是的,这个分词是GLM自己的,这个肯定影响匹配,特别是如果你有特殊词汇,但是正常对话不影响的 |
Beta Was this translation helpful? Give feedback.
0 replies
-
@zRzRzRzRzRzRzR |
Beta Was this translation helpful? Give feedback.
2 replies
-
如果glm在大规模语料上训练过分词器 |
Beta Was this translation helpful? Give feedback.
1 reply
-
在一些古老的多的模型上喂过很不利于分词的东西(BERT时代的破玩意儿),分得也一塌糊涂,但最终效果还行 |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
import sentencepiece
sp = sentencepiece.SentencePieceProcessor()
sp.load('/home/wencan/.cache/huggingface/hub/models--THUDM--chatglm3-6b-base/snapshots/f91a1de587fdc692073367198e65369669a0b49d/tokenizer.model' )
sp.EncodeAsPieces('百度官方出品,百度公司CTO王海峰博士作序,张钹院士、李未院士、百度集团副总裁吴甜联袂推荐!结合新近PaddlePaddle版本,融合大量实践案例,让你从“零基础”到“全精通”,深入掌握深度学习的知识')
得到:
['▁', '百度', '官方', '出品', ',', '百度', '公司', 'CT', 'O', '王', '海', '峰', '博士', '作', '序', ',', '张', '钹', '院士', '、', '李', '未', '院士', '、', '百度', '集团', '副总裁', '吴', '甜', '联', '袂', '推荐', '!', '结合', '新', '近', 'P', 'add', 'le', 'P', 'add', 'le', '版本', ',', '融合', '大量', '实践', '案例', ',', '让你', '从', '“', '零', '基础', '”', '到', '“', '全', '精通', '”,', '深入', '掌握', '深度', '学习的', '知识']
本人小白
貌似ChatGLM的分词模型,是google/sentencepiece无监督训练得到的。
想请问:
Beta Was this translation helpful? Give feedback.
All reactions