Skip to content
Discussion options

You must be logged in to vote

The short answer is: don't do this. The statistical models are trained using that particular segmentater so using different segmentation will degrade the performance. Adding a handful of custom exceptions is not going to make a big difference and using mixed instead of spacy's pkuseg model would probably be only slightly worse overall because they're trained on fairly similar data, but switching to jieba would lead to extremely poor annotation from the statistical models.

If you want to use a different segmenter, you'll need to train a new pipeline from scratch.

Replies: 3 comments 1 reply

Comment options

You must be logged in to vote
0 replies
Answer selected by svlandeg
Comment options

You must be logged in to vote
1 reply
@adrianeboyd
Comment options

Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lang / zh Chinese language data and models feat / tokenizer Feature: Tokenizer
2 participants
Converted from issue

This discussion was converted from issue #8576 on July 02, 2021 07:20.