How to control segmenter while loading Chinese full models in Spacy 3.06? #8577

lingvisa · 2021-07-02T06:54:18Z

lingvisa
Jul 2, 2021

nlp = spacy.load('zh_core_web_trf')

If I want to choose a particular segmenter, how to do that? I can't do the following:
nlp = spacy.load('zh_core_web_trf', config={'segmenter': 'jieba'})

Or I create the tokenizer first:

cfg = {"segmenter": "pkuseg"}
nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}})
nlp.tokenizer.initialize(pkuseg_model="mixed")

Then how to load other models, pos, ne, parser etc through this nlp object?

Answered by adrianeboyd

Jul 2, 2021

The short answer is: don't do this. The statistical models are trained using that particular segmentater so using different segmentation will degrade the performance. Adding a handful of custom exceptions is not going to make a big difference and using mixed instead of spacy's pkuseg model would probably be only slightly worse overall because they're trained on fairly similar data, but switching to jieba would lead to extremely poor annotation from the statistical models.

If you want to use a different segmenter, you'll need to train a new pipeline from scratch.

View full answer

adrianeboyd · 2021-07-02T08:50:19Z

adrianeboyd
Jul 2, 2021

The short answer is: don't do this. The statistical models are trained using that particular segmentater so using different segmentation will degrade the performance. Adding a handful of custom exceptions is not going to make a big difference and using mixed instead of spacy's pkuseg model would probably be only slightly worse overall because they're trained on fairly similar data, but switching to jieba would lead to extremely poor annotation from the statistical models.

If you want to use a different segmenter, you'll need to train a new pipeline from scratch.

0 replies

lingvisa · 2021-07-02T14:54:11Z

lingvisa
Jul 2, 2021
Author

@adrianeboyd That makes sense. The reason I may want to switch to another segmenter is that one of the use cases of my pipeline is for real-time applications and speed is most important, and I have large domain lexicon customized for jieba, which won't work for pkuseg segmenter. The mechanism of handling user lexicon in pkuseg and jieba are very different. For other scenarios, I would want to use pkuseg. I remember jieba segmenter is about 3 times faster than pkuseg. Is that right? PKUSEG have two problems that I want to fix first before using it (the author didn't maintain it):

Segmentation degraded heavily when large user vocabulary is added (添加用户词典后，分词乱套 lancopku/pkuseg-python#127; BUG！加载百万级词库后，会将每个字都单独分开 lancopku/pkuseg-python#116)
English word split, which is very common in Chinese social media:
超保湿PUCOMARY固态安肌水
PKUSEG: [超保, 湿PUCOMA, RY, 固态, 安肌, 水]
Correct: [超保湿 PUCOMARY, 固态, 安肌, 水]
But due to speed concern, I haven't really spent time on these two issues for PKUSEG.

I would mainly want to use spacy for these scenarios:

JIeba + PhraseMatcher
JIeba + Matcher
Wordpiece + Transformer (custom components)
Segmenter + Transformer
Pkuseg + other models

1 reply

adrianeboyd Jul 5, 2021

I'm sure jieba is much much faster than pkuseg.

I'm not sure I understand enough about what's going on from the machine translation of those bug reports, but since we're now using a fork in spacy-pkuseg, if there are relatively straightforward bugs to fix, we can potentially fix them in our fork.

lingvisa · 2021-07-06T16:01:34Z

lingvisa
Jul 6, 2021
Author

The two issues basically says when a size of a million user entries (or 50 MB) were added into user dictionary, the segmentation becomes much worse. The author did reply once by saying that all entires in user dictionary will be split into characters first and if you add a lot, that will mess up the machine learning based logic, since those entries won't occur in training data. One possible solution is to adjust dictionary-based segmentation, but how to do that remains an issue. In Jieba each word is associated with a count computed from training data, but the two segmenters work fundamentally differently and it won't be a straightforward change. Instead of trying to fix this inside PKUSEG, it may be better to enhance user dictionary functionality in Spacy by token merging and re-split. For large vocabulary, the entries need to be specified in a file, not in python code. This could work for other languages which have similar segmentation needs as Chinese. I can write a story about this in Spacy's issues when I come to this part.

The 2nd problem: the english word split error, will be simpler fix in PKUSEG. There is also an existing issue: lancopku/pkuseg-python#112

That particular example seems to have been fixed but still lots of English words were randomly split. This needs more work in pkuseg' Preprocess's init.py. English words need to be recognized first and kept together in later stages. I will create an issue once I test more examples.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

How to control segmenter while loading Chinese full models in Spacy 3.06? #8577

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

How to control segmenter while loading Chinese full models in Spacy 3.06? #8577

Uh oh!

Uh oh!

lingvisa Jul 2, 2021

Replies: 3 comments · 1 reply

Uh oh!

adrianeboyd Jul 2, 2021

Uh oh!

Uh oh!

lingvisa Jul 2, 2021 Author

Uh oh!

adrianeboyd Jul 5, 2021

Uh oh!

lingvisa Jul 6, 2021 Author

lingvisa
Jul 2, 2021

Replies: 3 comments 1 reply

adrianeboyd
Jul 2, 2021

lingvisa
Jul 2, 2021
Author

lingvisa
Jul 6, 2021
Author