How to control segmenter while loading Chinese full models in Spacy 3.06? #8577
-
If I want to choose a particular segmenter, how to do that? I can't do the following: Or I create the tokenizer first:
Then how to load other models, pos, ne, parser etc through this nlp object? |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 1 reply
-
The short answer is: don't do this. The statistical models are trained using that particular segmentater so using different segmentation will degrade the performance. Adding a handful of custom exceptions is not going to make a big difference and using If you want to use a different segmenter, you'll need to train a new pipeline from scratch. |
Beta Was this translation helpful? Give feedback.
-
@adrianeboyd That makes sense. The reason I may want to switch to another segmenter is that one of the use cases of my pipeline is for real-time applications and speed is most important, and I have large domain lexicon customized for jieba, which won't work for pkuseg segmenter. The mechanism of handling user lexicon in pkuseg and jieba are very different. For other scenarios, I would want to use pkuseg. I remember jieba segmenter is about 3 times faster than pkuseg. Is that right? PKUSEG have two problems that I want to fix first before using it (the author didn't maintain it):
I would mainly want to use spacy for these scenarios:
|
Beta Was this translation helpful? Give feedback.
-
The two issues basically says when a size of a million user entries (or 50 MB) were added into user dictionary, the segmentation becomes much worse. The author did reply once by saying that all entires in user dictionary will be split into characters first and if you add a lot, that will mess up the machine learning based logic, since those entries won't occur in training data. One possible solution is to adjust dictionary-based segmentation, but how to do that remains an issue. In Jieba each word is associated with a count computed from training data, but the two segmenters work fundamentally differently and it won't be a straightforward change. Instead of trying to fix this inside PKUSEG, it may be better to enhance user dictionary functionality in Spacy by token merging and re-split. For large vocabulary, the entries need to be specified in a file, not in python code. This could work for other languages which have similar segmentation needs as Chinese. I can write a story about this in Spacy's issues when I come to this part. The 2nd problem: the english word split error, will be simpler fix in PKUSEG. There is also an existing issue: lancopku/pkuseg-python#112 That particular example seems to have been fixed but still lots of English words were randomly split. This needs more work in pkuseg' Preprocess's init.py. English words need to be recognized first and kept together in later stages. I will create an issue once I test more examples. |
Beta Was this translation helpful? Give feedback.
The short answer is: don't do this. The statistical models are trained using that particular segmentater so using different segmentation will degrade the performance. Adding a handful of custom exceptions is not going to make a big difference and using
mixed
instead of spacy's pkuseg model would probably be only slightly worse overall because they're trained on fairly similar data, but switching tojieba
would lead to extremely poor annotation from the statistical models.If you want to use a different segmenter, you'll need to train a new pipeline from scratch.