How to make custom Language and Tokenizer class without changing spacy source code? #11036
-
I used to compile Spacy source code locally and set up my-own-spacy as a dependency of my project, for example, KBMatcher. Then I can install my local compiled spacy into my virtual environment of KBMatcher. One of the primary reasons I need to modify Spacy is that I am using the Chinese models and I need to use two versions of tokenizer: char- and token- based tokenization in the same pipeline. I need to use char tokenization to do PhraseMatcher but then use Jieba tokenizations to disambiguate some of the char-based matches. The way I currently realize this purpose is the following: Line: 296, I add a function to update the segmenter:
Before I add Dictionary component to the pipeline, I set the segmenter to be char tokenization, so that PhraseMatcher can do char based tokenization. After that, I set the segmenter back to token based tokenization as below:
This worked for me locally. Now I want to "pip install spacy" directly without locally changing spacy source code, because currently there is an issue in our docker environment. Is there a way to achieve the same purpose somehow? For example, in my local KBMatcher if I create this file spacy_zh.py:
This is like writing a custom version of the Chinese class by inheritance and overwrite the one in the spacy code base, although they are the same, but the load() function always returns an original 'Chinese' object, right:
Another consideration is that without using inheritance, is there a way to get two versions of Chinese tokenizers from the 'nlp' object in my own code? |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 3 replies
-
That is a pretty complicated use of the pipeline! I am glad it worked for you, but I have never heard of anyone else changing the tokenizer on a pipeline in the same document, and I am surprised you didn't run into issues. Regarding the issue with It sounds like you only need the jieba segmentation for comparison, right? In that case I think it makes sense to have a separate pipeline that you use in a component. So you could do something like this:
You could put the jieba doc in an underscore attribute, for example, or do the comparison you need right in your custom component. This is similar to how spacy-transformers handles the HuggingFace tokenizer data internally.
The tokenizer object should be publicly accessible, so you should just be able to do |
Beta Was this translation helpful? Give feedback.
-
@polm I had this question because our Spacy project is having an issue in our docker environment when it's served with http requests. The issue may be due to my code, our production environment or something else. When locally tested, it runs all smoothly, but some of the requests take extremely long times when it is put into the production environment, so the QPS is very low. Earlier I was afraid I might modify the Spacy source code not correctly and may cause the problem. However, this time after this question above, I just pip install spacy==3.2.4 with cpu only, and didn't modify the original Spacy any way, and the local test is extremely fast with all normal behaviors. But when it's put into production environment and served with Flask, the problem persists. Just wondering does some other Spacy user report similar symptoms by any chance in the past? |
Beta Was this translation helpful? Give feedback.
-
We eventually found the issue. It's not due to Spacy, but some setting related to 'monitor' in http server. It's good to know the solution of upgrading to Python 3.8 for the issue #7774. Thanks. |
Beta Was this translation helpful? Give feedback.
That is a pretty complicated use of the pipeline! I am glad it worked for you, but I have never heard of anyone else changing the tokenizer on a pipeline in the same document, and I am surprised you didn't run into issues.
Regarding the issue with
spacy.load
, what is happening is that when you load the pipeline, it is also callingimport spacy.lang.zh
, and has no way to look at the custom code you defined. You should be able to define a custom language, use it at training/assemble time, and save a pipeline that way, though I think there are better ways to do this.It sounds like you only need the jieba segmentation for comparison, right? In that case I think it makes sense to have a separ…