How to make custom Language and Tokenizer class without changing spacy source code? #11036

lingvisa · 2022-06-27T19:40:51Z

lingvisa
Jun 27, 2022

I used to compile Spacy source code locally and set up my-own-spacy as a dependency of my project, for example, KBMatcher. Then I can install my local compiled spacy into my virtual environment of KBMatcher. One of the primary reasons I need to modify Spacy is that I am using the Chinese models and I need to use two versions of tokenizer: char- and token- based tokenization in the same pipeline. I need to use char tokenization to do PhraseMatcher but then use Jieba tokenizations to disambiguate some of the char-based matches. The way I currently realize this purpose is the following:
In this line: https://github.com/explosion/spaCy/blob/v3.2.4/spacy/lang/zh/__init__.py

Line: 296, I add a function to update the segmenter:

class Chinese(Language):
    lang = "zh"
    Defaults = ChineseDefaults

    def update_segmenter(self, val):
        self.tokenizer.segmenter = val

Before I add Dictionary component to the pipeline, I set the segmenter to be char tokenization, so that PhraseMatcher can do char based tokenization. After that, I set the segmenter back to token based tokenization as below:

     self.nlp.update_segmenter(Segmenter.char)
        if 'dictionary' not in self.nlp.pipe_names:
            self.nlp.add_pipe('dictionary')
        self.nlp.update_segmenter(Segmenter.jieba)

This worked for me locally. Now I want to "pip install spacy" directly without locally changing spacy source code, because currently there is an issue in our docker environment. Is there a way to achieve the same purpose somehow? For example, in my local KBMatcher if I create this file spacy_zh.py:

from spacy import Language
from spacy.lang.zh import ChineseDefaults

class Chinese(Language):
    lang = "zh"
    Defaults = ChineseDefaults

    def make_doc(self, text, merge_ascii=False):
        return self.tokenizer(text, merge_ascii)

    def update_segmenter(self, val):
        self.tokenizer.segmenter = val

This is like writing a custom version of the Chinese class by inheritance and overwrite the one in the spacy code base, although they are the same, but the load() function always returns an original 'Chinese' object, right:

nlp = spacy.load(model)

Another consideration is that without using inheritance, is there a way to get two versions of Chinese tokenizers from the 'nlp' object in my own code?

Answered by polm

Jun 28, 2022

That is a pretty complicated use of the pipeline! I am glad it worked for you, but I have never heard of anyone else changing the tokenizer on a pipeline in the same document, and I am surprised you didn't run into issues.

Regarding the issue with spacy.load, what is happening is that when you load the pipeline, it is also calling import spacy.lang.zh, and has no way to look at the custom code you defined. You should be able to define a custom language, use it at training/assemble time, and save a pipeline that way, though I think there are better ways to do this.

It sounds like you only need the jieba segmentation for comparison, right? In that case I think it makes sense to have a separ…

View full answer

polm · 2022-06-28T05:04:20Z

polm
Jun 28, 2022

That is a pretty complicated use of the pipeline! I am glad it worked for you, but I have never heard of anyone else changing the tokenizer on a pipeline in the same document, and I am surprised you didn't run into issues.

Regarding the issue with spacy.load, what is happening is that when you load the pipeline, it is also calling import spacy.lang.zh, and has no way to look at the custom code you defined. You should be able to define a custom language, use it at training/assemble time, and save a pipeline that way, though I think there are better ways to do this.

It sounds like you only need the jieba segmentation for comparison, right? In that case I think it makes sense to have a separate pipeline that you use in a component. So you could do something like this:

import spacy

nlp = spacy.load(...) # main nlp
jieba_nlp = spacy.load(...) # jieba nlp object

@Language.component("jieba_seg")
def my_component(doc):
    jieba_doc = jieba_nlp(doc.text)
    # ... do something ...
    return doc

You could put the jieba doc in an underscore attribute, for example, or do the comparison you need right in your custom component. This is similar to how spacy-transformers handles the HuggingFace tokenizer data internally.

Another consideration is that without using inheritance, is there a way to get two versions of Chinese tokenizers from the 'nlp' object in my own code?

The tokenizer object should be publicly accessible, so you should just be able to do nlp.tokenizer = .... I don't think we ever really check that usage, so it might exhibit strange behavior (I would be especially careful with word vectors), but it should work.

2 replies

lingvisa Jun 28, 2022
Author

Without changing the original Spacy code and compiling myself, I ended up with defining a simplified version of the original ChineseTokenizer() class from zh/init.py in my own code. because I need to change a bit in the original code. Then as you suggested, I reset the tokenizer before and after I add the Dictionary class: self.nlp.tokenizer = ChineseTokenizer(self.nlp.vocab, Segmenter.char); self.nlp.add_pipe('dictionary'); self.nlp.tokenizer = ChineseTokenizer(self.nlp.vocab, Segmenter.jieba). This seems to be working for me. I should really rename the ChineseTokenizer to distinguish it from the original class inside Spacy, but I just reused the same name.

polm Jun 28, 2022

Well, if it works that's the most important thing - glad you figured it out!

lingvisa · 2022-06-28T17:15:37Z

lingvisa
Jun 28, 2022
Author

@polm I had this question because our Spacy project is having an issue in our docker environment when it's served with http requests. The issue may be due to my code, our production environment or something else. When locally tested, it runs all smoothly, but some of the requests take extremely long times when it is put into the production environment, so the QPS is very low. Earlier I was afraid I might modify the Spacy source code not correctly and may cause the problem. However, this time after this question above, I just pip install spacy==3.2.4 with cpu only, and didn't modify the original Spacy any way, and the local test is extremely fast with all normal behaviors. But when it's put into production environment and served with Flask, the problem persists.

Just wondering does some other Spacy user report similar symptoms by any chance in the past?

1 reply

polm Jun 29, 2022

We have had reports of poor performance in web servers sometimes. I think it is almost always a code issue though - for example, you shouldn't call spacy.load on every request. So check that nothing like that is going on.

I looked over old issues and one thing that came up was #7774, where a specific version of importlib was causing performance issues. But I think that was also only when loading the models, not during normal processing. Still, it may help to make sure you're using the most recent spaCy version.

If you still have issues and can give us an example we can run we'd be glad to check it.

lingvisa · 2022-06-29T19:42:22Z

lingvisa
Jun 29, 2022
Author

We eventually found the issue. It's not due to Spacy, but some setting related to 'monitor' in http server. It's good to know the solution of upgrading to Python 3.8 for the issue #7774. Thanks.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

How to make custom Language and Tokenizer class without changing spacy source code? #11036

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

How to make custom Language and Tokenizer class without changing spacy source code? #11036

Uh oh!

Uh oh!

lingvisa Jun 27, 2022

Replies: 3 comments · 3 replies

Uh oh!

polm Jun 28, 2022

Uh oh!

Uh oh!

lingvisa Jun 28, 2022 Author

Uh oh!

polm Jun 28, 2022

Uh oh!

lingvisa Jun 28, 2022 Author

Uh oh!

polm Jun 29, 2022

Uh oh!

lingvisa Jun 29, 2022 Author

lingvisa
Jun 27, 2022

Replies: 3 comments 3 replies

polm
Jun 28, 2022

lingvisa Jun 28, 2022
Author

lingvisa
Jun 28, 2022
Author

lingvisa
Jun 29, 2022
Author