How to use multiprocessing when there is an app parameter to passed in in custom components? #10701

lingvisa · 2022-04-25T07:51:13Z

lingvisa
Apr 25, 2022

I created a standard Spacy custom components:

@Language.factory("kbmatcher")
def create_labeller(nlp:Language, name:str, disable:list):
    return KBMatcher(nlp, disable)

class KBMatcher(object):

    def __init__(self, nlp, disable):
        ...

   def __call__(self, doc, app_params):

        start = time.time()
        text_id = app_params.get("text_id", "")
        uid = app_params.get("uid", "")
       ...

Here I passed an 'app_params' dict as an additional parameter to the call method, and get a 'text_id' and 'uid' value from the parameter dict. In a single thread processing, I do this way to use the custom component:

        input_doc = {'text_id': tid, 'content': text, 'uid': uid}
        result_doc = self.nlp(input_doc['content'], disable=self.disable, component_cfg={'kbmatcher': {'app_params': input_doc}})

If I want to use multiprocessing like below:

        for doc in self.nlp.pipe(textlines, disable=self.disable, n_process=-1):
            yield list(doc)

How can I pass the component_cfg to the pipe() function? The component_cfg represents a set of app parameters specific to a single doc, but the pipe method takes a batch of texts as input.

Answered by pmbaumgartner

Apr 26, 2022

Hey @lingvisa - thanks for the question.

I think to get where you need to go, we'll need to refactor your current solution slightly. If I understand your problem right, what you're trying to do is look up some document-specific information within your component. Currently you're doing this by passing a document-specific dictionary to component_cfg, but it doesn't really belong there because the information isn't really a property of the component, but a property of the document -- this just happens to work because you're reconfiguring the component every time it's called to work with the specific doc it's called on.

Typically the way to solve this problem is to attach that information to …

View full answer

pmbaumgartner · 2022-04-26T18:53:33Z

pmbaumgartner
Apr 26, 2022

Hey @lingvisa - thanks for the question.

I think to get where you need to go, we'll need to refactor your current solution slightly. If I understand your problem right, what you're trying to do is look up some document-specific information within your component. Currently you're doing this by passing a document-specific dictionary to component_cfg, but it doesn't really belong there because the information isn't really a property of the component, but a property of the document -- this just happens to work because you're reconfiguring the component every time it's called to work with the specific doc it's called on.

Typically the way to solve this problem is to attach that information to the document itself when it's created as a custom attribute, then access that information within the component. Here's a potential solution with that in mind.

First, nlp.pipe (and nlp) can now take doc objects and not just raw strings. This means you can create your docs with the right attributes prior to being processed through a pipeline.

Using that information, here's how I might approach this. I'm assuming in this case you have a list of input_doc like objects (input_docs) that contain your document info.

Doc.set_extension("text_id", default=None)
Doc.set_extension("uid", default=None)

docs = []
for input_doc in input_docs:
    doc = nlp.make_doc(input_doc['content'])
    doc._.text_id = input_doc['tid']
    doc._.uid = input_doc['uid']
    docs.append(doc)

Now you might need to do something with this information inside a custom component - here's what that looks like:

class KBMatcher(object):

    def __init__(self, nlp, disable):
        ...

   def __call__(self, doc):

        start = time.time()
        text_id = doc._.text_id
        print(f"This doc's text_id is {text_id}")
       ...

Now if you do that, when you call nlp.pipe on your list of docs (which already contain these attributes), everything should work without you needing to pass document specific information again.

6 replies

pmbaumgartner Apr 27, 2022

make_doc only runs the tokenizer and creates the Doc object.

lingvisa Apr 27, 2022
Author

@pmbaumgartner My spacy version 3.1.0 and it seems the pipe() function only supports texts. Just checked the documentation and found I need to update to 3.2.

pmbaumgartner Apr 28, 2022

Is it possible for you to update your spaCy verison? If not, I think I might have an alternative solution as well.

lingvisa Apr 28, 2022
Author

I am upgrading now. If it breaks anything, I will let you know. It's always great to have alternatives.

lingvisa Apr 28, 2022
Author

One problem I am having now is that in a conda environment I have to use, the 'pip install -r requirements.txt' takes forever and can't finish without error message. Outside of conda, it installs smoothly.

lingvisa · 2022-04-28T08:36:19Z

lingvisa
Apr 28, 2022
Author

@pmbaumgartner Please suggest your alternative solution. For some reason, it is getting stuck at the "'pip install -r requirements.txt" on a Ubuntu machine in a conda environment. I may have to reverse back to 3.10 at this moment. I am building from Spacy 2.24 version. I can't simply install spacy from the build because I have some minor modification to the code for my needs. I am also attaching my requirements.txt which I modified from the original spacy requirements.txt.

requirements.txt

3 replies

lingvisa Apr 28, 2022
Author

More specifically, it's stuck at the last step 'Building wheel for wbnlu (setup.py) ... /', wbnlu is my package name.

pmbaumgartner Apr 28, 2022

Edit: This solution doesn't work.

In this case you can use as_tuples=True when you pass your docs to nlp.pipe, and assign the attributes then. You'll have to separate out your texts prior to processing as well. It might look something like this:

Doc.set_extension("text_id", default=None)
Doc.set_extension("uid", default=None)

texts = [i['content'] for i in input_docs]
data = zip(texts, input_docs)

docs = []
for (doc, context) in nlp.pipe(data, as_tuples=True):
    doc._.text_id = context['tid']
    doc._.uid = context['uid']

pmbaumgartner Apr 28, 2022

Actually that last response won't work because the attributes won't be available to your custom component, my apologies.

My best suggestion at this point, if you have some custom modifications, would be to install an updated version of spacy and then re-implement your minor modifications and attempt the first solution.

lingvisa · 2022-04-28T21:29:35Z

lingvisa
Apr 28, 2022
Author

@pmbaumgartner The upgrade was successful. I am having another problem now with multiprocessing now. In my pipleline, I have a bunch of custom components defined by spacy's convention. One of them is a classifier trained through bert model and in this DocClassifier component, it simply loads the bert models and call classify() function. I didn't use Spacy's internal training tools. This works all fine with single processing. But with multiprocessing on, the execution got stuck a the classify() function without giving any message. It hangs on forever. If each model is loaded into different processes through the nlp.pipe() function, this shouldn't happen. Any idea why this would happen? If I remove the classifier component, multiprocessing works fine.

1 reply

pmbaumgartner May 3, 2022

This is likely related to the torch deadlock issue mentioned here.

Uh oh!

How to use multiprocessing when there is an app parameter to passed in in custom components? #10701

Uh oh!

lingvisa Apr 25, 2022

Replies: 3 comments · 10 replies

Uh oh!

pmbaumgartner Apr 26, 2022

Uh oh!

pmbaumgartner Apr 27, 2022

Uh oh!

lingvisa Apr 27, 2022 Author

Uh oh!

pmbaumgartner Apr 28, 2022

Uh oh!

lingvisa Apr 28, 2022 Author

Uh oh!

lingvisa Apr 28, 2022 Author

Uh oh!

lingvisa Apr 28, 2022 Author

Uh oh!

lingvisa Apr 28, 2022 Author

Uh oh!

Uh oh!

pmbaumgartner Apr 28, 2022

Uh oh!

pmbaumgartner Apr 28, 2022

Uh oh!

lingvisa Apr 28, 2022 Author

Uh oh!

pmbaumgartner May 3, 2022

lingvisa
Apr 25, 2022

Replies: 3 comments 10 replies

pmbaumgartner
Apr 26, 2022

lingvisa Apr 27, 2022
Author

lingvisa Apr 28, 2022
Author

lingvisa Apr 28, 2022
Author

lingvisa
Apr 28, 2022
Author

lingvisa Apr 28, 2022
Author

lingvisa
Apr 28, 2022
Author