How to use multiprocessing when there is an app parameter to passed in in custom components? #10701
-
I created a standard Spacy custom components:
Here I passed an 'app_params' dict as an additional parameter to the call method, and get a 'text_id' and 'uid' value from the parameter dict. In a single thread processing, I do this way to use the custom component:
If I want to use multiprocessing like below:
How can I pass the component_cfg to the pipe() function? The component_cfg represents a set of app parameters specific to a single doc, but the pipe method takes a batch of texts as input. |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 10 replies
-
Hey @lingvisa - thanks for the question. I think to get where you need to go, we'll need to refactor your current solution slightly. If I understand your problem right, what you're trying to do is look up some document-specific information within your component. Currently you're doing this by passing a document-specific dictionary to Typically the way to solve this problem is to attach that information to the document itself when it's created as a custom attribute, then access that information within the component. Here's a potential solution with that in mind. First, Using that information, here's how I might approach this. I'm assuming in this case you have a list of Doc.set_extension("text_id", default=None)
Doc.set_extension("uid", default=None)
docs = []
for input_doc in input_docs:
doc = nlp.make_doc(input_doc['content'])
doc._.text_id = input_doc['tid']
doc._.uid = input_doc['uid']
docs.append(doc) Now you might need to do something with this information inside a custom component - here's what that looks like: class KBMatcher(object):
def __init__(self, nlp, disable):
...
def __call__(self, doc):
start = time.time()
text_id = doc._.text_id
print(f"This doc's text_id is {text_id}")
... Now if you do that, when you call |
Beta Was this translation helpful? Give feedback.
-
@pmbaumgartner Please suggest your alternative solution. For some reason, it is getting stuck at the "'pip install -r requirements.txt" on a Ubuntu machine in a conda environment. I may have to reverse back to 3.10 at this moment. I am building from Spacy 2.24 version. I can't simply install spacy from the build because I have some minor modification to the code for my needs. I am also attaching my requirements.txt which I modified from the original spacy requirements.txt. |
Beta Was this translation helpful? Give feedback.
-
@pmbaumgartner The upgrade was successful. I am having another problem now with multiprocessing now. In my pipleline, I have a bunch of custom components defined by spacy's convention. One of them is a classifier trained through bert model and in this DocClassifier component, it simply loads the bert models and call classify() function. I didn't use Spacy's internal training tools. This works all fine with single processing. But with multiprocessing on, the execution got stuck a the classify() function without giving any message. It hangs on forever. If each model is loaded into different processes through the nlp.pipe() function, this shouldn't happen. Any idea why this would happen? If I remove the classifier component, multiprocessing works fine. |
Beta Was this translation helpful? Give feedback.
Hey @lingvisa - thanks for the question.
I think to get where you need to go, we'll need to refactor your current solution slightly. If I understand your problem right, what you're trying to do is look up some document-specific information within your component. Currently you're doing this by passing a document-specific dictionary to
component_cfg
, but it doesn't really belong there because the information isn't really a property of the component, but a property of the document -- this just happens to work because you're reconfiguring the component every time it's called to work with the specific doc it's called on.Typically the way to solve this problem is to attach that information to …