How to implement the best stream of texts in multiprocessing with Spacy ? #10838
Replies: 2 comments
-
Sorry you've been having trouble with this. You are correct that the
This will not cause an error, but |
Beta Was this translation helpful? Give feedback.
-
My needs are mostly for multiprocessing purposes. I can either receive texts one by one, or in bursts, hence my need to be able to process them in parallel. I want to use the pipe method to have a low memory usage. With the classic method, if I'm not mistaken, I have to instantiate 2 objects, which will take more memory. I have created a PR to show the changes I would make to make the stream of documents work. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I want to put an API in front of spacy and process a continuous flow of texts to be analysed.
For optimization purposes, I also want my service to be multiprocessed to process several documents simultaneously.
So I create a single pipe with
nlp.pipe()
(to avoid recreating it each time as it is very energy consuming), to which I pass a generator that continuously yields the texts as soon as they arrive from an API call.I have a problem because I think that the .pipe() method was not designed for this use:
Indeed, I encounter a first problem as soon as I launch it: the pipe does not run until I have yielded 4 documents. This is because of the
_multiprocessing_pipe
method which makes 2 calls to sender.send(), which implies that we have to loop on self.data (the flow of texts) and get the first 4 texts (becausen_process = 2
and 2 calls on.send()
). As long as the 4 texts have not been retrieved, the thread will remain blocked at this point.spaCy/spacy/language.py
Lines 1613 to 1616 in 7ce3460
spaCy/spacy/language.py
Lines 2212 to 2218 in 7ce3460
I encounter a second problem after the previous one, but it is related:
The processed documents are only returned in pairs (in the case where n_process = 2). Once the condition
if i % batch_size == 0:
is true, the.step()
method calls the.send()
, which will read the generator.spaCy/spacy/language.py
Lines 1646 to 1648 in 7ce3460
This means that if only one document is yielded, the method will remain stucked, waiting for a second document. Which consequently will never return the previously yielded (and processed) document because the thread is blocked in
sender.step()
, and therefore cannot loop overbyte_tuples
spaCy/spacy/language.py
Lines 1634 to 1648 in 7ce3460
Here a code sample :
One way to solve the problem is to put the sender in a dedicated thread. It works but I don't know if this is the right way to do it.
Is it a good practice and should I create a PR ?
Beta Was this translation helpful? Give feedback.
All reactions