How to implement the best stream of texts in multiprocessing with Spacy ? #10838

hugoobauer · 2022-05-23T14:40:43Z

hugoobauer
May 23, 2022

I want to put an API in front of spacy and process a continuous flow of texts to be analysed.
For optimization purposes, I also want my service to be multiprocessed to process several documents simultaneously.
So I create a single pipe with nlp.pipe() (to avoid recreating it each time as it is very energy consuming), to which I pass a generator that continuously yields the texts as soon as they arrive from an API call.

I have a problem because I think that the .pipe() method was not designed for this use:
Indeed, I encounter a first problem as soon as I launch it: the pipe does not run until I have yielded 4 documents. This is because of the _multiprocessing_pipe method which makes 2 calls to sender.send(), which implies that we have to loop on self.data (the flow of texts) and get the first 4 texts (because n_process = 2 and 2 calls on .send()). As long as the 4 texts have not been retrieved, the thread will remain blocked at this point.

spaCy/spacy/language.py

Lines 1613 to 1616 in 7ce3460

    
           sender = _Sender(batch_texts, texts_q, chunk_size=n_process) 
        
           # send twice to make process busy 
        
           sender.send() 
        
           sender.send()

spaCy/spacy/language.py

Lines 2212 to 2218 in 7ce3460

    
           def send(self) -> None: 
        
               """Send chunk_size items from self.data to channels.""" 
        
               for item, q in itertools.islice( 
        
                   zip(self.data, cycle(self.queues)), self.chunk_size 
        
               ): 
        
                   # cycle channels so that distribute the texts evenly 
        
                   q.put(item)

I encounter a second problem after the previous one, but it is related:
The processed documents are only returned in pairs (in the case where n_process = 2). Once the condition if i % batch_size == 0: is true, the .step() method calls the .send(), which will read the generator.

spaCy/spacy/language.py

Lines 1646 to 1648 in 7ce3460

    
           if i % batch_size == 0: 
        
               # tell `sender` that one batch was consumed. 
        
               sender.step()

This means that if only one document is yielded, the method will remain stucked, waiting for a second document. Which consequently will never return the previously yielded (and processed) document because the thread is blocked in sender.step(), and therefore cannot loop over byte_tuples

spaCy/spacy/language.py

Lines 1634 to 1648 in 7ce3460

    
           for i, (_, (byte_doc, byte_context, byte_error)) in enumerate( 
        
               zip(raw_texts, byte_tuples), 1 
        
           ): 
        
               if byte_doc is not None: 
        
                   doc = Doc(self.vocab).from_bytes(byte_doc) 
        
                   doc._context = byte_context 
        
                   yield doc 
        
               elif byte_error is not None: 
        
                   error = srsly.msgpack_loads(byte_error) 
        
                   self.default_error_handler( 
        
                       None, None, None, ValueError(Errors.E871.format(error=error)) 
        
                   ) 
        
               if i % batch_size == 0: 
        
                   # tell `sender` that one batch was consumed. 
        
                   sender.step()

Here a code sample :

import datetime

import threading
from threading import Thread
from queue import Queue

import spacy


TIME = datetime.datetime.now()
cv = threading.Condition()
q = Queue()


def time_profiler():
    global TIME
    diff = datetime.datetime.now() - TIME
    TIME = datetime.datetime.now()
    return diff


def infinite_sequences():
    # global count
    while True:
        with cv:
            # cv.wait_for(lambda: not q.empty() or count % 2 != 0)
            cv.wait_for(lambda: not q.empty())
            while not q.empty():
                yield q.get()
                print("yielded to spacy")


def pipeline_runner(_nlp: spacy.language.Language):
    # global count
    my_pipe = _nlp.pipe(infinite_sequences(), n_process=2, batch_size=1)
    _i = 0
    print("Pipe started")
    for _elem in my_pipe:
        print("finished", time_profiler(), _i, _elem)
    print("end")


if __name__ == "__main__":
    nlp = spacy.load("fr_core_news_lg")
    pipeline_thread = Thread(target=pipeline_runner, args=(nlp, ))
    pipeline_thread.start()
    i = 0
    # with 4 texts, will work and return only the first 2 texts
    # with less than 4 texts, will not work and will wait to have 4 texts yielded
    for i in range(3):
        with cv:
            elem = f"<p>testing {i}</p>"
            q.put(elem)
            print("Elem inserted to queue", elem[:10])
            cv.notify_all()
            i += 1
    pipeline_thread.join()

One way to solve the problem is to put the sender in a dedicated thread. It works but I don't know if this is the right way to do it.

        sender = _Sender(batch_texts, texts_q, chunk_size=n_process)
        # thread creation to make it async
        th = Thread(target=sender.send)
        th.start()
        # send twice to make process busy
        # sender.send()
        # sender.send()
[...]

                # comment this part bellow to avoid blocking method

                # if i % batch_size == 0:
                #     # tell `sender` that one batch was consumed.
                #     sender.step()

Is it a good practice and should I create a PR ?

polm · 2022-05-26T02:01:54Z

polm
May 26, 2022

Sorry you've been having trouble with this.

You are correct that the pipe function is not really designed to be used with infinite generators in that particular way, the expectation is that you can provide data faster than nlp.pipe can process it. If you have input that comes in bursts, I suspect the best strategy is to handle batching yourself, and call nlp.pipe with a batch size that is some maximum value or the number of texts you have at the time, whichever is lower. It may even make sense to just not call nlp.pipe if your requests are mostly coming in one at a time.

my_pipe = _nlp.pipe(infinite_sequences(), n_process=2, batch_size=1)

This will not cause an error, but nlp.pipe with batch_size = 1 isn't useful. The main benefit of nlp.pipe is eliminating overhead by processing multiple items in a single batch.

0 replies

hugoobauer · 2022-05-30T09:20:00Z

hugoobauer
May 30, 2022
Author

My needs are mostly for multiprocessing purposes. I can either receive texts one by one, or in bursts, hence my need to be able to process them in parallel. I want to use the pipe method to have a low memory usage. With the classic method, if I'm not mistaken, I have to instantiate 2 objects, which will take more memory.

I have created a PR to show the changes I would make to make the stream of documents work.
#10874

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

How to implement the best stream of texts in multiprocessing with Spacy ? #10838

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

How to implement the best stream of texts in multiprocessing with Spacy ? #10838

Uh oh!

Uh oh!

hugoobauer May 23, 2022

Replies: 2 comments

Uh oh!

polm May 26, 2022

Uh oh!

hugoobauer May 30, 2022 Author

hugoobauer
May 23, 2022

polm
May 26, 2022

hugoobauer
May 30, 2022
Author