Memory leak when processing a large number of documents with Spacy transformers #12093

saketsharmabmb · 2022-12-30T23:46:09Z

saketsharmabmb
Dec 30, 2022

I have a Spacy distilbert transformer model trained for NER. When I use this model for predictions on a large corpus of documents, the RAM usage spikes up very quickly, and then keeps increasing over time, until I run out of memory, and my process gets killed.
I am running this on a CPU AWS machine m5.12xlarge
I see the same behavior when using en_core_web_trf model.

The following code can be used to reproduce error with en_core_web_trf model


import sys, pickle, time, os
import spacy

print(f"CPU Count: {os.cpu_count()}")

model = spacy.load("en_core_web_trf")

## Docs are English text documents with average character length of 2479, std dev 3487, max 69000
docs = pickle.load( open( "memory_analysis/data/docs.p", "rb" ) )
print(len(docs))

for i, body in (enumerate(docs)):
    if i==10000:
        break
    ## Spacy prediction 
    list( model.pipe([body], disable=["tok2vec", "parser", "attribute_ruler", "lemmatizer"] ))
    if i%400==0:
        print(f"Doc number: {i}")

Environment:

spacy-transformers==1.1.8
spacy==3.4.3
torch==1.12.1

Additional info:
I notice that model vocab length and cached string store grows with the processed documents as well, although unsure if this is causing the memory leak.
I tried periodically reloading model, but that does not help either.

Using Memray for memory usage analysis:

python3 -m memray run -o memory_usage_trf_max.bin  memory_analysis.py
python3 -m memray flamegraph memory_usage_trf_max_len.bin

Answered by shadeMe

Jan 11, 2023

Thanks for the extra context. After some extensive testing, we were able to reproduce the same memory behaviour, but the potential causes for that do not seem to point to a memory leak. Let's move this to the discussion forum as the underlying issue is not a bug per-se.

Background

A bit of background on how the transformer pipeline works during inference: The user passes in strings or Doc instances to the model's pipe method. In the case of the former, the model initially runs the tokenizer on the strings and constructs Doc objects, since pipeline components only work with Doc inputs.

When a batch of documents are passed to the predict method of the Transformer pipe, it has to split the t…

View full answer

shadeMe · 2023-01-03T11:12:45Z

shadeMe
Jan 3, 2023

Thanks for the report and the code! A couple of questions:

Is this behaviour dependent on the documents that being processed? Specifically, does it also happen with other languages too?
Could you upload the memray profile to the issue so that we have a baseline for comparison?

0 replies

saketsharmabmb · 2023-01-03T16:17:52Z

saketsharmabmb
Jan 3, 2023
Author

Hi @shadeMe ,
Happy 2023!

I have attached three files:

model_predict_stream_long_run : This is the memory usage when I run the trained NER distilbert model on a stream of English web pages. As you can see, the memory usage spikes, and then flattens but increases asymptotically. I find it strange that the distilbert model (I only have "ner" and "transformer" in the pipeline) takes 10-12GB of memory. This memory analysis is printed by the kubernetes container model is running in.
model_predict: This is when I run the same model on a small batch of data, and report memray analysis. Again, the usage seems to increase over time. (It does not reach the same level as I do not run it long enough -> See time stamps)
vocab_len : This is an example of how model.vocab.length changes as I process 200,000 articles. Similar behavior is seen for len (list (model.vocab.strings) )

About the documents being processed, I am processing this on a stream of English web pages. I am quite confident that these are all English documents (barring some minimal noise) due to the way I source them. I am not sure if this happens also with other languages. It is a stream of new content, and the shape of the memory usage graph is always as seen in model_predict_stream_long_run.

0 replies

shadeMe · 2023-01-11T15:46:41Z

shadeMe
Jan 11, 2023

Thanks for the extra context. After some extensive testing, we were able to reproduce the same memory behaviour, but the potential causes for that do not seem to point to a memory leak. Let's move this to the discussion forum as the underlying issue is not a bug per-se.

Background

A bit of background on how the transformer pipeline works during inference: The user passes in strings or Doc instances to the model's pipe method. In the case of the former, the model initially runs the tokenizer on the strings and constructs Doc objects, since pipeline components only work with Doc inputs.

When a batch of documents are passed to the predict method of the Transformer pipe, it has to split the tokens in theDoc object into spans to ensure that the inputs do not exceed the maximum input length of the huggingface transformer model. Furthermore, it performs a second tokenization pass using the tokenizer associated with the transformer model and computes the alignments between the result and spaCy's tokenization. The newly tokenized input is passed to the transformer model, which returns its corresponding representations. To convert this output back to spaCy's format, we re-align the outputs to the original spaCy tokens and merge the spans to reconstruct the per-document, per-token representations. These are then stored on each Doc object so that downstream components can use them as inputs for their own models.

As you can imagine, the complexity of the above process requires us to maintain additional state for book-keeping and lazy-evaluation purposes. When combined with the transformer representations, each Doc instance ultimately ends up storing a significant amount of data when it's annotated with a transformer model.

Profiling results

During our testing, we only noticed ballooning memory usage when the Doc instances generated by the pipeline were kept around in memory, i.e., there was at least one reference to the Doc object, which prevented it from being garbage-collected. Is this also the case on your end? It does seem to be the case in your example code, in any event: Even though you immediately discard the output of model.pipe, the inputs are Doc objects from the docs container. Since the docs container keeps around references to them until the end of the program, the transformer data also sticks around with it.

The above graph is from a profiling session where the en_core_web_trf model was used run inference on ~18000 documents using a modified version of your example code. The only change was to pass the documents as strings to model.pipe, resulting in the immediate disposal of the Doc objects returned by the method. The spike seen towards the end corresponds to a batch of documents with a large number of tokens.

Re. vocab length: Given that you're processing webpages, there's a high chance of the model encountering novel tokens that are not found in its pre-trained vocabulary. This results in their being added to its string store. While this also contributes to the increase in memory usage, it will likely be eclipsed by the transformer data. Nevertheless, periodically reloading the model should reset its vocabulary.

Misc

One further point of note: huggingface transformer models are built using PyTorch and TensorFlow, but spaCy only supports the former. PyTorch uses a custom memory allocator that caches allocations. This means that when a particular block of PyTorch-allocated memory is freed, it doesn't immediately get released to the OS - the framework instead attempts to reuse the freed blocks as efficiently as possible. So, increasing memory usage doesn't necessarily mean that there's a memory leak behind the scenes.

3 replies

saketsharmabmb Jan 11, 2023
Author

Hi @shadeMe

I appreciate the detailed write up. This clarifies the behavior behind the scenes, and I have developed better understanding of the Spacy-transformer pipeline.

In our case, we process documents in a stream; supplying text as Python string to model.pipe within a function. The Doc object is then parsed to extract spans into a Python dict 'esque object. Exiting the function deletes all references to Doc object marking it for garbage collection.

We have an almost realtime stream of lots of such text documents being processed sequentially.

The final thing I am unclear on is the consistent high baseline memory usage of around 9+ GB.
I would anticipate the following scenario:
When we process a string we see a spike in memory allocation due to the book-keeping for span alignment, and trf_data. But when we exit the function, this memory should be garbage collected? Followed by another uptick in memory (based on length of document) when we get the next document?
However I never really see a drop in memory consumption below ~9ish GBs in our case.

Is this because GC is not instantaneous, and we possibly always are processing a new doc as memory is being released?
It could also be torch not releasing memory like you said.
Thanks.

saketsharmabmb Jan 12, 2023
Author

Also, do you have a reference for that PyTorch behavior? Thanks.

shadeMe Jan 13, 2023

In our case, we process documents in a stream; supplying text as Python string to model.pipe within a function. The Doc object is then parsed to extract spans into a Python dict 'esque object. Exiting the function deletes all references to Doc object marking it for garbage collection.

Are the spans that you're storing in the dict directly extracted from the Doc? If the extracted spans are indeed Span objects, they would contain a reference to their parent Doc instance. This would prevent the latter from being garbage collected until the former is released.

Is this because GC is not instantaneous, and we possibly always are processing a new doc as memory is being released?
It could also be torch not releasing memory like you said.

CPython uses a generational garbage collector (in addition to a custom memory allocator for small objects), so objects aren't immediately released. You can experiment with gc.collect() to see if it helps in your usecase.

The final thing I am unclear on is the consistent high baseline memory usage of around 9+ GB.

The high baseline memory usage can stem from multiple things, such as PyTorch optimistically acquiring one or more large blocks of memory from the OS to populate its memory pools. Also, the dependencies of PyTorch (like the Intel MKL) can themselves be caching their primitives. What does the memray flame graph show?

Also, do you have a reference for that PyTorch behavior? Thanks.

PyTorch docs.

Uh oh!

Memory leak when processing a large number of documents with Spacy transformers #12093

Uh oh!

Uh oh!

saketsharmabmb Dec 30, 2022

Background

Replies: 3 comments · 3 replies

Uh oh!

shadeMe Jan 3, 2023

Uh oh!

saketsharmabmb Jan 3, 2023 Author

Uh oh!

Uh oh!

shadeMe Jan 11, 2023

Background

Profiling results

Misc

Uh oh!

Uh oh!

saketsharmabmb Jan 11, 2023 Author

Uh oh!

saketsharmabmb Jan 12, 2023 Author

Uh oh!

Uh oh!

shadeMe Jan 13, 2023

saketsharmabmb
Dec 30, 2022

Replies: 3 comments 3 replies

shadeMe
Jan 3, 2023

saketsharmabmb
Jan 3, 2023
Author

shadeMe
Jan 11, 2023

saketsharmabmb Jan 11, 2023
Author

saketsharmabmb Jan 12, 2023
Author