Memory leak when processing a large number of documents with Spacy transformers #12093
-
I have a Spacy distilbert transformer model trained for NER. When I use this model for predictions on a large corpus of documents, the RAM usage spikes up very quickly, and then keeps increasing over time, until I run out of memory, and my process gets killed. The following code can be used to reproduce error with en_core_web_trf model
Environment:
Additional info: Using Memray for memory usage analysis:
|
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 3 replies
-
Thanks for the report and the code! A couple of questions:
|
Beta Was this translation helpful? Give feedback.
-
Hi @shadeMe , I have attached three files:
About the documents being processed, I am processing this on a stream of English web pages. I am quite confident that these are all English documents (barring some minimal noise) due to the way I source them. I am not sure if this happens also with other languages. It is a stream of new content, and the shape of the memory usage graph is always as seen in model_predict_stream_long_run. |
Beta Was this translation helpful? Give feedback.
-
Thanks for the extra context. After some extensive testing, we were able to reproduce the same memory behaviour, but the potential causes for that do not seem to point to a memory leak. Let's move this to the discussion forum as the underlying issue is not a bug per-se. BackgroundA bit of background on how the When a batch of documents are passed to the As you can imagine, the complexity of the above process requires us to maintain additional state for book-keeping and lazy-evaluation purposes. When combined with the transformer representations, each Profiling resultsDuring our testing, we only noticed ballooning memory usage when the The above graph is from a profiling session where the Re. vocab length: Given that you're processing webpages, there's a high chance of the model encountering novel tokens that are not found in its pre-trained vocabulary. This results in their being added to its string store. While this also contributes to the increase in memory usage, it will likely be eclipsed by the transformer data. Nevertheless, periodically reloading the model should reset its vocabulary. MiscOne further point of note: |
Beta Was this translation helpful? Give feedback.
Thanks for the extra context. After some extensive testing, we were able to reproduce the same memory behaviour, but the potential causes for that do not seem to point to a memory leak. Let's move this to the discussion forum as the underlying issue is not a bug per-se.
Background
A bit of background on how the
transformer
pipeline works during inference: The user passes in strings orDoc
instances to the model'spipe
method. In the case of the former, the model initially runs the tokenizer on the strings and constructsDoc
objects, since pipeline components only work withDoc
inputs.When a batch of documents are passed to the
predict
method of theTransformer
pipe, it has to split the t…