Recommendation for processing very large documents #9170

erotavlas · 2021-09-07T20:00:32Z

erotavlas
Sep 7, 2021

I'm encountering an issue with very large text. My document is over 300,000 characters in length (it came from a pdf file that was approximately 40 pages in length) so a lot of text.

The issue I'm having is when I try to pass all this text as a single string into the model, it crashes the Flask web service it is running in. If I reduce the size to something like 275,000 characters it processes just fine.

I tested by running the python code on its own (not in a webservice) on a more powerful machine with a lot more memory I was able to pass the full text to spacy and get a result. I monitored the memory usage and noticed that it used approximately 3.8 GB of RAM while processing the text through the model. I'm guessing that the machine I was running the webservice on didn't have enough memory available to process the text.

I guess my question is what recommendations if any do you have to deal with very large text documents where the memory on the machine is fixed and we don't know how large the text could get. (i.e. therefore don't know in advance how much memory to allocate)

adrianeboyd · 2021-09-08T09:43:58Z

adrianeboyd
Sep 8, 2021

Yes, it's probably running out of RAM. In terms of the linguistic annotation there's no benefit to processing a large document as a single unit (all the features are relatively local, usually within the same paragraph), so the solution is to break up your input text into smaller segments in the most sensible way you can (section breaks, paragraph breaks, etc.) and use a maximum text length that's appropriate for your pipeline + environment. You can set nlp.max_length to a custom value to get an error on longer texts, if you'd like.

Because the memory usage for the parser and ner components doesn't scale linearly with text length, it can be more efficient to process a set of shorter texts (with nlp.pipe, of course).

0 replies

Onyoursix · 2021-09-09T14:10:27Z

Onyoursix
Sep 9, 2021

I work with a lot of large documents as well, I break up a single document into smaller units like adrianeboyd said then I combine the doc objects into a new single doc object as per the docs here.

https://spacy.io/api/doc#from_docs

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Recommendation for processing very large documents #9170

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Recommendation for processing very large documents #9170

Uh oh!

Uh oh!

erotavlas Sep 7, 2021

Replies: 2 comments

Uh oh!

adrianeboyd Sep 8, 2021

Uh oh!

Onyoursix Sep 9, 2021

erotavlas
Sep 7, 2021

adrianeboyd
Sep 8, 2021

Onyoursix
Sep 9, 2021