Recommendation for processing very large documents #9170
Replies: 2 comments
-
Yes, it's probably running out of RAM. In terms of the linguistic annotation there's no benefit to processing a large document as a single unit (all the features are relatively local, usually within the same paragraph), so the solution is to break up your input text into smaller segments in the most sensible way you can (section breaks, paragraph breaks, etc.) and use a maximum text length that's appropriate for your pipeline + environment. You can set Because the memory usage for the |
Beta Was this translation helpful? Give feedback.
-
I work with a lot of large documents as well, I break up a single document into smaller units like adrianeboyd said then I combine the doc objects into a new single doc object as per the docs here. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm encountering an issue with very large text. My document is over 300,000 characters in length (it came from a pdf file that was approximately 40 pages in length) so a lot of text.
The issue I'm having is when I try to pass all this text as a single string into the model, it crashes the Flask web service it is running in. If I reduce the size to something like 275,000 characters it processes just fine.
I tested by running the python code on its own (not in a webservice) on a more powerful machine with a lot more memory I was able to pass the full text to spacy and get a result. I monitored the memory usage and noticed that it used approximately 3.8 GB of RAM while processing the text through the model. I'm guessing that the machine I was running the webservice on didn't have enough memory available to process the text.
I guess my question is what recommendations if any do you have to deal with very large text documents where the memory on the machine is fixed and we don't know how large the text could get. (i.e. therefore don't know in advance how much memory to allocate)
Beta Was this translation helpful? Give feedback.
All reactions