Batch processing of a large document #9514
-
I would like to ask for tips for batch processing of a large document. Currently, I'm out of memory.
My use case (and hopefully a benefit) is to process annotated sentences one by one separately. So far, I need part-of-speech tags and dependency parsing. So, I don't need to have the whole annotated document in memory at the same time. On the other hand, I assume that spaCy can have better results when annotating paragraphs of sentences than separate sentences. The question is how to split the text into reasonable parts – can spaCy help me here? For example, split the text by a number of characters that are still safe to process and the last sentence is not split? Something like:
To provide a complete info, this is my configuration: |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
You are correct that spaCy can have better results working with paragraphs than individual sentences, but it doesn't benefit from context much longer than a paragraph.
There are some tools kind of like that for Transformers (see Can you just split your input into paragraphs, perhaps by splitting on double newlines? |
Beta Was this translation helpful? Give feedback.
-
Thanks for the hint! Yes, we can split the paragraphs by double newlines. I want first to clarify if spaCy does this preprocessing or not, so we do not use the same methods on the same text twice. |
Beta Was this translation helpful? Give feedback.
You are correct that spaCy can have better results working with paragraphs than individual sentences, but it doesn't benefit from context much longer than a paragraph.
There are some tools kind of like that for Transformers (see
span_getters
) but not for the CPU models. In particular for the CPU models, the text that's safe to process is going to be much longer than the longest segment that's useful to process.Can you just split your input into paragraphs, perhaps by splitting on do…