Batch processing of a large document #9514

dsenkyr · 2021-10-19T17:57:53Z

dsenkyr
Oct 19, 2021

I would like to ask for tips for batch processing of a large document. Currently, I'm out of memory.

[E088] Text of length 4104268 exceeds maximum of 1000000. The parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the nlp.max_length limit. The limit is in number of characters, so you can check whether your inputs are too long by checking len(text).

My use case (and hopefully a benefit) is to process annotated sentences one by one separately. So far, I need part-of-speech tags and dependency parsing.

So, I don't need to have the whole annotated document in memory at the same time. On the other hand, I assume that spaCy can have better results when annotating paragraphs of sentences than separate sentences.

The question is how to split the text into reasonable parts – can spaCy help me here? For example, split the text by a number of characters that are still safe to process and the last sentence is not split?

Something like:

text_parts = split(large_text_document)
for text_part in text_parts:
	doc_part = nlp(text_part)
	my_analyzer(doc_part)

To provide a complete info, this is my configuration:
spaCy 3.0.3 (model: en_core_web_trf 3.0.0): spacy.load('en_core_web_trf', disable=['ner'])
Python 3.9.2

Answered by polm

Oct 20, 2021

You are correct that spaCy can have better results working with paragraphs than individual sentences, but it doesn't benefit from context much longer than a paragraph.

The question is how to split the text into reasonable parts – can spaCy help me here? For example, split the text by a number of characters that are still safe to process and the last sentence is not split?

There are some tools kind of like that for Transformers (see span_getters) but not for the CPU models. In particular for the CPU models, the text that's safe to process is going to be much longer than the longest segment that's useful to process.

Can you just split your input into paragraphs, perhaps by splitting on do…

View full answer

polm · 2021-10-20T04:44:48Z

polm
Oct 20, 2021

You are correct that spaCy can have better results working with paragraphs than individual sentences, but it doesn't benefit from context much longer than a paragraph.

The question is how to split the text into reasonable parts – can spaCy help me here? For example, split the text by a number of characters that are still safe to process and the last sentence is not split?

There are some tools kind of like that for Transformers (see span_getters) but not for the CPU models. In particular for the CPU models, the text that's safe to process is going to be much longer than the longest segment that's useful to process.

Can you just split your input into paragraphs, perhaps by splitting on double newlines?

0 replies

dsenkyr · 2021-10-21T13:22:00Z

dsenkyr
Oct 21, 2021
Author

Thanks for the hint! Yes, we can split the paragraphs by double newlines. I want first to clarify if spaCy does this preprocessing or not, so we do not use the same methods on the same text twice.

1 reply

polm Oct 22, 2021

spaCy doesn't do automatic doc splitting - the text you pass as input is the text in the Doc. It won't do this kind of preprocessing.

The span getters do kind of split documents, but only in the specific case of Transformers, and it's basically a backend step. Since Transformers often have relatively strict size limitations the span getters automate a typical workaround for that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Batch processing of a large document #9514

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Batch processing of a large document #9514

Uh oh!

dsenkyr Oct 19, 2021

Replies: 2 comments · 1 reply

Uh oh!

polm Oct 20, 2021

Uh oh!

dsenkyr Oct 21, 2021 Author

Uh oh!

polm Oct 22, 2021

dsenkyr
Oct 19, 2021

Replies: 2 comments 1 reply

polm
Oct 20, 2021

dsenkyr
Oct 21, 2021
Author