Performance with `nlp.pipe` #9598

Uzay-G · 2021-11-02T15:28:14Z

Uzay-G
Nov 2, 2021

Hello,
I'm trying to write a pipeline to process documents of the following scale: upper hundreds of documents, total ~1.5M words.

I've been using nlp.pipe and disabling the language features I don't need.

However, it seems now that when I process a large batch with nlp.pipe simply too much memory is used and the Jupyter kernel (I'm running this in a Jupyter notebook) dies.

When I process the documents individually with nlp(doc) it works, albeit much slower.

Is there a way I could fix this? Maybe instead of calling nlp.pipe on all my documents, batch them into smaller collections (like 15 docs), and then process those with nlp.pipe?

Answered by polm

Nov 3, 2021

Have you tried passing a batch_size to nlp.pipe?

Not sure what environment you're developing in, but note that Jupyter kernels often set a memory limit lower than the system memory, so you might want to look into adjusting that.

upper hundreds of documents, total ~1.5M words.

Does that mean each document is around 10,000 words? That's pretty long - you might find it easier to work with documents if you slice them into paragraphs or other sub-units.

View full answer

polm · 2021-11-03T04:34:16Z

polm
Nov 3, 2021

Have you tried passing a batch_size to nlp.pipe?

Not sure what environment you're developing in, but note that Jupyter kernels often set a memory limit lower than the system memory, so you might want to look into adjusting that.

upper hundreds of documents, total ~1.5M words.

Does that mean each document is around 10,000 words? That's pretty long - you might find it easier to work with documents if you slice them into paragraphs or other sub-units.

1 reply

Uzay-G Nov 3, 2021
Author

Thanks for the advice! I had learned about nlp.pipe through (https://spacy.io/usage/processing-pipelines) and hadn't noticed the batch_size parameter. Thanks a bunch :)

I'll also look into splitting documents further if need be.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Performance with `nlp.pipe` #9598

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Performance with nlp.pipe #9598

Uh oh!

Uh oh!

Uzay-G Nov 2, 2021

Replies: 1 comment · 1 reply

Uh oh!

polm Nov 3, 2021

Uh oh!

Uzay-G Nov 3, 2021 Author

Performance with `nlp.pipe` #9598

Uzay-G
Nov 2, 2021

Replies: 1 comment 1 reply

polm
Nov 3, 2021

Uzay-G Nov 3, 2021
Author