Best Practice for Large Documents #8561

morankyle · 2021-06-30T20:53:46Z

morankyle
Jun 30, 2021

Currently have a Spacy pipeline which includes various built-in pipes such as ner, parser, etc, as well a custom component for extracting SVOs and SpacyTextBlob.

The model runs into issues with very large documents, I've read Spacy needs 1GB per 100,000. Is there any best practice for splitting up documents before sending them thru the pipeline in? Or is it possible to somehow split the doc into spans and send those thru the pipeline individually?

polm · 2021-07-01T06:12:23Z

polm
Jul 1, 2021

If you're running into memory errors due to large individual docs, usually you should be able to split them into paragraphs without a loss in performance.

This is mentioned in the same comment that suggests 1GB per 100,000 characters, actually.

1 reply

morankyle Jul 1, 2021
Author

Awesome thank you! Actually came across that comment in a different Stack Over Flow thread so that was helpful!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Best Practice for Large Documents #8561

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Best Practice for Large Documents #8561

Uh oh!

morankyle Jun 30, 2021

Replies: 1 comment · 1 reply

Uh oh!

polm Jul 1, 2021

Uh oh!

morankyle Jul 1, 2021 Author

morankyle
Jun 30, 2021

Replies: 1 comment 1 reply

polm
Jul 1, 2021

morankyle Jul 1, 2021
Author