Best Practice for Large Documents #8561
morankyle
started this conversation in
Help: Best practices
Replies: 1 comment 1 reply
-
If you're running into memory errors due to large individual docs, usually you should be able to split them into paragraphs without a loss in performance. This is mentioned in the same comment that suggests 1GB per 100,000 characters, actually. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Currently have a Spacy pipeline which includes various built-in pipes such as ner, parser, etc, as well a custom component for extracting SVOs and SpacyTextBlob.
The model runs into issues with very large documents, I've read Spacy needs 1GB per 100,000. Is there any best practice for splitting up documents before sending them thru the pipeline in? Or is it possible to somehow split the doc into spans and send those thru the pipeline individually?
Beta Was this translation helpful? Give feedback.
All reactions