NER on large documents #8303
Replies: 2 comments 2 replies
-
It sounds like you should be preprocessing your text to make it into smaller documents, perhaps on the order of a few paragraphs. You can then feed those to spaCy. There is not a setting to do this, but it should be very easy to do by looking for blank lines or something. You can set max_length, though that won't handle segmentation for you, it'll just throw an error if a document is too long. Usually when we talk about the "tokenizer" in spaCy we're discussing how text is split into words, like how
I have no idea what you mean by this. Do you mean there's a table of contents in the document? |
Beta Was this translation helpful? Give feedback.
-
Thanks for the help, max_length may help but let me try and explain the problem a little better. Index example... |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Is there any way to configure the tokenizer of a pre-trained model?
I'm using en_core_web_trf to do NER (and matching) on large corpora (up to 100 pages) some of which have 50% of the document specified as an Index which cases OOM errors (often without text="Index"). I was unable to find a way to configure the pre-trained model to truncate or avoid these OOM errors unless I hack the source code as it seems only a model I train myself can be fully configured?
What's the best way to handle this please? should I chunk up my text and use nlp.pipe? Or am I missing something obvious.
Beta Was this translation helpful? Give feedback.
All reactions