NER on large documents #8303

jazzdup · 2021-06-07T14:51:55Z

jazzdup
Jun 7, 2021

Is there any way to configure the tokenizer of a pre-trained model?
I'm using en_core_web_trf to do NER (and matching) on large corpora (up to 100 pages) some of which have 50% of the document specified as an Index which cases OOM errors (often without text="Index"). I was unable to find a way to configure the pre-trained model to truncate or avoid these OOM errors unless I hack the source code as it seems only a model I train myself can be fully configured?
What's the best way to handle this please? should I chunk up my text and use nlp.pipe? Or am I missing something obvious.

polm · 2021-06-08T03:15:43Z

polm
Jun 8, 2021

It sounds like you should be preprocessing your text to make it into smaller documents, perhaps on the order of a few paragraphs. You can then feed those to spaCy. There is not a setting to do this, but it should be very easy to do by looking for blank lines or something.

You can set max_length, though that won't handle segmentation for you, it'll just throw an error if a document is too long.

Usually when we talk about the "tokenizer" in spaCy we're discussing how text is split into words, like how don't is split into two tokens, not how a multi-page document is divided up.

some of which have 50% of the document specified as an Index which cases OOM errors (often without text="Index")

I have no idea what you mean by this. Do you mean there's a table of contents in the document?

0 replies

jazzdup · 2021-06-08T09:53:34Z

jazzdup
Jun 8, 2021
Author

Thanks for the help, max_length may help but let me try and explain the problem a little better.
I'm processing court documents and transcripts from scanned PDF's and there is quite a variety in the formatting which is why it's difficult to split up the documents well. The main reason I'm not splitting them up is because I'm using the Neuralcoref library to resolve co-references and the language used in English courts is often archane and long-winded and it's difficult to find a consistent way to split the text as references are carried across paragraphs due to the way they are transcribed.
When I say "Index" I'm referring to an index as you might find in the back of a book! ;-)
From A-Z starting as in the example below except in some cases I have a 100 page document; 50 pages of which will be this index which as you can see causes the OOM error in documents which have it. Any suggestions on best practice how to identify and exclude processing of this index?
Many thanks.

Index example...
A
A/13 86:16,19
A/13(p) 86:22
A/138 93:22
A/147 101:17
102:7
A/1524 48:18
A/1525 48:19
A/16 91:9
A/1780 50:19
A/1781 52:11
A/19 82:17
A/200 113:23
A/201 118:15
A/204 109:24
A/205 125:24
..............................
Y....
Z....

2 replies

polm Jun 9, 2021

Thanks for clarifying you're using spaCy v2.

OK, so it sounds like you have two issues:

You need to remove the indexes, because they are long and not helpful
You need to split the remaining text somehow

If you fix 1 you may be able to avoid having to deal with 2.

Discarding your index looks like it should be simple. You can just discard lines under a certain length, or lines that match a regex, something like [A-Z]/[0-9]+ [0-9]+:[0-9]+, or lines where more than some percentage of characters are numeric.

For dividing up text, surely you can find sections or something? If you really can't resolve coreferences at the end of the document without the start then it's probably going to be difficult no matter what you do, but you can start by dividing up sections - surely they're separated by multiple blank lines or some kind of pattern you can use? If splitting into sections actually causes issues then you may just need a different coreference strategy than neuralcoref uses.

jazzdup Jun 17, 2021
Author

I'm actually using spacy v2 with neuralcoref, then processing again with spacy v3 and the pre-trained trf model as it does a say 30+% better job of picking out relevant entities.
Great idea for recognising index by percentage of numeric chars, thanks so much!
Yes it's difficult as all the documents are scanned and organised differently - some with random numbers at the start of each line just for fun.
Much appreciated.:-)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

NER on large documents #8303

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

NER on large documents #8303

Uh oh!

jazzdup Jun 7, 2021

Replies: 2 comments · 2 replies

Uh oh!

polm Jun 8, 2021

Uh oh!

jazzdup Jun 8, 2021 Author

Uh oh!

polm Jun 9, 2021

Uh oh!

jazzdup Jun 17, 2021 Author

jazzdup
Jun 7, 2021

Replies: 2 comments 2 replies

polm
Jun 8, 2021

jazzdup
Jun 8, 2021
Author

jazzdup Jun 17, 2021
Author