Textcat on large documents - best practice for performance #10174

NixBiks · 2022-01-31T12:10:05Z

NixBiks
Jan 31, 2022

Hi,

I have a textcat task on large documents where speed/performance is important. As the Explosion team point out in the docs, the textcat can struggle on large documents so it might be a better idea to split the document up into smaller paragraphs and do some aggregation. My concern is that it's slower to create Doc for each paragraph. I see the following possibilities

Create a Doc for each paragraph and run the textcat on each doc.
Have a single Doc and run a trained SpanCatetegorizer on each paragraph using a custom suggester function.
Something else...

I'd like to get some feedback on what is generally recommended if that's possible.

pmbaumgartner · 2022-02-02T17:44:22Z

pmbaumgartner
Feb 2, 2022

Hey @mr-bjerre - typically with longer documents, I take the approach of breaking things down to the sentence level and classifying there, then aggregating those predictions to the whole document level. The assumption there is that typically always at least one sentence that indicates how a document should be classified. You could also use the predicted probabilities of each sentence to do something more complicated when labeling the entire document.

I should not in general it's hard to give general recommendations for something like this, it can depend heavily on the content of your documents. Do you have any more details about your problem that you can share?

2 replies

NixBiks Feb 2, 2022
Author

Hey Peter,

Thanks for your comments. So if you would classify each sentence - would you use the SpanCategorizer or use a TextCategorizer for that part?

To me it isn’t clear if SpanCategorizer is meant for those type of tasks

pmbaumgartner Feb 3, 2022

Without knowing more about the type of text, I'd go with the TextCategorizer. The SpanCategorizer is still experimental and when annotating data it can be hard to determine what the boundaries are for the relevant spans. With single sentences its easier to make a labeling decision.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Textcat on large documents - best practice for performance #10174

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Textcat on large documents - best practice for performance #10174

Uh oh!

NixBiks Jan 31, 2022

Replies: 1 comment · 2 replies

Uh oh!

pmbaumgartner Feb 2, 2022

Uh oh!

NixBiks Feb 2, 2022 Author

Uh oh!

pmbaumgartner Feb 3, 2022

NixBiks
Jan 31, 2022

Replies: 1 comment 2 replies

pmbaumgartner
Feb 2, 2022

NixBiks Feb 2, 2022
Author