Textcat on large documents - best practice for performance #10174
Replies: 1 comment 2 replies
-
Hey @mr-bjerre - typically with longer documents, I take the approach of breaking things down to the sentence level and classifying there, then aggregating those predictions to the whole document level. The assumption there is that typically always at least one sentence that indicates how a document should be classified. You could also use the predicted probabilities of each sentence to do something more complicated when labeling the entire document. I should not in general it's hard to give general recommendations for something like this, it can depend heavily on the content of your documents. Do you have any more details about your problem that you can share? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I have a textcat task on large documents where speed/performance is important. As the Explosion team point out in the docs, the textcat can struggle on large documents so it might be a better idea to split the document up into smaller paragraphs and do some aggregation. My concern is that it's slower to create
Doc
for each paragraph. I see the following possibilitiesDoc
for each paragraph and run the textcat on each doc.Doc
and run a trainedSpanCatetegorizer
on each paragraph using a custom suggester function.I'd like to get some feedback on what is generally recommended if that's possible.
Beta Was this translation helpful? Give feedback.
All reactions