-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Description
During evaluation of the document classifier, it was discovered that the ground truth data for the "unknown" class contains a number of pages that should actually be classified as "title page". Examples of mislabelled pages include: 11613_1.pdf, 18774_1.pdf, 19115_1.pdf, 19462_1.pdf, 20993_1.pdf, 28777_1.pdf, and 35787_1.pdf. There also appears to be some overlap / gray area between "unknown" and other classes more generally, suggesting the "unknown" class is somewhat chaotic and inconsistently defined.
Task
Rather than treating "unknown" as a proper class with dedicated training data, we propose removing it as an explicit class and instead using a confidence threshold approach:
- Train the XGBoost classifier on the well-defined classes only (excluding "unknown").
- Extract a confidence score from the classifier for each prediction.
- If the confidence score falls below a defined threshold, return "unknown" instead of the top-scoring class.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels