Skip to content

Rethink handling of the "unknown" class in document classifier #101

@christianabbet

Description

@christianabbet

Description

During evaluation of the document classifier, it was discovered that the ground truth data for the "unknown" class contains a number of pages that should actually be classified as "title page". Examples of mislabelled pages include: 11613_1.pdf, 18774_1.pdf, 19115_1.pdf, 19462_1.pdf, 20993_1.pdf, 28777_1.pdf, and 35787_1.pdf. There also appears to be some overlap / gray area between "unknown" and other classes more generally, suggesting the "unknown" class is somewhat chaotic and inconsistently defined.

Task

Rather than treating "unknown" as a proper class with dedicated training data, we propose removing it as an explicit class and instead using a confidence threshold approach:

  • Train the XGBoost classifier on the well-defined classes only (excluding "unknown").
  • Extract a confidence score from the classifier for each prediction.
  • If the confidence score falls below a defined threshold, return "unknown" instead of the top-scoring class.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions