Skip to content

Add predefined title page keywords as a feature for page classification #99

@christianabbet

Description

@christianabbet

Followup of #95 (review)

The current approach relies on multiple features for page classification. It would be interesting to add a new feature using the predefined list of keywords in src/language_detection/pages_to_ignore.py. These predefined keywords could help improve the classification performance of our page detection model expecially for title and section header.

# Existing features
{
  "features": [
    "KeywordsList",  <- Exmaple of new feature
    "Words Per Line",
    "Word Density",
    "Mean Left",
    "Text Width",
    "Indent Std Dev",
    "Capitalization Ratio",
    "Num Map Keyword Lines",
    "Grid Line Length Sum",
    "Non Grid Line Length Sum",
    "Line Angle Entropy",
    "Line Score",
    "Num Geo Profile Keywords",
    "Num Unit Keyword",
    "Y Scale OK",
    "X Scale OK",
    "Num Valid Borehole Descriptions",
    "Num Strip Logs",
    "Num Tables",
    "Num Boreholes",
    "Num Good Sidebars",
    "Best Sidebar Score",
    "Num Long or Horizontal Lines",
    "Text Line Count"
  ]
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions