-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Followup of #95 (review)
The current approach relies on multiple features for page classification. It would be interesting to add a new feature using the predefined list of keywords in src/language_detection/pages_to_ignore.py. These predefined keywords could help improve the classification performance of our page detection model expecially for title and section header.
# Existing features
{
"features": [
"KeywordsList", <- Exmaple of new feature
"Words Per Line",
"Word Density",
"Mean Left",
"Text Width",
"Indent Std Dev",
"Capitalization Ratio",
"Num Map Keyword Lines",
"Grid Line Length Sum",
"Non Grid Line Length Sum",
"Line Angle Entropy",
"Line Score",
"Num Geo Profile Keywords",
"Num Unit Keyword",
"Y Scale OK",
"X Scale OK",
"Num Valid Borehole Descriptions",
"Num Strip Logs",
"Num Tables",
"Num Boreholes",
"Num Good Sidebars",
"Best Sidebar Score",
"Num Long or Horizontal Lines",
"Text Line Count"
]
}Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels