Add predefined title page keywords as a feature for page classification

Followup of https://github.com/swisstopo/swissgeol-assets-dataextraction/pull/95#pullrequestreview-3820107138

The current approach relies on multiple features for page classification. It would be interesting to add a new feature using the predefined list of keywords in `src/language_detection/pages_to_ignore.py.` These predefined keywords could help improve the classification performance of our page detection model expecially for title and section header.

```json
# Existing features
{
  "features": [
    "KeywordsList",  <- Exmaple of new feature
    "Words Per Line",
    "Word Density",
    "Mean Left",
    "Text Width",
    "Indent Std Dev",
    "Capitalization Ratio",
    "Num Map Keyword Lines",
    "Grid Line Length Sum",
    "Non Grid Line Length Sum",
    "Line Angle Entropy",
    "Line Score",
    "Num Geo Profile Keywords",
    "Num Unit Keyword",
    "Y Scale OK",
    "X Scale OK",
    "Num Valid Borehole Descriptions",
    "Num Strip Logs",
    "Num Tables",
    "Num Boreholes",
    "Num Good Sidebars",
    "Best Sidebar Score",
    "Num Long or Horizontal Lines",
    "Text Line Count"
  ]
}
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add predefined title page keywords as a feature for page classification #99

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add predefined title page keywords as a feature for page classification #99

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions