Skip to content

Commit 33d258c

Browse files
Merge pull request #95 from swisstopo/feat/issue-89/title-page-behaviour
Change behaviour of Title page detection
2 parents 2c636e3 + 7628fda commit 33d258c

File tree

16 files changed

+5464
-6508
lines changed

16 files changed

+5464
-6508
lines changed

.gitignore

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,8 @@ minio
2222

2323
# IDE config
2424
.idea/
25-
.vscode/
25+
.vscode/*
26+
!.vscode/launch.json.template.jsonc
2627

2728
# Package metadata
2829
*.egg-info/

.vscode/launch.json.template.jsonc

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
{
2+
"version": "0.2.0",
3+
"configurations": [
4+
{
5+
"name":"Python: Run dataset split (single pages)",
6+
"type":"debugpy",
7+
"request":"launch",
8+
"program":"src/scripts/split_data.py",
9+
"console":"integratedTerminal",
10+
"args": ["-i", "data/single_pages", "-o", "data/single_pages_splits", "-rv", "0.2", "-rt", "0.0"],
11+
"python": "${workspaceFolder}/venv/bin/python3",
12+
"env": {
13+
"PYTHONPATH": "${workspaceFolder}"
14+
}
15+
},
16+
{
17+
"name":"Python: Train XGBoost classsification (single pages)",
18+
"type":"debugpy",
19+
"request":"launch",
20+
"program":"src/models/treebased/train.py",
21+
"console":"integratedTerminal",
22+
"args": ["--config-file-path", "config/xgboost_config.yml", "--out-directory", "models/xgboost_model"],
23+
"python": "${workspaceFolder}/venv/bin/python3",
24+
"env": {
25+
"PYTHONPATH": "${workspaceFolder}"
26+
}
27+
}
28+
]
29+
}

CLASSIFICATION.md

Lines changed: 122 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,122 @@
1+
# XGBoost Classification
2+
3+
To train the classification, the development package needs to be installed and MLflow tracking activated.
4+
5+
The dataset used to train the provided model (`models/stable/model.joblib`) is internal and not publicly available. It is stored in a private S3 bucket (`stijnvermeeren-assets-data`) accessible only to the project team. The dataset is composed of 1011 labeled single-page PDF across 9 classes, with ground truth available under `data/gt_single_pages_2026.json`. The distribution of the pages is listed below.
6+
7+
| Class | Number | Percentage |
8+
|-----------------|-------:|-----------:|
9+
| boreprofile | 115 | 13.4 |
10+
| diagram | 106 | 10.5 |
11+
| geo_profile | 74 | 7.3 |
12+
| map | 126 | 12.5 |
13+
| section_header | 93 | 9.2 |
14+
| table | 60 | 5.9 |
15+
| text | 202 | 20.0 |
16+
| title_page | 109 | 10.8 |
17+
| unknown | 126 | 12.5 |
18+
19+
20+
The classification results on the validation set are reported below.
21+
22+
| Class | Precision | Recall | F1-score |
23+
|-----------------|----------:|-------:|---------:|
24+
| boreprofile | 96.7 | 87.9 | 92.1 |
25+
| diagram | 84.6 | 84.6 | 84.6 |
26+
| geo_profile | 55.6 | 71.4 | 62.5 |
27+
| map | 63.6 | 80.8 | 71.2 |
28+
| section_header | 64.7 | 73.3 | 68.8 |
29+
| table | 90.9 | 83.3 | 87.0 |
30+
| text | 84.4 | 88.4 | 86.4 |
31+
| title_page | 95.0 | 95.0 | 95.0 |
32+
| unknown | 57.9 | 39.3 | 46.8 |
33+
| Overall (macro) | 77.0 | 78.2 | 77.1 |
34+
35+
36+
## Train with your own data
37+
38+
### 1. Prepare the folder structure
39+
40+
Organize your labeled single-page images with one subfolder per class:
41+
42+
```
43+
data/single_pages/
44+
├── boreprofile/
45+
├── diagram/
46+
├── geo_profile/
47+
├── map/
48+
├── section_header/
49+
├── table/
50+
├── text/
51+
├── title_page/
52+
└── unknown/
53+
```
54+
55+
### 2. Prepare the ground truth
56+
57+
The ground truth file is a JSON list of labeled documents. Follow the same format as `data/gt_single_pages.json`:
58+
59+
```jsonc
60+
[
61+
{
62+
"filename": "24911_1.pdf", // file name relative to train / validation folder
63+
"metadata": {
64+
"page_count": 1 // total number of pages in the document
65+
},
66+
"pages": [
67+
{
68+
"page": 1, // page number (1-indexed)
69+
"classification": { // one-hot encoding of the page class
70+
"text": 0,
71+
"boreprofile": 0,
72+
"map": 0,
73+
"geo_profile": 0,
74+
"title_page": 1,
75+
"diagram": 0,
76+
"table": 0,
77+
"unknown": 0,
78+
"section_header": 0
79+
}
80+
}
81+
]
82+
}
83+
]
84+
```
85+
86+
### 3. Split into train and validation sets
87+
88+
Split the dataset using an 80-20% ratio based on filename:
89+
90+
```bash
91+
python src/scripts/split_data.py \
92+
-i data/single_pages \
93+
-o data/single_pages_splits \
94+
-rv 0.2 \
95+
-rt 0.0
96+
```
97+
98+
### 4. Update the config
99+
100+
Edit `config/xgboost_config.yml` to point to your data:
101+
102+
```yaml
103+
# Path to the training set
104+
train_folder_path: "data/single_pages_splits/train"
105+
# Path to the validation set
106+
val_folder_path: "data/single_pages_splits/validation"
107+
# Ground truth for model training and validation
108+
ground_truth_file_path: "data/gt_single_pages.json"
109+
```
110+
111+
### 5. Train the model
112+
113+
```bash
114+
python -m src.models.treebased.train \
115+
--config-file-path config/xgboost_config.yml \
116+
--out-directory models/xgboost_model
117+
```
118+
119+
The trained model will be saved under `models/xgboost_model`. For macOS users, if you encounter OpenMP issues, install the library via Homebrew first:
120+
```bash
121+
brew install libomp
122+
```

api/app/v1/schemas.py

Lines changed: 17 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,22 @@
1-
from enum import Enum
2-
from typing import TypeAlias
1+
from enum import StrEnum
32

43
from pydantic import BaseModel, ConfigDict, Field
54
from pydantic.alias_generators import to_pascal
65

76
from src.page_classes import PageClasses
87

9-
# dynamically created Enum to expose PascalCase class names to the API.
10-
PascalPageClasses = Enum(
11-
"PascalPageClasses",
12-
{name: to_pascal(value) for name, value in PageClasses.__members__.items()},
13-
type=str,
14-
)
15-
PascalPageClasses: TypeAlias = PascalPageClasses # pyright: ignore[reportInvalidTypeForm]
8+
9+
class PascalPageClasses(StrEnum):
10+
"""Enum for classifying pages into page types."""
11+
12+
BOREPROFILE = "Boreprofile"
13+
DIAGRAM = "Diagram"
14+
GEO_PROFILE = "GeoProfile"
15+
MAP = "Map"
16+
TABLE = "Table"
17+
TEXT = "Text"
18+
TITLE_PAGE = "TitlePage"
19+
UNKNOWN = "Unknown"
1620

1721

1822
class MetaDataSchema(BaseModel):
@@ -63,7 +67,7 @@ class PredictionSchema(BaseModel):
6367
pages: list[PagePrediction]
6468

6569
@classmethod
66-
def from_prediction(cls, prediction: dict[dict]):
70+
def from_prediction(cls, prediction: dict):
6771
return cls(
6872
filename=prediction["filename"],
6973
metadata=MetaDataSchema.from_prediction(prediction["metadata"]),
@@ -107,9 +111,11 @@ def create_response(cls, predictions: list[dict]):
107111
def predicted_class(classification: PageClasses) -> PascalPageClasses:
108112
"""Parse the predicted class from a one-hot encoded classification dictionary.
109113
110-
The values of the dict are the sting representation of each class in the PageClasses enum.
114+
The values of the dict are the string representation of each class in the PageClasses enum.
111115
"""
112116
try:
117+
# Cast detected pages to Pascal equivalent
113118
return PascalPageClasses(to_pascal(classification))
114119
except ValueError:
120+
# Other undefined classes such as Section Header
115121
return PascalPageClasses.UNKNOWN

config/xgboost_config.yml

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
model_type: xgboost
22

33
train_folder_path: "data/single_pages_splits/train"
4-
val_folder_path: "data/single_pages_splits/val"
4+
val_folder_path: "data/single_pages_splits/validation"
55
ground_truth_file_path: "data/gt_single_pages.json"
66

77
# Feature names to track (23 features total)
@@ -34,7 +34,6 @@ feature_names:
3434
- Num Long or Horizontal Lines
3535
- Text Line Count
3636

37-
3837
hyperparameters:
3938
n_estimators: 600
4039
max_depth: 6
@@ -51,4 +50,4 @@ tuning:
5150
max_depth: [4, 5, 6, 7, 8]
5251
learning_rate: [0.01, 0.03, 0.05, 0.1, 0.2]
5352
subsample: [0.6, 0.8, 1.0]
54-
colsample_bytree: [0.6, 0.8, 1.0]
53+
colsample_bytree: [0.6, 0.8, 1.0]

0 commit comments

Comments
 (0)