Page Classification for Geological Documents

This repository provides a classification pipeline to categorise PDF pages from geological reports into document classes, with the goal of supporting document understanding and metadata extraction in the Assets platform. The solution can be used as a standalone API.

The classification helps to map individual pages in a document, which facilitates the identification of borehole profiles and maps in PDFs to link between documents on Assets and boreprofiles on Boreholes.

Features:

Classifies individual PDF pages into 8 document classes: Text, Boreprofile, Maps, TitlePage, GeoProfile, Table, Diagram, Unknown.
Two classifier backends: feature-based XGBoost (default) and Pixtral Large (via Amazon Bedrock).
REST API with versioned endpoints (V1, V2) and batch processing support.
SHAP-based model explainability for tree-based classifiers.
MLflow experiment tracking (optional).

Usage

1. Installation

Python >=3.11 is required. Example using a virtual environment:

python -m venv venv
source venv/bin/activate

2. Install dependencies

To install base dependencies:

pip install .

For development, install all optional tools:

pip install '.[all]'

3. Configuration

Copy the environment template and configure your settings:

cp .env.template .env

4. Running as CLI

We base our running pipeline on the command:

python main.py -i <input_path> [-g <ground_truth_path>] [-c <classifier_name>] [-p <model_path>] [-w]

The input path -i is mandatory and can be either a single PDF file or a directory. In the latter case, all PDF files in that directory will be processed. To simply obtain predictions, use -w to write the results to data/prediction.json.

python main.py -i path/to/document.pdf -w

If no classifier (-c) is specified, the default treebased classifier is used. The model path (-p / --model_path) defaults to models/stable/model.joblib. The ground truth file (-g) is optional and only required to compute accuracy metrics:

python main.py -i data/single_pages/ -g data/gt_single_pages.json

See Model Training for the ground truth file format.

Classifier	Description
`treebased`	Default. Feature-based XGBoost model
`pixtral`	Uses Pixtral Large via Amazon Bedrock

5. Running as API

uvicorn api.api:app --reload --host 0.0.0.0 --port 8000

For detailed endpoint documentation, output formats, and local S3 setup, see the API Usage Guide.

Documentation

Document	Description
API Architecture	API versioning and OpenAPI spec
API Usage Guide	Endpoints, output formats, MinIO setup
Docker Deployment	Building and running Docker images
Model Overview	Stable model features and usage
Model Explainability	SHAP interpretation for tree-based models
Model Training	Data, XGBoost training, hyperparameter tuning
Pixtral Setup	AWS Bedrock configuration

Repository Structure

api/: FastAPI application
config/: YAML configs (models, matching, prediction profiles)
data/: Input data, predictions and ground truths
docs/: Detailed documentation
evaluation/: Evaluation and metrics
models/: Trained models (TreeBased)
prompts/: Pixtral prompts
src/: Core logic and utility scripts
tests/: Unit tests
main.py: CLI entry point

Contributing

We use pre-commit hooks with Ruff for code formatting. After installing dependencies, run:

pre-commit install

This needs to be done only once. After installing, hooks will run automatically on each git commit.

Governance

This repository is managed by the Swiss Federal Office of Topography swisstopo. The project lead and primary maintainer is Stijn Vermeeren (@stijnvermeeren-swisstopo). Support has come from external contractors at Visium and EBP. Individual contributors are listed on GitHub's Contributors page.

We welcome suggestions, bug reports and code contributions from third parties. However, the priority of any external request will have to be evaluated based on compatibility with our legal mandate as a government agency.

License

This project is released as open-source software, under the principle of "public money, public code", in accordance with the 2023 federal law "EMBAG", and following the guidance of the tools for OSS published by the Federal Chancellery.

The source code is licensed under the AGPL-3.0-only License. This is due to the licensing of certain dependencies, most notably PyMuPDF, which is only available under either the AGPL license or a commercial license. If this dependency is removed in the future, we will switch to a more permissive license for this project.

Name		Name	Last commit message	Last commit date
Latest commit History 584 Commits
.github		.github
.vscode		.vscode
api		api
config		config
docs		docs
examples		examples
models/stable		models/stable
prompts		prompts
src		src
tests		tests
.env.template		.env.template
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
LICENSE.md		LICENSE.md
README.md		README.md
api.http		api.http
docker-compose.yml		docker-compose.yml
main.py		main.py
publiccode.yml		publiccode.yml
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Page Classification for Geological Documents

Usage

1. Installation

2. Install dependencies

3. Configuration

4. Running as CLI

5. Running as API

Documentation

Repository Structure

Contributing

Governance

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Page Classification for Geological Documents

Usage

1. Installation

2. Install dependencies

3. Configuration

4. Running as CLI

5. Running as API

Documentation

Repository Structure

Contributing

Governance

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages