This repository provides a classification pipeline to categorise PDF pages from geological reports into document classes, with the goal of supporting document understanding and metadata extraction in the Assets platform. The solution can be used as a standalone API.
The classification helps to map individual pages in a document, which facilitates the identification of borehole profiles and maps in PDFs to link between documents on Assets and boreprofiles on Boreholes.
Features:
- Classifies individual PDF pages into 8 document classes: Text, Boreprofile, Maps, TitlePage, GeoProfile, Table, Diagram, Unknown.
- Two classifier backends: feature-based XGBoost (default) and Pixtral Large (via Amazon Bedrock).
- REST API with versioned endpoints (V1, V2) and batch processing support.
- SHAP-based model explainability for tree-based classifiers.
- MLflow experiment tracking (optional).
Python >=3.11 is required. Example using a virtual environment:
python -m venv venv
source venv/bin/activateTo install base dependencies:
pip install .For development, install all optional tools:
pip install '.[all]'Copy the environment template and configure your settings:
cp .env.template .envWe base our running pipeline on the command:
python main.py -i <input_path> [-g <ground_truth_path>] [-c <classifier_name>] [-p <model_path>] [-w]The input path -i is mandatory and can be either a single PDF file or a directory. In the latter case, all PDF files in that directory will be processed. To simply obtain predictions, use -w to write the results to data/prediction.json.
python main.py -i path/to/document.pdf -wIf no classifier (-c) is specified, the default treebased classifier is used. The model path (-p / --model_path) defaults to models/stable/model.joblib. The ground truth file (-g) is optional and only required to compute accuracy metrics:
python main.py -i data/single_pages/ -g data/gt_single_pages.jsonSee Model Training for the ground truth file format.
| Classifier | Description |
|---|---|
treebased |
Default. Feature-based XGBoost model |
pixtral |
Uses Pixtral Large via Amazon Bedrock |
uvicorn api.api:app --reload --host 0.0.0.0 --port 8000For detailed endpoint documentation, output formats, and local S3 setup, see the API Usage Guide.
| Document | Description |
|---|---|
| API Architecture | API versioning and OpenAPI spec |
| API Usage Guide | Endpoints, output formats, MinIO setup |
| Docker Deployment | Building and running Docker images |
| Model Overview | Stable model features and usage |
| Model Explainability | SHAP interpretation for tree-based models |
| Model Training | Data, XGBoost training, hyperparameter tuning |
| Pixtral Setup | AWS Bedrock configuration |
api/: FastAPI applicationconfig/: YAML configs (models, matching, prediction profiles)data/: Input data, predictions and ground truthsdocs/: Detailed documentationevaluation/: Evaluation and metricsmodels/: Trained models (TreeBased)prompts/: Pixtral promptssrc/: Core logic and utility scriptstests/: Unit testsmain.py: CLI entry point
We use pre-commit hooks with Ruff for code formatting. After installing dependencies, run:
pre-commit installThis needs to be done only once. After installing, hooks will run automatically on each git commit.
This repository is managed by the Swiss Federal Office of Topography swisstopo. The project lead and primary maintainer is Stijn Vermeeren (@stijnvermeeren-swisstopo). Support has come from external contractors at Visium and EBP. Individual contributors are listed on GitHub's Contributors page.
We welcome suggestions, bug reports and code contributions from third parties. However, the priority of any external request will have to be evaluated based on compatibility with our legal mandate as a government agency.
This project is released as open-source software, under the principle of "public money, public code", in accordance with the 2023 federal law "EMBAG", and following the guidance of the tools for OSS published by the Federal Chancellery.
The source code is licensed under the AGPL-3.0-only License. This is due to the licensing of certain dependencies, most notably PyMuPDF, which is only available under either the AGPL license or a commercial license. If this dependency is removed in the future, we will switch to a more permissive license for this project.