Skip to content

qurator-spk/sbb_ner_hf

Repository files navigation

sbb_ner_hf

HuggingFace Jupyter PyTorch Pandas Python NumPy

Description

This tool aims at finetuning and evaluating Named Entity Recognition (NER) models for German historical newspaper contents with the help of the HuggingFace Transformers library. It is implemented in such a way that different pretrained models from HuggingFace can be trained and tested with a variation of preprocessed and optionally combined datasets.

In its current state, this repository mostly serves the purpose of training and evaluating NER models; however, including inference and automatic annotation of input datasets would be feasible with the help of further developments as well. Also have a look at our previously developed, BERT-based solutions for entity recognition, disambiguation and linking sbb_ner and sbb_ned.

  • License: License
  • Related resources: ZEFYS2025 dataset on Github and Zenodo DOI

Installation

  • Install Python version 3.10.14, best to use a virtual environment for that (e.g. with pyenv-virtualenv).
  • Install the requirements from requirements.txt file.
  • Clone this repo.

Overall, it would be beneficial to have GPU access (depending on the choice of models and parameters this may not be necessary though).

Usage

Preprocessing

Most of the datasets that were used in this process needed additional preprocessing to be of the same dataset format. See below for more information, how the datasets were acquired and processed in this first step.

dataset preprocessing steps
zefys2025 run preprocess_zefys2025.py
hisGermaNER run preprocess_hisgermaner.py
hipe2020 run preprocess_hipe_hipe2020.py
neiss (arendt, sturm) run preprocess_neiss.py for each subset
europeana (lft, onb) transform into zefys2025 data format, run preprocess_zefys2025.py for each subset
conll2003 / (loaded directly via HF)
germeval2014 / (loaded directly via HF)

Each preprocessing script uses the datasets.Dataset.save_to_disk() function to save the .hf dataset including train/test/validation splits in Apache Arrow format for simple reloading (datasets.Dataset.load_from_disk()). There are three columns in a dataset:

  • id: identifier for each sentence in the dataset
  • tokens: nested list of all the tokens per sentence
  • ner_tags: nested list of all the NER tags per sentence

Introduction

For a first/broad understanding of different functionalities included in config.py, train.py, merge_datasets.py and eval_opt.py, see main.ipynb. The token_classification.ipynb notebook from HuggingFace served as a starting point to the developments which can be found in these files. To run the code cells from main.ipynb, Jupyter Notebook needs to be installed.

Experiments

To be able to experiment with multiple training configurations at once, experiments.py and Makefile were developed. Experimental results are saved as .pkl files and can then be accessed similar to experiments.ipynb.

experiment.py provides the following command line interface. How it has been used to obtain the results published in the paper can be seen from the Makefile. The tables in the paper have been generated by experiments.ipynb.

python experiment.py --help
Usage: experiment.py [OPTIONS] RESULT_FILE

Options:
  --max-epochs INTEGER            Maximum number of epochs to train. Default
                                  30.
  --exp-type [single|merged|historical|contemporary]
  --batch-size INTEGER            Can be supplied multiple times. Batch size
                                  to try.
  --learning-rate FLOAT           Can be supplied multiple times. Learning
                                  rate to try.
  --weight-decay FLOAT            Can be supplied multiple times. Weight decay
                                  to try.
  --warmup-step INTEGER           Can be supplied multiple times. Warmup steps
                                  to try.
  --use-data-config TEXT          Can be supplied multiple times. Run only on
                                  these training configs.
  --use-test-config TEXT          Can be supplied multiple times. Test each
                                  trained model on these configs.
  --pretrain-config-file PATH     Train on pretrained models defined in this
                                  result file (from a previous experiment.py
                                  run).
  --pretrain-path PATH            Load the pretrained models checkpoints
                                  (EXP_... directories) from this directory.
                                  Default './'
  --model-storage-path PATH       Store the models checkpoints (EXP_...
                                  directories) in this directory.
  --dry-run                       Dry run only. Do not train or test but just
                                  check if everything runs through.
  --help                          Show this message and exit.

How to cite

Dataset

@dataset{schneider_2025_15771823,
  author       = {Schneider, Sophie and
                  Förstel, Ulrike and
                  Labusch, Kai and
                  Lehmann, Jörg and
                  Neudecker, Clemens},
  title        = {ZEFYS2025: A German Dataset for Named Entity
                   Recognition and Entity Linking for Historical
                   Newspapers
                  },
  month        = jul,
  year         = 2025,
  publisher    = {Staatsbibliothek zu Berlin - Berlin State Library},
  version      = 1,
  doi          = {10.5281/zenodo.15771823},
  url          = {https://doi.org/10.5281/zenodo.15771823},
}

Publication

[will be added soon]

About

sbb ner finetuning with huggingface

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •