sbb_ner_hf

Description

This tool aims at finetuning and evaluating Named Entity Recognition (NER) models for German historical newspaper contents with the help of the HuggingFace Transformers library. It is implemented in such a way that different pretrained models from HuggingFace can be trained and tested with a variation of preprocessed and optionally combined datasets.

In its current state, this repository mostly serves the purpose of training and evaluating NER models; however, including inference and automatic annotation of input datasets would be feasible with the help of further developments as well. Also have a look at our previously developed, BERT-based solutions for entity recognition, disambiguation and linking sbb_ner and sbb_ned.

License:
Related resources: ZEFYS2025 dataset on Github and Zenodo

Installation

Install Python version 3.10.14, best to use a virtual environment for that (e.g. with pyenv-virtualenv).
Install the requirements from requirements.txt file.
Clone this repo.

Overall, it would be beneficial to have GPU access (depending on the choice of models and parameters this may not be necessary though).

Usage

Preprocessing

Most of the datasets that were used in this process needed additional preprocessing to be of the same dataset format. See below for more information, how the datasets were acquired and processed in this first step.

dataset	preprocessing steps
zefys2025	run `preprocess_zefys2025.py`
hisGermaNER	run `preprocess_hisgermaner.py`
hipe2020	run `preprocess_hipe_hipe2020.py`
neiss (arendt, sturm)	run `preprocess_neiss.py` for each subset
europeana (lft, onb)	transform into zefys2025 data format, run `preprocess_zefys2025.py` for each subset
conll2003	/ (loaded directly via HF)
germeval2014	/ (loaded directly via HF)

Each preprocessing script uses the datasets.Dataset.save_to_disk() function to save the .hf dataset including train/test/validation splits in Apache Arrow format for simple reloading (datasets.Dataset.load_from_disk()). There are three columns in a dataset:

id: identifier for each sentence in the dataset
tokens: nested list of all the tokens per sentence
ner_tags: nested list of all the NER tags per sentence

Introduction

For a first/broad understanding of different functionalities included in config.py, train.py, merge_datasets.py and eval_opt.py, see main.ipynb. The token_classification.ipynb notebook from HuggingFace served as a starting point to the developments which can be found in these files. To run the code cells from main.ipynb, Jupyter Notebook needs to be installed.

Experiments

To be able to experiment with multiple training configurations at once, experiments.py and Makefile were developed. Experimental results are saved as .pkl files and can then be accessed similar to experiments.ipynb.

experiment.py provides the following command line interface. How it has been used to obtain the results published in the paper can be seen from the Makefile. The tables in the paper have been generated by experiments.ipynb.

python experiment.py --help
Usage: experiment.py [OPTIONS] RESULT_FILE

Options:
  --max-epochs INTEGER            Maximum number of epochs to train. Default
                                  30.
  --exp-type [single|merged|historical|contemporary]
  --batch-size INTEGER            Can be supplied multiple times. Batch size
                                  to try.
  --learning-rate FLOAT           Can be supplied multiple times. Learning
                                  rate to try.
  --weight-decay FLOAT            Can be supplied multiple times. Weight decay
                                  to try.
  --warmup-step INTEGER           Can be supplied multiple times. Warmup steps
                                  to try.
  --use-data-config TEXT          Can be supplied multiple times. Run only on
                                  these training configs.
  --use-test-config TEXT          Can be supplied multiple times. Test each
                                  trained model on these configs.
  --pretrain-config-file PATH     Train on pretrained models defined in this
                                  result file (from a previous experiment.py
                                  run).
  --pretrain-path PATH            Load the pretrained models checkpoints
                                  (EXP_... directories) from this directory.
                                  Default './'
  --model-storage-path PATH       Store the models checkpoints (EXP_...
                                  directories) in this directory.
  --dry-run                       Dry run only. Do not train or test but just
                                  check if everything runs through.
  --help                          Show this message and exit.

How to cite

Dataset

@dataset{schneider_2025_15771823,
  author       = {Schneider, Sophie and
                  Förstel, Ulrike and
                  Labusch, Kai and
                  Lehmann, Jörg and
                  Neudecker, Clemens},
  title        = {ZEFYS2025: A German Dataset for Named Entity
                   Recognition and Entity Linking for Historical
                   Newspapers
                  },
  month        = jul,
  year         = 2025,
  publisher    = {Staatsbibliothek zu Berlin - Berlin State Library},
  version      = 1,
  doi          = {10.5281/zenodo.15771823},
  url          = {https://doi.org/10.5281/zenodo.15771823},
}

Publication

[will be added soon]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

sbb_ner_hf

Description

Installation

Usage

Preprocessing

Introduction

Experiments

How to cite

Dataset

Publication

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
preprocessing		preprocessing
Makefile		Makefile
README.md		README.md
config.py		config.py
eval_opt.py		eval_opt.py
experiment.py		experiment.py
experiments.ipynb		experiments.ipynb
main.ipynb		main.ipynb
merge_datasets.py		merge_datasets.py
requirements.txt		requirements.txt
train.py		train.py

qurator-spk/sbb_ner_hf

Folders and files

Latest commit

History

Repository files navigation

sbb_ner_hf

Description

Installation

Usage

Preprocessing

Introduction

Experiments

How to cite

Dataset

Publication

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages