- Dataset for historical German named entity recognition (NER) and entity linking (EL)
- Manually annotated data from the Berlin State Library digitised newspaper collection (ZEFYS)
- 100 pages (84 pages OCR + 16 pages ground truth) from historical (1837-1940) German newspapers
- Related resources: Github code repository sbb_ner_hf and dataset on Zenodo
Dataset statistics in comparison to similar datasets for German NER
dataset | # tokens | # PER | # LOC | # ORG | # links |
---|---|---|---|---|---|
ZEFYS2025 | 348.307 | 4.389 | 6.049 | 3.223 | 10.341 |
CoNLL-2003 | 310.318 | 5.369 | 6.579 | 4.441 | - |
GermEval2014 | 591.006 | 10.500 | 12.165 | 7.175 | - |
NewsEye | 527.756 | 3.500 | 5.904 | 3.370 | 2.279 |
HIPE2020 | 153.875 | 1.910 | 3.006 | 660 | 5.066 |
- Entity tagset:
PER
(person),LOC
(location),ORG
(organization) - Nesting: yes, for an additional level (
NE-EMB
column) - Linking: yes, for NE-TAG entities (
ID
column) - Guidelines: derived from
; main changes: restricted to entity types listed under entity tagset, components not considered
- Annotation tool: neat
- Format based on GermEval TSV/CoNLL data format
- Tagging scheme: IOB2, indicating beginning (
B-
)/inside (I-
)/outside (O
) position of an entity
Example:
No. TOKEN NE-TAG NE-EMB ID url_id left right top bottom
# https://content.staatsbibliothek-berlin.de/zefys/SNP11614109-18820501-0-1-0-0/left,top,width,height/full/0/default.jpg
0 Neueſte O O - 0 495 1379 267 505
1 Mittheitungen O O - 0 1466 3154 309 525
2 . O O - 0 1466 3154 309 525
0 O O - 0 1204 1649 553 624
1 Verantwortlicher O O - 0 1204 1649 553 624
2 Herausgeber O O - 0 1685 2037 562 633
3 : O O - 0 1685 2037 562 633
4 Dr O O - 0 2076 2158 568 622
5 . O O - 0 2076 2158 568 622
0 H B-PER O NIL 0 2198 2266 570 640
1 . I-PER O NIL 0 2198 2266 570 640
2 Klee I-PER O NIL 0 2307 2439 571 631
3 . O O - 0 2307 2439 571 631
0 O O - 0 1490 1655 659 721
1 Berlin B-LOC O Q64 0 1490 1655 659 721
2 , O O - 0 1490 1655 659 721
3 den O O - 0 1691 1767 665 711
4 1 O O - 0 1804 1842 669 710
5 . O O - 0 1804 1842 669 710
0 Mai O O - 0 1879 1976 670 716
1 1882 O O - 0 2013 2143 672 719
2 . O O - 0 2013 2143 672 719
Columns:
No.
: token position within current sentence, based on sentence splitting performed with page2tsv#
indicates the start of a comment, including the url(s) to the source material
TOKEN
: token textNE-TAG
: outer entity span annotationNE-EMB
: embedded entity span annotationID
: link forNE-TAG
in authority file (QID from Wikidata), NIL if none foundurl_id
: indicates which url to use for iiif Image API supportleft
,right
,top
,bottom
: pixel coordinates for the token- Columns listed under 6. and 7. are used for displaying image snippets for the tokens from the digitized version (see neat screenshot)
- HIPE (conversion to HIPE format)
- SoNAR-IDH Goldstandard
- Europeana Newspapers
- Contributors: Markus Bierkoch, Knut Lohse
Creative Commons Attribution 4.0 International CC BY 4.0
@dataset{schneider_2025_15771823,
author = {Schneider, Sophie and
Förstel, Ulrike and
Labusch, Kai and
Lehmann, Jörg and
Neudecker, Clemens},
title = {ZEFYS2025: A German Dataset for Named Entity
Recognition and Entity Linking for Historical
Newspapers
},
month = jul,
year = 2025,
publisher = {Staatsbibliothek zu Berlin - Berlin State Library},
version = 1,
doi = {10.5281/zenodo.15771823},
url = {https://doi.org/10.5281/zenodo.15771823},
}
[will be added soon]