Skip to content

qurator-spk/ZEFYS2025

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 

Repository files navigation

ZEFYS2025

Introduction

  • Dataset for historical German named entity recognition (NER) and entity linking (EL)
  • Manually annotated data from the Berlin State Library digitised newspaper collection (ZEFYS)
  • 100 pages (84 pages OCR + 16 pages ground truth) from historical (1837-1940) German newspapers
  • Related resources: Github code repository sbb_ner_hf and dataset on Zenodo DOI

Statistics

Dataset statistics in comparison to similar datasets for German NER

dataset # tokens # PER # LOC # ORG # links
ZEFYS2025 348.307 4.389 6.049 3.223 10.341
CoNLL-2003 310.318 5.369 6.579 4.441 -
GermEval2014 591.006 10.500 12.165 7.175 -
NewsEye 527.756 3.500 5.904 3.370 2.279
HIPE2020 153.875 1.910 3.006 660 5.066

Annotation

  • Entity tagset: PER (person), LOC (location), ORG (organization)
  • Nesting: yes, for an additional level (NE-EMB column)
  • Linking: yes, for NE-TAG entities (ID column)
  • Guidelines: derived from DOI; main changes: restricted to entity types listed under entity tagset, components not considered
  • Annotation tool: neat

Format

  • Format based on GermEval TSV/CoNLL data format
  • Tagging scheme: IOB2, indicating beginning (B-)/inside (I-)/outside (O) position of an entity

Example:

No.	TOKEN	NE-TAG	NE-EMB	ID	url_id	left	right	top	bottom
# https://content.staatsbibliothek-berlin.de/zefys/SNP11614109-18820501-0-1-0-0/left,top,width,height/full/0/default.jpg
0	Neueſte	O	O	-	0	495	1379	267	505
1	Mittheitungen	O	O	-	0	1466	3154	309	525
2	.	O	O	-	0	1466	3154	309	525
0		O	O	-	0	1204	1649	553	624
1	Verantwortlicher	O	O	-	0	1204	1649	553	624
2	Herausgeber	O	O	-	0	1685	2037	562	633
3	:	O	O	-	0	1685	2037	562	633
4	Dr	O	O	-	0	2076	2158	568	622
5	.	O	O	-	0	2076	2158	568	622
0	H	B-PER	O	NIL	0	2198	2266	570	640
1	.	I-PER	O	NIL	0	2198	2266	570	640
2	Klee	I-PER	O	NIL	0	2307	2439	571	631
3	.	O	O	-	0	2307	2439	571	631
0		O	O	-	0	1490	1655	659	721
1	Berlin	B-LOC	O	Q64	0	1490	1655	659	721
2	,	O	O	-	0	1490	1655	659	721
3	den	O	O	-	0	1691	1767	665	711
4	1	O	O	-	0	1804	1842	669	710
5	.	O	O	-	0	1804	1842	669	710
0	Mai	O	O	-	0	1879	1976	670	716
1	1882	O	O	-	0	2013	2143	672	719
2	.	O	O	-	0	2013	2143	672	719

Columns:

  1. No.: token position within current sentence, based on sentence splitting performed with page2tsv
    • # indicates the start of a comment, including the url(s) to the source material
  2. TOKEN: token text
  3. NE-TAG: outer entity span annotation
  4. NE-EMB: embedded entity span annotation
  5. ID: link for NE-TAG in authority file (QID from Wikidata), NIL if none found
  6. url_id: indicates which url to use for iiif Image API support
  7. left, right, top, bottom: pixel coordinates for the token
    • Columns listed under 6. and 7. are used for displaying image snippets for the tokens from the digitized version (see neat screenshot)

Acknowledgments

License

Creative Commons Attribution 4.0 International CC BY 4.0

How to cite

Dataset

@dataset{schneider_2025_15771823,
  author       = {Schneider, Sophie and
                  Förstel, Ulrike and
                  Labusch, Kai and
                  Lehmann, Jörg and
                  Neudecker, Clemens},
  title        = {ZEFYS2025: A German Dataset for Named Entity
                   Recognition and Entity Linking for Historical
                   Newspapers
                  },
  month        = jul,
  year         = 2025,
  publisher    = {Staatsbibliothek zu Berlin - Berlin State Library},
  version      = 1,
  doi          = {10.5281/zenodo.15771823},
  url          = {https://doi.org/10.5281/zenodo.15771823},
}

Publication

[will be added soon]

About

Annotated data from SBB for NER

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 5