TUB Sign Language Corpus Collection

We present a collection of parallel corpora of 12 sign languages in video format, together with subtitles in the dominant spoken languages of the corresponding countries. The entire collection includes more than 1,300 hours in 4,381 video files, accompanied by 1,3 M subtitles containing 14 M tokens. Most notably, it includes the first consistent parallel corpora for 8 Latin American sign languages, whereas the size of the German Sign Language corpora is ten times the size of the previously available corpora. The collection was created by collecting and processing videos of multiple sign languages from various online sources, mainly broadcast material of news shows, governmental bodies and educational channels. The preparation involved several stages, including data collection, informing the content creators and seeking usage approvals, scraping, and cropping. The paper provides statistics on the collection and an overview of the methods used to collect the data.

The links to the corpora entries can be found at the "catalogue" folder.

The full paper describing the collection can be found here

If you use the corpus please cite as follows:

Full text:

Eleftherios Avramidis, Vera Czehmann, Fabian Deckert, Lorenz Hufe, Aljoscha Lipski, Yuni Amaloa Quintero Villalobos, Tae Kwon Rhee, Mengqian Shi, Lennart Stölting, Fabrizio Nunnari, et al. The tub sign language corpus collection. In ACM International Conference on Intelligent Virtual Agents (IVA Adjunct ’25). International Workshop on Sign Language Translation and Avatar Technology (SLTAT-2025), 9th International Workshop on Sign Language Translation and Avatar Technology, September 16, Berlin, Germany. ACM, 9 2025.

BibTex:

@inproceedings{avramidis2025tub,
  title={The TUB Sign Language Corpus Collection},
  author={Avramidis, Eleftherios and Czehmann, Vera and Deckert, Fabian and Hufe, Lorenz and Lipski, Aljoscha and Amaloa Quintero Villalobos, Yuni and Rhee, Tae Kwon and Shi, Mengqian and St{\"o}lting, Lennart and Nunnari, Fabrizio and Möller, Sebastian},
    booktitle = {ACM International Conference on Intelligent Virtual Agents (IVA  Adjunct ’25). International Workshop on Sign Language Translation and Avatar Technology (SLTAT-2025), 9th International Workshop on Sign Language Translation and Avatar Technology, September 16, Berlin, Germany
},
    year = {2025},
    month = {9},
    publisher = {ACM},
}

Update 08.09.2025

Keep-out data:

Please note that for the corpus part "Heute Journal" we are working on preparing a manually annotated test set using the subtitles:

29.06.2022
16.11.2022
12.08.2022
09.10.2022

We kindly ask you to refrain from training/tuning on these videos.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
catalogue		catalogue
.history		.history
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TUB Sign Language Corpus Collection

Update 08.09.2025

About

Uh oh!

Releases

Packages

License

DFKI-SignLanguage/TUB-Sign-Language-Corpus-Collection

Folders and files

Latest commit

History

Repository files navigation

TUB Sign Language Corpus Collection

Update 08.09.2025

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages