Skip to content

Segmentation Corpus -- Medieval Languages -- v1.0.0

Latest

Choose a tag to compare

@matgille matgille released this 17 Oct 10:17
· 22 commits to main since this release

Overview

This release provides the segmented dataset.
It represents the frozen reference version used as a starting point for subsequent experiments.

Contents

  • Heterogeneous set of segmented texts
  • Languages: multiple Romance languages, as well as English and Latin
  • Format: raw and segmented .txt files for training and evaluation

Language Statistics (v0)

Language Texts Segmented Tokens Segments (£) Train/Dev/Test?
Latin (la) 557 85,888 8,366
French (fr) 1,526 160,472 11,774
English (en) 152 27,072 2,315
Portuguese (pt) 987 101,565 10,477
Catalan (ca) 388 38,441 2,879
Italian (it) 2,649 85,290 6,347
Castilian (es) 1,436 111,811 8,091
Total 7,695 610,539 50,249

Legend:

  • Texts = total number of annotated examples (i.e. segmented lines)
  • Segmented Tokens = total number of tokens (excluding £)
  • Segments (£) = total number of £ delimiters → i.e. segments
  • Train/Dev/Test? = indicates whether train.json, dev.json, and test.json are all present

Purpose

This dataset is intended as the reference version for comparison with augmented corpora.
It ensures reproducibility of experiments and consistency across model evaluations.

How to use

  • Check out this tag locally:
    git checkout v1.0.0
  • Or download the source snapshot directly from this release page.

Citation

If you use this dataset in academic work, please cite:

ProMeText, Multilingual Segmentation Dataset – Version 1.0.0, GitHub, 2025.
https://github.com/ProMeText/multilingual-segmentation-dataset/releases/tag/v1.0.0