Release Segmentation Corpus -- Medieval Languages -- v1.0.0 · ProMeText/multilingual-segmentation-dataset

Overview

This release provides the segmented dataset.
It represents the frozen reference version used as a starting point for subsequent experiments.

Heterogeneous set of segmented texts
Languages: multiple Romance languages, as well as English and Latin
Format: raw and segmented .txt files for training and evaluation

Language Statistics (v0)

Language	Texts	Segmented Tokens	Segments (`£`)	Train/Dev/Test?
Latin (`la`)	557	85,888	8,366	✅
French (`fr`)	1,526	160,472	11,774	✅
English (`en`)	152	27,072	2,315	✅
Portuguese (`pt`)	987	101,565	10,477	✅
Catalan (`ca`)	388	38,441	2,879	✅
Italian (`it`)	2,649	85,290	6,347	✅
Castilian (`es`)	1,436	111,811	8,091	✅
Total	7,695	610,539	50,249	✅

Legend:

Texts = total number of annotated examples (i.e. segmented lines)
Segmented Tokens = total number of tokens (excluding £)
Segments (£) = total number of £ delimiters → i.e. segments
Train/Dev/Test? = indicates whether train.json, dev.json, and test.json are all present

Purpose

This dataset is intended as the reference version for comparison with augmented corpora.
It ensures reproducibility of experiments and consistency across model evaluations.

How to use

Check out this tag locally:
```
git checkout v1.0.0
```
Or download the source snapshot directly from this release page.

Citation

If you use this dataset in academic work, please cite:

ProMeText, Multilingual Segmentation Dataset – Version 1.0.0, GitHub, 2025.
https://github.com/ProMeText/multilingual-segmentation-dataset/releases/tag/v1.0.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation Corpus -- Medieval Languages -- v1.0.0

Choose a tag to compare

Sorry, something went wrong.