Releases: ProMeText/multilingual-segmentation-dataset
Segmentation Corpus -- Medieval Languages -- v1.0.0
Overview
This release provides the segmented dataset.
It represents the frozen reference version used as a starting point for subsequent experiments.
Contents
- Heterogeneous set of segmented texts
- Languages: multiple Romance languages, as well as English and Latin
- Format: raw and segmented
.txtfiles for training and evaluation
Language Statistics (v0)
| Language | Texts | Segmented Tokens | Segments (£) |
Train/Dev/Test? |
|---|---|---|---|---|
Latin (la) |
557 | 85,888 | 8,366 | ✅ |
French (fr) |
1,526 | 160,472 | 11,774 | ✅ |
English (en) |
152 | 27,072 | 2,315 | ✅ |
Portuguese (pt) |
987 | 101,565 | 10,477 | ✅ |
Catalan (ca) |
388 | 38,441 | 2,879 | ✅ |
Italian (it) |
2,649 | 85,290 | 6,347 | ✅ |
Castilian (es) |
1,436 | 111,811 | 8,091 | ✅ |
| Total | 7,695 | 610,539 | 50,249 | ✅ |
Legend:
- Texts = total number of annotated examples (i.e. segmented lines)
- Segmented Tokens = total number of tokens (excluding
£) - Segments (
£) = total number of£delimiters → i.e. segments - Train/Dev/Test? = indicates whether
train.json,dev.json, andtest.jsonare all present
Purpose
This dataset is intended as the reference version for comparison with augmented corpora.
It ensures reproducibility of experiments and consistency across model evaluations.
How to use
- Check out this tag locally:
git checkout v1.0.0
- Or download the source snapshot directly from this release page.
Citation
If you use this dataset in academic work, please cite:
ProMeText, Multilingual Segmentation Dataset – Version 1.0.0, GitHub, 2025.
https://github.com/ProMeText/multilingual-segmentation-dataset/releases/tag/v1.0.0
Segmented Baseline 2025
Overview
This release provides the segmented baseline dataset before any data augmentation.
It represents the frozen reference version used as a starting point for subsequent experiments.
Contents
- Heterogeneous set of segmented texts
- Languages: multiple Romance languages, as well as English and Latin
- Format: raw and segmented
.txtfiles for training and evaluation
Language Statistics (v0)
| Language | Texts | Segmented Tokens | Segments (£) |
Train/Dev/Test? |
|---|---|---|---|---|
Latin (la) |
557 | 85,888 | 8,366 | ✅ |
French (fr) |
1,526 | 160,472 | 11,774 | ✅ |
English (en) |
152 | 27,072 | 2,315 | ✅ |
Portuguese (pt) |
987 | 101,565 | 10,477 | ✅ |
Catalan (ca) |
388 | 38,441 | 2,879 | ✅ |
Italian (it) |
2,649 | 85,290 | 6,347 | ✅ |
Castilian (es) |
1,436 | 111,811 | 8,091 | ✅ |
| Total | 7,695 | 610,539 | 50,249 | ✅ |
Legend:
- Texts = total number of annotated examples (i.e. segmented lines)
- Segmented Tokens = total number of tokens (excluding
£) - Segments (
£) = total number of£delimiters → i.e. segments - Train/Dev/Test? = indicates whether
train.json,dev.json, andtest.jsonare all present
Purpose
This baseline is intended as the reference version for comparison with augmented corpora.
It ensures reproducibility of experiments and consistency across model evaluations.
How to use
- Check out this tag locally:
git checkout segmented-baseline-2025
- Or download the source snapshot directly from this release page.
Citation
If you use this dataset in academic work, please cite:
ProMeText, Multilingual Segmentation Dataset – Baseline 2025, GitHub, 2025.
https://github.com/ProMeText/multilingual-segmentation-dataset/releases/tag/segmented-baseline-2025