Overview
This release provides the segmented dataset.
It represents the frozen reference version used as a starting point for subsequent experiments.
Contents
- Heterogeneous set of segmented texts
- Languages: multiple Romance languages, as well as English and Latin
- Format: raw and segmented
.txtfiles for training and evaluation
Language Statistics (v0)
| Language | Texts | Segmented Tokens | Segments (£) |
Train/Dev/Test? |
|---|---|---|---|---|
Latin (la) |
557 | 85,888 | 8,366 | ✅ |
French (fr) |
1,526 | 160,472 | 11,774 | ✅ |
English (en) |
152 | 27,072 | 2,315 | ✅ |
Portuguese (pt) |
987 | 101,565 | 10,477 | ✅ |
Catalan (ca) |
388 | 38,441 | 2,879 | ✅ |
Italian (it) |
2,649 | 85,290 | 6,347 | ✅ |
Castilian (es) |
1,436 | 111,811 | 8,091 | ✅ |
| Total | 7,695 | 610,539 | 50,249 | ✅ |
Legend:
- Texts = total number of annotated examples (i.e. segmented lines)
- Segmented Tokens = total number of tokens (excluding
£) - Segments (
£) = total number of£delimiters → i.e. segments - Train/Dev/Test? = indicates whether
train.json,dev.json, andtest.jsonare all present
Purpose
This dataset is intended as the reference version for comparison with augmented corpora.
It ensures reproducibility of experiments and consistency across model evaluations.
How to use
- Check out this tag locally:
git checkout v1.0.0
- Or download the source snapshot directly from this release page.
Citation
If you use this dataset in academic work, please cite:
ProMeText, Multilingual Segmentation Dataset – Version 1.0.0, GitHub, 2025.
https://github.com/ProMeText/multilingual-segmentation-dataset/releases/tag/v1.0.0