Releases · ProMeText/multilingual-segmentation-dataset

17 Oct 10:17

matgille

v1.0.0

3df7815

Segmentation Corpus -- Medieval Languages -- v1.0.0 Latest

Latest

Overview

This release provides the segmented dataset.
It represents the frozen reference version used as a starting point for subsequent experiments.

Heterogeneous set of segmented texts
Languages: multiple Romance languages, as well as English and Latin
Format: raw and segmented .txt files for training and evaluation

Language Statistics (v0)

Language	Texts	Segmented Tokens	Segments (`£`)	Train/Dev/Test?
Latin (`la`)	557	85,888	8,366	✅
French (`fr`)	1,526	160,472	11,774	✅
English (`en`)	152	27,072	2,315	✅
Portuguese (`pt`)	987	101,565	10,477	✅
Catalan (`ca`)	388	38,441	2,879	✅
Italian (`it`)	2,649	85,290	6,347	✅
Castilian (`es`)	1,436	111,811	8,091	✅
Total	7,695	610,539	50,249	✅

Legend:

Texts = total number of annotated examples (i.e. segmented lines)
Segmented Tokens = total number of tokens (excluding £)
Segments (£) = total number of £ delimiters → i.e. segments
Train/Dev/Test? = indicates whether train.json, dev.json, and test.json are all present

Purpose

This dataset is intended as the reference version for comparison with augmented corpora.
It ensures reproducibility of experiments and consistency across model evaluations.

How to use

Check out this tag locally:
```
git checkout v1.0.0
```
Or download the source snapshot directly from this release page.

Citation

If you use this dataset in academic work, please cite:

ProMeText, Multilingual Segmentation Dataset – Version 1.0.0, GitHub, 2025.
https://github.com/ProMeText/multilingual-segmentation-dataset/releases/tag/v1.0.0

Assets 2

25 Aug 15:32

carolisteia

segmented-baseline-2025

744269d

Segmented Baseline 2025

Overview

This release provides the segmented baseline dataset before any data augmentation.
It represents the frozen reference version used as a starting point for subsequent experiments.

Heterogeneous set of segmented texts
Languages: multiple Romance languages, as well as English and Latin
Format: raw and segmented .txt files for training and evaluation

Language Statistics (v0)

Language	Texts	Segmented Tokens	Segments (`£`)	Train/Dev/Test?
Latin (`la`)	557	85,888	8,366	✅
French (`fr`)	1,526	160,472	11,774	✅
English (`en`)	152	27,072	2,315	✅
Portuguese (`pt`)	987	101,565	10,477	✅
Catalan (`ca`)	388	38,441	2,879	✅
Italian (`it`)	2,649	85,290	6,347	✅
Castilian (`es`)	1,436	111,811	8,091	✅
Total	7,695	610,539	50,249	✅

Legend:

Texts = total number of annotated examples (i.e. segmented lines)
Segmented Tokens = total number of tokens (excluding £)
Segments (£) = total number of £ delimiters → i.e. segments
Train/Dev/Test? = indicates whether train.json, dev.json, and test.json are all present

Purpose

This baseline is intended as the reference version for comparison with augmented corpora.
It ensures reproducibility of experiments and consistency across model evaluations.

How to use

Check out this tag locally:
```
git checkout segmented-baseline-2025
```
Or download the source snapshot directly from this release page.

Citation

If you use this dataset in academic work, please cite:

ProMeText, Multilingual Segmentation Dataset – Baseline 2025, GitHub, 2025.
https://github.com/ProMeText/multilingual-segmentation-dataset/releases/tag/segmented-baseline-2025

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Overview

Contents

Language Statistics (v0)

Purpose

How to use

Citation

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Overview

Contents

Language Statistics (v0)

Purpose

How to use

Citation

Uh oh!

Releases: ProMeText/multilingual-segmentation-dataset

Segmentation Corpus -- Medieval Languages -- v1.0.0

Overview

Contents

Language Statistics (v0)

Purpose

How to use

Citation

Uh oh!

Segmented Baseline 2025

Overview

Contents

Language Statistics (v0)

Purpose

How to use

Citation

Uh oh!