This document describes the data preparation pipeline used to build the multilingual segmentation dataset.
It details the steps taken from raw text collection to the production of segmented, model-ready files.
Texts were collected from a wide range of sources, including digital editions, OCR outputs, and manually transcribed files.
Preferred formats included:
.txt(plain text).xml(TEI, custom formats).html/.pdf(converted to plain text using external tools)
All files underwent basic cleaning, mostly using regular expressions, to normalize:
- markup and tags (removal or replacement)
- inconsistent whitespace and line breaks
- encoding anomalies (e.g., non-UTF-8)
⚠️ Note: This step does not involve orthographic normalization.
Original spelling, punctuation, and diacritics were preserved to maintain historical variation.
Segmentation was performed using a hybrid approach combining:
- manual annotation by trained annotators
- rule-based heuristics to accelerate the process (e.g., splitting at discourse markers; see
word_delimiters.md)
Segment boundaries are marked using the £ symbol.
The £ delimiter is consistent across formats (.txt, .json) and enables straightforward tokenization for machine learning tasks.
It also supports binary labeling of tokens: 1 for a segmentation point (i.e., when a token is followed by £), and 0 otherwise.
Files with annotated segments (using £) are stored in the pre_split/ folder.
They are not yet split into train/dev/test.
Formats:
.txt(plain text with£).json(structured format for ML input)
Files are then partitioned into the split/ folder, with the following structure:
train/,dev/, andtest/subfolders per language- Each subfolder contains:
.txtfiles with£delimiters.jsonfiles with segmented data in ML-friendly format
raw/: Cleaned, unsegmented textsegmented/pre_split/: Fully segmented excerpts, not partitionedsegmented/split/: Segmented and partitioned into ML subsets
word_delimiters.json– Language-specific word boundary rulessegmentation_criteria.md– Sentence segmentation principles and heuristicssegmentation_exemples.md– Annotated examples in multiple languages
🔙 Back to README
