This document summarizes key notes on text acquisition, language-specific sourcing variability, and metadata management across the Multilingual Segmentation Corpus.
The process of text acquisition varied considerably depending on the language.
-
For some languages, such as French, acquisition was straightforward thanks to well-structured resources like the BFM Corpus.
-
For Portuguese, English, and Italian, we used structured corpora such as the:
- CTA – Corpus de Textos Antigos
- LAEME – Linguistic Atlas of Early Middle English
- Biblioteca Italiana
These provided valuable data, though required more intensive preparation.
-
In contrast, resources like the OVI (Italian) and CICA (Catalan) offered limited public access, prompting recovery from critical editions or web scraping when necessary.
This variability required a flexible, language-sensitive approach to both sourcing and preprocessing.
To ensure transparency, consistency, and reproducibility, an internal application form was created to standardize metadata collection for each text in the segmentation corpus.
This form captured key details such as:
- 📌 Source type (digital edition, manuscript, OCR, etc.)
- 📚 Edition or manuscript reference (bibliographic citation)
- 🌍 Linguistic variety and chronological range
- 🗂️ Format and structure of the original file
- 📝 Reuse/licensing conditions
Although the form itself is not public, we provide access to the processed metadata and the scripts used during compilation.
-
📂 Data Processing Repository –
Corpus Temporis App
Streamlit app and scripts used to structure, validate, and convert incoming texts and metadata. -
📊 Compiled Metadata Table –
data.csv
A centralized CSV listing all processed texts with:- Language
- Title
- Edition or source
- Format
- License/reuse status
- File location references
📬 For contributions or metadata corrections, feel free to open an issue or pull request.