Tracing Lancelot’s journey not only through lands and courts, but through languages, manuscripts, and time.
An aligned multilingual corpus of the Lancelot en prose, processed with Aquilign.
This repository provides training and evaluation data for the study of the multilingual medieval textual tradition of the Lancelot en prose, including both manually corrected alignments and automatically generated outputs.
It features a curated selection of witnesses from the Lancelot en prose, one of the core cycles of the Arthurian literary corpus. The texts are presented in multiple medieval languages—Old French, Old Spanish (Castilian), and Old Italian—and have been automatically segmented and aligned using the Aquilign pipeline, a BERT-based tool for multilingual clause-level alignment of medieval texts.
In addition to alignment outputs, the repository includes manual corrections and evaluation files to support philological analysis and computational validation.
This project is part of a broader study on computational multilingual alignment and stemmatological analysis of medieval texts. The case study focuses on the Lancelot en prose, a major prose romance composed anonymously in the early 13th century. With at least 126 surviving manuscript witnesses and numerous translations, the Lancelot represents one of the most widespread and complex textual traditions of medieval Europe.
This corpus includes aligned segments from the Medieval Romance tradition—specifically:
- French (source),
- Castilian, and
- Italian.
Due to the fragmentary and unstable nature of the textual transmission, only comparable segments from selected witnesses were retained. Alignment is performed at the clause (“sentence”) level, preceded by a segmentation step using a custom-trained model. The aligned outputs are then used for variant classification and stemmatological exploration.
The corpus was processed using Aquilign, a multilingual alignment tool developed by the ProMeText team.
-
Segmentation: Performed at the clause level using custom token classification models trained on historical data.
For detailed information about the segmentation guidelines, training data, and language coverage, see the Multilingual Segmentation Dataset repository, particularly its/docs/folder. -
Alignment: Based on contextual embeddings using LaBSE.
-
Languages: Medieval French, Castilian, and Italian
Aquilign repository: https://github.com/ProMeText/Aquilign
Example of multilingual alignment table: 👉 View aligned chapter
Contributions to the project are highly encouraged, whether they be additional data, bug fixes, or enhancements to the analysis scripts. To contribute:
- Fork the Repository – Start by forking the repository and cloning it locally.
- Create a Branch – Make your changes in a new branch named after the feature or fix.
- Submit a Pull Request – After pushing your changes to your fork, open a pull request for discussion and review.
If you use the corpus or refer to the alignment methodology, please cite:
Gille Levenson, M., Ing, L., & Camps, J.-B. (2024).
Textual Transmission without Borders: Multiple Multilingual Alignment and Stemmatology of the "Lancelot en prose" (Medieval French, Castilian, Italian).
In W. Haverals, M. Koolen, & L. Thompson (Eds.), Proceedings of the Computational Humanities Research Conference 2024 (Vol. 3834, pp. 65–92). CEUR.
https://ceur-ws.org/Vol-3834/#paper104
@inproceedings{gillelevenson_TextualTransmissionBorders_2024a,
title = {Textual {{Transmission}} without {{Borders}}: {{Multiple Multilingual Alignment}} and {{Stemmatology}} of the ``{{Lancelot}} En Prose'' ({{Medieval French}}, {{Castilian}}, {{Italian}})},
shorttitle = {Textual {{Transmission}} without {{Borders}}},
booktitle = {Proceedings of the {{Computational Humanities}} {{Research Conference}} 2024},
author = {Gille Levenson, Matthias and Ing, Lucence and Camps, Jean-Baptiste},
editor = {Haverals, Wouter and Koolen, Marijn and Thompson, Laure},
date = {2024},
series = {{{CEUR Workshop Proceedings}}},
volume = {3834},
pages = {65--92},
publisher = {CEUR},
location = {Aarhus, Denmark},
issn = {1613-0073},
url = {https://ceur-ws.org/Vol-3834/#paper104},
urldate = {2024-12-09},
eventtitle = {Computational {{Humanities Research}} 2024},
langid = {english},
file = {/home/mgl/Bureau/Travail/Bibliotheque_zoteros/storage/CIH7IAHV/Levenson et al. - 2024 - Textual Transmission without Borders Multiple Multilingual Alignment and Stemmatology of the ``Lanc.pdf}
}This repository is part of a broader ecosystem of tools and corpora developed for the study of medieval multilingual textual traditions:
-
Aquilign
A clause-level multilingual alignment engine based on contextual embeddings (LaBSE), developed for medieval texts. -
Multilingual Segmentation Dataset Source texts and their segmented versions in multiple medieval Romance languages, as well as in Latin and English, used for training and evaluating clause segmentation models.
-
Parallelium – an aligned scriptures dataset
A multilingual dataset of aligned Biblical and Qur’anic texts — spanning medieval and modern languages — designed for training and evaluating multilingual alignment models, especially in historical and philological contexts. -
Multilingual Aegidius
A parallel corpus of translations of Aegidius Romanus’ De regimine principum in Latin, Medieval Romance languages, and English, segmented and aligned using the same pipeline.
This work benefited from national funding managed by the Agence Nationale de la Recherche under the Investissements d'avenir programme with the reference ANR-21-ESRE-0005 (Biblissima+).
Ce travail a bénéficié d'une aide de l’État gérée par l’Agence Nationale de la Recherche au titre du programme d’Investissements d’avenir portant la référence ANR-21-ESRE-0005 (Biblissima+).
This project is licensed under the CC BY-NC-SA 4.0 license.
This license allows users to adapt, remix, and build upon the work for non-commercial purposes, provided that they credit the original authors and share any derivative works under the same license.

