|
1 | | -# AQUILIGN -- Mutilingual aligner and collator |
| 1 | +# 📐 AQUILIGN – Multilingual Aligner and Collator |
2 | 2 |
|
3 | 3 | [](https://codecov.io/github/ProMeText/Aquilign) |
| 4 | +[](https://github.com/ProMeText/Aquilign/commits/main) |
| 5 | +[](https://github.com/ProMeText/Aquilign) |
| 6 | +[](https://github.com/ProMeText/Aquilign/issues) |
| 7 | +[](https://github.com/ProMeText/Aquilign/network/members) |
| 8 | +[](https://github.com/ProMeText/Aquilign/stargazers) |
4 | 9 |
|
| 10 | +**AQUILIGN** is a multilingual alignment and collation engine designed for **historical and philological corpora**. |
| 11 | +It performs **clause-level alignment** of parallel texts using a combination of **regular-expression and BERT-based segmentation**, and supports multilingual workflows across medieval Romance, Latin, and Middle English texts. |
5 | 12 |
|
6 | | -This repo contains a set of scripts to align (and soon collate) a multilingual medieval corpus. Its designers are Matthias Gille Levenson, Lucence Ing and Jean-Baptiste Camps. |
| 13 | +🧪 Developed by [Matthias Gille Levenson](https://github.com/matgille), [Lucence Ing](https://cv.hal.science/lucence-ing), and [Jean-Baptiste Camps](https://github.com/Jean-Baptiste-Camps). |
| 14 | +Originally presented at the *Computational Humanities Research Conference (CHR 2023)* — see [citation](#citation) for full reference. |
7 | 15 |
|
8 | | -It is based on a fork of the automatic multilingual sentence aligner Bertalign. |
9 | 16 |
|
10 | | -The scripts relies on a prior phase of text segmentation at syntagm level using regular expressions or bert-based segmentation to match grammatical syntagms and produce a more precise alignment. |
| 17 | +--- |
11 | 18 |
|
12 | | -## Installation |
| 19 | +## 💡 Key Features |
13 | 20 |
|
14 | | -**Caveat**: the code is being tested on Python 3.9 and 3.10 due to some libraries limitations. |
| 21 | +- 🔀 **Multilingual clause-level alignment** using contextual embeddings |
| 22 | +- ✂️ **Trainable segmentation module** (BERT-based or regex-based) |
| 23 | +- 🧩 **Collation-ready architecture** (stemmatology support in development) |
| 24 | +- 📚 Optimized for **premodern and historical corpora** |
15 | 25 |
|
16 | | -`pip3 install -r requirements.txt` |
| 26 | +AQUILIGN builds on a fork of [Bertalign](https://github.com/roytseng-tw/bertalign), customized for historical languages and alignment evaluation. |
17 | 27 |
|
| 28 | +--- |
18 | 29 |
|
19 | | -## Training the segmenter |
| 30 | +## ⚙️ Installation |
20 | 31 |
|
21 | | -The segmenter we use is based on a Bert AutoModelForTokenClassification that is trainable. |
| 32 | +Supports **Python 3.9 or 3.10** only (due to dependency constraints). |
22 | 33 |
|
23 | | -Example of use: |
| 34 | +```bash |
| 35 | +git clone https://github.com/ProMeText/Aquilign.git |
| 36 | +cd Aquilign |
| 37 | +pip install -r requirements.txt |
| 38 | +``` |
| 39 | +## 🧠 Training the Segmenter |
| 40 | + |
| 41 | +The segmenter is based on a trainable `BertForTokenClassification` model from Hugging Face’s `transformers` library. |
| 42 | + |
| 43 | +We fine-tune this model to detect custom sentence delimiters (`£`) in historical texts from the **[Multilingual Segmentation Dataset](https://github.com/carolisteia/multilingual-segmentation-dataset)**. |
24 | 44 |
|
25 | | -`python3 train_tokenizer.py -m google-bert/bert-base-multilingual-cased -t ../Multilingual_Aegidius/data/segmentation_data/split/multilingual/train.json -d ../Multilingual_Aegidius/data/segmentation_data/split/multilingual/dev.json -e ../Multilingual_Aegidius/data/segmentation_data/split/multilingual/test.json -ep 100 -b 128 --device cuda:0 -bf16 -n multilingual_model -s 2 -es 10` |
| 45 | +--- |
| 46 | + |
| 47 | +### 🔧 Example Command |
| 48 | + |
| 49 | +```bash |
| 50 | +python3 train_tokenizer.py \ |
| 51 | + -m google-bert/bert-base-multilingual-cased \ |
| 52 | + -t multilingual-segmentation-dataset/data/Multilingual_Aegidius/segmented/split/multilingual/train.json \ |
| 53 | + -d multilingual-segmentation-dataset/data/Multilingual_Aegidius/segmented/split/multilingual/dev.json \ |
| 54 | + -e multilingual-segmentation-dataset/data/Multilingual_Aegidius/segmented/split/multilingual/test.json \ |
| 55 | + -ep 100 \ |
| 56 | + -b 128 \ |
| 57 | + --device cuda:0 \ |
| 58 | + -bf16 \ |
| 59 | + -n multilingual_model \ |
| 60 | + -s 2 \ |
| 61 | + -es 10 |
| 62 | +``` |
| 63 | +This command fine-tunes the `bert-base-multilingual-cased` model with the following configuration: |
26 | 64 |
|
27 | | -For finetuning a multilingual model from the `bert-base-multilingual-cased` model, on 100 epochs, a batch size of 128, |
28 | | -on the GPU, using bf16 mixed precision, saving the model every two epochs and with and early stopping value of 10. |
| 65 | +- **Epochs**: `100` |
| 66 | +- **Batch size**: `128` |
| 67 | +- **Device**: `cuda:0` (GPU) |
| 68 | +- **Precision**: `bf16` (bfloat16 mixed precision) |
| 69 | +- **Checkpointing**: Saves the model every 2 epochs |
| 70 | +- **Early stopping**: Stops after 10 epochs without improvement |
29 | 71 |
|
30 | | -The training data must follow the following structure and will be validated against a specific JSON schema. |
| 72 | +--- |
31 | 73 |
|
32 | | -```JSON |
33 | | -{"metadata": |
34 | | - { |
| 74 | +### 🗂️ Input Format: JSON Schema |
| 75 | + |
| 76 | +Training data must follow a structured JSON format, including both metadata and examples. |
| 77 | + |
| 78 | +```json |
| 79 | +{ |
| 80 | + "metadata": { |
35 | 81 | "lang": ["la", "it", "es", "fr", "en", "ca", "pt"], |
36 | | - "centuries": [13, 14, 15, 16], "delimiter": "£" |
| 82 | + "centuries": [13, 14, 15, 16], |
| 83 | + "delimiter": "£" |
37 | 84 | }, |
38 | | -"examples": |
39 | | - [ |
40 | | - {"example": "que mi padre me diese £por muger a un su fijo del Rey", |
41 | | - "lang": "es"}, |
42 | | - {"example": "Per fé, disse Lion, £i v’andasse volentieri, £ma i vo veggio £qui", |
43 | | - "lang": "it"} |
44 | | - ] |
| 85 | + "examples": [ |
| 86 | + { |
| 87 | + "example": "que mi padre me diese £por muger a un su fijo del Rey", |
| 88 | + "lang": "es" |
| 89 | + }, |
| 90 | + { |
| 91 | + "example": "Per fé, disse Lion, £i v’andasse volentieri, £ma i vo veggio £qui", |
| 92 | + "lang": "it" |
| 93 | + } |
| 94 | + ] |
45 | 95 | } |
46 | 96 | ``` |
47 | | -The metadata is used for describing the corpus and will be parsed in search for the delimiter. It is the only mandatory |
48 | | -information. |
| 97 | +- The `metadata` block must include: |
| 98 | + |
| 99 | + - `"lang"`: a list of ISO 639-1 codes representing the languages in the dataset |
| 100 | + - `"centuries"`: historical coverage of the examples (used for metadata and possible filtering) |
| 101 | + - `"delimiter"`: the segmentation marker token (default: `£`), predicted by the model |
49 | 102 |
|
50 | | -We recommend using the ISO codes for the target languages. |
51 | | -The codes must match the language codes that are in the [`aquilign/preproc/delimiters.json`](aquilign/preproc/delimiters.json) file, used for the |
52 | | -regexp tokenization that can be used as a baseline. |
| 103 | +- The `examples` block is an array of training samples, each containing: |
53 | 104 |
|
54 | | -## Use of the aligner |
| 105 | + - `"example"`: a string of text including segmentation markers |
| 106 | + - `"lang"`: the ISO code of the language the text belongs to |
55 | 107 |
|
56 | | -`python3 main.py -o lancelot -i data/extraitsLancelot/ii-48/ -mw data/extraitsLancelot/ii-48/fr/micha-ii-48.txt -d |
57 | | -cuda:0 -t bert-based` to perform alignment with our bert-based segmenter, choosing Micha edition as base witness, |
58 | | -on the GPU. The results will be saved in `result_dir/lancelot` |
| 108 | +--- |
59 | 109 |
|
60 | | -`python3 main.py --help` to print help. |
| 110 | +📖 For more details, see the full documentation: |
| 111 | +➡️ [segmentation_model.md](https://github.com/carolisteia/multilingual-segmentation-dataset/blob/main/docs/segmentation_model.md) |
61 | 112 |
|
62 | | -Files must be sorted by language, using the ISO_639-1 language code as parent directory name (`es`, `fr`, `it`, `en`, etc). |
63 | | -## Citation |
64 | 113 |
|
65 | | -Gille Levenson, M., Ing, L., & Camps, J.-B. (2024). Textual Transmission without Borders: Multiple Multilingual Alignment and Stemmatology of the ``Lancelot en prose’’ (Medieval French, Castilian, Italian). In W. Haverals, M. Koolen, & L. Thompson (Eds.), Proceedings of the Computational Humanities Research Conference 2024 (Vol. 3834, pp. 65–92). CEUR. https://ceur-ws.org/Vol-3834/#paper104 |
| 114 | +## 🧮 Using the Aligner |
66 | 115 |
|
| 116 | +To align a set of parallel texts using the BERT-based segmenter, run: |
67 | 117 |
|
| 118 | +```bash |
| 119 | +python3 main.py \ |
| 120 | + -o lancelot \ |
| 121 | + -i data/extraitsLancelot/ii-48/ \ |
| 122 | + -mw data/extraitsLancelot/ii-48/fr/micha-ii-48.txt \ |
| 123 | + -d cuda:0 \ |
| 124 | + -t bert-based |
68 | 125 | ``` |
| 126 | +This will: |
| 127 | + |
| 128 | +- ✅ Align the multilingual files found in `data/extraitsLancelot/ii-48/` |
| 129 | +- 📚 Use the **Micha edition** (French) as the **base witness** |
| 130 | +- ⚙️ Run on the **GPU** (`cuda:0`) |
| 131 | +- 💾 Save results to: `result_dir/lancelot/` |
| 132 | + |
| 133 | + |
| 134 | +> 📂 Files must be sorted by language, using the ISO 639-1 language code |
| 135 | +> as the **parent directory name** (`es/`, `fr/`, `it/`, `en/`, etc.). |
| 136 | +
|
| 137 | +To view all available options: |
| 138 | + |
| 139 | +```bash |
| 140 | +python3 main.py --help |
| 141 | +``` |
| 142 | +--- |
| 143 | +## 📚 Citation |
| 144 | + |
| 145 | +If you use this tool in your research, please cite: |
| 146 | + |
| 147 | +Gille Levenson, M., Ing, L., & Camps, J.-B. (2024). |
| 148 | +**Textual Transmission without Borders: Multiple Multilingual Alignment and Stemmatology of the _Lancelot en prose_ (Medieval French, Castilian, Italian).** |
| 149 | +In W. Haverals, M. Koolen, & L. Thompson (Eds.), *Proceedings of the Computational Humanities Research Conference 2024* (Vol. 3834, pp. 65–92). CEUR. |
| 150 | +🔗 [https://ceur-ws.org/Vol-3834/#paper104](https://ceur-ws.org/Vol-3834/#paper104) |
| 151 | + |
| 152 | +### 📄 BibTeX |
| 153 | + |
| 154 | +```bibtex |
69 | 155 | @inproceedings{gillelevenson_TextualTransmissionBorders_2024a, |
70 | | - title = {Textual {{Transmission}} without {{Borders}}: {{Multiple Multilingual Alignment}} and {{Stemmatology}} of the ``{{Lancelot}} En Prose'' ({{Medieval French}}, {{Castilian}}, {{Italian}})}, |
71 | | - shorttitle = {Textual {{Transmission}} without {{Borders}}}, |
72 | | - booktitle = {Proceedings of the {{Computational Humanities}} {{Research Conference}} 2024}, |
| 156 | + title = {Textual Transmission without Borders: Multiple Multilingual Alignment and Stemmatology of the ``Lancelot En Prose'' (Medieval French, Castilian, Italian)}, |
| 157 | + shorttitle = {Textual Transmission without Borders}, |
| 158 | + booktitle = {Proceedings of the Computational Humanities Research Conference 2024}, |
73 | 159 | author = {Gille Levenson, Matthias and Ing, Lucence and Camps, Jean-Baptiste}, |
74 | 160 | editor = {Haverals, Wouter and Koolen, Marijn and Thompson, Laure}, |
75 | 161 | date = {2024}, |
76 | | - series = {{{CEUR Workshop Proceedings}}}, |
| 162 | + series = {CEUR Workshop Proceedings}, |
77 | 163 | volume = {3834}, |
78 | 164 | pages = {65--92}, |
79 | 165 | publisher = {CEUR}, |
80 | 166 | location = {Aarhus, Denmark}, |
81 | 167 | issn = {1613-0073}, |
82 | 168 | url = {https://ceur-ws.org/Vol-3834/#paper104}, |
83 | 169 | urldate = {2024-12-09}, |
84 | | - eventtitle = {Computational {{Humanities Research}} 2024}, |
85 | | - langid = {english}, |
86 | | - file = {/home/mgl/Bureau/Travail/Bibliotheque_zoteros/storage/CIH7IAHV/Levenson et al. - 2024 - Textual Transmission without Borders Multiple Multilingual Alignment and Stemmatology of the ``Lanc.pdf} |
| 170 | + eventtitle = {Computational Humanities Research 2024}, |
| 171 | + langid = {english} |
87 | 172 | } |
88 | | -
|
89 | 173 | ``` |
90 | 174 |
|
| 175 | +## 🔗 Related Projects |
| 176 | + |
| 177 | +**Aquilign** is part of a broader ecosystem of tools and corpora developed for the computational study of medieval multilingual textual traditions. The following repositories provide aligned datasets, segmentation resources, and use cases for the Aquilign pipeline: |
| 178 | + |
| 179 | +- [Multilingual Segmentation Data](https://github.com/ProMeText/multilingual-segmentation-data) |
| 180 | + Sentence and clause-level segmentation datasets in seven medieval languages, used to train and evaluate the segmentation model integrated into Aquilign. |
| 181 | + |
| 182 | +- [Parallelium – an aligned scriptures dataset](https://github.com/carolisteia/parallelium-scriptures-alignment-dataset) |
| 183 | + A multilingual dataset of aligned Biblical and Qur’anic texts (medieval and modern), used for benchmarking multilingual alignment in diverse historical settings. |
| 184 | + |
| 185 | +- [Lancelot par maints langages](https://github.com/carolisteia/lancelot-par-maints-langages) |
| 186 | + A parallel corpus of *Lancelot en prose* in French, Castilian, and Italian. First testbed for Aquilign’s multilingual alignment and stemmatological comparison. |
| 187 | + |
| 188 | +- [Multilingual Aegidius](https://github.com/ProMeText/Multilingual_Aegidius) |
| 189 | + A corpus of *De regimine principum* and its translations in Latin, Romance vernaculars, and Middle English. Built using the Aquilign segmentation and alignment workflow. |
| 190 | + |
| 191 | +--- |
| 192 | + |
| 193 | +## 🚧 Project Status & Future Directions |
| 194 | + |
| 195 | +**Aquilign** is under active development and currently supports: |
| 196 | + |
| 197 | +- ✅ Sentence- and clause-level alignment across multiple languages |
| 198 | +- ✅ Integration with BERT-based and regex-based segmenters |
| 199 | +- ✅ Alignment evaluation and output export in tabular format |
| 200 | +- ✅ Compatibility with multilingual historical corpora (e.g. *Lancelot*, *De Regimine Principum*) |
| 201 | + |
| 202 | +--- |
| 203 | + |
| 204 | +### 🔮 Planned Features |
| 205 | + |
| 206 | +- 🧬 **Collation Module**: |
| 207 | + Automatic generation of collation tables across aligned witnesses for textual variant analysis |
| 208 | + |
| 209 | +- 🏛️ **Stemmatic Analysis Integration**: |
| 210 | + Tools for stemmatological inference based on alignment structure and textual divergence |
| 211 | + |
| 212 | +- 📊 **Interactive Visualization Tools**: |
| 213 | + Visualization of alignment, variant graphs, and stemma hypotheses |
| 214 | + |
| 215 | +- 🌐 **Support for Additional Languages**: |
| 216 | + Extending tokenization and alignment capabilities to new premodern languages and scripts |
| 217 | + |
| 218 | +--- |
| 219 | + |
| 220 | +If you're interested in contributing to any of these areas or proposing enhancements, see [Contact & Contributions](#-contact--contributions). |
| 221 | + |
| 222 | +--- |
| 223 | + |
| 224 | +## 📫 Contact & Contributions |
| 225 | + |
| 226 | +We welcome questions, feedback, and contributions to improve the Aquilign pipeline. |
| 227 | + |
| 228 | +- 🛠️ Found a bug or have a feature request? |
| 229 | + ➡️ [Open an issue](https://github.com/ProMeText/Aquilign/issues) |
| 230 | + |
| 231 | +- 🔄 Want to contribute code or improvements? |
| 232 | + ➡️ Fork the repo and submit a pull request |
91 | 233 |
|
92 | | -## Licence |
| 234 | +- 🎓 For academic collaboration or project inquiries: |
| 235 | + ➡️ Reach out via [GitHub Discussions](https://github.com/ProMeText/Aquilign/discussions) or contact the authors directly |
93 | 236 |
|
94 | | -This fork is released under the [GNU General Public License v3.0](./LICENCE) |
| 237 | +--- |
| 238 | +## 💰 Funding |
95 | 239 |
|
96 | | -## Funding |
| 240 | +This work benefited from national funding managed by the **Agence Nationale de la Recherche** |
| 241 | +under the *Investissements d'avenir* programme with the reference: |
| 242 | +**ANR-21-ESRE-0005 (Biblissima+)** |
97 | 243 |
|
98 | | -This work benefited́ from national funding managed by the Agence Nationale de la Recherche under the Investissements d'avenir programme with the reference ANR-21-ESRE-0005 (Biblissima+). |
| 244 | +> Ce travail a bénéficié d'une aide de l’État gérée par l’**Agence Nationale de la Recherche** |
| 245 | +> au titre du programme d’**Investissements d’avenir**, référence **ANR-21-ESRE-0005 (Biblissima+)**. |
99 | 246 |
|
100 | | -Ce travail a bénéficié́ d'une aide de l’État gérée par l'Agence Nationale de la Recherche au titre du programme d’Investissements d’avenir portant la référence ANR-21-ESRE-0005 (Biblissima+) |
| 247 | +<p align="center"> |
| 248 | + <img src="https://github.com/user-attachments/assets/915c871f-fbaa-45ea-8334-2bf3dde8252d" alt="Biblissima+ Logo" width="600"/> |
| 249 | +</p> |
101 | 250 |
|
102 | | - |
| 251 | +## ⚖️ License |
103 | 252 |
|
| 253 | +This project is released under the **[GNU General Public License v3.0](./LICENCE)**. |
| 254 | +You are free to use, modify, and redistribute the code under the same license conditions. |
0 commit comments