This repository contains code used in the Language Model Adaptation for Low-Resource African Languages project.
[📄 Dissertation]
The corresponding trained and adapted tokenizers as well as models can be found on the HuggingFace site of the project.
evaluation/- Code used for model evaluation on downstream tasks. In addition contains processed results.modelling/- Functions for model embedding matrix modifications.scripts/- Bash scripts for dataset processing, tokenizer and model training and model adaptation. Scripts come with SGE scheduler flags.tokenization/- Functions for tokenizer adaptation.training/- Functions for training dataset pre-processing and model training.fertility_analysis/- Fertility evaluation results of selected tokenizers.add_tokens.py- Tokenizer adaptation through token addition.replace_tokens.py- Tokenizer adaptation through token replacement.add_embeddings.py- Model embedding matrix modification through embedding addition.replace_embeddings.py- Model embedding matrix modification through embedding replacement.fertility_evaluation.py- Script used for tokenizer fertility evaluation on WURA validation sets.train_model.py- Model training script.train_wura_tokenizer.py- Script used for training language-dedicated tokenizers using the WURA dataset.requirements.txt- A file containing a list of Python pip packages.README.md- This file :)
Download data:
- WURA dataset and place it in a
./data/wuradirectory.
To reproduce the tokenizer fertility results, run the following scripts:
- Train language-dedicated tokenizers using
scripts/train_wura_tokenizers_opt.qsub.sh. - Run
add_tokens.pyandreplace_tokens.pyto produce adapted tokenizers. - Specify paths to desired tokenizers and run
fertility_evaluation.py.
To reproduce model adaptation results, run the above and the following:
- Run
add_embeddings.pyandreplace_embeddings.pyto create models with modified embeddings. - Run all scripts from
scripts/dataset_processingto pre-process, tokenize and group training samples. - Train models using scripts from
scripts/model_training. - Run
scripts/download_evaluation_data_repos.shto download evaluation datasets. - Evaluate models
- Generate model answers by running scripts in
scripts/model_evaluation. - Aggregate and compute metrics using scripts from
scripts/evaluation_results_processing.
- Generate model answers by running scripts in
