Language Model Adaptation for Low-Resource African Languages 🌍

This repository contains code used in the Language Model Adaptation for Low-Resource African Languages project.

The corresponding trained and adapted tokenizers as well as models can be found on the HuggingFace site of the project.

Structure of the repository:

evaluation/ - Code used for model evaluation on downstream tasks. In addition contains processed results.
modelling/ - Functions for model embedding matrix modifications.
scripts/ - Bash scripts for dataset processing, tokenizer and model training and model adaptation. Scripts come with SGE scheduler flags.
tokenization/ - Functions for tokenizer adaptation.
training/ - Functions for training dataset pre-processing and model training.
fertility_analysis/ - Fertility evaluation results of selected tokenizers.
add_tokens.py - Tokenizer adaptation through token addition.
replace_tokens.py - Tokenizer adaptation through token replacement.
add_embeddings.py- Model embedding matrix modification through embedding addition.
replace_embeddings.py - Model embedding matrix modification through embedding replacement.
fertility_evaluation.py - Script used for tokenizer fertility evaluation on WURA validation sets.
train_model.py - Model training script.
train_wura_tokenizer.py - Script used for training language-dedicated tokenizers using the WURA dataset.
requirements.txt - A file containing a list of Python pip packages.
README.md - This file :)

Download data:

To reproduce the tokenizer fertility results, run the following scripts:

Train language-dedicated tokenizers using scripts/train_wura_tokenizers_opt.qsub.sh.
Run add_tokens.py and replace_tokens.py to produce adapted tokenizers.
Specify paths to desired tokenizers and run fertility_evaluation.py.

To reproduce model adaptation results, run the above and the following:

Run add_embeddings.py and replace_embeddings.py to create models with modified embeddings.
Run all scripts from scripts/dataset_processing to pre-process, tokenize and group training samples.
Train models using scripts from scripts/model_training.
Run scripts/download_evaluation_data_repos.sh to download evaluation datasets.
Evaluate models
1. Generate model answers by running scripts in scripts/model_evaluation.
2. Aggregate and compute metrics using scripts from scripts/evaluation_results_processing.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
evaluation		evaluation
fertility_analysis		fertility_analysis
imgs		imgs
modelling		modelling
scripts		scripts
tokenization		tokenization
training		training
.gitignore		.gitignore
README.md		README.md
add_embeddings.py		add_embeddings.py
add_tokens.py		add_tokens.py
fertility_evaluation.py		fertility_evaluation.py
replace_embeddings.py		replace_embeddings.py
replace_tokens.py		replace_tokens.py
requirements.txt		requirements.txt
train_model.py		train_model.py
train_wura_tokenizer.py		train_wura_tokenizer.py