📈 Results
Data augmentation, where novel examples are created from existing data, has been an effective strategy in NLP, particularly in low-resource settings. Augmentation typically relies on simple perturbations, such as concatenating, permuting, and replacing substrings. However, these transformations generally fail to preserve linguistic validity, resulting in augmentated examples which are similar to valid examples, but are themselves ungrammatical or strange. Other research uses linguistic knowledge to constrain the newly-created augmented examples to (hypothetically) grammatical instances.
This study provides a head-to-head comparison of linguistically-naive and linguistically-motivated data augmentation strategies. We use a case study on two low-resource languages, Uspanteko and Arapaho, and study machine translation and interlinear gloss prediction.
Both IGT datasets are publicly available on Huggingface:
Set up environment:
python -m venv .venv # Please use Python >=3.12
source .venv/bin/activate
pip install -r requirements.txt
Warning
If you are trying to install this on an Apple Silicon machine, mlconjug3
won't install its dependencies correctly.
You can fix this by running pip install defusedxml scikit-learn
Run experiments:
source .venv/bin/activate
python src/train.py --direction transc->transl --sample_train_size 50 --seed 0
We utilize the Mayan language Uspanteko, which has ~10k examples of interlinear glossed text (IGT), a data format that combines a transcription, segmentation, morphological glossing, and translation. For example:
\t o sey xtok rixoqiil # the transcription in Uspanteko
\m o' sea x-tok r-ixóqiil # the transcription, segmented into morphemes
\p CONJ ADV COM-VT E3S-S # part-of-speech tags for each morpheme
\g o sea COM-buscar E3S-esposa # interlinear glosses for each morpheme
\l O sea busca esposa. # Spanish translation
The richness of IGT enables us to evaluate several tasks with the same dataset. Specifically, we use:
Task | Inputs -> Outputs |
---|---|
Gloss generation | transcription -> glosses |
Translation | transcription -> translation |
Reverse translation | translation -> transcription |
We consider linguistically-motivated and linguistically-naive strategies for both languages.
UPD-TAM: Updates the aspect marker of the verb
INS-CONJ: Inserts a random conjunction at the start of the sentence
INS-NOISE: Inserts a random word at the start of the sentence
DEL: Randomly deletes a word by index
DEL-EXCL: Randomly deletes a word by index, excluding verbs
DUP: Randomly duplicates a word by index
INS-INTJ: Inserts a random interjection at the start of the sentence
INS-NOISE: Inserts a random word at the start of the sentence
PERM: Produces up to 10 permutations of the original word order
As augmentation has been shown to primarily benefit low-resource settings, we evaluate over several training set sizes. In each case, we sample some number of training examples, created augmented examples using only those examples, and use the same evaluation set.