This is the code base for CharLOTTE, a system that leverages character correspondences between related languages in low-resource NMT.
CharLOTTE stands for Character-Level Orthographic Transfer for Token Embeddings.
The CharLOTTE system assumes that the phenomenon of systematic sound correspondence in linguistics is reflected in character correspondences in orthography. For example, j-lh and h-f correspondences between Spanish and Portugues, seen in word pairs:
- ojo, olho
- ajo, alho
- hierro, ferro
- horno, forno
- hijo , filho
CharLOTTE learns these character correspondences with we call SC models and trains tokenizers and NMT models that exploit them so as to increase vocabulary overlap between related high and low-resourced languages. CharLOTTE utilizes a language-agnostic approach, requiring only the NMT parallel training, validation, and testing data; though additional sets of known langauge-specific sets of cognates can also be provided.
From root directory, run these:
cd CopperMT
git clone https://github.com/clefourrier/CopperMT.git
cd ../CopperMTfiles
python move_files.py
The code for running the main experiments is in the Pipeline directory.
The main Pipeline scripts are in the Pipeline directory. Skip to the documentation for each as needed:
- Pipeline/train_SC.sh - For training (and scoring) an SC model.
- Pipeline/pred_SC.sh - For running inference with an SC model.
- Pipeline/train_srctgt_tokenizer.sh - For training an NMT tokenizer.
SC Configs are the backbone of the pipeline. They are used both by Pipeline/train_SC.sh and Pipeline/pred_SC.sh. An overview of SC Configs is given first, followed by documentation of the main Pipeline scripts. Pipeline/train_srctgt_tokenizer.sh utilizes its own config, which will be described in its own section of this documentation.
See Pipeline/cfg/SC for the .cfg files for all 10 scenarios of these experiments. They contain the following parameters. If ever not using one of these parameters, as relvant (most should be used), then set it to null. See Pipeline/cfg/SC for details.
- MODULE_HOME_DIR: the system path to the code folder of this module, depending on where you cloned it on your system, e.g. ~/path/to/Cognate/code
- NMT_SRC: source language in the low-resource (LR) direction we ultimately want to translate. Used by make_nmt_configs.py to make NMT config .yaml files. Not used by train_SC.sh or pred_SC.sh or tokenizer training scripts.
- NMT_TGT: target language in the low-resource (LR) direction we ultimately want to translate. Used by make_nmt_configs.py to make NMT config .yaml files. Not used by train_SC.sh or pred_SC.sh or tokenizer training scripts.
- AUG_SRC: source language of the high-resource (HR) direction we want to levarage. Should be a high-resource (HR) language closely related to NMT_SRC. Used by make_nmt_configs.py to make NMT config .yaml files. Not used by train_SC.sh or pred_SC.sh or tokenizer training scripts.
- AUG_TGT: target language of the high-resource direction we want to leverage. Should be THE SAME AS NMT_TGT. Used by make_nmt_configs.py to make NMT config .yaml files. Not used by train_SC.sh or pred_SC.sh or tokenizer training scripts.
- SRC: the source language of the cognate prediction model. This should be the same as AUG_SRC. The goal is to use the resulting cognate prediction model to make AUG_SRC look more like NMT_SRC based on character correspondences.
- TGT: the target language of the cognate prediction model. This should be the same as NMT_SRC. The goal is to use the resulting cognate prediction model to make AUG_SRC look more like NMT_SRC based on character correspondences.
- SEED: a random seed used in different scripts, such as for randomizing data order
- PARALLEL_(TRAIN|VAL|TEST): Parallel train / val / test data .csv files. These are the parallel data used to train NMT models, and from which congates will be extracted to train the cognate prediction model.
- APPLY_TO: list (comma-delimited, no space) of more data .csv files to apply the cognate prediction model to. Not used by train_SC.sh but by pred_SC.sh.
- NO_GROUPING: Keep this set to True. Not sure I'll actually experiment with this. It's used when extracting the cognate list from the Fast Align results. Basically, if False, then "grouping" is applied. Don't worry about it. Ask Brendan if you really want to know.
- SC_MODEL_TYPE: 'RNN' or 'SMT'. Determines what kind of model will be trained to predict cognates.
- COGNATE_TRAIN: Directory where Fast Align results and cognate word lists are written. The final training data, however, will be created in COPPERMT_DATA_DIR. Don't ask why. It's inefficient copying of data in multiple places and I don't want to fix it at this point.
- COGNATE_THRESH: the normalized edit distance threshold to determine cognates. Parallel translation data is given to FastAlign which creats word pairs. Words pairs where the normalized edit distance is less than or equal to COGNATE_THRESH are considered cognates.
- COPPERMT_DATA_DIR: Directory where the cognate training data, model checkpoints, and predictions for each scenario will be saved. Each scenario will have its own subdirectory in this directory called {SRC}{TGT}{SMT_MODEL_TYPE}-{RNN_HYPERPARAMS_ID}_S={SEED}, e.g., fr_mfe_RNN-0_S-0.
- COPPERMT_DIR: The directory where the CopperMT repo was cloned, e.g, /home/hatch5o6/Cognate/code/CopperMT/CopperMT.
- PARAMETERS_DIR: A folder to save the CopperMT parameters files
- RNN_HYPERPARAMS: A folder containing RNN hyperparameter files (each containing a hyperparameter set) and a manifest.json file mapping an id to each hyperparameter set (file) (RNNs only).
- RNN_HYPERPARAMS_ID: The RNN hyperparameter set (see RNN_HYPERPARAMS) to use to train an RNN model (RNNs only).
- BEAM: The number of beams used in beam-search decoding (RNNs only).
- NBEST: The number of hypotheses to generate. This should just be 1 (Not sure why it's even parameterized). (RNNs only).
- REVERSE_SRC_TGT_COGNATES: Will prepare data to be passed into FastAlign in format
{target language sentence} ||| {source language sentence}, rather than in format{source language sentence} ||| {target language sentence}. It will result in slightly different data, but likely will not affect results much. - SC_MODEL_ID: an ID given to the resulting cognate prediction model. This ID is used in other pipelines. Not used by train_SC.sh, but is used by pred_SC.sh to label the resulting noramlized high-resource (norm HR) file (the file that has replaced all words in the HR file with the respective predicted cognate).
- ADDITIONAL_TRAIN_COGNATES_(SRC|TGT): Parallel cognate files if wanting to add data from other sources, such as CogNet or EtymDB, to the training data. If not using, set to 'null'
- (VAL|TEST)COGNATES(SRC|TGT): Set these to the validation/test src/tgt files. If not passed, you should set COGNATE_(TRAIN|VAL|TEST)_RATIO to make train / val / test splits instead. If not using, set to 'null'. Should use either this or COGNATE_(TRAIN|VAL|TEST)_RATIO.
- COGNATE_(TRAIN|VAL|TEST)_RATIO: If not passing (VAL|TEST)COGNATES(SRC|TGT), then these are the train / val / test ratios for splitting the cognate data. The three should add to 1. If not using, set to 'null'. Should use either this or (VAL|TEST)COGNATES(SRC|TGT).
This documentation is designed to walk you through the Pipeline/train_SC.sh script. You should read this documentation and the train_SC.sh script together. This documentation will refer to sections of the train_SC.sh code with numbers like 2.2 and 2.3.1.
Pipeline/train_SC.sh trains the character correspondence (SC) models. We call it SC, which stands for "sound correspondence", but more accurately, what we're more acurately detecting are actually character correspondences, since we apply this on orthography rather than phones.
Pipeline/train_SC.sh is run from /Cognate/code, and takes a single positional argument, one of the .cfg config files described above, e.g.:
bash Pipeline/train_SC.sh /home/hatch5o6/Cognate/code/Pipeline/cfg/SC/fr-mfe.cfg
Parallel Data .csv files - .csv files defining the NMT parallel training, validation, and test data are referenced in the .csg config files and this script. These files MUST contain the header src_lang, tgt_lang, src_path, tgt_path where:
- src_lang is the source language code
- tgt_lang is the target language code
- src_path is the path to the source parallel data text file
- tgt_path is the path to the target parallel data text file
src_path and tgt_path must be parallel to each other, with src_path containing one sentence per line and tgt_path containing the corresponding translations on each line.
It uses these parameters from the SC Config file:
- MODULE_HOME_DIR
- SRC
- TGT
- PARALLEL_TRAIN
- PARALLEL_VAL
- PARALLEL_TEST
- COGNATE_TRAIN
- NO_GROUPING
- SC_MODEL_TYPE
- SEED
- COGNATE_THRESH
- COPPERMT_DATA_DIR
- COPPERMT_DIR
- PARAMETERS_DIR
- RNN_HYPERPARAMS
- RNN_HYPERPARAMS_ID
- BEAM
- NBEST
- REVERSE_SRC_TGT_COGNATES
- ADDITIONAL_TRAIN_COGNATES_SRC
- ADDITIONAL_TRAIN_COGNATES_TGT
- VAL_COGNATES_SRC
- VAL_COGNATES_TGT
- TEST_COGNATES_SRC
- TEST_COGNATES_TGT
- COGNATE_TRAIN_RATIO
- COGNATE_TEST_RATIO
- COGNATE_VAL_RATIO
We add SC_MODEL_TYPE, RNN_HYPERPARAMS_ID, and SEED to COGNATE_TRAIN directory name. From hereon, when COGNATE_TRAIN is mentioned, it will refer to {COGNATE_TRAIN}_{SC_MODEL_TYPE}-{RNN_HYPERPARAMS_ID}_S-{SEED}.
If it exists, COGNATE_TRAIN is destroyed and recreated. The COGNATE_TRAIN directory is where the cognate detection parallel data and results get written and saved. It has two subdirectories: - cognate Contains the parallel data from which cognates are extracted. The path to this directory is set to COGNATE_DIR in train_SC.sh. The src and tgt parallel data are saved to files {COGNATE_DIR}/train.{SRC} and {COGNATE_DIR}/train.{TGT}, as explained in 2.2. - fastalign This is where the Fast Align results and the final list of cognates extracted from the parallel data in the cognate subdirectory are written. The path to this directory is set to FASTALIGN_DIR in train_SC.sh. This directory is discussed in 2.3.
Again, note that PARALLEL_TRAIN, PARALLEL_VAL, PARALLEL_TEST .csv files are define the NMT training, validation, and test data -- NOT training data for cognate prediction. We will extract cognates from ALL of the NMT training, validation, and testing data to create cognate prediction training data.
The Pipeline/make_SC_training_data.py script is a bit of a misnomer. It simply reads from the PARALLEL_TRAIN, PARALLEL_VAL, PARALLEL_TEST .csv files and writes the parallel data to {COGNATE_TRAIN}/cognate/train.{SRC} and {COGNATE_TRAIN}/cognate/train.{TGT}. ONLY parallel data for the provided src-tgt pair through --src and --tgt commandline arguments is written. Other pairs in the .csvs, if they exist, are ignored.
Pipeline/make_SC_training_data.py
- --train_csv: Parallel Data .csv file defining the NMT training data.
- --val_csv: Parallel Data .csv file defining the NMT validation data.
- --test_csv: Parallel Data .csv file defining the NMT test data.
- --src: the source language code
- --tgt: the target language code
- --src_out: the file path of the source sentences of the parallel data from which cognates will be extracted. Should be {COGNATE_TRAIN}/cognate/train.{SRC}.
- --tgt_out: the file path of the target sentences of the parallel data from which cognates will be extracted. Should be {COGNATE_TRAIN}/cognate/train.{TGT}.
Now that we have written all of our parallel data to files, we can run it through Fast Align to get word pair alignments.
Here, we create our file paths for our aligned word list files, depending on whether NO_GROUPING is True / False. NO_GROUPING should probably be True. These files are discussed in 2.4.1 and 2.4.2.
We need to format the inputs for fast_align. This is done by the word_alignments/prepare_for_fastalign.py script.
The input files to this script are the output files from Pipeline/make_SC_training_data.py, i.e., {COGNATE_TRAIN}/cognate/train.{SRC} and {COGNATE_TRAIN}/cognate/train.{TGT}.
This script will write the result to {COGNATE_TRAIN}/fastalign/{SRC}-{TGT}.txt, which writes each sentence pair to a line in the format {source language sentence} ||| {target language sentence}.
If REVERSE_SRC_TGT_COGNATES is set to true, then the source and target sentences will be flipped: {target language sentence} ||| {source language sentence}. This setting will result in slightly different cognate training data, but should likely not have significant impact on results. Should probably just keep REVERSE_SRC_TGT_COGNATES set to false.
word_alignments/prepare_for_fastalign.py
- --src: The file to the source parallel data from which cognates will be extracted. Should be {COGNATE_TRAIN}/cognate/train.{SRC}.
- --tgt: The file to the target parallel data from which cognates will be extracted. Should be {COGNATE_TRAIN}/cognate/train.{TGT}.
- --out: The path to the formatted sentence pairs. Should be {COGNATE_TRAIN}/fastalign/{SRC}-{TGT}.txt.
Here we run Fast Align on the parallel sentences to get aligned word pairs. We want the symmetricized alignment, so we have to run a forward and reverse alignment first, that is, we run three Fast Align commands: (1) forward alignment, (2) reverse alignment, (3) retrieving a symmetricized alignment from the forward and reverse alignments (using grow-diag-final-and algorithm).
Forward alignment is saved to {COGNATE_TRAIN}/fastalign/{SRC}-{TGT}.forward.align Reverse alignment is saved to {COGNATE_TRAIN}/fastalign/{SRC}-{TGT}.reverse.align Symmetricized alignment is saved to {COGNATE_TRAIN}/fastalign/{SRC}-{TGT}.sym.align
We then need to extract the word pairs from the Fast Align results, which is done with either the word_alignments/make_word_alignments_no_grouping.py or word_aligments/make_word_alignments.py scripts, depending on if NO_GROUPING is set to true or false. It should probably be set to true.
In essence, these two scripts read the word-level alignments from the symmetricized Fast Align results ({COGNATE_TRAIN}/fastalign/{SRC}-{TGT}.sym.align) and retrieve the corresponding word pairs.
The make_word_alignments_no_grouping.py version (the one that should probably be used) of the script simply grabs the word pair for each i-j pair in the alignment results where i is the index of a word in a source line and j is the index of a word in the target line.
The make_word_alignments.py script adds grouping logic when there are many-to-one, one-to-many, and many-to-many alignments, essentially creating phrase pairs rather than word pairs, where applicable. We should probably not use this script, for simplicity. Evaluating whether it improves performance is more complexity than I want to add right now.
These scripts write a list of source-target word pairs in the format {source_word} ||| {target word}. to make_word_alignments_no_grouping.py writes the results to {COGNATE_TRAIN}/fastalign/word_list.{SRC}-{TGT}.NG.txt (note the NG), whereas make_word_alignments.py writes to {COGNATE_TRAIN}/fastalign/word_list.{SRC}-{TGT}.txt (note absence of NG). These paths are set in the code of section 2.3.1.
word_alignments/make_word_alignments(_no_grouping).py
- --alignments, -a: The path to the Fast Align symmetricized results. Should be {COGNATE_TRAIN}/fastalign/{SRC}-{TGT}.sym.align.
- --sent_pairs, -s: The path to the sentence pairs. Should be the same as the outputs of word_alignments/prepare_for_fastalign.py and inputs to Fast Align in 2.3.3, i.e., should be {COGNATE_TRAIN}/fastalign/{SRC}-{TGT}.txt
- --out, -o: The output path to the aligned word pairs. Should be {COGNATE_TRAIN}/fastalign/word_list.{SRC}-{TGT}(.NG).txt.
- --VERBOSE (optional): Pass this flag to for verbose print outs.
- --START (int, optional): (make_word_alignments.py ONLY) If passed, this slices the list of sentence pairs from which to retrieve aligned words pairs to those starting with the provided START index (includes the START index). (Start index of sentences).
- --STOP (int, optional): If passed, this slices the list of sentence pairs from which to retrieve aligned word pairs to those up to the provided STOP index (excludes the STOP index). (Stop index of sentences).
We now will narrow down the list of aligned word pairs to a list of cognate predictions by filtering the list to those pairs within a normalized edit distance threshold (COGNATE_THRESH).
This is done with word_alignments/make_cognate_list.py. This calculates the normalized levenshtein distance of each word pair and for pairs whose distance are less than or equal to the threshold (default = 0.5), the pair of words are considered cognates.
The list of cognate pairs are written in the format {word 1} ||| {word 2} to {COGNATE_TRAIN}/fastalign/word_list.{SRC}-{TGT}(.NG).cognates.{COGNATE_THRESH}.txt. Additionally, parallel files of the source and target language words are written to {COGNATE_TRAIN}/fastalign/word_list.{SRC}-{TGT}(.NG).cognates.{COGNATE_THRESH}.parallel-{SRC}.txt and {COGNATE_TRAIN}/fastalign/word_list.{SRC}-{TGT}(.NG).cognates.{COGNATE_THRESH}.parallel-{TGT}.txt. These paths are set in the code of section 2.3.1.
word_alignments/make_cognate_list.py
- --word_list, -l: The list of word pairs. This should be the output of word_alignments/make_word_alignments(_no_grouping).py, that is, it should be {COGNATE_TRAIN}/fastalign/word_list.{SRC}-{TGT}(.NG).txt.
- --theta, -t (float): This is the normalized levenshtein distance threshold. Word pairs with a normalized distance less than or equal to this value will be considered cognates.
- --src: The source language code.
- --tgt: The target language code.
- --out, -o (optional): Path where the final cognate pairs wil be written. If not passed, will be written to {COGNATE_TRAIN}/fastalign/word_list.{SRC}-{TGT}(.NG).cognates.{COGNATE_THRESH}.txt. Parallel source and target cognate files will be written to files of the same path, except ending in .parallel-{SRC}.{file extension} and .parallel-{TGT}.{file extension}.
If REVERSE_SRC_TGT_COGNATES is true, then TGT will be passed as for --src and SRC will be passed for --tgt, just because we flipped the source and target sentences when running word_alignments/prepare_for_fastalign.py in 2.3.2.
If datasets for cognate prediction validation and testing are not provided in the .cfg config file with VAL_COGNATES_SRC, VAL_COGNATES_TGT, TEST_COGNATES_SRC, TEST_COGNATES_TGT, then the cognate word pairs extracted from the parallel data will be divided into training, validation, and testing sets. The train_SC.sh script checks if this needs to be done by checking if TEST_COGNATES_SRC equals "null".
If TEST_COGNATES_SRC equals "null", then the script Pipeline/split.py is run to make the train, validation, and test splits on the detected cognates. This script writes the split data to files in the pattern {COGNATE_TRAIN}/fastalign/word_list.{SRC}-{TGT}(.NG).cognates.{COGNATE_THRESH}.parallel-({SRC}|{TGT}).(train|test|val)-s={SEED}.txt. In total, there are six files: a source file and a target file for each of the train, validation, and test sets.
These six files are saved to the following variables in train_SC.sh:
- TRAIN_COGNATES_SRC
- TRAIN_COGNATES_TGT
- VAL_COGNATES_SRC - overwriting the value set in the .cfg config file, which should have been "null"
- VAL_COGNATES_TGT - overwriting the value set in the .cfg config file, which should have been "null"
- TEST_COGNATES_SRC - overwriting the value set in the .cfg config file, which should have been "null"
- TEST_COGNATES_SRC - overwriting the value set in the .cfg config file, which should have been "null"
Pipeline/split.py
- --data1: Path to the words in the source language. Should be {COGNATE_TRAIN}/fastalign/word_list.{SRC}-{TGT}(.NG).cognates.{COGNATE_THRESH}.parallel-{SRC}.txt. (WORD_LIST_SRC path set in 2.3.1)
- --data2: Path to the corresponding cognate words in the target language. Should be {COGNATE_TRAIN}/fastalign/word_list.{SRC}-{TGT}(.NG).cognates.{COGNATE_THRESH}.parallel-{TGT}.txt. (WORD_LIST_TGT path set in 2.3.1)
- --train (float): The ratio of cognate pairs to put in the training data. --train + --val + --test must equal 1.
- --val (float): The ratio of cognate pairs to put in the validation data. --train + --val + --test must equal 1.
- --test (float): The ratio of congate pairs to put in the test data. --train + --val + --test must equal 1.
- --seed (int): The seed for random shuffling.
- --out_dir: Directory where output files are saved. Should be {COGNATE_TRAIN}/fastalign. File names of output will be same as --data1 and data2, but in the provided directory, and with an ammended extension .(train|val|test)-s={SEED}.{original file extension}.
- --UNIQUE_TEST: If this flag is passed, then will reduce the test set so that a given source word only occurs once.
If dataset splits don't need to be added, meaning TEST_COGNATES_SRC is not "null", then all of VAL_COGNATES_SRC, VAL_COGNATES_TGT, TEST_COGNATES_SRC, TEST_COGNATES_TGT should be set (not "null") in the .cfg config file to files containing known cognates, such as from Cognet and/or EtymDB. In this case, these files will be used for validation and testing and TRAIN_COGNATES_SRC and TRAIN_COGNATES_TGT will be set to files containing all of the cognate pairs detected from the parallel NMT data.
3.1.2 Include ADDITIONAL_TRAIN_COGNATES_SRC and ADDITIONAL_TRAIN_COGNATES_TGT in train set file paths
Files containing known cognate pairs, such as from CogNet and EtymDB, can also be set to ADDITIONAL_TRAIN_COGNATES_SRC and ADDITIONAL_TRAIN_COGNATES_TGT. If so, these will be appended to TRAIN_COGNATES_SRC and TRAIN_COGNATES_TGT as a comma-delimited list.
The comma-delimited lists of files in TRAIN_COGNATES_SRC, TRAIN_COGNATES_TGT, VAL_COGNATES_SRC, VAL_COGNATES_TGT, TEST_COGNATES_SRC, TEST_COGNATES_TGT are printed.
The directory structure for the CopperMT scenario is created. This structure will contain the model, training data, outputs, etc. If the parent directory of this structure already exists, it will be deleted then recreated.
The parent directory should be {COPPERMT_DATA_DIR}/{SRC}{TGT}{SC_MODEL_TYPE}-{RNN_HYPERPARAMS_ID}_S-{SEED}.
3.2.2 Copy the RNN hyperparams set file, corresponding to RNN_HYPERPARAMS_ID, to its place in the COPPERMT directory structure
The RNN hyperparameters file corresponding to RNN_HYPERPARAMS_ID is copied to its place in the CopperMT scenario directory structure.
Pipeline/copy_rnn_hyperparams.py
- --rnn_hyperparam_id, -i: The ID of the desired RNN hyperparam set.
- --rnn_hyperparams_dir, -d: Folder containing the RNN hyperparam set files. Should be RNN_HYPERPARAMS.
- --copy_to_path, -c: The path the RNN hyperparams set file will be copied to. Should be copied to the appropriate place inside the CopperMT scenario directory structure: {COPPERMT_DATA_DIR}/{SRC}{TGT}{SC_MODEL_TYPE}-{RNN_HYPERPARAMS_ID}S-{SEED}/inputs/parameters/bilingual_default/default_parameters_rnn{SRC}-{TGT}.txt.
The cognate pair data needs to be formatted for the CopperMT module. This is done with CopperMT/format_data.py, which is run three times: once each for the training, validation, and test data sets. This script takes the parallel cognate files, and writes the cognate pairs in the CopperMT format to files in the CopperMT scenario directory structure. Specifically, they will be written to the folder {COPPERMT_DATA_DIR}/{SRC}{TGT}{SC_MODEL_TYPE}-{RNN_HYPERPARAMS_ID}S-{SEED}/inputs/split_data/{SRC}{TGT}/{SEED}. Parallel cognate files for training are called train_{SRC}_{TGT}.{SRC} and train_{SRC}_{TGT}.{TGT}, for validation, fine_tune_{SRC}_{TGT}.{SRC} and fine_tune_{SRC}_{TGT}.{TGT}, and for testing, test_{SRC}_{TGT}.{SRC} and test_{SRC}_{TGT}.{TGT}. (NOTE, the fine_tune prefix was established by CopperMT module, but is actually used to refer to validation data). the CopperMT/format_data.py script will also shuffle each dataset and make sure it (internally) has only unique source-target cognate pairs.
CopperMT/format_data.py
- --src_data (str): comma-delimited list of parallel source cognate files. Should be variabel TRAIN/VAL/TEST_COGNATES_SRC.
- --tgt_data (str): comma-delimited list of parallel target cognate files, corresponding to those passed to --src_data. Should be variabel TRAIN/VAL/TEST_COGNATES_TGT.
- --src (str): Source language code.
- --tgt (str): Target language code.
- --out_dir (str): The directory the formatted output files will be written to. Note that the files will be written to a subdirectory of this directory corresponding to the seed (see --seed below). This should be {COPPERMT_DATA_DIR}/{SRC}{TGT}{SC_MODEL_TYPE}-{RNN_HYPERPARAMS_ID}S-{SEED}/inputs/split_data/{SRC}{TGT}, and hence, the files will be written to {COPPERMT_DATA_DIR}/{SRC}{TGT}{SC_MODEL_TYPE}-{RNN_HYPERPARAMS_ID}S-{SEED}/inputs/split_data/{SRC}{TGT}/{SEED}.
- --prefix (str): Must be "train", "fine_tune", or "test", depending on if it's the training, validation, or test set (use "fine_tune" for validation set).
- --seed (int): The seed to use for random shuffling of the data. Will also be the name of the subdirectory the output files will be in.
3.2.4 Assert there is no overlap of src and tgt segments (words) between the cognate prediction train / dev / test data
Here we just make sure there no source or target words overlapping between the cognate prediction train, dev, and test datasets. More than ensure there are no overlapping pairs, this ensures there are no overlapping source words or overlapping target words.
The CopperMT/assert_no_overlap_in_formatted_data script is run twice to do this. The first time (without the --TEST_ONLY flag), it will remove any existing overlap between the train, dev, and test sets. It does this by first checking if any source words in the train exist in the source side of either the dev or test, and removes the corresponding pairs. It then does the same for target words, checking if any exist in the target side of the dev or test set, and removing corresponding pairs. This process is repeated for the dev set, though now only checking if the words exist in the test set.
On the second run, it will simply just test, mostly for good measure, that there are no overlapping source or target words accross train, dev, and test sets.
CopperMT/assert_no_overlap_in_formatted_data.py
- --format_out_dir: The directory the formatted data is written to. Should be {COPPERMT_DATA_DIR}/{SRC}{TGT}{SC_MODEL_TYPE}-{RNN_HYPERPARAMS_ID}S-{SEED}/inputs/split_data/{SRC}{TGT}/{SEED}.
- --src Source language code.
- --tgt Target language code.
- --TEST_ONLY If this flag is passed, it will ONLY check that there is no overlap between the train, fine_tune (validation), and test sets. If it is not passed, then the script will remove any existing overlap.
First, the a log .json file is chosen, depending on if NO_GROUPING is true or false (should be true).
Then the sizes of the train, val, and test (for the corresponding language) sets are logged to the log file. This log file maintains a history. See the "latest" key for the latest logged sizes, and "history" for the history of the size change. A corresponding .csv file (same path as the .json log file, but with a .csv extension) is also written, which just shows the latest sizes.
Pipeline/cognate_dataset_log.py
- --formatted_data_dir, -f: The formatted data directory. Should be {COPPERMT_DATA_DIR}/{SRC}{TGT}{SC_MODEL_TYPE}-{RNN_HYPERPARAMS_ID}S-{SEED}/inputs/split_data/{SRC}{TGT}/{SEED}.
- --lang_pair, -l: The source-target language pair, formatted "{source lang}-{target lang}". Should be {SRC}-{TGT}.
- --LOG_F, -L: the path of the .json log to be updated. Will be either cognate_dataset_log_NG=True.json or cognate_dataset_log_NG=False.json, depending on the value of NO_GROUPING (which should be true). A corresponding .csv file will also be written.
The parameters file required by the CopperMT module needs to be written, which is performed by Pipeline/write_scripts.py.
Pipeline/write_scripts.py
- --src: Source language code.
- --tgt: Target language code
- --coppermt_data_dir: Parent folder containing the training data, models, and outputs of each cognate prediction scenario. Should be COPPERMT_DATA_DIR.
- --sc_model_type: The SC model type, either "RNN" or "SMT". Should be SC_MODEL_TYPE.
- --rnn_hyperparams_id: The id corresonding to the desired RNN hyperparams set. Should be RNN_HYPERPARAMS_ID.
- --seed: Should be SEED.
- --parameters, -p: The path the CopperMT parameters file will be written to. Should be {PARAMETERS_DIR}/parameters.{SRC}-{TGT}_{SC_MODEL_TYPE}-{RNN_HYPERPARAMS_ID}_S-{SEED}.cfg
The SC model is now trained. This is done by calling scripts in the CopperMT module.
Training an RNN model:
If training an RNN model, {COPPERMT_DIR}/pipeline/main_nmt_bilingual_full_brendan.sh script is called, passing in the parameters file created in 3.2.6 (should be {PARAMETERS_DIR}/parameters.{SRC}-{TGT}_{SC_MODEL_TYPE}-{RNN_HYPERPARAMS_ID}_S-{SEED}.cfg) and the SEED.
After training, the best RNN checkpoint is selected, using Pipeline/select_checkpoint.py. This selects the best performing checkpoint, based on BLEU score calculated by CopperMT, from those in a directory that contains checkpoints and outputs. This directory is set to variable WORKSPACE_SEED_DIR, which should be COPPERMT_DATA_DIR/{SRC}{TGT}{SC_MODEL_TYPE}-{RNN_HYPERPARAMS_ID}S-{SEED}/workspace/reference_models/bilingual/rnn{SRC}-{TGT}/{SEED}. Pipeline/select_checkpoint.py will save the best checkpoint to {WORKSPACE_SEED_DIR}/checkpoints/selected.pt. All other checkpoints will be deleted to conserve storage space.
Training an SMT model:
If training an SMT model, {COPPERMT_DIR}/pipeline/main_smt_full_brendan.sh is run, passing the same parameters file from 3.2.6 and SEED.
A couple directories, if pre-existing, are deleted. #TODO describe what they are after you figure this out. (Are they still used or did we change this?? The files aren't called elsewhere in train_SC.sh or pred_SC.sh). Run the script, and see afterwards if they exist. #UPDATE I don't think these paths are being used for anything anymore.
To calculate scores, inference is first run on the test set.
Inference with an RNN model
To run inference with an RNN model, {COPPERMT_DIR}/pipeline/main_nmt_bilingual_full_brendan_PREDICT.sh is called, passing the Copper MT parameters file from 3.2.6 (PARAMETERS_F), the path to the selected RNN checkpoint from 3.2.7 (SELECTED_RNN_CHECKPOING), SEED, an indicator "test", NBEST, and BEAM. This script will save its results to a file whose path is saved to the variable HYP_OUT_TXT. This path should be {COPPERMT_DATA_DIR}/{SRC}{TGT}{SC_MODEL_TYPE}-{RNN_HYPERPARAMS_ID}S-{SEED}/workspace/reference_models/bilingual/rnn{SRC}-{TGT}/{SEED}/results/test_selected_checkpoint_{SRC}_{TGT}.{TGT}/generate-test.txt.
The model hypotheses need to be extracted from the HYP_OUT_TXT file, which is done with the NMT/hr_CopperMT.py script. This script has three modes: "prepare", "retrieve", and "get_test_results". Modes "prepare" and "retrieve" will be discussed later in connection to pred_SC.sh below To extract the hypotheses from the model test results file, we use mode "get_test_results". Only the parameters relevant to this mode are shown here. This mode will write the hypotheses to a file parallel to the source file, where on each line is simply the cognate hypothesis for each source word.
NMT/hr_CopperMT.py (get_test_results)
- --function, -F: The script mode. In this case, it should be "get_test_results".
- --test_src: The test source sentences. Should be {COPPERMT_DATA_DIR}/{SRC}{TGT}{SC_MODEL_TYPE}-{RNN_HYPERPARAMS_ID}S-{SEED}/inputs/split_data/{SRC}{TGT}/{SEED}/test_{SRC}_{TGT}.{SRC} (saved to variable SRC_TEXT).
- --data: The model results, written by main_nmt_bilingual_full_brendan_PREDICT.sh. The path is saved to HYP_OUT_TXT.
- --out: The path to save the hypotheses extracted from the model results file. Should be {COPPERMT_DATA_DIR}/{SRC}{TGT}{SC_MODEL_TYPE}-{RNN_HYPERPARAMS_ID}S-{SEED}/workspace/reference_models/bilingual/rnn{SRC}-{TGT}/{SEED}/results/test_selected_checkpoint_{SRC}_{TGT}.{TGT}/generate-test.hyp.txt (saved to TEST_OUT_F).
The path to write the scores for an RNN model (set to variable SCORES_OUT_F) is then set to {COPPERMT_DATA_DIR}/{SRC}${TGT}${SC_MODEL_TYPE}-{RNN_HYPERPARAMS_ID}S-{SEED}/workspace/reference_models/bilingual/rnn{SRC}-{TGT}/{SEED}/results/test_selected_checkpoint_{SRC}_{TGT}.{TGT}/generate-test.hyp.scores.txt. This path will be used in 4.3.
Inference with an SMT model
To run inference with an SMT model, {COPPERMT_DIR}/pipeline/main_smt_full_brendan_PREDICT.sh is run, passing in the Copper MT parameters file from 3.2.6 (PARAMETERS_F), the file path of the source sentences (SRC_TEXT), a template for the outputs (HYP_OUT), and SEED. The hypotheses will be written to {COPPERMT_DATA_DIR}/{SRC}{TGT}{SC_MODEL_TYPE}-{RNN_HYPERPARAMS_ID}S-{SEED}/inputs/split_data/{SRC}{TGT}/{SEED}/test_{SRC}_{TGT}.{TGT}.hyp.txt (saved to variable TEST_OUT_F).
The path to write the scores for an SMT model (set to variable SCORES_OUT_F) is then set to {COPPERMT_DATA_DIR}/{SRC}{TGT}{SC_MODEL_TYPE}-{RNN_HYPERPARAMS_ID}S-{SEED}/inputs/split_data/{SRC}{TGT}/{SEED}/test_{SRC}_{TGT}.{TGT}.hyp.scores.txt. This path will be used in 4.3.
Finally, the results are evaluated using NMT/evaluate.py which will calculate a character-level BLEU score (actually just regular BLEU, but since characters in the output are separated by spaces, it amounts to character-level BLEU), and chrF.
NMT/evaluate.py
- --ref: The path to the reference translations. Should be {COPPERMT_DATA_DIR}/{SRC}{TGT}{SC_MODEL_TYPE}-{RNN_HYPERPARAMS_ID}S-{SEED}/inputs/split_data/{SRC}{TGT}/{SEED}/test_{SRC}_{TGT}.{TGT}
- --hyp: The path to the model hypotheses, saved to TEST_OUT_F, set in 4.2.
- --out: The file path to write the scores to, which is SCORES_OUT_F, set in 4.2.
This documentation is designed to walk you through the Pipeline/pred_SC.sh script. You should read this documentation and the pred_SC.sh script together. This documentation will refer to sections of the pred_SC.sh code with numbers like 2.2 and 2.3.
Pipeline/pred_SC.sh runs inference of an SC model. It is run from /Cognate/code, and takes a single positional argument, one of the .cfg config files described above, e.g.:
bash Pipeline/pred_SC.sh /home/hatch5o6/Cognate/code/Pipeline/cfg/SC/fr-mfe.cfg
For each parallel data .csv file in PARALLEL_(TRAIN|VAL|TEST) and APPLY_TO, this script looks for any source or target sentence file in the .csv for the language SRC -- that is, even if there is a target file in the .csv for the language SRC it will be included -- and applies the SC model to each word of each sentence and then saves the result to a new file. We then have data in the SRC language that is made more similary to the TGT language based on learned character correspondences.
It uses these parameters from the SC Config file:
- MODULE_HOME_DIR
- SRC
- TGT
- PARALLEL_TRAIN
- PARALLEL_VAL
- PARALLEL_TEST
- APPLY_TO
- SC_MODEL_TYPE
- SEED
- SC_MODEL_ID
- COPPERMT_DATA_DIR
- COPPERMT_DIR
- PARAMETERS_DIR
- RNN_HYPERPARAMS_ID
- BEAM
- NBEST
We simply append the SC_MODEL_TYPE ('SMT' or 'RNN') and the RNN_HYPERPARAMS_ID to the end of SC_MODEL_ID, just so we can track precisely which version of a model was used to apply character correspondence later on.
First the CopperMT parameters file is written. This step is just like step 3.2.6 under Pipeline/train_SC.sh.
The parameters file required by the CopperMT module needs to be written, which is performed by Pipeline/write_scripts.py.
Pipeline/write_scripts.py
- --src: Source language code.
- --tgt: Target language code
- --coppermt_data_dir: Parent folder containing the training data, models, and outputs of each cognate prediction scenario. Should be COPPERMT_DATA_DIR.
- --sc_model_type: The SC model type, either "RNN" or "SMT". Should be SC_MODEL_TYPE.
- --rnn_hyperparams_id: The id corresonding to the desired RNN hyperparams set. Should be RNN_HYPERPARAMS_ID.
- --seed: Should be SEED.
- --parameters, -p: The path the CopperMT parameters file will be written to. Should be {PARAMETERS_DIR}/parameters.{SRC}-{TGT}_{SC_MODEL_TYPE}-{RNN_HYPERPARAMS_ID}_S-{SEED}.cfg
If running an RNN model, we need to retrieve the path to the best model. This should be COPPERMT_DATA_DIR/{SRC}{TGT}{SC_MODEL_TYPE}-{RNN_HYPERPARAMS_ID}S-{SEED}/workspace/reference_models/bilingual/rnn{SRC}-{TGT}/{SEED}/checkpoints/selected.pt, which is set to the variable SELECTED_RNN_CHECKPOINT.
This is the same as 4.1 under Pipeline/train_SC.sh.
A couple directories, if pre-existing, are deleted. #TODO describe what they are after you figure this out. (Are they still used or did we change this?? The files aren't called elsewhere in train_SC.sh or pred_SC.sh). Run the script, and see afterwards if they exist. #UPDATE I don't think these paths are being used for anything anymore.
The SC model is then applied to text files corresponding to the language SRC, making the data look more like language TGT based on character correspondences.
To do this, words from the text files need to be prepared for the SC model. This is done with the NMT/hr_CopperMT.py script run in the "prepare" mode on each .csv file in PARALLEL_(TRAIN|VAL|TEST) and in the comma-delimited list APPLY_TO. In each .csv are the source and target parallel text files. The script grabs all text files corresponding to the language SRC, regardless of whether they are set as a source or target file in the .csv, and from them compiles a list of unique words. The arguments applicable to the "prepare" mode are described here.
NMT/hr_CopperMT.py (prepare)
- --function, -F: To run in "prepare" mode, set this to "prepare". "prepare" is also the default value, so if this parameter is not specified, it will run in "prepare" mode.
- --data: The path to a parallel data .csv file.
- --out: The directory in which will be written the list of unique words extracted from the text files listed in the --data .csv file.
- --hr_lang, -hr: The high-resource language. Should be SRC.
- --lr_lang, -lr: The low-resource language. Should be TGT.
- --training_data: The folder where the SC model training data was written. Should be {COPPERMT_DATA_DIR}/{SRC}{TGT}{SC_MODEL_TYPE}-{RNN_HYPERPARAMS_ID}S-{SEED}/inputs/split_data/{SRC}{TGT}/{SEED}.
- --limit_lang: The language whose text files we want to grab from the .csv. Should be SRC.
Afterwards, we can run inference.
Inference with RNN model If infering with an RNN model, we run {COPPERMT_DIR}/pipeline/main_nmt_bilingual_full_brendan_PREDICT.sh, passing in the CopperMT parameters file (PARAMETERS_F) from 2.1, the path to the best checkpoint (SELECTED_RNN_CHECKPOINT) from 2.2, SEED, the tag "inference", NBEST, and BEAM.
This will predict the cognates for each of the words in our list created by hr_CopperMT.py (prepare). Its output will be saved to the path {COPPERMT_DATA_DIR}/{SRC}{TGT}RNN-{RNN_HYPERPARAMS_ID}S-{SEED}/workspace/reference_models/bilingual/rnn{SRC}-{TGT}/{SEED}/results/inference_selected_checkpoint{SRC}{TGT}.{TGT}/generate-test.txt, which is saved to variable COPPERMT_RESULTS. Then we run NMT/hr_CopperMT.py in "retrieve" mode, which for each word in the high resource text files, it replaces it with its predicted cognate.
NMT/hr_CopperMT.py (retreive), for RNN
- --function: Set to 'retrieve'.
- --data: The path to the parallel data .csv file, listing the text files where we want to replace each word with its predicted cognate. Should be the same file we pass in for --data in the 'prepare' model.
- --CopperMT_results: The output from the RNN model. Should be path saved to COPPERMT_RESULTS described above.
- --hr_lang, -hr: The high-resource language. Should be SRC. For each parallel text file (in the .csv passed as --data) corresponding to this language, each word in the file will be replaced with its predicted cognate. Should be SRC.
- --lr_lang, -lr: The low-resource language. Should be TGT.
- --MODEL_ID: Set this to SC_MODEL_ID. A copy of the SRC text files will be saved to the original path but with the string "SC_{SC_MODEL_ID}{SRC}2{TGT}" inserted into the file name just before the file extension, indicating it is the version of the data where words have been replaced with their predicted cognates. For example, if predicting cognates for the text file "source.txt", the results will be saved to "source.SC{SC_MODEL_ID}_{SRC}2{TGT}.txt". This file is the final result of the cognate prediction where each word has been replaced with a predicted cognate.
Inference with SMT model If inferring with an SMT model, we run {COPPERMT_DIR}/pipeline/main_smt_full_brendan_PREDICT.sh, passing in the CopperMT parameters file (PARAMETERS_F) from 2.1, the file path of the source sentences (TEXT), a template for the outputs (HYP_OUT), and SEED. This functions similarily to inference with the RNN model, where it's predicting cognates for each word in the list created by hr_CopperMT.py (prepare). The outputs are written to {COPPERMT_DATA_DIR}/{SRC}{TGT}SMT-null_S-{SEED}/inputs/split_data/{SRC}{TGT}/inference/test{SRC}_{TGT}.{TGT}.hyp.txt, which is saved to variable HYP_OUT_F. We then run NMT/hr_CopperMT.py in "retrieve" mode, which for each word in the high resource text files, it replaces it with its predicted cognate.
NMT/hr_CopperMT.py (retrieve), for SMT The parameters passed in "retrieve" mode for SMT model results are exactly the same as those for RNN model results EXCEPT instead of --CopperMT_results, we use the parameter --CopperMT_SMT_results, which will be set to the output of the SMT model, saved to the variable HYP_OUT_F described above.
We are then done! Hurray! :D
This is documentation for the Pipeline/train_srctgt_tokenizer.sh script, which is used to train an SentencePiece (https://github.com/google/sentencepiece) tokenizer. This scripts requires a Tok Config .cfg file and trains a single tokenizer for all provided source and target languages. The Tok Config file contains the following fields:
See the config files in Pipeline/cfg/tok_NO for examples.
- SPM_TRAIN_SIZE: This is the total number of lines of data to use to train the tokenizer. Provided data will be down- / upsampled to this number.
- SRC_LANGS: A comma-delimitted list of source language codes (no spaces). E.g. 'en', 'en,fr', 'en,fr,it'.
- SRC_TOK_NAME: A source name for the tokenizer. I like to use a hyphen-delimited list of the source langs. E.g. 'en-fr'. The tokenizer name will be {SRC_TOK_NAME}_{TGT_TOK_NAME}.
- TGT_LANGS: A comma-delimitted list of target language codes (no spaces). E.g. 'es', 'es,pt', 'es,pt,en'.
- TGT_TOK_NAME: A target name for the tokenizer. I like to use a hyphen-delimited list of the target langs. The tokenizer name will be {SRC_TOK_NAME}_{TGT_TOK_NAME}.
- DIST: A string representing what percentage of the training data is to be assigned to each source/target language. Percentages are assigned in the format "{language code}:{percentage}" in a comma-delimited list (no spaces). For example, given "bn:25,as:25,hi:50", the bn language data will be down- / upsampled until it is 25% of the SPM_TRAIN_SIZE, as until it is 25%, and hi until it is 50%. The percentages in the string must add up to 100.
- (TRAIN|VAL|TEST)_PARALLEL: These are comma-delimited lists pointing to parallel data .csv files (the same used for training cognate-prediction models). The parallel data in these files will be used as tokenizer training data in this script. Note that there is no real functional distinction here between TRAIN_PARALLEL from VAL_PARALLEL and TEST_PARALLEL, as all files will be used to gather training data. They are just used for organizational purposes.
- TOK_TRAIN_DATA_DIR: The folder where the tokenizer training data and models will be written to.
- SC_MODEL_ID: If relevant (e.g., if including SC parallel data .csv files, such as from NMT/data/SC), then this is SC_MODEL_ID of the cognate prediction model that was used to alter the high-resource data. This is used to read the right versions of the parallel text files. If not relevant, set this to "null".
- VOCAB_SIZE: The voacbulary size of the model.
- SPLIT_ON_WS: If "true", then a whitespace token "_" will be added to the sentencepiece module and compell segmentation on whitespace. If "false", then no whitespace token is created.
- INCLUDE_LANG_TOKS: If "true", special language tokens for the provided SRC_LANGS and TGT_LANGS will be added to the model.
- INCLUDE_PAD_TOK: If "true", will include a padding token ("") in the tokenizer.
- SPECIAL_TOKS: A comma-delimited list of other special tokens you want to add to the tokenizer (no spaces between elements, e.g. SPECIAL_TOKS=,,,). Set to "null" to not pass in any additional special tokens.
- IS_ATT: If this the tokenizer will be used in an experiment where sound correspondences are applied to the target language to create parallel pretraining data (such as for the en2djk-djk_en or hi2bho-bho_hi scenarios), then set to "true". Otherwise, set to "false".
The script will read all the parallel data .csv files provided and extract the data corresponding to the provided SRC_LANGS and TGT_LANGS. The distinction between TRAIN, VAL and TEST parallel data does not matter, except for organizational purposes. All of it will be gathered. The entirety of this data will be written to files corresponding to each language inside of TOK_TRAIN_DATA_DIR (e.g., en.txt, fr.txt, mfe.txt). These files will be read to create the final collection of tokenizer training data, which will contain SPM_TRAIN_SIZE lines.
A subfolder called {SRC_TOK_NAME}_{TGT_TOK_NAME} will be created inside TOK_TRAIN_DATA_DIR. Inside {SRC_TOK_NAME}_{TGT_TOK_NAME} will be written the following files: - data_dict.json: a dictionary where the keys are the data files used to train the tokenizer. These will point to the same language data files in TOK_TRAIN_DATA_DIR. The values are the fraction of the total tokenizer training data (SPM_TRAIN_SIZE) that will come from the respective file. The data in each file will be up- or down-sampled to meet this ammount. - {SRC_TOK_NAME}_{TGT_TOK_NAME}.model: The spm model - {SRC_TOK_NAME}_{TGT_TOK_NAME}.vocab: The spm vocabulary file - training_data.s=1500.txt: The final collection of tokenizer training data extracted from the parallel data .csv files. This will contain SPM_TRAIN_SIZE number of sentences with the per language distribution specified in DIST. - training_data.s=1500div={language code}.txt: For each language in SRC_LANGS and TGT_LANGS, a file containing the subset of final tokenizer data in training_data.s=1500.txt pertaining to the language. These files are not used for anything except as a way of logging the per-language training data. Only training_data.s=1500.txt is read by the SentencePiece trainer.