Code to train classifiers for abbreviation detection and expansion in context. This repository also contains the evaluation code that complements the paper Dealing with Abbreviations in the Slovenian Biographical Lexicon to be presented at The 2022 Conference on Empirical Methods in Natural Language Processing EMNLP 2022
Download repo
git clone git@github.com:angel-daza/abbreviation-detector.gitCreate a new environment:
conda create -n abbr-detector python=3.9
conda activate abbr-detectorInstall Requirements:
pip install -r requirementsCreate the Dataset Train/Dev/Test Partitions:
python3 slovene_abbr_preprocess.pyTo Reproduce the Baseline Results:
python3 naive_baselines.pyTo Reproduce the BERT Abbreviation Classifier Results:
# 1) Train the Binary BERT Classifier [ABBR, NO_ABBR]
python3 bert_token_classifier.py -t data/sbl-51abbr.tok.train.json -d data/sbl-51abbr.tok.dev.json\
--bert_model 'EMBEDDIA/sloberta' --save_model_dir saved_models/BERT_ABBR_876972\
--epochs 5 --batch_size 32 --info_every 10 --seed_val 876972
# 2) Make predictions using the BERT Classifier
python3 bert_token_classifier_predict.py -m saved_models/BERT_ABBR_876972 --bert_model 'EMBEDDIA/sloberta'\
--epoch 1 --test_path data/sbl-51abbr.tok.test.json --gold_labels TrueTo Reproduce BERT Abbreviation Expansion Results:
python3 bert_abbrev_expansion.py