A post-training and listwise method for adjusting word embeddings with synonyms and antonyms based on undirected list encoder generated from self-ettention.
Paper Links: https://drive.google.com/file/d/1UO6FVOwuAenNfWnkFrylgzHQe-9WQkff/view
- Please make sure the .txt files of embeddings and lexicons that you're introducing to the model are encoded in UTF-8.
- Performance may drop significantly if .txt files are not uniformly encoded (for this paper and also related works). This is an empirical conclusion.
- A simple way to make a UTF-8 encoding txt file (on a Windows computer):
- Open the target txt file (that you're not sure if UTF-8 encoded) in Windows Wordpad.
- Press File -> Save As -> choose "UTF-8" at the Encoding box -> Save
- The loss funtions are located in
./code/trainer/pretrain.py
. - The proposed loss function in the paper will not lead to the experiment results in the paper, and the python code of this loss function
is commented out in
pretrain.py
file atline 114
.- The actual loss function that produce the result proposed in the paper is at
line 117
. - Experiment result will drop significantly if we use the loss function located at
line 114
instead of the one atline 117
- The actual loss function that produce the result proposed in the paper is at
- Train self-attention model and adjust pre-trained word embeddings.
Usage:
$python main.py -ep <filepath of pre-trained embeddings>
-en <filename of pre-trained embeddings>
-lp <filepath of lexicons>
-vp <filepath of vocabulary>
-op <filepath to save model>
Example:
$python main.py -ep ../data/embeddings/GloVe/glove.6B.300d.txt
-en glove.6B.300d
-lp ../data/lexicons/wordnet_syn_ant.txt
-vp ../data/embeddings/GloVe/glove.6B.300d.txt.vocab.pkl
-op ../output/model/listwise.model
- Will evaluate the lastly trained model and generate a .txt file for the adjusted embedding at
<filepath for saving embeddings>
. - Compare the performance of word embeddings on word similarity tasks before and after adjusting.
- Process the raw output to GloVe format.
Usage:
$python evaluation.py -ep <filepath of the original pre-trained embeddings>
-vp <filepath of vocabulary>
-op <filepath for saving embeddings>
Example:
$python evaluation.py -ep ../data/embeddings/GloVe/glove.6B.300d.txt
-vp ../data/embeddings/GloVe/glove.6B.300d.txt.vocab.pkl
-op ../output/embeddings/Listwise_Vectors.txt
- Pretrained word embeddings filtered by 50K frequent words in GloVe format.
Data format:
word1 -0.09611 -0.25788 ... -0.092774 0.39058
word2 -0.24837 -0.45461 ... 0.15458 -0.38053
- Synonyms and antonyms retrieved from dictionary.
Data format:
word1 syn1 ... synn \t ant1 ... antn
word2 syn1 ... synn \t ant1 ... antn
- Similarity tasks datasets.
Data format:
word1 word2 50.00
word1 word2 49.00