Adjusting Word Embeddings

Introduction

A post-training and listwise method for adjusting word embeddings with synonyms and antonyms based on undirected list encoder generated from self-ettention.

Paper Links: https://drive.google.com/file/d/1UO6FVOwuAenNfWnkFrylgzHQe-9WQkff/view

Nitty-gritty to Know Before Starting

Encoding of Data

Please make sure the .txt files of embeddings and lexicons that you're introducing to the model are encoded in UTF-8.
Performance may drop significantly if .txt files are not uniformly encoded (for this paper and also related works). This is an empirical conclusion.
A simple way to make a UTF-8 encoding txt file (on a Windows computer):
- Open the target txt file (that you're not sure if UTF-8 encoded) in Windows Wordpad.
- Press File -> Save As -> choose "UTF-8" at the Encoding box -> Save

Loss Function

The loss funtions are located in ./code/trainer/pretrain.py.
The proposed loss function in the paper will not lead to the experiment results in the paper, and the python code of this loss function is commented out in pretrain.py file at line 114.
- The actual loss function that produce the result proposed in the paper is at line 117.
- Experiment result will drop significantly if we use the loss function located at line 114 instead of the one at line 117

Quickstart

main.py

Train self-attention model and adjust pre-trained word embeddings.

Usage:

$python main.py -ep <filepath of pre-trained embeddings> 
                -en <filename of pre-trained embeddings> 
                -lp <filepath of lexicons>
                -vp <filepath of vocabulary>
                -op <filepath to save model>

Example:

$python main.py -ep ../data/embeddings/GloVe/glove.6B.300d.txt 
                -en glove.6B.300d 
                -lp ../data/lexicons/wordnet_syn_ant.txt 
                -vp ../data/embeddings/GloVe/glove.6B.300d.txt.vocab.pkl 
                -op ../output/model/listwise.model

evaluation.py

Will evaluate the lastly trained model and generate a .txt file for the adjusted embedding at <filepath for saving embeddings>.
Compare the performance of word embeddings on word similarity tasks before and after adjusting.
Process the raw output to GloVe format.

Usage:

$python evaluation.py -ep <filepath of the original pre-trained embeddings>
                      -vp <filepath of vocabulary>
                      -op <filepath for saving embeddings>

Example:

$python evaluation.py -ep ../data/embeddings/GloVe/glove.6B.300d.txt 
                      -vp ../data/embeddings/GloVe/glove.6B.300d.txt.vocab.pkl
                      -op ../output/embeddings/Listwise_Vectors.txt

Datasets

Embeddings

Pretrained word embeddings filtered by 50K frequent words in GloVe format.

Data format:

word1 -0.09611 -0.25788 ... -0.092774  0.39058
word2 -0.24837 -0.45461 ...  0.15458  -0.38053

Lexicons

Synonyms and antonyms retrieved from dictionary.

Data format:

word1 syn1 ... synn \t ant1 ... antn 
word2 syn1 ... synn \t ant1 ... antn

Testsets

Similarity tasks datasets.

Data format:

word1 word2 50.00
word1 word2 49.00

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
code		code
data		data
output		output
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Adjusting Word Embeddings

Introduction

Nitty-gritty to Know Before Starting

Encoding of Data

Loss Function

Quickstart

main.py

evaluation.py

Datasets

Embeddings

Lexicons

Testsets

Reference

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

ncu-dart/Adjusting-Word-Embeddings

Folders and files

Latest commit

History

Repository files navigation

Adjusting Word Embeddings

Introduction

Nitty-gritty to Know Before Starting

Encoding of Data

Loss Function

Quickstart

main.py

evaluation.py

Datasets

Embeddings

Lexicons

Testsets

Reference

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages