Requirements

Software accompanying this paper from the RepEval 2017 EMNLP Workshop

Requirements

Every python script requires python 3 with the following packages (that can be installed using pip):

Numpy
Scipy
PrettyTable

You may need to make some small changes to the bash scripts if they don't use the correct python version by changing the python command used in the scripts.

Downloading the SPP datasets

The SPP datasets need to be fetched from the http://spp.montana.edu/ website. The 8 datasets are under the menus:

Search... -> Lexical Decision Data... -> By item
Search... -> Naming Data... -> By item

In both of these menus, download the datasets by going to the tabs "Assoc Related (1661)", "Assoc Unrel (1661)", "Other Assoc Related (1661)" and "Other Assoc Unrel (1661)".

Place these files into the 'data/raw/ldt' folder for the 4 lexical decision datasets and 'data/raw/nt' for the 4 naming datasets.

Full description of the datasets can be found in this paper.

Building the dataset

Once the raw datasets have been downloaded, you can use the 'build_dataset.sh' script. It will automatically extract all the needed information from the raw datasets and it will also build the two splits. The first one is a dev-test split. The second one is a train-dev-test split. You can find the folds used to make these splits in 'data/folds'.

Word embeddings evaluation

To evaluate word embedding models by using the previous datasets, we provide the 'wordsim.py' script. It needs a dataset (LDT/NT 200/1200ms) and one or more word embedding models. You can also provide a wordset to use as a filter. Only words that are in this wordset will be used from the word embedding models. This is useful to use the same pairs of words when evaluating multiple models.

Included scripts

Here are the different scripts that are used in the paper:

wordsim.py (spearman's correlations shown in the first experiment)
corrmatrix.py (needed to do the steiger test)
corrstats.py (does the steiger test)
datasets_correlations.py (spearman's correlations shown in second experiment)

You can use the --help flag to get the usage of these commands.

Additional useful scripts available in data/tools/:

extract_embedding_wordset.py (returns only the words that appear in all the given word embedding models)

Word embeddings

Here are the links to the off-the-shelf word embeddings used in the paper:

GloVe
- Dimension: 200
- Window: 10
- Corpus: Twitter
SkipGram
- Dimension: 300
- Window: 5
- Corpus: Google News
Multilingual
- Dimension: 512
- Window: ?
- Corpus: WMT-2011 (English, Spanish, German, French), WMT-2012 (French)
Dependency-based
- Dimension: 300
- Window: Dynamic
- Corpus: Wikipedia
FastText
- Dimension: 300
- Window: 5
- Corpus Wikipedia

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data		data
extramodules		extramodules
LICENSE		LICENSE
README.md		README.md
build_dataset.sh		build_dataset.sh
corrmatrix.py		corrmatrix.py
corrstats.py		corrstats.py
datasets_correlation.py		datasets_correlation.py
wordsim.py		wordsim.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Requirements

Downloading the SPP datasets

Building the dataset

Word embeddings evaluation

Included scripts

Word embeddings

About

Uh oh!

Releases

Packages

Languages

License

JomnTAL/spp-wordsim

Folders and files

Latest commit

History

Repository files navigation

Requirements

Downloading the SPP datasets

Building the dataset

Word embeddings evaluation

Included scripts

Word embeddings

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages