Skip to content

JomnTAL/spp-wordsim

Repository files navigation

Software accompanying this paper from the RepEval 2017 EMNLP Workshop

Requirements

Every python script requires python 3 with the following packages (that can be installed using pip):

  • Numpy
  • Scipy
  • PrettyTable

You may need to make some small changes to the bash scripts if they don't use the correct python version by changing the python command used in the scripts.

Downloading the SPP datasets

The SPP datasets need to be fetched from the http://spp.montana.edu/ website. The 8 datasets are under the menus:

  • Search... -> Lexical Decision Data... -> By item
  • Search... -> Naming Data... -> By item

In both of these menus, download the datasets by going to the tabs "Assoc Related (1661)", "Assoc Unrel (1661)", "Other Assoc Related (1661)" and "Other Assoc Unrel (1661)".

Place these files into the 'data/raw/ldt' folder for the 4 lexical decision datasets and 'data/raw/nt' for the 4 naming datasets.

Full description of the datasets can be found in this paper.

Building the dataset

Once the raw datasets have been downloaded, you can use the 'build_dataset.sh' script. It will automatically extract all the needed information from the raw datasets and it will also build the two splits. The first one is a dev-test split. The second one is a train-dev-test split. You can find the folds used to make these splits in 'data/folds'.

Word embeddings evaluation

To evaluate word embedding models by using the previous datasets, we provide the 'wordsim.py' script. It needs a dataset (LDT/NT 200/1200ms) and one or more word embedding models. You can also provide a wordset to use as a filter. Only words that are in this wordset will be used from the word embedding models. This is useful to use the same pairs of words when evaluating multiple models.

Included scripts

Here are the different scripts that are used in the paper:

  • wordsim.py (spearman's correlations shown in the first experiment)
  • corrmatrix.py (needed to do the steiger test)
  • corrstats.py (does the steiger test)
  • datasets_correlations.py (spearman's correlations shown in second experiment)

You can use the --help flag to get the usage of these commands.

Additional useful scripts available in data/tools/:

  • extract_embedding_wordset.py (returns only the words that appear in all the given word embedding models)

Word embeddings

Here are the links to the off-the-shelf word embeddings used in the paper:

  • GloVe
    • Dimension: 200
    • Window: 10
    • Corpus: Twitter
  • SkipGram
    • Dimension: 300
    • Window: 5
    • Corpus: Google News
  • Multilingual
    • Dimension: 512
    • Window: ?
    • Corpus: WMT-2011 (English, Spanish, German, French), WMT-2012 (French)
  • Dependency-based
    • Dimension: 300
    • Window: Dynamic
    • Corpus: Wikipedia
  • FastText
    • Dimension: 300
    • Window: 5
    • Corpus Wikipedia

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published