|
| 1 | +# LexNET for Noun Compound Relation Classification |
| 2 | + |
| 3 | +This is a [Tensorflow](http://www.tensorflow.org/) implementation of the LexNET |
| 4 | +algorithm for classifying relationships, specifically applied to classifying the |
| 5 | +relationships that hold between noun compounds: |
| 6 | + |
| 7 | +* *olive oil* is oil that is *made from* olives |
| 8 | +* *cooking oil* which is oil that is *used for* cooking |
| 9 | +* *motor oil* is oil that is *contained in* a motor |
| 10 | + |
| 11 | +The model is a supervised classifier that predicts the relationship that holds |
| 12 | +between the constituents of a two-word noun compound using: |
| 13 | + |
| 14 | +1. A neural "paraphrase" of each syntactic dependency path that connects the |
| 15 | + constituents in a large corpus. For example, given a sentence like *This fine |
| 16 | + oil is made from first-press olives*, the dependency path is something like |
| 17 | + `oil <NSUBJPASS made PREP> from POBJ> olive`. |
| 18 | +2. The distributional information provided by the individual words; i.e., the |
| 19 | + word embeddings of the two consituents. |
| 20 | +3. The distributional signal provided by the compound itself; i.e., the |
| 21 | + embedding of the noun compound in context. |
| 22 | + |
| 23 | +The model includes several variants: *path-based model* uses (1) alone, the |
| 24 | +*distributional model* uses (2) alone, and the *integrated model* uses (1) and |
| 25 | +(2). The *distributional-nc model* and the *integrated-nc* model each add (3). |
| 26 | + |
| 27 | +Training a model requires the following: |
| 28 | + |
| 29 | +1. A collection of noun compounds that have been labeled using a *relation |
| 30 | + inventory*. The inventory describes the specific relationships that you'd |
| 31 | + like the model to differentiate (e.g. *part of* versus *composed of* versus |
| 32 | + *purpose*), and generally may consist of tens of classes. |
| 33 | +2. You'll need a collection of word embeddings: the path-based model uses the |
| 34 | + word embeddings as part of the path representation, and the distributional |
| 35 | + models use the word embeddings directly as prediction features. |
| 36 | +3. The path-based model requires a collection of syntactic dependency parses |
| 37 | + that connect the constituents for each noun compound. |
| 38 | + |
| 39 | +At the moment, this repository does not contain the tools for generating this |
| 40 | +data, but we will provide references to existing datasets and plan to add tools |
| 41 | +to generate the data in the future. |
| 42 | + |
| 43 | +# Contents |
| 44 | + |
| 45 | +The following source code is included here: |
| 46 | + |
| 47 | +* `learn_path_embeddings.py` is a script that trains and evaluates a path-based |
| 48 | + model to predict a noun-compound relationship given labeled noun-compounds and |
| 49 | + dependency parse paths. |
| 50 | +* `learn_classifier.py` is a script that trains and evaluates a classifier based |
| 51 | + on any combination of paths, word embeddings, and noun-compound embeddings. |
| 52 | +* `get_indicative_paths.py` is a script that generates the most indicative |
| 53 | + syntactic dependency paths for a particular relationship. |
| 54 | + |
| 55 | +# Dependencies |
| 56 | + |
| 57 | +* [TensorFlow](http://www.tensorflow.org/): see detailed installation |
| 58 | + instructions at that site. |
| 59 | +* [SciKit Learn](http://scikit-learn.org/): you can probably just install this |
| 60 | + with `pip install sklearn`. |
| 61 | + |
| 62 | +# Creating the Model |
| 63 | + |
| 64 | +This section describes the necessary steps that you must follow to reproduce the |
| 65 | +results in the paper. |
| 66 | + |
| 67 | +## Generate/Download Path Data |
| 68 | + |
| 69 | +TBD! Our plan is to make the aggregate path data available that was used to |
| 70 | +train path embeddings and classifiers; however, this will be released |
| 71 | +separately. |
| 72 | + |
| 73 | +## Generate/Download Embedding Data |
| 74 | + |
| 75 | +TBD! While we used the standard Glove vectors for the relata embeddings, the NC |
| 76 | +embeddings were generated separately. Our plan is to make that data available, |
| 77 | +but it will be released separately. |
| 78 | + |
| 79 | +## Create Path Embeddings |
| 80 | + |
| 81 | +Create the path embeddings using `learn_path_embeddings.py`. This shell script |
| 82 | +fragment will iterate through each dataset, split, and corpus to generate path |
| 83 | +embeddings for each. |
| 84 | + |
| 85 | + for DATASET in tratz/fine_grained tratz/coarse_grained ; do |
| 86 | + for SPLIT in random lexical_head lexical_mod lexical_full ; do |
| 87 | + for CORPUS in wiki_gigiawords ; do |
| 88 | + python learn_path_embeddings.py \ |
| 89 | + --dataset_dir ~/lexnet/datasets \ |
| 90 | + --dataset "${DATASET}" \ |
| 91 | + --corpus "${SPLIT}/${CORPUS}" \ |
| 92 | + --embeddings_base_path ~/lexnet/embeddings \ |
| 93 | + --logdir /tmp/learn_path_embeddings |
| 94 | + done |
| 95 | + done |
| 96 | + done |
| 97 | + |
| 98 | +The path embeddings will be placed in the directory specified by |
| 99 | +`--embeddings_base_path`. |
| 100 | + |
| 101 | +## Train classifiers |
| 102 | + |
| 103 | +Train classifiers and evaluate on the validation and test data using |
| 104 | +`train_classifiers.py` script. This shell script fragment will iterate through |
| 105 | +each dataset, split, corpus, and model type to train and evaluate classifiers. |
| 106 | + |
| 107 | + LOGDIR=/tmp/learn_classifier |
| 108 | + for DATASET in tratz/fine_grained tratz/coarse_grained ; do |
| 109 | + for SPLIT in random lexical_head lexical_mod lexical_full ; do |
| 110 | + for CORPUS in wiki_gigiawords ; do |
| 111 | + for MODEL in dist dist-nc path integrated integrated-nc ; do |
| 112 | + # Filename for the log that will contain the classifier results. |
| 113 | + LOGFILE=$(echo "${DATASET}.${SPLIT}.${CORPUS}.${MODEL}.log" | sed -e "s,/,.,g") |
| 114 | + python learn_classifier.py \ |
| 115 | + --dataset_dir ~/lexnet/datasets \ |
| 116 | + --dataset "${DATASET}" \ |
| 117 | + --corpus "${SPLIT}/${CORPUS}" \ |
| 118 | + --embeddings_base_path ~/lexnet/embeddings \ |
| 119 | + --logdir ${LOGDIR} \ |
| 120 | + --input "${MODEL}" > "${LOGDIR}/${LOGFILE}" |
| 121 | + done |
| 122 | + done |
| 123 | + done |
| 124 | + done |
| 125 | + |
| 126 | +The log file will contain the final performance (precision, recall, F1) on the |
| 127 | +train, dev, and test sets, and will include a confusion matrix for each. |
| 128 | + |
| 129 | +# Contact |
| 130 | + |
| 131 | +If you have any questions, issues, or suggestions, feel free to contact either |
| 132 | +@vered1986 or @waterson. |
0 commit comments