Skip to content

Commit deb772d

Browse files
authored
Merge pull request #3683 from waterson/master
Add LexNet noun compounds model to models repository.
2 parents 2b77572 + 1f37153 commit deb772d

File tree

10 files changed

+1887
-4
lines changed

10 files changed

+1887
-4
lines changed

CODEOWNERS

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@
1717
/research/inception/ @shlens @vincentvanhoucke
1818
/research/learned_optimizer/ @olganw @nirum
1919
/research/learning_to_remember_rare_events/ @lukaszkaiser @ofirnachum
20+
/research/lexnet_nc/ @vered1986 @waterson
2021
/research/lfads/ @jazcollins @susillo
2122
/research/lm_1b/ @oriolvinyals @panyx0718
2223
/research/maskgan/ @a-dai

research/README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,8 @@ installation](https://www.tensorflow.org/install).
3636
- [inception](inception): deep convolutional networks for computer vision.
3737
- [learning_to_remember_rare_events](learning_to_remember_rare_events): a
3838
large-scale life-long memory module for use in deep learning.
39+
- [lexnet_nc](lexnet_nc): a distributed model for noun compound relationship
40+
classification.
3941
- [lfads](lfads): sequential variational autoencoder for analyzing
4042
neuroscience data.
4143
- [lm_1b](lm_1b): language modeling on the one billion word benchmark.

research/lexnet_nc/README.md

Lines changed: 132 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,132 @@
1+
# LexNET for Noun Compound Relation Classification
2+
3+
This is a [Tensorflow](http://www.tensorflow.org/) implementation of the LexNET
4+
algorithm for classifying relationships, specifically applied to classifying the
5+
relationships that hold between noun compounds:
6+
7+
* *olive oil* is oil that is *made from* olives
8+
* *cooking oil* which is oil that is *used for* cooking
9+
* *motor oil* is oil that is *contained in* a motor
10+
11+
The model is a supervised classifier that predicts the relationship that holds
12+
between the constituents of a two-word noun compound using:
13+
14+
1. A neural "paraphrase" of each syntactic dependency path that connects the
15+
constituents in a large corpus. For example, given a sentence like *This fine
16+
oil is made from first-press olives*, the dependency path is something like
17+
`oil <NSUBJPASS made PREP> from POBJ> olive`.
18+
2. The distributional information provided by the individual words; i.e., the
19+
word embeddings of the two consituents.
20+
3. The distributional signal provided by the compound itself; i.e., the
21+
embedding of the noun compound in context.
22+
23+
The model includes several variants: *path-based model* uses (1) alone, the
24+
*distributional model* uses (2) alone, and the *integrated model* uses (1) and
25+
(2). The *distributional-nc model* and the *integrated-nc* model each add (3).
26+
27+
Training a model requires the following:
28+
29+
1. A collection of noun compounds that have been labeled using a *relation
30+
inventory*. The inventory describes the specific relationships that you'd
31+
like the model to differentiate (e.g. *part of* versus *composed of* versus
32+
*purpose*), and generally may consist of tens of classes.
33+
2. You'll need a collection of word embeddings: the path-based model uses the
34+
word embeddings as part of the path representation, and the distributional
35+
models use the word embeddings directly as prediction features.
36+
3. The path-based model requires a collection of syntactic dependency parses
37+
that connect the constituents for each noun compound.
38+
39+
At the moment, this repository does not contain the tools for generating this
40+
data, but we will provide references to existing datasets and plan to add tools
41+
to generate the data in the future.
42+
43+
# Contents
44+
45+
The following source code is included here:
46+
47+
* `learn_path_embeddings.py` is a script that trains and evaluates a path-based
48+
model to predict a noun-compound relationship given labeled noun-compounds and
49+
dependency parse paths.
50+
* `learn_classifier.py` is a script that trains and evaluates a classifier based
51+
on any combination of paths, word embeddings, and noun-compound embeddings.
52+
* `get_indicative_paths.py` is a script that generates the most indicative
53+
syntactic dependency paths for a particular relationship.
54+
55+
# Dependencies
56+
57+
* [TensorFlow](http://www.tensorflow.org/): see detailed installation
58+
instructions at that site.
59+
* [SciKit Learn](http://scikit-learn.org/): you can probably just install this
60+
with `pip install sklearn`.
61+
62+
# Creating the Model
63+
64+
This section describes the necessary steps that you must follow to reproduce the
65+
results in the paper.
66+
67+
## Generate/Download Path Data
68+
69+
TBD! Our plan is to make the aggregate path data available that was used to
70+
train path embeddings and classifiers; however, this will be released
71+
separately.
72+
73+
## Generate/Download Embedding Data
74+
75+
TBD! While we used the standard Glove vectors for the relata embeddings, the NC
76+
embeddings were generated separately. Our plan is to make that data available,
77+
but it will be released separately.
78+
79+
## Create Path Embeddings
80+
81+
Create the path embeddings using `learn_path_embeddings.py`. This shell script
82+
fragment will iterate through each dataset, split, and corpus to generate path
83+
embeddings for each.
84+
85+
for DATASET in tratz/fine_grained tratz/coarse_grained ; do
86+
for SPLIT in random lexical_head lexical_mod lexical_full ; do
87+
for CORPUS in wiki_gigiawords ; do
88+
python learn_path_embeddings.py \
89+
--dataset_dir ~/lexnet/datasets \
90+
--dataset "${DATASET}" \
91+
--corpus "${SPLIT}/${CORPUS}" \
92+
--embeddings_base_path ~/lexnet/embeddings \
93+
--logdir /tmp/learn_path_embeddings
94+
done
95+
done
96+
done
97+
98+
The path embeddings will be placed in the directory specified by
99+
`--embeddings_base_path`.
100+
101+
## Train classifiers
102+
103+
Train classifiers and evaluate on the validation and test data using
104+
`train_classifiers.py` script. This shell script fragment will iterate through
105+
each dataset, split, corpus, and model type to train and evaluate classifiers.
106+
107+
LOGDIR=/tmp/learn_classifier
108+
for DATASET in tratz/fine_grained tratz/coarse_grained ; do
109+
for SPLIT in random lexical_head lexical_mod lexical_full ; do
110+
for CORPUS in wiki_gigiawords ; do
111+
for MODEL in dist dist-nc path integrated integrated-nc ; do
112+
# Filename for the log that will contain the classifier results.
113+
LOGFILE=$(echo "${DATASET}.${SPLIT}.${CORPUS}.${MODEL}.log" | sed -e "s,/,.,g")
114+
python learn_classifier.py \
115+
--dataset_dir ~/lexnet/datasets \
116+
--dataset "${DATASET}" \
117+
--corpus "${SPLIT}/${CORPUS}" \
118+
--embeddings_base_path ~/lexnet/embeddings \
119+
--logdir ${LOGDIR} \
120+
--input "${MODEL}" > "${LOGDIR}/${LOGFILE}"
121+
done
122+
done
123+
done
124+
done
125+
126+
The log file will contain the final performance (precision, recall, F1) on the
127+
train, dev, and test sets, and will include a confusion matrix for each.
128+
129+
# Contact
130+
131+
If you have any questions, issues, or suggestions, feel free to contact either
132+
@vered1986 or @waterson.
Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
#!/usr/bin/env python
2+
# Copyright 2017, 2018 Google, Inc. All Rights Reserved.
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
# ==============================================================================
16+
17+
"""Extracts paths that are indicative of each relation."""
18+
from __future__ import absolute_import
19+
from __future__ import division
20+
from __future__ import print_function
21+
22+
import os
23+
24+
import tensorflow as tf
25+
26+
from . import path_model
27+
from . import lexnet_common
28+
29+
tf.flags.DEFINE_string(
30+
'dataset_dir', 'datasets',
31+
'Dataset base directory')
32+
33+
tf.flags.DEFINE_string(
34+
'dataset',
35+
'tratz/fine_grained',
36+
'Subdirectory containing the corpus directories: '
37+
'subdirectory of dataset_dir')
38+
39+
tf.flags.DEFINE_string(
40+
'corpus', 'random/wiki',
41+
'Subdirectory containing the corpus and split: '
42+
'subdirectory of dataset_dir/dataset')
43+
44+
tf.flags.DEFINE_string(
45+
'embeddings_base_path', 'embeddings',
46+
'Embeddings base directory')
47+
48+
tf.flags.DEFINE_string(
49+
'logdir', 'logdir',
50+
'Directory of model output files')
51+
52+
tf.flags.DEFINE_integer(
53+
'top_k', 20, 'Number of top paths to extract')
54+
55+
tf.flags.DEFINE_float(
56+
'threshold', 0.8, 'Threshold above which to consider paths as indicative')
57+
58+
FLAGS = tf.flags.FLAGS
59+
60+
61+
def main(_):
62+
hparams = path_model.PathBasedModel.default_hparams()
63+
64+
# First things first. Load the path data.
65+
path_embeddings_file = 'path_embeddings/{dataset}/{corpus}'.format(
66+
dataset=FLAGS.dataset,
67+
corpus=FLAGS.corpus)
68+
69+
path_dim = (hparams.lemma_dim + hparams.pos_dim +
70+
hparams.dep_dim + hparams.dir_dim)
71+
72+
path_embeddings, path_to_index = path_model.load_path_embeddings(
73+
os.path.join(FLAGS.embeddings_base_path, path_embeddings_file),
74+
path_dim)
75+
76+
# Load and count the classes so we can correctly instantiate the model.
77+
classes_filename = os.path.join(
78+
FLAGS.dataset_dir, FLAGS.dataset, 'classes.txt')
79+
80+
with open(classes_filename) as f_in:
81+
classes = f_in.read().splitlines()
82+
83+
hparams.num_classes = len(classes)
84+
85+
# We need the word embeddings to instantiate the model, too.
86+
print('Loading word embeddings...')
87+
lemma_embeddings = lexnet_common.load_word_embeddings(
88+
FLAGS.embeddings_base_path, hparams.lemma_embeddings_file)
89+
90+
# Instantiate the model.
91+
with tf.Graph().as_default():
92+
with tf.variable_scope('lexnet'):
93+
instance = tf.placeholder(dtype=tf.string)
94+
model = path_model.PathBasedModel(
95+
hparams, lemma_embeddings, instance)
96+
97+
with tf.Session() as session:
98+
model_dir = '{logdir}/results/{dataset}/path/{corpus}'.format(
99+
logdir=FLAGS.logdir,
100+
dataset=FLAGS.dataset,
101+
corpus=FLAGS.corpus)
102+
103+
saver = tf.train.Saver()
104+
saver.restore(session, os.path.join(model_dir, 'best.ckpt'))
105+
106+
path_model.get_indicative_paths(
107+
model, session, path_to_index, path_embeddings, classes,
108+
model_dir, FLAGS.top_k, FLAGS.threshold)
109+
110+
if __name__ == '__main__':
111+
tf.app.run()

0 commit comments

Comments
 (0)