Skip to content
This repository was archived by the owner on Nov 8, 2022. It is now read-only.

Commit 8784c11

Browse files
Merge pull request #252 from NervanaSystems/set_expansion_PR
Set Expansion Solution PR
2 parents 0e8cd8c + 58568a6 commit 8784c11

File tree

22 files changed

+1346
-10
lines changed

22 files changed

+1346
-10
lines changed

.gitignore

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,13 +12,13 @@
1212
.styleenv
1313
.coverage
1414
build
15-
*.gz
1615
generated
1716
*.ropeproject
1817
*.cubin
1918
*.hdf5
2019
*.h5
2120
*.html
21+
!solutions/set_expansion/ui/templates/*.html
2222
.vscode
2323
!server/web_service/static/*.html
2424
!tests/fixtures/data/server/*.gz
@@ -31,4 +31,5 @@ pylint.html
3131
pylint.txt
3232
flake8.txt
3333
nlp_architect/pipelines/bist-pretrained/*
34-
nlp_architect/api/ner-pretrained/*
34+
venv
35+
nlp_architect/api/ner-pretrained/*

Makefile

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,8 +14,8 @@
1414
# limitations under the License.
1515
# ******************************************************************************
1616

17-
FLAKE8_CHECK_DIRS := examples nlp_architect/* server tests
18-
PYLINT_CHECK_DIRS := examples nlp_architect server tests setup
17+
FLAKE8_CHECK_DIRS := examples nlp_architect/* server tests solutions
18+
PYLINT_CHECK_DIRS := examples nlp_architect server tests setup solutions
1919
DOC_DIR := doc
2020
DOC_PUB_RELEASE_PATH := $(DOC_PUB_PATH)/$(RELEASE)
2121

44.9 MB
Binary file not shown.
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
2+
Data:
3+
==========
4+
enwiki-20171201_subset.txt is a subset of Wikimedia English data dumps:
5+
6+
https://meta.wikimedia.org/wiki/Data_dumps
7+
https://dumps.wikimedia.org/enwiki/
8+
9+
10+
11+
License:
12+
==========
13+
Creative Commons Attribution-Share-Alike 3.0 License
14+
https://creativecommons.org/licenses/by-sa/3.0/
120 KB
Loading
29.8 KB
Loading

doc/source/index.rst

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -128,6 +128,13 @@ on this project, please see the :doc:`developer guide <developer_guide>`.
128128
memn2n.rst
129129
kvmemn2n.rst
130130

131+
.. toctree::
132+
:hidden:
133+
:maxdepth: 1
134+
:caption: Solutions
135+
136+
term_set_expansion.rst
137+
131138
.. toctree::
132139
:hidden:
133140
:maxdepth: 1

doc/source/term_set_expansion.rst

Lines changed: 129 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,129 @@
1+
.. ---------------------------------------------------------------------------
2+
.. Copyright 2016-2018 Intel Corporation
3+
..
4+
.. Licensed under the Apache License, Version 2.0 (the "License");
5+
.. you may not use this file except in compliance with the License.
6+
.. You may obtain a copy of the License at
7+
..
8+
.. http://www.apache.org/licenses/LICENSE-2.0
9+
..
10+
.. Unless required by applicable law or agreed to in writing, software
11+
.. distributed under the License is distributed on an "AS IS" BASIS,
12+
.. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
.. See the License for the specific language governing permissions and
14+
.. limitations under the License.
15+
.. ---------------------------------------------------------------------------
16+
17+
Set Expansion Solution
18+
######################
19+
20+
Overview
21+
========
22+
Term set expansion is the task of expanding a given partial set of terms into
23+
a more complete set of terms that belong to the same semantic class. This
24+
solution demonstrates the capability of a corpus-based set expansion system
25+
in a simple web application.
26+
27+
.. image :: assets/expansion_demo.png
28+
29+
Algorithm Overview
30+
==================
31+
Our approach is described by (Mamou et al, 2018). It is based on representing any
32+
term of a
33+
training corpus using word embeddings in order
34+
to estimate the similarity between the seed terms and any candidate term. Noun phrases provide
35+
good approximation for candidate terms and are extracted in our system using a noun phrase chunker.
36+
At expansion time, given a seed of terms, the most similar terms are returned.
37+
38+
Flow
39+
====
40+
41+
.. image :: assets/expansion_flow.png
42+
43+
Training
44+
========
45+
46+
The first step in training is to prepare the data for generating a word embedding model. We
47+
provide a subset of English Wikipedia at datasets/wikipedia as a sample corpus under the
48+
`Creative Commons Attribution-Share-Alike 3.0 License <https://creativecommons.org/licenses/by-sa/3.0/>`__ (Copyright 2018 Wikimedia Foundation).
49+
The output of this step is the marked corpus where noun phrases are marked with the marking character (default: "\_") as described in the NLP Architect :doc:`np2vec` module documentation. The pre-process script supports using NLP Architect :doc:`noun phrase extractor <spacy_np_annotator>` which uses an LSTM :doc:`chunker` model or using spaCy's own noun phrases matcher.
50+
This is done by running:
51+
52+
.. code:: python
53+
54+
python solutions/set_expansion/prepare_data.py --corpus TRAINING_CORPUS --marked_corpus MARKED_TRAINING_CORPUS
55+
56+
The next step is to train the model using NLP Architect :doc:`np2vec` module.
57+
For set expansion, we recommend the following values 100, 10, 10, 0 for respectively,
58+
size, min_count, window and hs hyperparameters. Please refer to the np2vec module documentation for more details about these parameters.
59+
60+
.. code:: python
61+
62+
python examples/np2vec/train.py --size 100 --min_count 10 --window 10 --hs 0 --corpus MARKED_TRAINING_CORPUS --np2vec_model_file MODEL_PATH --corpus_format txt
63+
64+
65+
A `pretrained model <http://nervana-modelzoo.s3.amazonaws.com/NLP/SetExp/enwiki-20171201_pretrained_set_expansion.txt>`__
66+
on English Wikipedia dump (enwiki-20171201-pages-articles-multistream.xml.bz2) is available under
67+
Apache 2.0 license. It has been trained with hyperparameters values
68+
recommended above. Full English Wikipedia `raw corpus <http://nervana-modelzoo.s3.amazonaws.com/NLP/SetExp/enwiki-20171201.txt>`_ and
69+
`marked corpus <http://nervana-modelzoo.s3.amazonaws.com/NLP/SetExp/enwiki-20171201_spacy_marked.txt>`_
70+
are also available under the
71+
`Creative Commons Attribution-Share-Alike 3.0 License <https://creativecommons.org/licenses/by-sa/3.0/>`__.
72+
73+
74+
Inference
75+
=========
76+
77+
The inference step consists of expanding given seed terms into a set of terms that belong to the same semantic class.
78+
It can be done in two ways:
79+
80+
1. Running a python script:
81+
82+
.. code:: python
83+
84+
python solutions/set_expansion/set_expand.py --np2vec_model_file MODEL_PATH --topn TOPN
85+
86+
2. Web application
87+
88+
A. Loading the expand server with the trained model:
89+
90+
.. code:: python
91+
92+
python expand_server.py [--host HOST] [--port PORT] model_path
93+
94+
The expand server gets requests containing seed terms, and expands them
95+
based on the given word embedding model. You can use the model you trained
96+
yourself in the previous step, or to provide a pre-trained model you own.
97+
**Important note**: default server
98+
will listen on localhost:1234. If you set the host/port you should also
99+
set it in the ui/settings.py file.
100+
101+
102+
B. Run the UI application:
103+
104+
.. code:: python
105+
106+
bokeh serve --show ui
107+
108+
The UI is a simple web based application for performing expansion.
109+
The application communicates with the server by sending expand
110+
requests, present the results in a simple table and export them to a csv
111+
file. It allows you to either directly type the terms to expand or to
112+
select terms from the model vocabulary list. After you get some expand
113+
results you can perform re-expansion by selecting terms from the results (hold the Ctrl key for
114+
multiple selection). **Important note**: If you set the host/port of the expand server you
115+
should also set it in the ui/settings.py file. You can also load the ui
116+
application as a server using the bokeh options --address and --port, for example:
117+
118+
.. code:: python
119+
120+
bokeh serve ui --address=12.13.14.15 --port=1010 --allow-websocket-origin=12.13.14.15:1010
121+
122+
123+
Citation
124+
========
125+
126+
`Term Set Expansion based on Multi-Context Term Embeddings: an End-to-end Workflow
127+
<http://arxiv.org/abs/1807.10104>`__, Jonathan Mamou,
128+
Oren Pereg, Moshe Wasserblat, Ido Dagan, Yoav Goldberg, Alon Eirew, Yael Green, Shira Guskin,
129+
Peter Izsak, Daniel Korat, COLING 2018 System Demonstration paper.

nlp_architect/models/np2vec.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -239,7 +239,7 @@ def save(self, np2vec_model_file='np2vec.model', binary=False):
239239
total_vec = 0
240240
vector_size = self.model.vector_size
241241
for word in self.model.wv.vocab.keys():
242-
if self.is_marked(word):
242+
if self.is_marked(word) and len(word) > 1:
243243
total_vec += 1
244244
logger.info(
245245
"storing %sx%s projection weights for NP's into %s",
@@ -250,7 +250,7 @@ def save(self, np2vec_model_file='np2vec.model', binary=False):
250250
for word, vocab in sorted(
251251
iteritems(
252252
self.model.wv.vocab), key=lambda item: -item[1].count):
253-
if self.is_marked(word):
253+
if self.is_marked(word) and len(word) > 1: # discard empty marked np's
254254
embedding_vec = self.model.wv.syn0[vocab.index]
255255
if binary:
256256
fout.write(

nlp_architect/utils/generic.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -134,13 +134,13 @@ def get_paddedXY_sequence(X, y, vocab_size=20000, sentence_length=100, oov=2,
134134

135135
def license_prompt(model_name, model_website, dataset_dir=None):
136136
if dataset_dir:
137-
print('{} was not found in the directory: {}'.format(model_name, dataset_dir))
137+
print('\n\n***\n{} was not found in the directory: {}'.format(model_name, dataset_dir))
138138
else:
139-
print('{} was not found on local installation'.format(model_name))
139+
print('\n\n***\n\n{} was not found on local installation'.format(model_name))
140140
print('{} can be downloaded from {}'.format(model_name, model_website))
141-
print('\nThe terms and conditions of the data set license apply. Intel does not '
141+
print('The terms and conditions of the data set license apply. Intel does not '
142142
'grant any rights to the data files or database\n')
143-
response = input('\nTo download \'{}\' from {}, please enter YES: '.
143+
response = input('To download \'{}\' from {}, please enter YES: '.
144144
format(model_name, model_website))
145145
res = response.lower().strip()
146146
if res == "yes" or (len(res) == 1 and res == 'y'):

0 commit comments

Comments
 (0)