IntelLabs
diff --git a/‎.gitignore‎
Lines changed: 3 additions & 2 deletions b/‎.gitignore‎
Lines changed: 3 additions & 2 deletions
diff --git a/‎Makefile‎
Lines changed: 2 additions & 2 deletions b/‎Makefile‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎datasets/wikipedia/enwiki-20171201_subset.txt.gz‎
44.9 MB b/‎datasets/wikipedia/enwiki-20171201_subset.txt.gz‎
44.9 MB
diff --git a/‎datasets/wikipedia/enwiki-20171201_subset_license.txt‎
Lines changed: 14 additions & 0 deletions b/‎datasets/wikipedia/enwiki-20171201_subset_license.txt‎
Lines changed: 14 additions & 0 deletions
diff --git a/‎doc/source/assets/expansion_demo.png‎
120 KB b/‎doc/source/assets/expansion_demo.png‎
120 KB
diff --git a/‎doc/source/assets/expansion_flow.png‎
29.8 KB b/‎doc/source/assets/expansion_flow.png‎
29.8 KB
diff --git a/‎doc/source/index.rst‎
Lines changed: 7 additions & 0 deletions b/‎doc/source/index.rst‎
Lines changed: 7 additions & 0 deletions
diff --git a/‎doc/source/term_set_expansion.rst‎
Lines changed: 129 additions & 0 deletions b/‎doc/source/term_set_expansion.rst‎
Lines changed: 129 additions & 0 deletions
diff --git a/‎nlp_architect/models/np2vec.py‎
Lines changed: 2 additions & 2 deletions b/‎nlp_architect/models/np2vec.py‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎nlp_architect/utils/generic.py‎
Lines changed: 4 additions & 4 deletions b/‎nlp_architect/utils/generic.py‎
Lines changed: 4 additions & 4 deletions
@@ -12,13 +12,13 @@
 .styleenv
 .coverage
 build
-*.gz
 generated
 *.ropeproject
 *.cubin
 *.hdf5
 *.h5
 *.html
+!solutions/set_expansion/ui/templates/*.html
 .vscode
 !server/web_service/static/*.html
 !tests/fixtures/data/server/*.gz
@@ -31,4 +31,5 @@ pylint.html
 pylint.txt
 flake8.txt
 nlp_architect/pipelines/bist-pretrained/*
-nlp_architect/api/ner-pretrained/*
+venv
+nlp_architect/api/ner-pretrained/*
@@ -14,8 +14,8 @@
 # limitations under the License.
 # ******************************************************************************
 
-FLAKE8_CHECK_DIRS := examples nlp_architect/* server tests
-PYLINT_CHECK_DIRS := examples nlp_architect server tests setup
+FLAKE8_CHECK_DIRS := examples nlp_architect/* server tests solutions
+PYLINT_CHECK_DIRS := examples nlp_architect server tests setup solutions
 DOC_DIR := doc
 DOC_PUB_RELEASE_PATH := $(DOC_PUB_PATH)/$(RELEASE)
 
 
@@ -0,0 +1,14 @@
+
+Data:
+==========
+enwiki-20171201_subset.txt is a subset of Wikimedia English data dumps:
+
+https://meta.wikimedia.org/wiki/Data_dumps
+https://dumps.wikimedia.org/enwiki/
+
+
+
+License: 
+==========
+Creative Commons Attribution-Share-Alike 3.0 License
+https://creativecommons.org/licenses/by-sa/3.0/
@@ -128,6 +128,13 @@ on this project, please see the :doc:`developer guide <developer_guide>`.
    memn2n.rst
    kvmemn2n.rst
 
+.. toctree::
+   :hidden:
+   :maxdepth: 1
+   :caption: Solutions
+
+   term_set_expansion.rst
+
 .. toctree::
    :hidden:
    :maxdepth: 1
 
@@ -0,0 +1,129 @@
+.. ---------------------------------------------------------------------------
+.. Copyright 2016-2018 Intel Corporation
+..
+.. Licensed under the Apache License, Version 2.0 (the "License");
+.. you may not use this file except in compliance with the License.
+.. You may obtain a copy of the License at
+..
+..      http://www.apache.org/licenses/LICENSE-2.0
+..
+.. Unless required by applicable law or agreed to in writing, software
+.. distributed under the License is distributed on an "AS IS" BASIS,
+.. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+.. See the License for the specific language governing permissions and
+.. limitations under the License.
+.. ---------------------------------------------------------------------------
+
+Set Expansion Solution
+######################
+
+Overview
+========
+Term set expansion is the task of expanding a given partial set of terms into
+a more complete set of terms that belong to the same semantic class. This
+solution demonstrates the capability of a corpus-based set expansion system
+in a simple web application.
+
+.. image :: assets/expansion_demo.png
+
+Algorithm Overview
+==================
+Our approach is described by (Mamou et al, 2018). It is based on representing any
+term of a
+training corpus using word embeddings in order
+to estimate the similarity between the seed terms and any candidate term. Noun phrases provide
+good approximation for candidate terms and are extracted in our system using a noun phrase chunker.
+At expansion time, given a seed of terms, the most similar terms are returned.
+
+Flow
+====
+
+.. image :: assets/expansion_flow.png
+
+Training
+========
+
+The first step in training is to prepare the data for generating a word embedding model. We
+provide a subset of English Wikipedia at datasets/wikipedia as a sample corpus under the
+`Creative Commons Attribution-Share-Alike 3.0 License <https://creativecommons.org/licenses/by-sa/3.0/>`__ (Copyright 2018 Wikimedia Foundation).
+The output of this step is the marked corpus where noun phrases are marked with the marking character (default: "\_") as described in the NLP Architect :doc:`np2vec` module documentation. The pre-process script supports using NLP Architect :doc:`noun phrase extractor <spacy_np_annotator>` which uses an LSTM :doc:`chunker` model or using spaCy's own noun phrases matcher.
+This is done by running:
+
+.. code:: python
+
+  python solutions/set_expansion/prepare_data.py --corpus TRAINING_CORPUS --marked_corpus MARKED_TRAINING_CORPUS
+
+The next step is to train the model using NLP Architect :doc:`np2vec` module.
+For set expansion, we recommend the following values 100, 10, 10, 0 for respectively,
+size, min_count, window and hs hyperparameters. Please refer to the np2vec module documentation for more details about these parameters.
+
+.. code:: python
+
+  python examples/np2vec/train.py --size 100 --min_count 10 --window 10 --hs 0 --corpus MARKED_TRAINING_CORPUS --np2vec_model_file MODEL_PATH --corpus_format txt
+
+
+A `pretrained model <http://nervana-modelzoo.s3.amazonaws.com/NLP/SetExp/enwiki-20171201_pretrained_set_expansion.txt>`__
+on English Wikipedia dump (enwiki-20171201-pages-articles-multistream.xml.bz2) is available under
+Apache 2.0 license. It has been trained with hyperparameters values
+recommended above. Full English Wikipedia `raw corpus <http://nervana-modelzoo.s3.amazonaws.com/NLP/SetExp/enwiki-20171201.txt>`_ and
+`marked corpus <http://nervana-modelzoo.s3.amazonaws.com/NLP/SetExp/enwiki-20171201_spacy_marked.txt>`_
+are also available under the
+`Creative Commons Attribution-Share-Alike 3.0 License <https://creativecommons.org/licenses/by-sa/3.0/>`__.
+
+
+Inference
+=========
+
+The inference step consists of expanding given seed terms into a set of terms that belong to the same semantic class.
+It can be done in two ways:
+
+1. Running a python script:
+
+    .. code:: python
+
+      python solutions/set_expansion/set_expand.py --np2vec_model_file MODEL_PATH --topn TOPN
+
+2. Web application
+
+    A.  Loading the expand server with the trained model:
+
+    .. code:: python
+
+      python expand_server.py [--host HOST] [--port PORT] model_path
+
+    The expand server gets requests containing seed terms, and expands them
+    based on the given word embedding model. You can use the model you trained
+    yourself in the previous step, or to provide a pre-trained model you own.
+    **Important note**: default server
+    will listen on localhost:1234. If you set the host/port you should also
+    set it in the ui/settings.py file.
+
+
+    B.  Run the UI application:
+
+    .. code:: python
+
+      bokeh serve --show ui
+
+    The UI is a simple web based application for performing expansion.
+    The application communicates with the server by sending expand
+    requests, present the results in a simple table and export them to a csv
+    file. It allows you to either directly type the terms to expand or to
+    select terms from the model vocabulary list. After you get some expand
+    results you can perform re-expansion by selecting terms from the results (hold the Ctrl key for
+    multiple selection). **Important note**: If you set the host/port of the expand server you
+    should also set it in the ui/settings.py file. You can also load the ui
+    application as a server using the bokeh options --address and --port, for example:
+
+    .. code:: python
+
+      bokeh serve ui --address=12.13.14.15 --port=1010 --allow-websocket-origin=12.13.14.15:1010
+
+
+Citation
+========
+
+`Term Set Expansion based on Multi-Context Term Embeddings: an End-to-end Workflow
+<http://arxiv.org/abs/1807.10104>`__, Jonathan Mamou,
+Oren Pereg, Moshe Wasserblat, Ido Dagan, Yoav Goldberg, Alon Eirew, Yael Green, Shira Guskin,
+Peter Izsak, Daniel Korat, COLING 2018 System Demonstration paper.
@@ -239,7 +239,7 @@ def save(self, np2vec_model_file='np2vec.model', binary=False):
             total_vec = 0
             vector_size = self.model.vector_size
             for word in self.model.wv.vocab.keys():
-                if self.is_marked(word):
+                if self.is_marked(word) and len(word) > 1:
                     total_vec += 1
             logger.info(
                 "storing %sx%s projection weights for NP's into %s",
@@ -250,7 +250,7 @@ def save(self, np2vec_model_file='np2vec.model', binary=False):
                 for word, vocab in sorted(
                         iteritems(
                             self.model.wv.vocab), key=lambda item: -item[1].count):
-                    if self.is_marked(word):
+                    if self.is_marked(word) and len(word) > 1:  # discard empty marked np's
                         embedding_vec = self.model.wv.syn0[vocab.index]
                         if binary:
                             fout.write(
 
@@ -134,13 +134,13 @@ def get_paddedXY_sequence(X, y, vocab_size=20000, sentence_length=100, oov=2,
 
 def license_prompt(model_name, model_website, dataset_dir=None):
     if dataset_dir:
-        print('{} was not found in the directory: {}'.format(model_name, dataset_dir))
+        print('\n\n***\n{} was not found in the directory: {}'.format(model_name, dataset_dir))
     else:
-        print('{} was not found on local installation'.format(model_name))
+        print('\n\n***\n\n{} was not found on local installation'.format(model_name))
     print('{} can be downloaded from {}'.format(model_name, model_website))
-    print('\nThe terms and conditions of the data set license apply. Intel does not '
+    print('The terms and conditions of the data set license apply. Intel does not '
           'grant any rights to the data files or database\n')
-    response = input('\nTo download \'{}\' from {}, please enter YES: '.
+    response = input('To download \'{}\' from {}, please enter YES: '.
                      format(model_name, model_website))
     res = response.lower().strip()
     if res == "yes" or (len(res) == 1 and res == 'y'):