|
| 1 | +.. --------------------------------------------------------------------------- |
| 2 | +.. Copyright 2016-2018 Intel Corporation |
| 3 | +.. |
| 4 | +.. Licensed under the Apache License, Version 2.0 (the "License"); |
| 5 | +.. you may not use this file except in compliance with the License. |
| 6 | +.. You may obtain a copy of the License at |
| 7 | +.. |
| 8 | +.. http://www.apache.org/licenses/LICENSE-2.0 |
| 9 | +.. |
| 10 | +.. Unless required by applicable law or agreed to in writing, software |
| 11 | +.. distributed under the License is distributed on an "AS IS" BASIS, |
| 12 | +.. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| 13 | +.. See the License for the specific language governing permissions and |
| 14 | +.. limitations under the License. |
| 15 | +.. --------------------------------------------------------------------------- |
| 16 | +
|
| 17 | +Set Expansion Solution |
| 18 | +###################### |
| 19 | + |
| 20 | +Overview |
| 21 | +======== |
| 22 | +Term set expansion is the task of expanding a given partial set of terms into |
| 23 | +a more complete set of terms that belong to the same semantic class. This |
| 24 | +solution demonstrates the capability of a corpus-based set expansion system |
| 25 | +in a simple web application. |
| 26 | + |
| 27 | +.. image :: assets/expansion_demo.png |
| 28 | +
|
| 29 | +Algorithm Overview |
| 30 | +================== |
| 31 | +Our approach is described by (Mamou et al, 2018). It is based on representing any |
| 32 | +term of a |
| 33 | +training corpus using word embeddings in order |
| 34 | +to estimate the similarity between the seed terms and any candidate term. Noun phrases provide |
| 35 | +good approximation for candidate terms and are extracted in our system using a noun phrase chunker. |
| 36 | +At expansion time, given a seed of terms, the most similar terms are returned. |
| 37 | + |
| 38 | +Flow |
| 39 | +==== |
| 40 | + |
| 41 | +.. image :: assets/expansion_flow.png |
| 42 | +
|
| 43 | +Training |
| 44 | +======== |
| 45 | + |
| 46 | +The first step in training is to prepare the data for generating a word embedding model. We |
| 47 | +provide a subset of English Wikipedia at datasets/wikipedia as a sample corpus under the |
| 48 | +`Creative Commons Attribution-Share-Alike 3.0 License <https://creativecommons.org/licenses/by-sa/3.0/>`__ (Copyright 2018 Wikimedia Foundation). |
| 49 | +The output of this step is the marked corpus where noun phrases are marked with the marking character (default: "\_") as described in the NLP Architect :doc:`np2vec` module documentation. The pre-process script supports using NLP Architect :doc:`noun phrase extractor <spacy_np_annotator>` which uses an LSTM :doc:`chunker` model or using spaCy's own noun phrases matcher. |
| 50 | +This is done by running: |
| 51 | + |
| 52 | +.. code:: python |
| 53 | +
|
| 54 | + python solutions/set_expansion/prepare_data.py --corpus TRAINING_CORPUS --marked_corpus MARKED_TRAINING_CORPUS |
| 55 | +
|
| 56 | +The next step is to train the model using NLP Architect :doc:`np2vec` module. |
| 57 | +For set expansion, we recommend the following values 100, 10, 10, 0 for respectively, |
| 58 | +size, min_count, window and hs hyperparameters. Please refer to the np2vec module documentation for more details about these parameters. |
| 59 | + |
| 60 | +.. code:: python |
| 61 | +
|
| 62 | + python examples/np2vec/train.py --size 100 --min_count 10 --window 10 --hs 0 --corpus MARKED_TRAINING_CORPUS --np2vec_model_file MODEL_PATH --corpus_format txt |
| 63 | +
|
| 64 | +
|
| 65 | +A `pretrained model <http://nervana-modelzoo.s3.amazonaws.com/NLP/SetExp/enwiki-20171201_pretrained_set_expansion.txt>`__ |
| 66 | +on English Wikipedia dump (enwiki-20171201-pages-articles-multistream.xml.bz2) is available under |
| 67 | +Apache 2.0 license. It has been trained with hyperparameters values |
| 68 | +recommended above. Full English Wikipedia `raw corpus <http://nervana-modelzoo.s3.amazonaws.com/NLP/SetExp/enwiki-20171201.txt>`_ and |
| 69 | +`marked corpus <http://nervana-modelzoo.s3.amazonaws.com/NLP/SetExp/enwiki-20171201_spacy_marked.txt>`_ |
| 70 | +are also available under the |
| 71 | +`Creative Commons Attribution-Share-Alike 3.0 License <https://creativecommons.org/licenses/by-sa/3.0/>`__. |
| 72 | + |
| 73 | + |
| 74 | +Inference |
| 75 | +========= |
| 76 | + |
| 77 | +The inference step consists of expanding given seed terms into a set of terms that belong to the same semantic class. |
| 78 | +It can be done in two ways: |
| 79 | + |
| 80 | +1. Running a python script: |
| 81 | + |
| 82 | + .. code:: python |
| 83 | +
|
| 84 | + python solutions/set_expansion/set_expand.py --np2vec_model_file MODEL_PATH --topn TOPN |
| 85 | +
|
| 86 | +2. Web application |
| 87 | + |
| 88 | + A. Loading the expand server with the trained model: |
| 89 | + |
| 90 | + .. code:: python |
| 91 | +
|
| 92 | + python expand_server.py [--host HOST] [--port PORT] model_path |
| 93 | +
|
| 94 | + The expand server gets requests containing seed terms, and expands them |
| 95 | + based on the given word embedding model. You can use the model you trained |
| 96 | + yourself in the previous step, or to provide a pre-trained model you own. |
| 97 | + **Important note**: default server |
| 98 | + will listen on localhost:1234. If you set the host/port you should also |
| 99 | + set it in the ui/settings.py file. |
| 100 | + |
| 101 | + |
| 102 | + B. Run the UI application: |
| 103 | + |
| 104 | + .. code:: python |
| 105 | +
|
| 106 | + bokeh serve --show ui |
| 107 | +
|
| 108 | + The UI is a simple web based application for performing expansion. |
| 109 | + The application communicates with the server by sending expand |
| 110 | + requests, present the results in a simple table and export them to a csv |
| 111 | + file. It allows you to either directly type the terms to expand or to |
| 112 | + select terms from the model vocabulary list. After you get some expand |
| 113 | + results you can perform re-expansion by selecting terms from the results (hold the Ctrl key for |
| 114 | + multiple selection). **Important note**: If you set the host/port of the expand server you |
| 115 | + should also set it in the ui/settings.py file. You can also load the ui |
| 116 | + application as a server using the bokeh options --address and --port, for example: |
| 117 | + |
| 118 | + .. code:: python |
| 119 | +
|
| 120 | + bokeh serve ui --address=12.13.14.15 --port=1010 --allow-websocket-origin=12.13.14.15:1010 |
| 121 | +
|
| 122 | +
|
| 123 | +Citation |
| 124 | +======== |
| 125 | + |
| 126 | +`Term Set Expansion based on Multi-Context Term Embeddings: an End-to-end Workflow |
| 127 | +<http://arxiv.org/abs/1807.10104>`__, Jonathan Mamou, |
| 128 | +Oren Pereg, Moshe Wasserblat, Ido Dagan, Yoav Goldberg, Alon Eirew, Yael Green, Shira Guskin, |
| 129 | +Peter Izsak, Daniel Korat, COLING 2018 System Demonstration paper. |
0 commit comments