Skip to content
This repository was archived by the owner on Nov 8, 2022. It is now read-only.

Commit 72ac75e

Browse files
author
Peter Izsak
committed
Updated documantation
1 parent 090e580 commit 72ac75e

File tree

1 file changed

+10
-11
lines changed

1 file changed

+10
-11
lines changed

doc/source/term_set_expansion.rst

Lines changed: 10 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -31,8 +31,8 @@ Algorithm Overview
3131
Our approach is described by (Mamou et al, 2018). It is based on representing any
3232
term of a
3333
training corpus using word embeddings in order
34-
to estimate the similarity between the seed terms and any candidate term. Noun phrases provide
35-
good approximation for candidate terms and are extracted in our system using a noun phrase chunker.
34+
to estimate the similarity between the seed terms and any candidate term. Noun phrases provide
35+
good approximation for candidate terms and are extracted in our system using a noun phrase chunker.
3636
At expansion time, given a seed of terms, the most similar terms are returned.
3737

3838
Flow
@@ -42,19 +42,19 @@ Flow
4242
4343
Training
4444
========
45-
46-
The first step in training is to prepare the data for generating a word embedding model. We
47-
provide a subset of English Wikipedia at datasets/wikipedia as a sample corpus under the
45+
46+
The first step in training is to prepare the data for generating a word embedding model. We
47+
provide a subset of English Wikipedia at datasets/wikipedia as a sample corpus under the
4848
`Creative Commons Attribution-Share-Alike 3.0 License <https://creativecommons.org/licenses/by-sa/3.0/>`__ (Copyright 2018 Wikimedia Foundation).
49-
The output of this step is the marked corpus where noun phrases are marked with the marking character (default: "\_") as described in the `NLP Architect np2vec module documentation <http://nlp_architect.nervanasys.com/np2vec.html>`__.
49+
The output of this step is the marked corpus where noun phrases are marked with the marking character (default: "\_") as described in the NLP Architect :doc:`np2vec` module documentation. The pre-process script supports using NLP Architect :doc:`noun phrase extractor <spacy_np_annotator>` which uses an LSTM :doc:`chunker` model or using spaCy's own noun phrases matcher.
5050
This is done by running:
5151

5252
.. code:: python
5353
5454
python solutions/set_expansion/prepare_data.py --corpus TRAINING_CORPUS --marked_corpus MARKED_TRAINING_CORPUS
5555
56-
The next step is to train the model using `NLP Architect np2vec module <http://nlp_architect.nervanasys.com/np2vec.html>`__.
57-
For set expansion, we recommend the following values 100, 10, 10, 0 for respectively,
56+
The next step is to train the model using NLP Architect :doc:`np2vec` module.
57+
For set expansion, we recommend the following values 100, 10, 10, 0 for respectively,
5858
size, min_count, window and hs hyperparameters. Please refer to the np2vec module documentation for more details about these parameters.
5959

6060
.. code:: python
@@ -125,6 +125,5 @@ Citation
125125

126126
`Term Set Expansion based on Multi-Context Term Embeddings: an End-to-end Workflow
127127
<http://arxiv.org/abs/1807.10104>`__, Jonathan Mamou,
128-
Oren Pereg, Moshe Wasserblat, Ido Dagan, Yoav Goldberg, Alon Eirew, Yael Green, Shira Guskin,
129-
Peter Izsak, Daniel Korat, COLING 2018 System Demonstration paper.
130-
128+
Oren Pereg, Moshe Wasserblat, Ido Dagan, Yoav Goldberg, Alon Eirew, Yael Green, Shira Guskin,
129+
Peter Izsak, Daniel Korat, COLING 2018 System Demonstration paper.

0 commit comments

Comments
 (0)