Skip to content
This repository was archived by the owner on Nov 8, 2022. It is now read-only.

Commit 53bffb8

Browse files
amityaccobipeteriz
authored andcommitted
Amit/v0.1 legal fixes (#208)
* server - fixed legal issues * Fix validate_parent_exists() * np_semantic_segmentation - fixes 1. Legal issue 2. Prompt added to nltk.dowload datasets * WSD - fixes 1. fixed legal requirements 2. Removed ‘requirement.txt’ 3. Added prompt for nltk.download() * NP_Semantic_seg - small fix in nltk dowloaded corpora check * Fix server tests fails: 1. Fixes wrong header (format -> Response-Format) 2. Added .gz files 3. Added .gz files ignore to the .gitignore file (specifically to tests/fixtures/data/server/ dir)
1 parent 585b441 commit 53bffb8

File tree

15 files changed

+72
-29
lines changed

15 files changed

+72
-29
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ generated
2020
*.h5
2121
*.html
2222
!server/web_service/visualizer/displacy/*.html
23+
!tests/fixtures/data/server/*.gz
2324
*.log
2425
.idea/
2526
dist

doc/source/np_segmentation.rst

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -69,8 +69,7 @@ Dataset <https://vered1986.github.io/papers/Tratz2011_Dataset.tar.gz>`__.
6969
Is also available in
7070
`here <https://www.isi.edu/publications/licensed-sw/fanseparser/index.html>`__.
7171
(The terms and conditions of the data set license apply. Intel does not
72-
grant any rights to the data files or database. see relevant `license
73-
agreement <http://www.apache.org/licenses/LICENSE-2.0>`__)
72+
grant any rights to the data files or database.
7473

7574
After downloading and unzipping the dataset, run
7675
``preprocess_tratz2011.py`` in order to construct the labeled data and
@@ -97,8 +96,7 @@ command ``python data.py``
9796
- Pre-trained Google News Word2vec model can download
9897
`here <https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing>`__
9998
- The terms and conditions of the data set license apply. Intel does
100-
not grant any rights to the data files or database. see relevant
101-
`license agreement <http://www.apache.org/licenses/LICENSE-2.0>`__
99+
not grant any rights to the data files or database.
102100

103101
- Cosine distance between 2 words in the Noun-Phrase.
104102
- NLTKCollocations score (PMI score (from Manning and Schutze 5.4) and Chi-square score (Manning and Schutze 5.3.3)).

doc/source/word_sense.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -83,6 +83,7 @@ Dataset Preparation
8383

8484
The script prepare_data.py uses the gold standard csv file as described in the requirements section above
8585
using pretrained Google News Word2vec model. Pretrained Google News Word2vec model can be download here_.
86+
The terms and conditions of the data set license apply. Intel does not grant any rights to the data files.
8687

8788
.. code:: python
8889

examples/most_common_word_sense/feature_extraction.py

Lines changed: 17 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,8 +20,23 @@
2020
from numpy import dot
2121
from numpy.linalg import norm
2222

23-
nltk.download('averaged_perceptron_tagger')
24-
nltk.download('punkt')
23+
from nlp_architect.utils.generic import license_prompt
24+
25+
try:
26+
nltk.data.find('taggers/averaged_perceptron_tagger')
27+
except LookupError:
28+
if license_prompt('Averaged Perceptron Tagger', 'http://www.nltk.org/nltk_data/') is False:
29+
raise Exception("can't continue data prepare process "
30+
"without downloading averaged_perceptron_tagger")
31+
nltk.download('averaged_perceptron_tagger')
32+
33+
try:
34+
nltk.data.find('tokenizers/punkt')
35+
except LookupError:
36+
if license_prompt('Punkt model', 'http://www.nltk.org/nltk_data/') is False:
37+
raise Exception("can't continue data prepare process "
38+
"without downloading punkt")
39+
nltk.download('punkt')
2540

2641
# -------------------------------------------------------------------------------------#
2742

examples/most_common_word_sense/prepare_data.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,6 @@
1818
"""
1919

2020
import argparse
21-
import codecs
2221
import csv
2322
import logging
2423
import math
@@ -27,9 +26,11 @@
2726
import gensim
2827
import numpy as np
2928
from feature_extraction import extract_features_envelope
30-
from nlp_architect.utils.io import validate_existing_directory, validate_existing_filepath
3129
from sklearn.model_selection import train_test_split
3230

31+
from nlp_architect.utils.io import validate_existing_filepath, \
32+
check_size, validate_parent_exists
33+
3334
logger = logging.getLogger(__name__)
3435
logger.setLevel(logging.DEBUG)
3536

examples/most_common_word_sense/requirements.txt

Lines changed: 0 additions & 11 deletions
This file was deleted.

examples/np_semantic_segmentation/README.md

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -19,24 +19,23 @@ The expected dataset is a CSV file with 2 columns. the first column contains the
1919

2020
If you wish to use an existing dataset for training the model, you can download Tratz 2011 et al. dataset [1,2] from the following link:
2121
[Tratz 2011 Dataset](https://vered1986.github.io/papers/Tratz2011_Dataset.tar.gz). Is also available in [here](https://www.isi.edu/publications/licensed-sw/fanseparser/index.html).
22-
(The terms and conditions of the data set license apply. Intel does not grant any rights to the data files or database. see relevant [license agreement](http://www.apache.org/licenses/LICENSE-2.0))
22+
(The terms and conditions of the data set license apply. Intel does not grant any rights to the data files or database.
2323

2424

2525
After downloading and unzipping the dataset, run `preprocess_tratz2011.py` in order to construct the labeled data and save it in a CSV file (as expected for the model).
2626
the scripts read 2 .tsv files ('tratz2011_coarse_grained_random/train.tsv' and 'tratz2011_coarse_grained_random/val.tsv') and outputs 2 .csv files accordingly.
2727

2828
Parameters can be obtained by running:
2929

30-
python preprocess_tratz2011.py -h
31-
--data path_to_Tratz_2011_dataset_folder
30+
python preprocess_tratz2011.py --data path_to_Tratz_2011_dataset_folder
3231

3332

3433
### Pre-processing the data:
3534
A feature vector is extracted from each Noun-Phrase string using the command `python data.py`
3635

3736
* Word2Vec word embedding (300 size vector for each word in the Noun-Phrase) .
3837
* Pre-trained Google News Word2vec model can download [here](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing)
39-
* The terms and conditions of the data set license apply. Intel does not grant any rights to the data files or database. see relevant [license agreement](http://www.apache.org/licenses/LICENSE-2.0)
38+
* The terms and conditions of the data set license apply. Intel does not grant any rights to the data files or database.
4039
* Cosine distance between 2 words in the Noun-Phrase.
4140
* NLTKCollocations score (NPMI and UCI scores).
4241
* A binary features whether the Noun-Phrase has existing entity in Wikidata.

examples/np_semantic_segmentation/feature_extraction.py

Lines changed: 16 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,8 @@
2424
from nltk.corpus import wordnet as wn
2525
from nltk.stem.snowball import SnowballStemmer
2626

27+
from nlp_architect.utils.generic import license_prompt
28+
2729
stemmer = SnowballStemmer("english")
2830
headers = {"Accept": "application/json"}
2931

@@ -33,7 +35,13 @@ class NLTKCollocations:
3335
NLTKCollocations score using NLTK framework on Brown dataset
3436
"""
3537
def __init__(self):
36-
nltk.download('brown')
38+
try:
39+
nltk.data.find('corpora/brown')
40+
except LookupError:
41+
if license_prompt('brown data set', 'http://www.nltk.org/nltk_data/') is False:
42+
raise Exception("can't continue data prepare process "
43+
"without downloading brown dataset")
44+
nltk.download('brown')
3745
self.bigram_finder = nltk.collocations.BigramCollocationFinder.from_words(
3846
nltk.corpus.brown.words())
3947
self.bigram_messure = nltk.collocations.BigramAssocMeasures()
@@ -235,7 +243,13 @@ class Wordnet:
235243
"""
236244

237245
def __init__(self):
238-
nltk.download('wordnet')
246+
try:
247+
nltk.data.find('corpora/wordnet')
248+
except LookupError:
249+
if license_prompt('WordNet data set', 'http://www.nltk.org/nltk_data/') is False:
250+
raise Exception("can't continue data prepare process "
251+
"without downloading WordNet dataset")
252+
nltk.download('wordnet')
239253
self.wordnet = wn
240254

241255
def find_wordnet_existence(self, candidates):

nlp_architect/utils/io.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -135,7 +135,10 @@ def validate_existing_directory(arg):
135135

136136
def validate_parent_exists(arg):
137137
"""Validates an input argument is a path string, and its parent directory exists."""
138-
return validate_existing_directory(os.path.dirname(arg))
138+
arg = path.abspath(arg)
139+
dir_arg = os.path.dirname(os.path.abspath(arg))
140+
if not validate_existing_directory(dir_arg) is None:
141+
return arg
139142

140143

141144
def sanitize_path(path):

server/web_service/visualizer/displacy/displacy-ent.js

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
//- ----------------------------------
22
//- 💥 DISPLACY ENT
33
//- ----------------------------------
4+
/* this file is taken from: "https://github.com/explosion/displacy-ent" */
45

56
'use strict';
67

0 commit comments

Comments
 (0)