Skip to content
This repository was archived by the owner on Nov 8, 2022. It is now read-only.

Commit 58568a6

Browse files
Merge branch 'master' into set_expansion_PR
2 parents c07fc0a + 0e8cd8c commit 58568a6

File tree

15 files changed

+1926
-2
lines changed

15 files changed

+1926
-2
lines changed

doc/source/api.rst

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,8 @@ to train the model weights, perform inference, and save/load the model.
4949
nlp_architect.models.bist_parser.BISTModel
5050
nlp_architect.models.memn2n_dialogue.MemN2N_Dialog
5151
nlp_architect.models.kvmemn2n.KVMemN2N
52+
nlp_architect.models.supervised_sentiment.simple_lstm
53+
nlp_architect.models.supervised_sentiment.one_hot_cnn
5254

5355

5456
``nlp_architect.layers``
@@ -88,7 +90,7 @@ these will be placed into a central repository.
8890
nlp_architect.data.sequential_tagging.SequentialTaggingDataset
8991
nlp_architect.data.babi_dialog.BABI_Dialog
9092
nlp_architect.data.wikimovies.WIKIMOVIES
91-
93+
nlp_architect.data.amazon_reviews.Amazon_Reviews
9294

9395

9496
``nlp_architect.pipelines``
@@ -117,4 +119,3 @@ NLP pipelines modules using models implemented from ``nlp_architect.models``.
117119

118120
server.serve
119121
server.service
120-

doc/source/index.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,7 @@ The library contains state-of-art and novel NLP and NLU models in a varity of to
5757
- NER and NE expansion
5858
- Text chunking
5959
- Reading comprehension
60+
- Supervised sentiment analysis
6061

6162

6263
Deep Learning frameworks
@@ -115,6 +116,7 @@ on this project, please see the :doc:`developer guide <developer_guide>`.
115116
bist_parser.rst
116117
word_sense.rst
117118
np2vec.rst
119+
supervised_sentiment.rst
118120
tcn.rst
119121

120122
.. toctree::

doc/source/overview.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,7 @@ The library contains state-of-art and novel NLP and NLU models in a varity of to
5353
- NER and NE expansion
5454
- Text chunking
5555
- Reading comprehension
56+
- Supervised sentiment analysis
5657

5758
Deep Learning frameworks
5859
````````````````````````
Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
.. ---------------------------------------------------------------------------
2+
.. Copyright 2017-2018 Intel Corporation
3+
..
4+
.. Licensed under the Apache License, Version 2.0 (the "License");
5+
.. you may not use this file except in compliance with the License.
6+
.. You may obtain a copy of the License at
7+
..
8+
.. http://www.apache.org/licenses/LICENSE-2.0
9+
..
10+
.. Unless required by applicable law or agreed to in writing, software
11+
.. distributed under the License is distributed on an "AS IS" BASIS,
12+
.. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
.. See the License for the specific language governing permissions and
14+
.. limitations under the License.
15+
.. ---------------------------------------------------------------------------
16+
17+
Supervised Sentiment
18+
####################
19+
20+
Overview
21+
========
22+
23+
This is a set of models which are examples of supervised implementations for sentiment analysis.
24+
The larger idea behind these models is to allow ensembling with other supervised or unsupervised models.
25+
26+
Files
27+
=====
28+
29+
- **nlp_architect/models/supervised_sentiment.py**: Sentiment analysis models - currently an LSTM and a one-hot CNN
30+
- **nlp_architect/data/amazon_reviews.py**: Code which will download and process the Amazon datasets described below
31+
- **nlp_architect/utils/ensembler.py**: Contains the ensembling algorithm(s)
32+
- **example_ensemble.py**: An example of how the sentiment models can be trained and ensembled.
33+
- **optimize_example.py**: An example of using an hyperparameter optimizer with the simple LSTM model.
34+
35+
36+
Models
37+
======
38+
Two models are shown as classification examples. Additional models can be added as desired.
39+
40+
Bi-directional LSTM
41+
-------------------
42+
A simple bidirectional lstm with one fully connected layer. The number of vocab features, dense output size, and document input length, should be determined in the data preprocessing steps. The user can then change the size of the lstm hidden layer, and the recurrent dropout rate.
43+
44+
Temporal CNN
45+
------------
46+
As defined in "Text Understanding from Scratch" by Zhang, LeCun 2015 https://arxiv.org/pdf/1502.01710v4.pdf this model is a series of 1D CNNs, with a maxpooling and fully connected layers. The frame sizes may either be large or small.
47+
48+
49+
Datasets
50+
========
51+
The dataset in this example is the Amazon Reviews dataset, though other datasets can be easily substituted.
52+
The Amazon review dataset(s) should be downloaded from http://jmcauley.ucsd.edu/data/amazon/. These are `*.json.gzip` files which should be unzipped. The terms and conditions of the data set license apply. Intel does not grant any rights to the data files.
53+
For best results, a medium sized dataset should be chosen though the algorithms will work on larger and smaller datasets as well. For experimentation I chose the Movie and TV reviews.
54+
Only the "overall", "reviewText", and "summary" columns of the review dataset will be retained. The "overall" is the overall rating in terms of stars - this is transformed into a rating where currently 4-5 stars is a positive review, 3 is neutral, and 1-2 stars is a negative review.
55+
The "summary" or title of the review is concatenated with the review text and subsequently cleaned.
56+
57+
The Amazon Review Dataset was published in the following papers:
58+
59+
Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering
60+
R. He, J. McAuley
61+
WWW, 2016
62+
http://cseweb.ucsd.edu/~jmcauley/pdfs/www16a.pdf
63+
64+
Image-based recommendations on styles and substitutes
65+
J. McAuley, C. Targett, J. Shi, A. van den Hengel
66+
SIGIR, 2015
67+
http://cseweb.ucsd.edu/~jmcauley/pdfs/sigir15.pdf
68+
69+
70+
Running Modalities
71+
==================
72+
73+
Ensemble Train/Test
74+
-------------------
75+
Currently, the pipeline shows a full train/test/ensemble cycle. The main pipeline can be run with the following command:
76+
```
77+
python example_ensemble.py --file_path ./reviews_Movies_and_TV.json/
78+
```
79+
At the conclusion of training a final confusion matrix will be displayed.
80+
81+
Hyperparameter optimization
82+
---------------------------
83+
An example of hyperparameter optimization is given using the python package hyperopt which uses a Tree of Parzen estimator to optimize the simple bi-lstm algorithm. To run this example the following command can be utilized:
84+
```
85+
python optimize_example.py --file_path ./reviews_Movies_and_TV.json/ --new_trials 50 --output_file ./data/optimize_output.pkl
86+
```
87+
The file will output a result of each of the trial attempts to the specified pickle file.
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
# Supervised Sentiment
2+
3+
This is a set of models which are examples of supervised implementations for sentiment analysis.
4+
The larger idea behind these models is to allow ensembling with other supervised or unsupervised models.
5+
6+
7+
# Models
8+
Two models are shown as classification examples. Additional models can be added as desired.
9+
10+
<b>Bi-directional LSTM</b>
11+
A simple bidirectional lstm with one fully connected layer. The number of vocab features, dense output size, and document input length, should be determined in the data preprocessing steps. The user can then change the size of the lstm hidden layer, and the recurrent dropout rate.
12+
13+
<b>Temporal CNN</b>
14+
As defined in "Text Understanding from Scratch" by Zhang, LeCun 2015 https://arxiv.org/pdf/1502.01710v4.pdf this model is a series of 1D CNNs, with a maxpooling and fully connected layers. The frame sizes may either be large or small.
15+
16+
17+
# Datasets
18+
The dataset in this example is the Amazon Reviews dataset, though other datasets can be easily substituted.
19+
The Amazon review dataset(s) should be downloaded from http://jmcauley.ucsd.edu/data/amazon/. These are `*.json.gzip` files which should be unzipped. The terms and conditions of the data set license apply. Intel does not grant any rights to the data files.
20+
For best results, a medium sized dataset should be chosen though the algorithms will work on larger and smaller datasets as well. For experimentation I chose the Movie and TV reviews.
21+
Only the "overall", "reviewText", and "summary" columns of the review dataset will be retained. The "overall" is the overall rating in terms of stars - this is transformed into a rating where currently 4-5 stars is a positive review, 3 is neutral, and 1-2 stars is a negative review.
22+
The "summary" or title of the review is concatenated with the review text and subsequently cleaned.
23+
24+
The Amazon Review Dataset was published in the following papers:
25+
26+
Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering
27+
R. He, J. McAuley
28+
WWW, 2016
29+
http://cseweb.ucsd.edu/~jmcauley/pdfs/www16a.pdf
30+
31+
Image-based recommendations on styles and substitutes
32+
J. McAuley, C. Targett, J. Shi, A. van den Hengel
33+
SIGIR, 2015
34+
http://cseweb.ucsd.edu/~jmcauley/pdfs/sigir15.pdf
35+
36+
37+
# Train/Test
38+
Currently, the pipeline shows a full train/test/ensemble cycle. The main pipeline can be run with the following command:
39+
```
40+
python example_ensemble.py --file_path ./reviews_Movies_and_TV.json/
41+
```
42+
At the conclusion of training a final confusion matrix will be displayed.
43+
44+
# Hyperparameter optimization
45+
An example of hyperparameter optimization is given using the python package hyperopt which uses a Tree of Parzen estimator to optimize the simple bi-lstm algorithm. To run this example the following command can be utilized:
46+
```
47+
python optimize_example.py --file_path ./reviews_Movies_and_TV.json/ --new_trials 50 --output_file ./data/optimize_output.pkl
48+
```
49+
The file will output a result of each of the trial attempts to the specified pickle file.

examples/supervised_sentiment/__init__.py

Whitespace-only changes.
Lines changed: 126 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,126 @@
1+
# ******************************************************************************
2+
# Copyright 2017-2018 Intel Corporation
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
# ******************************************************************************
16+
17+
"""
18+
This example uses the Amazon reviews though additional datasets can easily be substituted.
19+
It only requires text and a sentiment label
20+
It then takes the dataset and trains two models (again can be expanded)
21+
The labels for the test data is then predicted.
22+
The same train and test data is used for both models
23+
24+
The ensembler takes the two prediction matrixes and weights (as defined by model accuracy)
25+
and determines the final prediction matrix.
26+
27+
Finally, the full classification report is displayed.
28+
29+
A similar pipeline could be utilized to train models on a dataset, predict on a second dataset
30+
and aquire a list of final predictions
31+
"""
32+
33+
import numpy as np
34+
import argparse
35+
from keras.preprocessing.sequence import pad_sequences
36+
from keras.preprocessing.text import Tokenizer
37+
from sklearn.model_selection import train_test_split
38+
from sklearn.metrics import classification_report
39+
40+
from nlp_architect.data.amazon_reviews import Amazon_Reviews
41+
from nlp_architect.utils.generic import to_one_hot
42+
from nlp_architect.models.supervised_sentiment import simple_lstm, one_hot_cnn
43+
from nlp_architect.utils.ensembler import simple_ensembler
44+
from nlp_architect.utils.io import validate_existing_filepath, check_size
45+
46+
max_fatures = 2000
47+
max_len = 300
48+
batch_size = 32
49+
embed_dim = 256
50+
lstm_out = 140
51+
52+
53+
def ensemble_models(data, args):
54+
# split, train, test
55+
data.process()
56+
dense_out = len(data.labels[0])
57+
# split for all models
58+
X_train_, X_test_, Y_train, Y_test = train_test_split(data.text, data.labels,
59+
test_size=0.20, random_state=42)
60+
61+
# Prep data for the LSTM model
62+
tokenizer = Tokenizer(num_words=max_fatures, split=' ')
63+
tokenizer.fit_on_texts(X_train_)
64+
X_train = tokenizer.texts_to_sequences(X_train_)
65+
X_train = pad_sequences(X_train, maxlen=max_len)
66+
X_test = tokenizer.texts_to_sequences(X_test_)
67+
X_test = pad_sequences(X_test, maxlen=max_len)
68+
69+
# Train the LSTM model
70+
lstm_model = simple_lstm(max_fatures, dense_out, X_train.shape[1], embed_dim, lstm_out)
71+
model_hist = lstm_model.fit(X_train, Y_train, epochs=args.epochs, batch_size=batch_size,
72+
verbose=1, validation_data=(X_test, Y_test))
73+
lstm_acc = model_hist.history['acc'][-1]
74+
print("LSTM model accuracy ", lstm_acc)
75+
76+
# And make predictions using the LSTM model
77+
lstm_predictions = lstm_model.predict(X_test)
78+
79+
# Now prep data for the one-hot CNN model
80+
X_train_cnn = np.asarray([to_one_hot(x) for x in X_train_])
81+
X_test_cnn = np.asarray([to_one_hot(x) for x in X_test_])
82+
83+
# And train the one-hot CNN classifier
84+
model_cnn = one_hot_cnn(dense_out, max_len)
85+
model_hist_cnn = model_cnn.fit(X_train_cnn, Y_train, batch_size=batch_size, epochs=args.epochs,
86+
verbose=1, validation_data=(X_test_cnn, Y_test))
87+
cnn_acc = model_hist_cnn.history['acc'][-1]
88+
print("CNN model accuracy: ", cnn_acc)
89+
90+
# And make predictions
91+
one_hot_cnn_predictions = model_cnn.predict(X_test_cnn)
92+
93+
# Using the accuracies create an ensemble
94+
accuracies = [lstm_acc, cnn_acc]
95+
norm_accuracies = [a / sum(accuracies) for a in accuracies]
96+
97+
print("Ensembling with weights: ")
98+
for na in norm_accuracies:
99+
print(na)
100+
ensembled_predictions = simple_ensembler([lstm_predictions, one_hot_cnn_predictions],
101+
norm_accuracies)
102+
final_preds = np.argmax(ensembled_predictions, axis=1)
103+
104+
# Get the final accuracy
105+
print(classification_report(np.argmax(Y_test, axis=1), final_preds,
106+
target_names=data.labels_0.columns.values))
107+
108+
109+
if __name__ == '__main__':
110+
parser = argparse.ArgumentParser()
111+
parser.add_argument('--file_path', type=str, default='./',
112+
help='file_path where the files to parse are located')
113+
parser.add_argument('--data_type', type=str, default='amazon',
114+
choices=['amazon'],
115+
help='dataset source')
116+
parser.add_argument('--epochs', type=int, default=10,
117+
help='Number of epochs for both models', action=check_size(1, 20000))
118+
args_in = parser.parse_args()
119+
120+
# Check file path
121+
if args_in.file_path:
122+
validate_existing_filepath(args_in.file_path)
123+
124+
if args_in.data_type == 'amazon':
125+
data_in = Amazon_Reviews(args_in.file_path)
126+
ensemble_models(data_in, args_in)

0 commit comments

Comments
 (0)