IntelLabs
diff --git a/‎doc/source/api.rst‎
Lines changed: 3 additions & 2 deletions b/‎doc/source/api.rst‎
Lines changed: 3 additions & 2 deletions
diff --git a/‎doc/source/index.rst‎
Lines changed: 2 additions & 0 deletions b/‎doc/source/index.rst‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎doc/source/overview.rst‎
Lines changed: 1 addition & 0 deletions b/‎doc/source/overview.rst‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎doc/source/supervised_sentiment.rst‎
Lines changed: 87 additions & 0 deletions b/‎doc/source/supervised_sentiment.rst‎
Lines changed: 87 additions & 0 deletions
diff --git a/‎examples/supervised_sentiment/README.md‎
Lines changed: 49 additions & 0 deletions b/‎examples/supervised_sentiment/README.md‎
Lines changed: 49 additions & 0 deletions
diff --git a/‎examples/supervised_sentiment/__init__.py‎ b/‎examples/supervised_sentiment/__init__.py‎
diff --git a/‎examples/supervised_sentiment/example_ensemble.py‎
Lines changed: 126 additions & 0 deletions b/‎examples/supervised_sentiment/example_ensemble.py‎
Lines changed: 126 additions & 0 deletions
@@ -49,6 +49,8 @@ to train the model weights, perform inference, and save/load the model.
    nlp_architect.models.bist_parser.BISTModel
    nlp_architect.models.memn2n_dialogue.MemN2N_Dialog
    nlp_architect.models.kvmemn2n.KVMemN2N
+   nlp_architect.models.supervised_sentiment.simple_lstm
+   nlp_architect.models.supervised_sentiment.one_hot_cnn
 
 
 ``nlp_architect.layers``
@@ -88,7 +90,7 @@ these will be placed into a central repository.
     nlp_architect.data.sequential_tagging.SequentialTaggingDataset
     nlp_architect.data.babi_dialog.BABI_Dialog
     nlp_architect.data.wikimovies.WIKIMOVIES
-
+    nlp_architect.data.amazon_reviews.Amazon_Reviews
 
 
 ``nlp_architect.pipelines``
@@ -117,4 +119,3 @@ NLP pipelines modules using models implemented from ``nlp_architect.models``.
 
     server.serve
     server.service
-
 
@@ -57,6 +57,7 @@ The library contains state-of-art and novel NLP and NLU models in a varity of to
 - NER and NE expansion
 - Text chunking
 - Reading comprehension
+- Supervised sentiment analysis
 
 
 Deep Learning frameworks
@@ -115,6 +116,7 @@ on this project, please see the :doc:`developer guide <developer_guide>`.
    bist_parser.rst
    word_sense.rst
    np2vec.rst
+   supervised_sentiment.rst
    tcn.rst
 
 .. toctree::
 
@@ -53,6 +53,7 @@ The library contains state-of-art and novel NLP and NLU models in a varity of to
 - NER and NE expansion
 - Text chunking
 - Reading comprehension
+- Supervised sentiment analysis
 
 Deep Learning frameworks
 ````````````````````````
 
@@ -0,0 +1,87 @@
+.. ---------------------------------------------------------------------------
+.. Copyright 2017-2018 Intel Corporation
+..
+.. Licensed under the Apache License, Version 2.0 (the "License");
+.. you may not use this file except in compliance with the License.
+.. You may obtain a copy of the License at
+..
+..      http://www.apache.org/licenses/LICENSE-2.0
+..
+.. Unless required by applicable law or agreed to in writing, software
+.. distributed under the License is distributed on an "AS IS" BASIS,
+.. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+.. See the License for the specific language governing permissions and
+.. limitations under the License.
+.. ---------------------------------------------------------------------------
+
+Supervised Sentiment
+####################
+
+Overview
+========
+
+This is a set of models which are examples of supervised implementations for sentiment analysis.
+The larger idea behind these models is to allow ensembling with other supervised or unsupervised models.
+
+Files
+=====
+
+- **nlp_architect/models/supervised_sentiment.py**: Sentiment analysis models - currently an LSTM and a one-hot CNN
+- **nlp_architect/data/amazon_reviews.py**: Code which will download and process the Amazon datasets described below
+- **nlp_architect/utils/ensembler.py**: Contains the ensembling algorithm(s)
+- **example_ensemble.py**: An example of how the sentiment models can be trained and ensembled.
+- **optimize_example.py**: An example of using an hyperparameter optimizer with the simple LSTM model.
+
+
+Models
+======
+Two models are shown as classification examples. Additional models can be added as desired.
+
+Bi-directional LSTM
+-------------------
+A simple bidirectional lstm with one fully connected layer. The number of vocab features, dense output size, and document input length, should be determined in the data preprocessing steps. The user can then change the size of the lstm hidden layer, and the recurrent dropout rate.
+
+Temporal CNN
+------------
+As defined in "Text Understanding from Scratch" by Zhang, LeCun 2015 https://arxiv.org/pdf/1502.01710v4.pdf this model is a series of 1D CNNs, with a maxpooling and fully connected layers. The frame sizes may either be large or small.
+
+
+Datasets
+========
+The dataset in this example is the Amazon Reviews dataset, though other datasets can be easily substituted.
+The Amazon review dataset(s) should be downloaded from http://jmcauley.ucsd.edu/data/amazon/. These are `*.json.gzip` files which should be unzipped. The terms and conditions of the data set license apply. Intel does not grant any rights to the data files.
+For best results, a medium sized dataset should be chosen though the algorithms will work on larger and smaller datasets as well. For experimentation I chose the Movie and TV reviews.
+Only the "overall", "reviewText", and "summary" columns of the review dataset will be retained. The "overall" is the overall rating in terms of stars - this is transformed into a rating where currently 4-5 stars is a positive review, 3 is neutral, and 1-2 stars is a negative review.
+The "summary" or title of the review is concatenated with the review text and subsequently cleaned.
+
+The Amazon Review Dataset was published in the following papers:
+
+Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering
+R. He, J. McAuley
+WWW, 2016
+http://cseweb.ucsd.edu/~jmcauley/pdfs/www16a.pdf
+
+Image-based recommendations on styles and substitutes
+J. McAuley, C. Targett, J. Shi, A. van den Hengel
+SIGIR, 2015
+http://cseweb.ucsd.edu/~jmcauley/pdfs/sigir15.pdf
+
+
+Running Modalities
+==================
+
+Ensemble Train/Test
+-------------------
+Currently, the pipeline shows a full train/test/ensemble cycle. The main pipeline can be run with the following command:
+```
+ python example_ensemble.py --file_path ./reviews_Movies_and_TV.json/
+```
+At the conclusion of training a final confusion matrix will be displayed.
+
+Hyperparameter optimization
+---------------------------
+An example of hyperparameter optimization is given using the python package hyperopt which uses a Tree of Parzen estimator to optimize the simple bi-lstm algorithm. To run this example the following command can be utilized:
+```
+ python optimize_example.py --file_path ./reviews_Movies_and_TV.json/ --new_trials 50 --output_file ./data/optimize_output.pkl
+```
+The file will output a result of each of the trial attempts to the specified pickle file.
@@ -0,0 +1,49 @@
+# Supervised Sentiment
+
+This is a set of models which are examples of supervised implementations for sentiment analysis.
+The larger idea behind these models is to allow ensembling with other supervised or unsupervised models.
+
+
+# Models
+Two models are shown as classification examples. Additional models can be added as desired.
+
+<b>Bi-directional LSTM</b>
+A simple bidirectional lstm with one fully connected layer. The number of vocab features, dense output size, and document input length, should be determined in the data preprocessing steps. The user can then change the size of the lstm hidden layer, and the recurrent dropout rate.
+
+<b>Temporal CNN</b>
+As defined in "Text Understanding from Scratch" by Zhang, LeCun 2015 https://arxiv.org/pdf/1502.01710v4.pdf this model is a series of 1D CNNs, with a maxpooling and fully connected layers. The frame sizes may either be large or small.
+
+
+# Datasets
+The dataset in this example is the Amazon Reviews dataset, though other datasets can be easily substituted.
+The Amazon review dataset(s) should be downloaded from http://jmcauley.ucsd.edu/data/amazon/. These are `*.json.gzip` files which should be unzipped. The terms and conditions of the data set license apply. Intel does not grant any rights to the data files.
+For best results, a medium sized dataset should be chosen though the algorithms will work on larger and smaller datasets as well. For experimentation I chose the Movie and TV reviews.
+Only the "overall", "reviewText", and "summary" columns of the review dataset will be retained. The "overall" is the overall rating in terms of stars - this is transformed into a rating where currently 4-5 stars is a positive review, 3 is neutral, and 1-2 stars is a negative review.
+The "summary" or title of the review is concatenated with the review text and subsequently cleaned.
+
+The Amazon Review Dataset was published in the following papers:
+
+Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering
+R. He, J. McAuley
+WWW, 2016
+http://cseweb.ucsd.edu/~jmcauley/pdfs/www16a.pdf
+
+Image-based recommendations on styles and substitutes
+J. McAuley, C. Targett, J. Shi, A. van den Hengel
+SIGIR, 2015
+http://cseweb.ucsd.edu/~jmcauley/pdfs/sigir15.pdf
+
+
+# Train/Test
+Currently, the pipeline shows a full train/test/ensemble cycle. The main pipeline can be run with the following command:
+```
+ python example_ensemble.py --file_path ./reviews_Movies_and_TV.json/
+```
+At the conclusion of training a final confusion matrix will be displayed.
+
+# Hyperparameter optimization
+An example of hyperparameter optimization is given using the python package hyperopt which uses a Tree of Parzen estimator to optimize the simple bi-lstm algorithm. To run this example the following command can be utilized:
+```
+ python optimize_example.py --file_path ./reviews_Movies_and_TV.json/ --new_trials 50 --output_file ./data/optimize_output.pkl
+```
+The file will output a result of each of the trial attempts to the specified pickle file.
@@ -0,0 +1,126 @@
+# ******************************************************************************
+# Copyright 2017-2018 Intel Corporation
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ******************************************************************************
+
+"""
+This example uses the Amazon reviews though additional datasets can easily be substituted.
+It only requires text and a sentiment label
+It then takes the dataset and trains two models (again can be expanded)
+The labels for the test data is then predicted.
+The same train and test data is used for both models
+
+The ensembler takes the two prediction matrixes and weights (as defined by model accuracy)
+and determines the final prediction matrix.
+
+Finally, the full classification report is displayed.
+
+A similar pipeline could be utilized to train models on a dataset, predict on a second dataset
+and aquire a list of final predictions
+"""
+
+import numpy as np
+import argparse
+from keras.preprocessing.sequence import pad_sequences
+from keras.preprocessing.text import Tokenizer
+from sklearn.model_selection import train_test_split
+from sklearn.metrics import classification_report
+
+from nlp_architect.data.amazon_reviews import Amazon_Reviews
+from nlp_architect.utils.generic import to_one_hot
+from nlp_architect.models.supervised_sentiment import simple_lstm, one_hot_cnn
+from nlp_architect.utils.ensembler import simple_ensembler
+from nlp_architect.utils.io import validate_existing_filepath, check_size
+
+max_fatures = 2000
+max_len = 300
+batch_size = 32
+embed_dim = 256
+lstm_out = 140
+
+
+def ensemble_models(data, args):
+    # split, train, test
+    data.process()
+    dense_out = len(data.labels[0])
+    # split for all models
+    X_train_, X_test_, Y_train, Y_test = train_test_split(data.text, data.labels,
+                                                          test_size=0.20, random_state=42)
+
+    # Prep data for the LSTM model
+    tokenizer = Tokenizer(num_words=max_fatures, split=' ')
+    tokenizer.fit_on_texts(X_train_)
+    X_train = tokenizer.texts_to_sequences(X_train_)
+    X_train = pad_sequences(X_train, maxlen=max_len)
+    X_test = tokenizer.texts_to_sequences(X_test_)
+    X_test = pad_sequences(X_test, maxlen=max_len)
+
+    # Train the LSTM model
+    lstm_model = simple_lstm(max_fatures, dense_out, X_train.shape[1], embed_dim, lstm_out)
+    model_hist = lstm_model.fit(X_train, Y_train, epochs=args.epochs, batch_size=batch_size,
+                                verbose=1, validation_data=(X_test, Y_test))
+    lstm_acc = model_hist.history['acc'][-1]
+    print("LSTM model accuracy ", lstm_acc)
+
+    # And make predictions using the LSTM model
+    lstm_predictions = lstm_model.predict(X_test)
+
+    # Now prep data for the one-hot CNN model
+    X_train_cnn = np.asarray([to_one_hot(x) for x in X_train_])
+    X_test_cnn = np.asarray([to_one_hot(x) for x in X_test_])
+
+    # And train the one-hot CNN classifier
+    model_cnn = one_hot_cnn(dense_out, max_len)
+    model_hist_cnn = model_cnn.fit(X_train_cnn, Y_train, batch_size=batch_size, epochs=args.epochs,
+                                   verbose=1, validation_data=(X_test_cnn, Y_test))
+    cnn_acc = model_hist_cnn.history['acc'][-1]
+    print("CNN model accuracy: ", cnn_acc)
+
+    # And make predictions
+    one_hot_cnn_predictions = model_cnn.predict(X_test_cnn)
+
+    # Using the accuracies create an ensemble
+    accuracies = [lstm_acc, cnn_acc]
+    norm_accuracies = [a / sum(accuracies) for a in accuracies]
+
+    print("Ensembling with weights: ")
+    for na in norm_accuracies:
+        print(na)
+    ensembled_predictions = simple_ensembler([lstm_predictions, one_hot_cnn_predictions],
+                                             norm_accuracies)
+    final_preds = np.argmax(ensembled_predictions, axis=1)
+
+    # Get the final accuracy
+    print(classification_report(np.argmax(Y_test, axis=1), final_preds,
+                                target_names=data.labels_0.columns.values))
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--file_path', type=str, default='./',
+                        help='file_path where the files to parse are located')
+    parser.add_argument('--data_type', type=str, default='amazon',
+                        choices=['amazon'],
+                        help='dataset source')
+    parser.add_argument('--epochs', type=int, default=10,
+                        help='Number of epochs for both models', action=check_size(1, 20000))
+    args_in = parser.parse_args()
+
+    # Check file path
+    if args_in.file_path:
+        validate_existing_filepath(args_in.file_path)
+
+    if args_in.data_type == 'amazon':
+        data_in = Amazon_Reviews(args_in.file_path)
+    ensemble_models(data_in, args_in)