added Pilot 3 benchmark 2 - LSTM for path report generation

Arvind Ramanathan [v33] · Arvind Ramanathan [v33] · commit 4624a56b9f61 · 2017-01-05T16:00:01.000-05:00
diff --git a/P3B2/README.md b/P3B2/README.md
@@ -0,0 +1,59 @@
+## P3B1: RNN-LSTM: A Generative Model for Clinical Path Reports
+**Overview**:Given a sample corpus of biomedical text such as clinical reports, build a deep learning network that can automatically generate synthetic text documents with valid clinical context.
+
+**Relationship to core problem**:Labeled data is quite challenging to come by, specifically for patient data, since manual annotations are time consuming; hence, a core capability we intend to build is a “gold-standard” annotated data that is generated by deep learning networks to tune our deep text comprehension applications. 
+
+**Expected Outcomes**:A generative RNN based on LSTMs that can effectively generate synthetic biomedical text of desired clinical context. 
+
+### Benchmark Specs
+
+#### Description of the Data
+*Data source: Annotated pathology reports
+*Input dimensions: 250,000-500,000 [characters], or 5,000-20,000 [bag of words], or 200-500 [bag of concepts]
+*Output dimensions: Same as input
+*Sample size: O(1,000)
+*Notes on data balance and other issues: standard NLP pre-processing is required, including (but not limited to) stemming words, keywords, cleaning text, stop words, etc. Data balance is an issue since the number of positive examples vs. control is skewed
+
+#### Expected Outcomes
+*A generative model for pathology reports
+*Output range: N/A, since the outputs are actual text documents with known case descriptions/ concepts
+
+#### Evaluation Metrics
+*Accuracy or loss function: Standard information theoretic metrics such as log-likelihood score, minimum description length score, AIC/BIC to measure how similar actual documents are compared to generated ones
+*Expected performance of a naïve method: Latent Dirichlet allocation (LDA) models 
+
+#### Description of the Network
+*Proposed network architecture: LSTM with at least 4 layers and [128, 256, 512] character windows
+*Number of layers: At least two hidden layers with one input and one output sequence
+
+#### Annotated Keras Code
+Data loader, preprocessing, basic training and cross validation, prediction and evaluation on test data  
+
+### Running the baseline implementation
+The data file provided here is a compressed pickle file (.tgz extension). Before running the code, use:
+```
+tar -xzf data.pkl.tgz 
+```
+to unpack the archive. Note that the training data is provided as a single pickle file. The code is documented to provide enough information about how to reproduce the files. 
+
+After uncompressing the data file, you can run: 
+```
+python keras_p3b2_baseline.py
+```
+
+The original data from the pathology reports cannot be made available online. Hence, we have pre-processed the reports so that example training/testing sets can be generated. Contact yoonh@ornl.gov for more information for generating additional training and testing data. A generic data loader that generates training and testing sets will be provided in the near future.
+
+### Example output
+#### Checkpointing and model saving
+At each iteration of the training process, a model is output as a h5 file and also as a json file. An example model (in JSON format) is shown below. 
+```
+{"class_name": "Sequential", "keras_version": "1.1.0", "config": [{"class_name": "LSTM", "config": {"inner_activation": "hard_sigmoid", "trainable": true, "inner_init": "orthogonal", "output_dim": 256, "unroll": false, "consume_less": "cpu", "init": "glorot_uniform", "dropout_U": 0.0, "input_dtype": "float32", "batch_input_shape": [null, 20, 99], "input_length": null, "dropout_W": 0.0, "activation": "tanh", "stateful": false, "b_regularizer": null, "U_regularizer": null, "name": "lstm_1", "go_backwards": false, "input_dim": 99, "return_sequences": false, "W_regularizer": null, "forget_bias_init": "one"}}, {"class_name": "Dense", "config": {"W_constraint": null, "b_constraint": null, "name": "dense_1", "activity_regularizer": null, "trainable": true, "init": "glorot_uniform", "bias": true, "input_dim": null, "b_regularizer": null, "W_regularizer": null, "activation": "linear", "output_dim": 99}}, {"class_name": "Activation", "config": {"activation": "softmax", "trainable": true, "name": "activation_1"}}]}
+```
+
+#### Sample text generated
+The model generates text files that are stored as ```example_<epoch>_<text-number>.txt``` within a separate folder. An example output may look like this: 
+```
+----- Generating with seed: "Diagnosis"
+                    DiagnosisWZing Pathology Laboratory is certified under this report. **NAME[M. SSS dessDing Adientation of the tissue is submitted in the same container labeled with the patient's name and designated 'subcarinal lymph node is submitted in toto in cassette A1. B. Received in formalin labeled "right lower outer quadrant; A11-A10 - slice 16 with a cell block and submitted in cassette A1. B. Received fresh for
+``` 
+
diff --git a/P3B2/data.pkl.tgz b/P3B2/data.pkl.tgz
diff --git a/P3B2/lstm_text_synthsis.py b/P3B2/lstm_text_synthsis.py
@@ -0,0 +1,152 @@
+import keras
+from keras.models import Sequential
+from keras.layers import Dense, Activation, Dropout
+from keras.layers import LSTM
+from keras.optimizers import RMSprop
+import numpy as np
+import os
+
+import datetime
+import cPickle
+
+
+
+class LossHistory( keras.callbacks.Callback ):
+    def on_train_begin( self, logs= {} ):
+        self.losses = []
+
+    def on_batch_end( self, batch, logs= {} ):
+        self.losses.append( logs.get( 'loss' ) )
+
+
+rnn_size = 256
+
+
+# load data from pickle
+f = open( 'data.pkl', 'r' )
+
+classes = cPickle.load( f )
+chars = cPickle.load( f )
+char_indices = cPickle.load( f )
+indices_char = cPickle.load( f )
+
+maxlen = cPickle.load( f )
+step = cPickle.load( f )
+
+X_ind = cPickle.load( f )
+y_ind = cPickle.load( f )
+
+f.close()
+
+[ s1, s2 ] = X_ind.shape
+
+X = np.zeros( ( s1, s2, len( chars ) ), dtype=np.bool )
+y = np.zeros( ( s1, len( chars ) ), dtype=np.bool )
+
+for i in range( s1 ):
+    for t in range( s2 ):
+        X[ i, t, X_ind[ i, t ] ] = 1
+    y[ i, y_ind[ i ] ] = 1
+
+# build the model: a single LSTM
+print( 'Build model...' )
+model = Sequential()
+model.add( LSTM( rnn_size, input_shape=( maxlen, len( chars ) ) ) )
+model.add( Dense( len( chars ) ) )
+model.add( Activation( 'softmax' ) )
+
+optimizer = RMSprop( lr= 0.001 )
+model.compile( loss= 'categorical_crossentropy', optimizer= optimizer )
+
+
+def sample(preds, temperature=1.0):
+    # helper function to sample an index from a probability array
+    preds = np.asarray(preds).astype('float64')
+    preds = np.log(preds) / temperature
+    exp_preds = np.exp(preds)
+    preds = exp_preds / np.sum(exp_preds)
+    probas = np.random.multinomial(1, preds, 1)
+    return np.argmax(probas)
+
+# train the model, output generated text after each iteration
+min_loss = 1e15
+loss_count = 0
+
+for iteration in range(1, 100):
+    print()
+    print('-' * 50)
+    print('Iteration', iteration)
+
+    history = LossHistory()
+    model.fit( X, y, batch_size= 100, nb_epoch= 1, callbacks= [ history ] )
+
+    loss = history.losses[ -1 ]
+    print( loss )
+
+    if loss < min_loss:
+        min_loss = loss
+        loss_count = 0
+    else:
+        loss_count = loss_count + 1
+    if loss_count > 4:
+        break
+
+    dirname =  str( rnn_size ) + "/" + str( maxlen )
+    if not os.path.exists( dirname ):
+        os.makedirs( dirname )
+
+    # serialize model to JSON
+    model_json = model.to_json()
+    with open( dirname + "/model_" + str( iteration ) + "." + str( round( loss, 6 ) ) + ".json", "w" ) as json_file:
+        json_file.write( model_json )
+    # serialize weights to HDF5
+    model.save_weights( dirname + "/model_" + str( iteration ) + "." + str( round( loss, 6 ) ) + ".h5" )
+    print( "Checkpoint saved." )
+
+    outtext = open( dirname + "/example_" + str( iteration ) + "." + str( round( loss, 6 ) ) + ".txt", "w" )
+
+    for diversity in [0.2, 0.5, 1.0, 1.2]:
+        outtext.write('----- diversity:' + str( diversity ) + "\n" )
+
+        generated = ''
+        seedstr = "Diagnosis"
+        outtext.write('----- Generating with seed: "' + seedstr + '"' + "\n" )
+
+        sentence = " " * maxlen
+
+        # class_index = 0
+        generated += sentence
+        outtext.write( generated )
+
+        for c in seedstr:
+            sentence = sentence[1:] + c
+            x = np.zeros( ( 1, maxlen, len( chars ) ) )
+            for t, char in enumerate(sentence):
+                x[ 0, t, char_indices[ char ] ] = 1.
+
+            preds = model.predict(x, verbose=0)[0]
+            next_index = sample(preds, diversity)
+            next_char = indices_char[next_index]
+
+            generated += c
+
+            outtext.write( c )
+
+
+        for i in range( 400 ):
+            x = np.zeros( ( 1, maxlen, len( chars ) ) )
+            for t, char in enumerate(sentence):
+                x[ 0, t, char_indices[ char ] ] = 1.
+
+            preds = model.predict(x, verbose=0)[0]
+            next_index = sample(preds, diversity)
+            next_char = indices_char[next_index]
+
+            generated += next_char
+            sentence = sentence[1:] + next_char
+
+            outtext.write(next_char)
+
+        outtext.write( "\n" )
+
+    outtext.close()