Skip to content

Commit 4624a56

Browse files
author
Arvind Ramanathan [v33]
committed
added Pilot 3 benchmark 2 - LSTM for path report generation
1 parent 35f1bbe commit 4624a56

File tree

3 files changed

+211
-0
lines changed

3 files changed

+211
-0
lines changed

P3B2/README.md

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
## P3B1: RNN-LSTM: A Generative Model for Clinical Path Reports
2+
**Overview**:Given a sample corpus of biomedical text such as clinical reports, build a deep learning network that can automatically generate synthetic text documents with valid clinical context.
3+
4+
**Relationship to core problem**:Labeled data is quite challenging to come by, specifically for patient data, since manual annotations are time consuming; hence, a core capability we intend to build is a “gold-standard” annotated data that is generated by deep learning networks to tune our deep text comprehension applications.
5+
6+
**Expected Outcomes**:A generative RNN based on LSTMs that can effectively generate synthetic biomedical text of desired clinical context.
7+
8+
### Benchmark Specs
9+
10+
#### Description of the Data
11+
*Data source: Annotated pathology reports
12+
*Input dimensions: 250,000-500,000 [characters], or 5,000-20,000 [bag of words], or 200-500 [bag of concepts]
13+
*Output dimensions: Same as input
14+
*Sample size: O(1,000)
15+
*Notes on data balance and other issues: standard NLP pre-processing is required, including (but not limited to) stemming words, keywords, cleaning text, stop words, etc. Data balance is an issue since the number of positive examples vs. control is skewed
16+
17+
#### Expected Outcomes
18+
*A generative model for pathology reports
19+
*Output range: N/A, since the outputs are actual text documents with known case descriptions/ concepts
20+
21+
#### Evaluation Metrics
22+
*Accuracy or loss function: Standard information theoretic metrics such as log-likelihood score, minimum description length score, AIC/BIC to measure how similar actual documents are compared to generated ones
23+
*Expected performance of a naïve method: Latent Dirichlet allocation (LDA) models
24+
25+
#### Description of the Network
26+
*Proposed network architecture: LSTM with at least 4 layers and [128, 256, 512] character windows
27+
*Number of layers: At least two hidden layers with one input and one output sequence
28+
29+
#### Annotated Keras Code
30+
Data loader, preprocessing, basic training and cross validation, prediction and evaluation on test data
31+
32+
### Running the baseline implementation
33+
The data file provided here is a compressed pickle file (.tgz extension). Before running the code, use:
34+
```
35+
tar -xzf data.pkl.tgz
36+
```
37+
to unpack the archive. Note that the training data is provided as a single pickle file. The code is documented to provide enough information about how to reproduce the files.
38+
39+
After uncompressing the data file, you can run:
40+
```
41+
python keras_p3b2_baseline.py
42+
```
43+
44+
The original data from the pathology reports cannot be made available online. Hence, we have pre-processed the reports so that example training/testing sets can be generated. Contact [email protected] for more information for generating additional training and testing data. A generic data loader that generates training and testing sets will be provided in the near future.
45+
46+
### Example output
47+
#### Checkpointing and model saving
48+
At each iteration of the training process, a model is output as a h5 file and also as a json file. An example model (in JSON format) is shown below.
49+
```
50+
{"class_name": "Sequential", "keras_version": "1.1.0", "config": [{"class_name": "LSTM", "config": {"inner_activation": "hard_sigmoid", "trainable": true, "inner_init": "orthogonal", "output_dim": 256, "unroll": false, "consume_less": "cpu", "init": "glorot_uniform", "dropout_U": 0.0, "input_dtype": "float32", "batch_input_shape": [null, 20, 99], "input_length": null, "dropout_W": 0.0, "activation": "tanh", "stateful": false, "b_regularizer": null, "U_regularizer": null, "name": "lstm_1", "go_backwards": false, "input_dim": 99, "return_sequences": false, "W_regularizer": null, "forget_bias_init": "one"}}, {"class_name": "Dense", "config": {"W_constraint": null, "b_constraint": null, "name": "dense_1", "activity_regularizer": null, "trainable": true, "init": "glorot_uniform", "bias": true, "input_dim": null, "b_regularizer": null, "W_regularizer": null, "activation": "linear", "output_dim": 99}}, {"class_name": "Activation", "config": {"activation": "softmax", "trainable": true, "name": "activation_1"}}]}
51+
```
52+
53+
#### Sample text generated
54+
The model generates text files that are stored as ```example_<epoch>_<text-number>.txt``` within a separate folder. An example output may look like this:
55+
```
56+
----- Generating with seed: "Diagnosis"
57+
DiagnosisWZing Pathology Laboratory is certified under this report. **NAME[M. SSS dessDing Adientation of the tissue is submitted in the same container labeled with the patient's name and designated 'subcarinal lymph node is submitted in toto in cassette A1. B. Received in formalin labeled "right lower outer quadrant; A11-A10 - slice 16 with a cell block and submitted in cassette A1. B. Received fresh for
58+
```
59+

P3B2/data.pkl.tgz

7.43 MB
Binary file not shown.

P3B2/lstm_text_synthsis.py

Lines changed: 152 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,152 @@
1+
import keras
2+
from keras.models import Sequential
3+
from keras.layers import Dense, Activation, Dropout
4+
from keras.layers import LSTM
5+
from keras.optimizers import RMSprop
6+
import numpy as np
7+
import os
8+
9+
import datetime
10+
import cPickle
11+
12+
13+
14+
class LossHistory( keras.callbacks.Callback ):
15+
def on_train_begin( self, logs= {} ):
16+
self.losses = []
17+
18+
def on_batch_end( self, batch, logs= {} ):
19+
self.losses.append( logs.get( 'loss' ) )
20+
21+
22+
rnn_size = 256
23+
24+
25+
# load data from pickle
26+
f = open( 'data.pkl', 'r' )
27+
28+
classes = cPickle.load( f )
29+
chars = cPickle.load( f )
30+
char_indices = cPickle.load( f )
31+
indices_char = cPickle.load( f )
32+
33+
maxlen = cPickle.load( f )
34+
step = cPickle.load( f )
35+
36+
X_ind = cPickle.load( f )
37+
y_ind = cPickle.load( f )
38+
39+
f.close()
40+
41+
[ s1, s2 ] = X_ind.shape
42+
43+
X = np.zeros( ( s1, s2, len( chars ) ), dtype=np.bool )
44+
y = np.zeros( ( s1, len( chars ) ), dtype=np.bool )
45+
46+
for i in range( s1 ):
47+
for t in range( s2 ):
48+
X[ i, t, X_ind[ i, t ] ] = 1
49+
y[ i, y_ind[ i ] ] = 1
50+
51+
# build the model: a single LSTM
52+
print( 'Build model...' )
53+
model = Sequential()
54+
model.add( LSTM( rnn_size, input_shape=( maxlen, len( chars ) ) ) )
55+
model.add( Dense( len( chars ) ) )
56+
model.add( Activation( 'softmax' ) )
57+
58+
optimizer = RMSprop( lr= 0.001 )
59+
model.compile( loss= 'categorical_crossentropy', optimizer= optimizer )
60+
61+
62+
def sample(preds, temperature=1.0):
63+
# helper function to sample an index from a probability array
64+
preds = np.asarray(preds).astype('float64')
65+
preds = np.log(preds) / temperature
66+
exp_preds = np.exp(preds)
67+
preds = exp_preds / np.sum(exp_preds)
68+
probas = np.random.multinomial(1, preds, 1)
69+
return np.argmax(probas)
70+
71+
# train the model, output generated text after each iteration
72+
min_loss = 1e15
73+
loss_count = 0
74+
75+
for iteration in range(1, 100):
76+
print()
77+
print('-' * 50)
78+
print('Iteration', iteration)
79+
80+
history = LossHistory()
81+
model.fit( X, y, batch_size= 100, nb_epoch= 1, callbacks= [ history ] )
82+
83+
loss = history.losses[ -1 ]
84+
print( loss )
85+
86+
if loss < min_loss:
87+
min_loss = loss
88+
loss_count = 0
89+
else:
90+
loss_count = loss_count + 1
91+
if loss_count > 4:
92+
break
93+
94+
dirname = str( rnn_size ) + "/" + str( maxlen )
95+
if not os.path.exists( dirname ):
96+
os.makedirs( dirname )
97+
98+
# serialize model to JSON
99+
model_json = model.to_json()
100+
with open( dirname + "/model_" + str( iteration ) + "." + str( round( loss, 6 ) ) + ".json", "w" ) as json_file:
101+
json_file.write( model_json )
102+
# serialize weights to HDF5
103+
model.save_weights( dirname + "/model_" + str( iteration ) + "." + str( round( loss, 6 ) ) + ".h5" )
104+
print( "Checkpoint saved." )
105+
106+
outtext = open( dirname + "/example_" + str( iteration ) + "." + str( round( loss, 6 ) ) + ".txt", "w" )
107+
108+
for diversity in [0.2, 0.5, 1.0, 1.2]:
109+
outtext.write('----- diversity:' + str( diversity ) + "\n" )
110+
111+
generated = ''
112+
seedstr = "Diagnosis"
113+
outtext.write('----- Generating with seed: "' + seedstr + '"' + "\n" )
114+
115+
sentence = " " * maxlen
116+
117+
# class_index = 0
118+
generated += sentence
119+
outtext.write( generated )
120+
121+
for c in seedstr:
122+
sentence = sentence[1:] + c
123+
x = np.zeros( ( 1, maxlen, len( chars ) ) )
124+
for t, char in enumerate(sentence):
125+
x[ 0, t, char_indices[ char ] ] = 1.
126+
127+
preds = model.predict(x, verbose=0)[0]
128+
next_index = sample(preds, diversity)
129+
next_char = indices_char[next_index]
130+
131+
generated += c
132+
133+
outtext.write( c )
134+
135+
136+
for i in range( 400 ):
137+
x = np.zeros( ( 1, maxlen, len( chars ) ) )
138+
for t, char in enumerate(sentence):
139+
x[ 0, t, char_indices[ char ] ] = 1.
140+
141+
preds = model.predict(x, verbose=0)[0]
142+
next_index = sample(preds, diversity)
143+
next_char = indices_char[next_index]
144+
145+
generated += next_char
146+
sentence = sentence[1:] + next_char
147+
148+
outtext.write(next_char)
149+
150+
outtext.write( "\n" )
151+
152+
outtext.close()

0 commit comments

Comments
 (0)