Skip to content

Commit 5d35aca

Browse files
committed
Initial commit of P3B2 to Release_01 branch.
1 parent 93501a5 commit 5d35aca

File tree

4 files changed

+372
-0
lines changed

4 files changed

+372
-0
lines changed

Pilot3/P3B2/README.md

Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
## P3B2: RNN-LSTM: A Generative Model for Clinical Path Reports
2+
**Overview**:Given a sample corpus of biomedical text such as clinical reports, build a deep learning network that can automatically generate synthetic text documents with valid clinical context.
3+
4+
**Relationship to core problem**:Labeled data is quite challenging to come by, specifically for patient data, since manual annotations are time consuming; hence, a core capability we intend to build is a “gold-standard” annotated data that is generated by deep learning networks to tune our deep text comprehension applications.
5+
6+
**Expected Outcomes**:A generative RNN based on LSTMs that can effectively generate synthetic biomedical text of desired clinical context.
7+
8+
### Benchmark Specs
9+
10+
#### Description of the Data
11+
* Data source: Annotated pathology reports
12+
* Input dimensions: 250,000-500,000 [characters], or 5,000-20,000 [bag of words], or 200-500 [bag of concepts]
13+
* Output dimensions: Same as input
14+
* Sample size: O(1,000)
15+
* Notes on data balance and other issues: standard NLP pre-processing is required, including (but not limited to) stemming words, keywords, cleaning text, stop words, etc. Data balance is an issue since the number of positive examples vs. control is skewed
16+
17+
#### Expected Outcomes
18+
* A generative model for pathology reports
19+
* Output range: N/A, since the outputs are actual text documents with known case descriptions/ concepts
20+
21+
#### Evaluation Metrics
22+
* Accuracy or loss function: Standard information theoretic metrics such as log-likelihood score, minimum description length score, AIC/BIC to measure how similar actual documents are compared to generated ones
23+
* Expected performance of a naïve method: Latent Dirichlet allocation (LDA) models
24+
25+
#### Description of the Network
26+
* Proposed network architecture: LSTM with at least 1 layer with 256 character windows
27+
* Number of layers: At least two hidden layers with one input and one output sequence
28+
A graphical representation of the samme is shown here.
29+
![CB-RNN Architecture](https://raw.githubusercontent.com/ECP-CANDLE/Benchmarks/master/Pilot3/P3B2/images/RNN1.png)
30+
31+
#### Annotated Keras Code
32+
Data loader, preprocessing, basic training and cross validation, prediction and evaluation on test data
33+
34+
### Running the baseline implementation
35+
The data file provided here is a compressed pickle file (.tgz extension). Before running the code, use:
36+
```
37+
cd P3B2
38+
tar -xzf data.pkl.tgz
39+
```
40+
to unpack the archive. Note that the training data is provided as a single pickle file. The code is documented to provide enough information about how to reproduce the files.
41+
42+
After uncompressing the data file, you can run:
43+
```
44+
python keras_p3b2_baseline.py
45+
```
46+
47+
The original data from the pathology reports cannot be made available online. Hence, we have pre-processed the reports so that example training/testing sets can be generated. Contact [email protected] for more information for generating additional training and testing data. A generic data loader that generates training and testing sets will be provided in the near future.
48+
49+
### Example output
50+
#### Checkpointing and model saving
51+
At each iteration of the training process, a model is output as a h5 file and also as a json file. An example model (in JSON format) is shown below.
52+
```
53+
{"class_name": "Sequential", "keras_version": "1.1.0", "config": [{"class_name": "LSTM", "config": {"inner_activation": "hard_sigmoid", "trainable": true, "inner_init": "orthogonal", "output_dim": 256, "unroll": false, "consume_less": "cpu", "init": "glorot_uniform", "dropout_U": 0.0, "input_dtype": "float32", "batch_input_shape": [null, 20, 99], "input_length": null, "dropout_W": 0.0, "activation": "tanh", "stateful": false, "b_regularizer": null, "U_regularizer": null, "name": "lstm_1", "go_backwards": false, "input_dim": 99, "return_sequences": false, "W_regularizer": null, "forget_bias_init": "one"}}, {"class_name": "Dense", "config": {"W_constraint": null, "b_constraint": null, "name": "dense_1", "activity_regularizer": null, "trainable": true, "init": "glorot_uniform", "bias": true, "input_dim": null, "b_regularizer": null, "W_regularizer": null, "activation": "linear", "output_dim": 99}}, {"class_name": "Activation", "config": {"activation": "softmax", "trainable": true, "name": "activation_1"}}]}
54+
```
55+
56+
#### Sample text generated
57+
The model generates text files that are stored as ```example_<epoch>_<text-number>.txt``` within a separate folder. An example output may look like this:
58+
```
59+
----- Generating with seed: "Diagnosis"
60+
DiagnosisWZing Pathology Laboratory is certified under this report. **NAME[M. SSS dessDing Adientation of the tissue is submitted in the same container labeled with the patient's name and designated 'subcarinal lymph node is submitted in toto in cassette A1. B. Received in formalin labeled "right lower outer quadrant; A11-A10 - slice 16 with a cell block and submitted in cassette A1. B. Received fresh for
61+
```

Pilot3/P3B2/p3b2.py

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
from __future__ import print_function
2+
3+
import os
4+
import sys
5+
import argparse
6+
7+
file_path = os.path.dirname(os.path.realpath(__file__))
8+
lib_path2 = os.path.abspath(os.path.join(file_path, '..', '..', 'common'))
9+
sys.path.append(lib_path2)
10+
11+
import candle_keras as candle
12+
13+
additional_definitions = [
14+
{'name':'rnn_size',
15+
'action':'store',
16+
'type':int,
17+
'help':'size of LSTM internal state'},
18+
{'name':'n_layers',
19+
'action':'store',
20+
'help':'number of layers in the LSTM'},
21+
{'name':'do_sample',
22+
'type':candle.str2bool,
23+
'help':'generate synthesized text'},
24+
{'name':'temperature',
25+
'action':'store',
26+
'type': float,
27+
'help':'variability of text synthesis'},
28+
{'name':'primetext',
29+
'action':'store',
30+
'help': 'seed string of text synthesis' },
31+
{'name':'length',
32+
'action':'store',
33+
'type': int,
34+
'help': 'length of synthesized text'},
35+
]
36+
37+
required = ['train_data', 'rnn_size', 'epochs', 'n_layers', \
38+
'learning_rate', 'drop', 'recurrent_dropout', \
39+
'temperature','primetext', 'length']
40+
41+
class BenchmarkP3B2(candle.Benchmark):
42+
43+
def set_locals(self):
44+
"""Functionality to set variables specific for the benchmark
45+
- required: set of required parameters for the benchmark.
46+
- additional_definitions: list of dictionaries describing the additional parameters for the
47+
benchmark.
48+
"""
49+
50+
if required is not None:
51+
self.required = set(required)
52+
if additional_definitions is not None:
53+
self.additional_definitions = additional_definitions
Lines changed: 240 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,240 @@
1+
import keras
2+
from keras.models import Sequential
3+
from keras.layers import Dense, Activation, Dropout
4+
from keras.layers import LSTM
5+
from keras.optimizers import RMSprop
6+
import numpy as np
7+
import os
8+
9+
import datetime
10+
import pickle
11+
12+
import argparse
13+
import sys
14+
15+
import p3b2 as bmk
16+
import candle_keras as candle
17+
18+
def initialize_parameters():
19+
20+
# Build benchmark object
21+
p3b2Bmk = bmk.BenchmarkP3B2(bmk.file_path, 'p3b2_default_model.txt', 'keras',
22+
prog='p3b2_baseline', desc='Multi-task (DNN) for data extraction from clinical reports - Pilot 3 Benchmark 1')
23+
24+
# Initialize parameters
25+
gParameters = candle.initialize_parameters(p3b2Bmk)
26+
#bmk.logger.info('Params: {}'.format(gParameters))
27+
28+
return gParameters
29+
30+
class LossHistory( keras.callbacks.Callback ):
31+
def on_train_begin( self, logs= {} ):
32+
self.losses = []
33+
34+
def on_batch_end( self, batch, logs= {} ):
35+
self.losses.append( logs.get( 'loss' ) )
36+
37+
38+
39+
def sample( preds, temperature= 1.0 ):
40+
# helper function to sample an index from a probability array
41+
preds = np.asarray( preds ).astype( 'float64' )
42+
preds = np.log( preds ) / temperature
43+
exp_preds = np.exp( preds )
44+
preds = exp_preds / np.sum( exp_preds )
45+
probas = np.random.multinomial( 1, preds, 1 )
46+
return np.argmax( probas )
47+
48+
49+
50+
def run(gParameters, data_path):
51+
52+
kerasDefaults = candle.keras_default_config()
53+
54+
rnn_size = gParameters['rnn_size']
55+
n_layers = gParameters['n_layers']
56+
learning_rate = gParameters['learning_rate']
57+
dropout = gParameters['drop']
58+
recurrent_dropout = gParameters['recurrent_dropout']
59+
n_epochs = gParameters['epochs']
60+
data_train = data_path+'/data.pkl'
61+
verbose = gParameters['verbose']
62+
savedir = gParameters['output_dir']
63+
do_sample = gParameters['do_sample']
64+
temperature = gParameters['temperature']
65+
primetext = gParameters['primetext']
66+
length = gParameters['length']
67+
68+
69+
# load data from pickle
70+
f = open( data_train, 'rb' )
71+
72+
if ( sys.version_info > ( 3, 0 ) ):
73+
classes = pickle.load( f, encoding= 'latin1' )
74+
chars = pickle.load( f, encoding= 'latin1' )
75+
char_indices = pickle.load( f, encoding= 'latin1' )
76+
indices_char = pickle.load( f, encoding= 'latin1' )
77+
78+
maxlen = pickle.load( f, encoding= 'latin1' )
79+
step = pickle.load( f, encoding= 'latin1' )
80+
81+
X_ind = pickle.load( f, encoding= 'latin1' )
82+
y_ind = pickle.load( f, encoding= 'latin1' )
83+
else:
84+
classes = pickle.load( f )
85+
chars = pickle.load( f )
86+
char_indices = pickle.load( f )
87+
indices_char = pickle.load( f )
88+
89+
maxlen = pickle.load( f )
90+
step = pickle.load( f )
91+
92+
X_ind = pickle.load( f )
93+
y_ind = pickle.load( f )
94+
95+
f.close()
96+
97+
[ s1, s2 ] = X_ind.shape
98+
print( X_ind.shape )
99+
print( y_ind.shape )
100+
print( maxlen )
101+
print( len( chars ) )
102+
103+
X = np.zeros( ( s1, s2, len( chars ) ), dtype=np.bool )
104+
y = np.zeros( ( s1, len( chars ) ), dtype=np.bool )
105+
106+
for i in range( s1 ):
107+
for t in range( s2 ):
108+
X[ i, t, X_ind[ i, t ] ] = 1
109+
y[ i, y_ind[ i ] ] = 1
110+
111+
# build the model: a single LSTM
112+
if verbose:
113+
print( 'Build model...' )
114+
115+
model = Sequential()
116+
117+
# for rnn_size in rnn_sizes:
118+
for k in range( n_layers ):
119+
if k < n_layers - 1:
120+
ret_seq = True
121+
else:
122+
ret_seq = False
123+
124+
if k == 0:
125+
model.add( LSTM( rnn_size, input_shape= ( maxlen, len( chars ) ), return_sequences= ret_seq,
126+
dropout= dropout, recurrent_dropout= recurrent_dropout ) )
127+
else:
128+
model.add( LSTM( rnn_size, dropout= dropout, recurrent_dropout= recurrent_dropout, return_sequences= ret_seq ) )
129+
130+
model.add( Dense( len( chars ) ) )
131+
model.add( Activation( gParameters['activation'] ) )
132+
133+
optimizer = candle.build_optimizer(gParameters['optimizer'],
134+
gParameters['learning_rate'],
135+
kerasDefaults)
136+
137+
model.compile( loss= gParameters['loss'], optimizer= optimizer )
138+
139+
if verbose:
140+
model.summary()
141+
142+
143+
for iteration in range( 1, n_epochs + 1 ):
144+
if verbose:
145+
print()
146+
print('-' * 50)
147+
print('Iteration', iteration)
148+
149+
history = LossHistory()
150+
model.fit( X, y, batch_size= 100, epochs= 1, callbacks= [ history ] )
151+
152+
loss = history.losses[ -1 ]
153+
if verbose:
154+
print( loss )
155+
156+
157+
dirname = savedir
158+
if len( dirname ) > 0 and not dirname.endswith( '/' ):
159+
dirname = dirname + '/'
160+
161+
if not os.path.exists( dirname ):
162+
os.makedirs( dirname )
163+
164+
# serialize model to JSON
165+
model_json = model.to_json()
166+
with open( dirname + "/model_" + str( iteration ) + "_" + "{:f}".format( loss ) + ".json", "w" ) as json_file:
167+
json_file.write( model_json )
168+
169+
# serialize weights to HDF5
170+
model.save_weights( dirname + "/model_" + str( iteration ) + "_" + "{:f}".format( loss ) + ".h5" )
171+
172+
if verbose:
173+
print( "Checkpoint saved." )
174+
175+
if do_sample:
176+
outtext = open( dirname + "/example_" + str( iteration ) + "_" + "{:f}".format( loss ) + ".txt", "w" , encoding= 'utf-8' )
177+
178+
diversity = temperature
179+
180+
outtext.write('----- diversity:' + str( diversity ) + "\n" )
181+
182+
generated = ''
183+
seedstr = primetext
184+
185+
outtext.write('----- Generating with seed: "' + seedstr + '"' + "\n" )
186+
187+
sentence = " " * maxlen
188+
189+
# class_index = 0
190+
generated += sentence
191+
outtext.write( generated )
192+
193+
for c in seedstr:
194+
sentence = sentence[1:] + c
195+
x = np.zeros( ( 1, maxlen, len( chars ) ) )
196+
for t, char in enumerate(sentence):
197+
x[ 0, t, char_indices[ char ] ] = 1.
198+
199+
preds = model.predict( x, verbose= verbose )[ 0 ]
200+
next_index = sample( preds, diversity )
201+
next_char = indices_char[ next_index ]
202+
203+
generated += c
204+
205+
outtext.write( c )
206+
207+
208+
for i in range( length ):
209+
x = np.zeros( ( 1, maxlen, len( chars ) ) )
210+
for t, char in enumerate( sentence ):
211+
x[ 0, t, char_indices[ char ] ] = 1.
212+
213+
preds = model.predict( x, verbose= verbose )[ 0 ]
214+
next_index = sample( preds, diversity )
215+
next_char = indices_char[ next_index ]
216+
217+
generated += next_char
218+
sentence = sentence[ 1 : ] + next_char
219+
220+
if (sys.version_info > (3, 0)):
221+
outtext.write( generated + '\n' )
222+
else:
223+
outtext.write( generated.decode('utf-8').encode('utf-8') + '\n' )
224+
225+
outtext.close()
226+
227+
228+
if __name__ == "__main__":
229+
230+
gParameters = initialize_parameters()
231+
232+
origin = gParameters['data_url']
233+
train_data = gParameters['train_data']
234+
data_loc = candle.fetch_file(origin+train_data, untar=True, md5_hash=None, subdir='Pilot3')
235+
236+
print( 'Data downloaded and stored at: ' + data_loc )
237+
data_path = os.path.dirname(data_loc)
238+
print( data_path )
239+
240+
run(gParameters, data_path)

Pilot3/P3B2/p3b2_default_model.txt

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
[Global_Params]
2+
data_url = 'http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/P3B2/'
3+
train_data = 'P3B2_data.tgz'
4+
model_name = 'p3b2'
5+
rnn_size = 256
6+
epochs = 10
7+
n_layers = 1
8+
learning_rate = 0.01
9+
drop = 0.0
10+
recurrent_dropout = 0.0
11+
loss = 'categorical_crossentropy'
12+
activation = 'softmax'
13+
optimizer = 'rmsprop'
14+
temperature = 1.0
15+
primetext = 'Diagnosis'
16+
length = 1000
17+
do_sample = True
18+
verbose = True

0 commit comments

Comments
 (0)