Skip to content

Commit 6c66672

Browse files
author
Arvind Ramanathan [v33]
committed
modified README
1 parent 4624a56 commit 6c66672

File tree

1 file changed

+12
-11
lines changed

1 file changed

+12
-11
lines changed

P3B2/README.md

Lines changed: 12 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -8,30 +8,31 @@
88
### Benchmark Specs
99

1010
#### Description of the Data
11-
*Data source: Annotated pathology reports
12-
*Input dimensions: 250,000-500,000 [characters], or 5,000-20,000 [bag of words], or 200-500 [bag of concepts]
13-
*Output dimensions: Same as input
14-
*Sample size: O(1,000)
15-
*Notes on data balance and other issues: standard NLP pre-processing is required, including (but not limited to) stemming words, keywords, cleaning text, stop words, etc. Data balance is an issue since the number of positive examples vs. control is skewed
11+
* Data source: Annotated pathology reports
12+
* Input dimensions: 250,000-500,000 [characters], or 5,000-20,000 [bag of words], or 200-500 [bag of concepts]
13+
* Output dimensions: Same as input
14+
* Sample size: O(1,000)
15+
* Notes on data balance and other issues: standard NLP pre-processing is required, including (but not limited to) stemming words, keywords, cleaning text, stop words, etc. Data balance is an issue since the number of positive examples vs. control is skewed
1616

1717
#### Expected Outcomes
18-
*A generative model for pathology reports
19-
*Output range: N/A, since the outputs are actual text documents with known case descriptions/ concepts
18+
* A generative model for pathology reports
19+
* Output range: N/A, since the outputs are actual text documents with known case descriptions/ concepts
2020

2121
#### Evaluation Metrics
22-
*Accuracy or loss function: Standard information theoretic metrics such as log-likelihood score, minimum description length score, AIC/BIC to measure how similar actual documents are compared to generated ones
23-
*Expected performance of a naïve method: Latent Dirichlet allocation (LDA) models
22+
* Accuracy or loss function: Standard information theoretic metrics such as log-likelihood score, minimum description length score, AIC/BIC to measure how similar actual documents are compared to generated ones
23+
* Expected performance of a naïve method: Latent Dirichlet allocation (LDA) models
2424

2525
#### Description of the Network
26-
*Proposed network architecture: LSTM with at least 4 layers and [128, 256, 512] character windows
27-
*Number of layers: At least two hidden layers with one input and one output sequence
26+
* Proposed network architecture: LSTM with at least 4 layers and [128, 256, 512] character windows
27+
* Number of layers: At least two hidden layers with one input and one output sequence
2828

2929
#### Annotated Keras Code
3030
Data loader, preprocessing, basic training and cross validation, prediction and evaluation on test data
3131

3232
### Running the baseline implementation
3333
The data file provided here is a compressed pickle file (.tgz extension). Before running the code, use:
3434
```
35+
cd P3B2
3536
tar -xzf data.pkl.tgz
3637
```
3738
to unpack the archive. Note that the training data is provided as a single pickle file. The code is documented to provide enough information about how to reproduce the files.

0 commit comments

Comments
 (0)