|
8 | 8 | ### Benchmark Specs
|
9 | 9 |
|
10 | 10 | #### Description of the Data
|
11 |
| -*Data source: Annotated pathology reports |
12 |
| -*Input dimensions: 250,000-500,000 [characters], or 5,000-20,000 [bag of words], or 200-500 [bag of concepts] |
13 |
| -*Output dimensions: Same as input |
14 |
| -*Sample size: O(1,000) |
15 |
| -*Notes on data balance and other issues: standard NLP pre-processing is required, including (but not limited to) stemming words, keywords, cleaning text, stop words, etc. Data balance is an issue since the number of positive examples vs. control is skewed |
| 11 | +* Data source: Annotated pathology reports |
| 12 | +* Input dimensions: 250,000-500,000 [characters], or 5,000-20,000 [bag of words], or 200-500 [bag of concepts] |
| 13 | +* Output dimensions: Same as input |
| 14 | +* Sample size: O(1,000) |
| 15 | +* Notes on data balance and other issues: standard NLP pre-processing is required, including (but not limited to) stemming words, keywords, cleaning text, stop words, etc. Data balance is an issue since the number of positive examples vs. control is skewed |
16 | 16 |
|
17 | 17 | #### Expected Outcomes
|
18 |
| -*A generative model for pathology reports |
19 |
| -*Output range: N/A, since the outputs are actual text documents with known case descriptions/ concepts |
| 18 | +* A generative model for pathology reports |
| 19 | +* Output range: N/A, since the outputs are actual text documents with known case descriptions/ concepts |
20 | 20 |
|
21 | 21 | #### Evaluation Metrics
|
22 |
| -*Accuracy or loss function: Standard information theoretic metrics such as log-likelihood score, minimum description length score, AIC/BIC to measure how similar actual documents are compared to generated ones |
23 |
| -*Expected performance of a naïve method: Latent Dirichlet allocation (LDA) models |
| 22 | +* Accuracy or loss function: Standard information theoretic metrics such as log-likelihood score, minimum description length score, AIC/BIC to measure how similar actual documents are compared to generated ones |
| 23 | +* Expected performance of a naïve method: Latent Dirichlet allocation (LDA) models |
24 | 24 |
|
25 | 25 | #### Description of the Network
|
26 |
| -*Proposed network architecture: LSTM with at least 4 layers and [128, 256, 512] character windows |
27 |
| -*Number of layers: At least two hidden layers with one input and one output sequence |
| 26 | +* Proposed network architecture: LSTM with at least 4 layers and [128, 256, 512] character windows |
| 27 | +* Number of layers: At least two hidden layers with one input and one output sequence |
28 | 28 |
|
29 | 29 | #### Annotated Keras Code
|
30 | 30 | Data loader, preprocessing, basic training and cross validation, prediction and evaluation on test data
|
31 | 31 |
|
32 | 32 | ### Running the baseline implementation
|
33 | 33 | The data file provided here is a compressed pickle file (.tgz extension). Before running the code, use:
|
34 | 34 | ```
|
| 35 | +cd P3B2 |
35 | 36 | tar -xzf data.pkl.tgz
|
36 | 37 | ```
|
37 | 38 | to unpack the archive. Note that the training data is provided as a single pickle file. The code is documented to provide enough information about how to reproduce the files.
|
|
0 commit comments