update readme

levinas · levinas · commit be5b2675fbf9 · 2016-12-21T11:23:36.000-05:00
diff --git a/P1B1/README.md b/P1B1/README.md
@@ -1,10 +1,10 @@
 ## P1B1: Autoencoder Compressed Representation for Gene Expression
 
-**Overview**: Given a sample of gene expression data, build a sparse autoencoder that can compress the expression profile into a low-dimensional vector
+**Overview**: Given a sample of gene expression data, build a sparse autoencoder that can compress the expression profile into a low-dimensional vector.
 
-**Relationship to core problem**: Many molecular assays generate large numbers of features that can lead to time-consuming processing and over-fitting in learning tasks; hence, a core capability we intend to build is feature reduction
+**Relationship to core problem**: Many molecular assays generate large numbers of features that can lead to time-consuming processing and over-fitting in learning tasks; hence, a core capability we intend to build is feature reduction.
 
-**Expected outcome**: An autoencoder that collapse high dimensional expression profiles into low dimensional vectors without much loss of information 
+**Expected outcome**: Build an autoencoder that collapse high dimensional expression profiles into low dimensional vectors without much loss of information.
 
 ### Benchmark Specs Requirements 
 
@@ -13,7 +13,7 @@
 * Input dimensions: 60,484 floats; log(1+x) transformed FPKM-UQ values
 * Output dimensions: Same as input
 * Latent representation dimension: 1000
-* Sample size: 5,000
+* Sample size: 4,000 (3000 training + 1000 test)
 * Notes on data balance and other issues: unlabeled data draw from a diverse set of cancer types
 
 #### Expected Outcomes
diff --git a/P1B2/README.md b/P1B2/README.md
@@ -1,4 +1,41 @@
+## P1B2: Sparse Classifier Disease Type Prediction from Somatic SNPs
 
+**Overview**: Given patient somatic SNPs, build a deep learning network that can classify the cancer type.
+
+**Relationship to core problem**: Exercise two core capabilities we need to build: (1) classification based on very sparse input data; (2) evaluation of the information content and predictive value in a molecular assay with auxiliary learning tasks.
+
+**Expected outcome**: Build a DNN that can classify sparse data.
+
+### Benchmark Specs Requirements 
+
+#### Description of the Data
+* Data source: SNP data from GDC MAF files
+* Input dimensions: 28,205 (aggregated variation impact by gene from 2.7 million unique SNPs)
+* Output dimensions: 10 class probabilities (9 most abundant cancer types in GDC + 1 “others”)
+* Sample size: 4,000 (3000 training + 1000 test)
+* Notes on data balance and other issues: data balance achieved via undersampling; “others” category drawn from all remaining lower-abundance cancer types in GDC
+
+#### Expected Outcomes
+* Classification
+* Output range or number of classes: 10
+
+#### Evaluation Metrics
+* Accuracy or loss function: Standard approaches such as F1-score, accuracy, ROC-AUC, cross entropy, etc. 
+* Expected performance of a naïve method: linear regression or ensemble methods without feature selection
+
+#### Description of the Network
+* Proposed network architecture: MLP with regularization
+* Number of layers: ~5 layers
+
+### Running the baseline implementation
+
+```
+cd P1B2
+python p1b2_baseline.py
+```
+The training and test data files will be downloaded the first time this is run and will be cached for future runs.
+
+#### Example output
 
 ```
 Using Theano backend.
@@ -64,5 +101,13 @@ best_val_loss=1.31111 best_val_acc=0.59500
 
 Best model saved to: model.A=sigmoid.B=64.D=None.E=20.L1=1024.L2=512.L3=256.P=1e-05.h5
 
-Evaluation on test data: {'accuracy': 0.55000000000000004}
+Evaluation on test data: {'accuracy': 0.5500}
+```
+
+### Running the XGBoost classifier
+
+```
+cd P1B2
+python p1b2_xgboost.py
+
 ```
diff --git a/P1B3/README.md b/P1B3/README.md
@@ -1,11 +1,40 @@
-## Running the baseline implementation of the P1B3 benchmark
+## P1B3: MLP Regression Drug Response Prediction
+
+**Overview**: Given drug screening results on NCI60 cell lines, build a deep learning network that can predict the growth percentage from cell line expression and drug descriptors.
+
+**Relationship to core problem**: This benchmark is a simplified form of the core drug response prediction problem in which we need to combine multiple molecular assays and a diverse array of drug descriptors to make a prediction.
+
+**Expected outcome**: Build a DNN that can predict growth percentage of a cell line treated with a new drug.
+
+### Benchmark Specs Requirements 
+
+#### Description of the Data
+* Data source: Dose response screening results from NCI; 5-platform normalized expression data from NCI; Dragon7 generated drug descriptors based on 2D chemical structures from NCI
+* Input dimensions: ~30K; 26K normalized expression levels by gene + 4K drug descriptors [+ drug concentration]
+Output dimensions: 1 (growth percentage)
+* Sample size: ~2.5 M screening results (combinations of cell line and drug)
+* Notes on data balance: original data imbalanced with many drugs that have little inhibition effect.
+
+#### Expected Outcomes
+* Regression. Predict percent growth per NCI-60 cell lines and per drug
+* Dimension: 1 scalar value corresponding to the percent growth for a given drug concentration. Output range: [-100, 100]
+
+#### Evaluation Metrics
+* Accuracy or loss function: mean squared error or rank order.
+* Expected performance of a naïve method: mean response, linear regression or random forest regression.
+
+#### Description of the Network
+* Proposed network architecture: MLP
+* Number of layers: ~5 layers
+
+### Running the baseline implementation
 
 ```
 $ cd P1B3
 $ python p1b3_baseline.py
 ```
 
-### Example output
+#### Example output
 ```
 Using Theano backend.
 Using gpu device 0: Tesla K80 (CNMeM is enabled with initial size: 95.0% of memory, cuDNN 5004)
@@ -74,4 +103,3 @@ Cristina's results: Using the 5 layer MLP with standard normalization and sizes
 
 ![Measure vs Predicted percent growth after 141 epochs](https://raw.githubusercontent.com/ECP-CANDLE/Benchmarks/master/P1B3/images/meas_vs_pred_It140.png)
 
-