Skip to content

Commit be5b267

Browse files
committed
update readme
1 parent 51483ab commit be5b267

File tree

3 files changed

+81
-8
lines changed

3 files changed

+81
-8
lines changed

P1B1/README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
11
## P1B1: Autoencoder Compressed Representation for Gene Expression
22

3-
**Overview**: Given a sample of gene expression data, build a sparse autoencoder that can compress the expression profile into a low-dimensional vector
3+
**Overview**: Given a sample of gene expression data, build a sparse autoencoder that can compress the expression profile into a low-dimensional vector.
44

5-
**Relationship to core problem**: Many molecular assays generate large numbers of features that can lead to time-consuming processing and over-fitting in learning tasks; hence, a core capability we intend to build is feature reduction
5+
**Relationship to core problem**: Many molecular assays generate large numbers of features that can lead to time-consuming processing and over-fitting in learning tasks; hence, a core capability we intend to build is feature reduction.
66

7-
**Expected outcome**: An autoencoder that collapse high dimensional expression profiles into low dimensional vectors without much loss of information
7+
**Expected outcome**: Build an autoencoder that collapse high dimensional expression profiles into low dimensional vectors without much loss of information.
88

99
### Benchmark Specs Requirements
1010

@@ -13,7 +13,7 @@
1313
* Input dimensions: 60,484 floats; log(1+x) transformed FPKM-UQ values
1414
* Output dimensions: Same as input
1515
* Latent representation dimension: 1000
16-
* Sample size: 5,000
16+
* Sample size: 4,000 (3000 training + 1000 test)
1717
* Notes on data balance and other issues: unlabeled data draw from a diverse set of cancer types
1818

1919
#### Expected Outcomes

P1B2/README.md

Lines changed: 46 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,41 @@
1+
## P1B2: Sparse Classifier Disease Type Prediction from Somatic SNPs
12

3+
**Overview**: Given patient somatic SNPs, build a deep learning network that can classify the cancer type.
4+
5+
**Relationship to core problem**: Exercise two core capabilities we need to build: (1) classification based on very sparse input data; (2) evaluation of the information content and predictive value in a molecular assay with auxiliary learning tasks.
6+
7+
**Expected outcome**: Build a DNN that can classify sparse data.
8+
9+
### Benchmark Specs Requirements
10+
11+
#### Description of the Data
12+
* Data source: SNP data from GDC MAF files
13+
* Input dimensions: 28,205 (aggregated variation impact by gene from 2.7 million unique SNPs)
14+
* Output dimensions: 10 class probabilities (9 most abundant cancer types in GDC + 1 “others”)
15+
* Sample size: 4,000 (3000 training + 1000 test)
16+
* Notes on data balance and other issues: data balance achieved via undersampling; “others” category drawn from all remaining lower-abundance cancer types in GDC
17+
18+
#### Expected Outcomes
19+
* Classification
20+
* Output range or number of classes: 10
21+
22+
#### Evaluation Metrics
23+
* Accuracy or loss function: Standard approaches such as F1-score, accuracy, ROC-AUC, cross entropy, etc.
24+
* Expected performance of a naïve method: linear regression or ensemble methods without feature selection
25+
26+
#### Description of the Network
27+
* Proposed network architecture: MLP with regularization
28+
* Number of layers: ~5 layers
29+
30+
### Running the baseline implementation
31+
32+
```
33+
cd P1B2
34+
python p1b2_baseline.py
35+
```
36+
The training and test data files will be downloaded the first time this is run and will be cached for future runs.
37+
38+
#### Example output
239

340
```
441
Using Theano backend.
@@ -64,5 +101,13 @@ best_val_loss=1.31111 best_val_acc=0.59500
64101
65102
Best model saved to: model.A=sigmoid.B=64.D=None.E=20.L1=1024.L2=512.L3=256.P=1e-05.h5
66103
67-
Evaluation on test data: {'accuracy': 0.55000000000000004}
104+
Evaluation on test data: {'accuracy': 0.5500}
105+
```
106+
107+
### Running the XGBoost classifier
108+
109+
```
110+
cd P1B2
111+
python p1b2_xgboost.py
112+
68113
```

P1B3/README.md

Lines changed: 31 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,40 @@
1-
## Running the baseline implementation of the P1B3 benchmark
1+
## P1B3: MLP Regression Drug Response Prediction
2+
3+
**Overview**: Given drug screening results on NCI60 cell lines, build a deep learning network that can predict the growth percentage from cell line expression and drug descriptors.
4+
5+
**Relationship to core problem**: This benchmark is a simplified form of the core drug response prediction problem in which we need to combine multiple molecular assays and a diverse array of drug descriptors to make a prediction.
6+
7+
**Expected outcome**: Build a DNN that can predict growth percentage of a cell line treated with a new drug.
8+
9+
### Benchmark Specs Requirements
10+
11+
#### Description of the Data
12+
* Data source: Dose response screening results from NCI; 5-platform normalized expression data from NCI; Dragon7 generated drug descriptors based on 2D chemical structures from NCI
13+
* Input dimensions: ~30K; 26K normalized expression levels by gene + 4K drug descriptors [+ drug concentration]
14+
Output dimensions: 1 (growth percentage)
15+
* Sample size: ~2.5 M screening results (combinations of cell line and drug)
16+
* Notes on data balance: original data imbalanced with many drugs that have little inhibition effect.
17+
18+
#### Expected Outcomes
19+
* Regression. Predict percent growth per NCI-60 cell lines and per drug
20+
* Dimension: 1 scalar value corresponding to the percent growth for a given drug concentration. Output range: [-100, 100]
21+
22+
#### Evaluation Metrics
23+
* Accuracy or loss function: mean squared error or rank order.
24+
* Expected performance of a naïve method: mean response, linear regression or random forest regression.
25+
26+
#### Description of the Network
27+
* Proposed network architecture: MLP
28+
* Number of layers: ~5 layers
29+
30+
### Running the baseline implementation
231

332
```
433
$ cd P1B3
534
$ python p1b3_baseline.py
635
```
736

8-
### Example output
37+
#### Example output
938
```
1039
Using Theano backend.
1140
Using gpu device 0: Tesla K80 (CNMeM is enabled with initial size: 95.0% of memory, cuDNN 5004)
@@ -74,4 +103,3 @@ Cristina's results: Using the 5 layer MLP with standard normalization and sizes
74103

75104
![Measure vs Predicted percent growth after 141 epochs](https://raw.githubusercontent.com/ECP-CANDLE/Benchmarks/master/P1B3/images/meas_vs_pred_It140.png)
76105

77-

0 commit comments

Comments
 (0)