Skip to content

Commit bba374f

Browse files
committed
Initial port of Uno to Release_01
1 parent 5a82214 commit bba374f

9 files changed

+2550
-0
lines changed

Pilot1/Uno/README.md

Lines changed: 122 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,122 @@
1+
## Uno: Predicting Tumor Dose Response across Multiple Data Sources
2+
3+
#### Example output
4+
Uno can be trained with a subset of dose response data sources. Here is an command line example of training with all 6 sources: CCLE, CTRP, gCSI, GDSC, NCI60 single drug response, ALMANAC drug pair response.
5+
6+
```
7+
uno_baseline_keras2.py --train_sources all --cache cache/all --use_landmark_genes --preprocess_rnaseq source_scale --no_feature_source --no_response_source
8+
Using TensorFlow backend.
9+
Params: {'activation': 'relu', 'batch_size': 32, 'dense': [1000, 1000, 1000], 'dense_feature_layers': [1000, 1000, 1000], 'drop': 0, 'epochs': 10, 'learning_rate': None, 'loss':
10+
'mse', 'optimizer': 'adam', 'residual': False, 'rng_seed': 2018, 'save': 'save/uno', 'scaling': 'std', 'feature_subsample': 0, 'validation_split': 0.2, 'solr_root': '', 'timeout'
11+
: -1, 'train_sources': ['all'], 'test_sources': ['train'], 'cell_types': None, 'cell_features': ['rnaseq'], 'drug_features': ['descriptors', 'fingerprints'], 'cv': 1, 'max_val_lo
12+
ss': 1.0, 'base_lr': None, 'reduce_lr': False, 'warmup_lr': False, 'batch_normalization': False, 'no_gen': False, 'config_file': '/raid/fangfang/Benchmarks/Pilot1/Uno/uno_default
13+
_model.txt', 'verbose': False, 'logfile': None, 'train_bool': True, 'shuffle': True, 'alpha_dropout': False, 'gpus': [], 'experiment_id': 'EXP.000', 'run_id': 'RUN.000', 'by_cell
14+
': None, 'by_drug': None, 'drug_median_response_min': -1, 'drug_median_response_max': 1, 'no_feature_source': True, 'no_response_source': True, 'use_landmark_genes': True, 'use_f
15+
iltered_genes': False, 'preprocess_rnaseq': 'source_scale', 'cp': False, 'tb': False, 'partition_by': None, 'cache': 'cache/ALL', 'single': False, 'export_data': None, 'growth_bi
16+
ns': 0, 'datatype': <class 'numpy.float32'>}
17+
Cache parameter file does not exist: cache/ALL.params.json
18+
Loading data from scratch ...
19+
Loaded 27769716 single drug dose response measurements
20+
Loaded 3686475 drug pair dose response measurements
21+
Combined dose response data contains sources: ['CCLE' 'CTRP' 'gCSI' 'GDSC' 'NCI60' 'SCL' 'SCLC' 'ALMANAC.FG'
22+
'ALMANAC.FF' 'ALMANAC.1A']
23+
Summary of combined dose response by source:
24+
Growth Sample Drug1 Drug2 MedianDose
25+
Source
26+
ALMANAC.1A 208605 60 102 102 7.000000
27+
ALMANAC.FF 2062098 60 92 71 6.698970
28+
ALMANAC.FG 1415772 60 100 29 6.522879
29+
CCLE 93251 504 24 0 6.602060
30+
CTRP 6171005 887 544 0 6.585027
31+
GDSC 1894212 1075 249 0 6.505150
32+
NCI60 18862308 59 52671 0 6.000000
33+
SCL 301336 65 445 0 6.908485
34+
SCLC 389510 70 526 0 6.908485
35+
gCSI 58094 409 16 0 7.430334
36+
Combined raw dose response data has 3070 unique samples and 53520 unique drugs
37+
Limiting drugs to those with response min <= 1, max >= -1, span >= 0, median_min <= -1, median_max >= 1 ...
38+
Selected 47005 drugs from 53520
39+
Loaded combined RNAseq data: (15198, 943)
40+
Loaded combined dragon7 drug descriptors: (53507, 5271)
41+
Loaded combined dragon7 drug fingerprints: (53507, 2049)
42+
Filtering drug response data...
43+
2375 molecular samples with feature and response data
44+
46837 selected drugs with feature and response data
45+
Summary of filtered dose response by source:
46+
Growth Sample Drug1 Drug2 MedianDose
47+
Source
48+
ALMANAC.1A 206580 60 101 101 7.000000
49+
ALMANAC.FF 2062098 60 92 71 6.698970
50+
ALMANAC.FG 1293465 60 98 27 6.522879
51+
CCLE 80213 474 22 0 6.602060
52+
CTRP 3397103 812 311 0 6.585027
53+
GDSC 1022204 672 213 0 6.505150
54+
NCI60 17190561 59 46272 0 6.000000
55+
gCSI 50822 357 16 0 7.430334
56+
Grouped response data by drug_pair: 51763 groups
57+
Input features shapes:
58+
dose1: (1,)
59+
dose2: (1,)
60+
cell.rnaseq: (942,)
61+
drug1.descriptors: (5270,)
62+
drug1.fingerprints: (2048,)
63+
drug2.descriptors: (5270,)
64+
drug2.fingerprints: (2048,)
65+
Total input dimensions: 15580
66+
Saved data to cache: cache/all.pkl
67+
Combined model:
68+
__________________________________________________________________________________________________
69+
Layer (type) Output Shape Param # Connected to
70+
==================================================================================================
71+
input.cell.rnaseq (InputLayer) (None, 942) 0
72+
__________________________________________________________________________________________________
73+
input.drug1.descriptors (InputL (None, 5270) 0
74+
__________________________________________________________________________________________________
75+
input.drug1.fingerprints (Input (None, 2048) 0
76+
__________________________________________________________________________________________________
77+
input.drug2.descriptors (InputL (None, 5270) 0
78+
__________________________________________________________________________________________________
79+
input.drug2.fingerprints (Input (None, 2048) 0
80+
__________________________________________________________________________________________________
81+
input.dose1 (InputLayer) (None, 1) 0
82+
__________________________________________________________________________________________________
83+
input.dose2 (InputLayer) (None, 1) 0
84+
__________________________________________________________________________________________________
85+
cell.rnaseq (Model) (None, 1000) 2945000 input.cell.rnaseq[0][0]
86+
__________________________________________________________________________________________________
87+
drug.descriptors (Model) (None, 1000) 7273000 input.drug1.descriptors[0][0]
88+
input.drug2.descriptors[0][0]
89+
__________________________________________________________________________________________________
90+
drug.fingerprints (Model) (None, 1000) 4051000 input.drug1.fingerprints[0][0]
91+
input.drug2.fingerprints[0][0]
92+
__________________________________________________________________________________________________
93+
concatenate_1 (Concatenate) (None, 5002) 0 input.dose1[0][0]
94+
input.dose2[0][0]
95+
cell.rnaseq[1][0]
96+
drug.descriptors[1][0]
97+
drug.fingerprints[1][0]
98+
drug.descriptors[2][0]
99+
drug.fingerprints[2][0]
100+
__________________________________________________________________________________________________
101+
dense_10 (Dense) (None, 1000) 5003000 concatenate_1[0][0]
102+
__________________________________________________________________________________________________
103+
dense_11 (Dense) (None, 1000) 1001000 dense_10[0][0]
104+
__________________________________________________________________________________________________
105+
dense_12 (Dense) (None, 1000) 1001000 dense_11[0][0]
106+
__________________________________________________________________________________________________
107+
dense_13 (Dense) (None, 1) 1001 dense_12[0][0]
108+
==================================================================================================
109+
Total params: 21,275,001
110+
Trainable params: 21,275,001
111+
Non-trainable params: 0
112+
__________________________________________________________________________________________________
113+
Between random pairs in y_val:
114+
mse: 0.6069
115+
mae: 0.5458
116+
r2: -0.9998
117+
corr: 0.0001
118+
Data points per epoch: train = 20158325, val = 5144721
119+
Steps per epoch: train = 629948, val = 160773
120+
Epoch 1/10
121+
8078/629948 [..............................] - ETA: 50:20:54 - loss: 0.1955 - mae: 0.2982 - r2: 0.2964
122+
```

Pilot1/Uno/p1infer.py

Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
#! /usr/bin/env python
2+
3+
import argparse
4+
import os
5+
import pickle
6+
import pandas as pd
7+
8+
9+
OUT_DIR = 'p1save'
10+
11+
12+
def get_parser(description='Run a trained machine learningn model in inference mode on new data'):
13+
parser = argparse.ArgumentParser(description=description)
14+
parser.add_argument("-d", "--data",
15+
help="data file to train on")
16+
parser.add_argument("-m", "--model_file",
17+
help="saved trained model file")
18+
parser.add_argument("-k", "--keepcols", nargs='+', default=[],
19+
help="columns from input data file to keep in prediction file; use 'all' to keep all original columns")
20+
parser.add_argument("-o", "--out_dir", default=OUT_DIR,
21+
help="output directory")
22+
parser.add_argument("-p", "--prefix",
23+
help="output prefix")
24+
parser.add_argument("-y", "--ycol", default=None,
25+
help="0-based index or name of the column to be predicted")
26+
parser.add_argument("-C", "--ignore_categoricals", action='store_true',
27+
help="ignore categorical feature columns")
28+
return parser
29+
30+
31+
def main():
32+
parser = get_parser()
33+
args = parser.parse_args()
34+
35+
prefix = args.prefix or os.path.basename(args.data)
36+
prefix = os.path.join(args.out_dir, prefix)
37+
if not os.path.exists(args.out_dir):
38+
os.makedirs(args.out_dir)
39+
40+
df = pd.read_table(args.data, engine='c')
41+
df_x = df.copy()
42+
cat_cols = df.select_dtypes(['object']).columns
43+
if args.ignore_categoricals:
44+
df_x[cat_cols] = 0
45+
else:
46+
df_x[cat_cols] = df_x[cat_cols].apply(lambda x: x.astype('category').cat.codes)
47+
48+
keepcols = args.keepcols
49+
ycol = args.ycol
50+
if ycol:
51+
if ycol.isdigit():
52+
ycol = df_x.columns[int(ycol)]
53+
df_x = df_x.drop(ycol, axis=1)
54+
keepcols = [ycol] + keepcols
55+
else:
56+
df_x = df_x
57+
if 'all' in keepcols:
58+
keepcols = list(df.columns)
59+
60+
with open(args.model_file, 'rb') as f:
61+
model = pickle.load(f)
62+
63+
x = df_x.as_matrix()
64+
y = model.predict(x)
65+
66+
df_pred = df[keepcols]
67+
df_pred.insert(0, 'Pred', y)
68+
69+
fname = '{}.predicted.tsv'.format(prefix)
70+
df_pred.to_csv(fname, sep='\t', index=False, float_format='%.3g')
71+
print('Predictions saved in {}\n'.format(fname))
72+
73+
74+
if __name__ == '__main__':
75+
main()

Pilot1/Uno/p1train.py

Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
#! /usr/bin/env python
2+
3+
import argparse
4+
import os
5+
import numpy as np
6+
import pandas as pd
7+
import random
8+
from skwrapper import regress, classify, train, split_data
9+
10+
11+
MODELS = ['LightGBM', 'XGBoost', 'RandomForest']
12+
CV = 3
13+
THREADS = 4
14+
OUT_DIR = 'p1save'
15+
BINS = 0
16+
CUTOFFS = None
17+
FEATURE_SUBSAMPLE = 0
18+
SEED = 2018
19+
20+
21+
def get_parser(description='Run machine learning training algorithms implemented in scikit-learn'):
22+
parser = argparse.ArgumentParser(description=description)
23+
parser.add_argument("-b", "--bins", type=int, default=BINS,
24+
help="number of evenly distributed bins to make when classification mode is turned on")
25+
parser.add_argument("-c", "--classify", action="store_true",
26+
help="convert the regression problem into classification based on category cutoffs")
27+
parser.add_argument("-d", "--data",
28+
help="data file to train on")
29+
parser.add_argument("-g", "--groupcols", nargs='+',
30+
help="names of columns to be used in cross validation partitioning")
31+
parser.add_argument("-m", "--models", nargs='+', default=MODELS,
32+
help="list of regression models: XGBoost, XGB.1K, XGB.10K, RandomForest, RF.1K, RF.10K, AdaBoost, Linear, ElasticNet, Lasso, Ridge; or list of classification models: XGBoost, XGB.1K, XGB.10K, RandomForest, RF.1K, RF.10K, AdaBoost, Logistic, Gaussian, Bayes, KNN, SVM")
33+
parser.add_argument("-o", "--out_dir", default=OUT_DIR,
34+
help="output directory")
35+
parser.add_argument("-p", "--prefix",
36+
help="output prefix")
37+
parser.add_argument("-t", "--threads", type=int, default=THREADS,
38+
help="number of threads per machine learning training job; -1 for using all threads")
39+
parser.add_argument("-y", "--ycol", default='0',
40+
help="0-based index or name of the column to be predicted")
41+
parser.add_argument("--cutoffs", nargs='+', type=float, default=CUTOFFS,
42+
help="list of cutoffs delineating prediction target categories")
43+
parser.add_argument("--cv", type=int, default=CV,
44+
help="cross validation folds")
45+
parser.add_argument("--feature_subsample", type=int, default=FEATURE_SUBSAMPLE,
46+
help="number of features to randomly sample from each category, 0 means using all features")
47+
parser.add_argument("-C", "--ignore_categoricals", action='store_true',
48+
help="ignore categorical feature columns")
49+
parser.add_argument("--seed", type=int, default=SEED,
50+
help="specify random seed")
51+
return parser
52+
53+
54+
def set_seed(seed):
55+
os.environ['PYTHONHASHSEED'] = '0'
56+
np.random.seed(seed)
57+
random.seed(seed)
58+
59+
60+
def main():
61+
parser = get_parser()
62+
args = parser.parse_args()
63+
set_seed(args.seed)
64+
65+
prefix = args.prefix or os.path.basename(args.data)
66+
prefix = os.path.join(args.out_dir, prefix)
67+
68+
df = pd.read_table(args.data, engine='c')
69+
x, y, splits, features = split_data(df, ycol=args.ycol, classify=args.classify, cv=args.cv,
70+
bins=args.bins, cutoffs=args.cutoffs, groupcols=args.groupcols,
71+
ignore_categoricals=args.ignore_categoricals, verbose=True)
72+
73+
if args.classify and len(np.unique(y)) < 2:
74+
print('Not enough classes\n')
75+
return
76+
77+
best_score, best_model = -np.Inf, None
78+
for model in args.models:
79+
if args.classify:
80+
score = classify(model, x, y, splits, features, threads=args.threads, prefix=prefix, seed=args.seed)
81+
else:
82+
score = regress(model, x, y, splits, features, threads=args.threads, prefix=prefix, seed=args.seed)
83+
if score >= best_score:
84+
best_score = score
85+
best_model = model
86+
87+
print('Training the best model ({}={:.3g}) on the entire dataset...'.format(best_model, best_score))
88+
name = 'best.classifier' if args.classify else 'best.regressor'
89+
fname = train(best_model, x, y, features, classify=args.classify,
90+
threads=args.threads, prefix=prefix, name=name, save=True)
91+
print('Model saved in {}\n'.format(fname))
92+
93+
94+
if __name__ == '__main__':
95+
main()

0 commit comments

Comments
 (0)