ECP-CANDLE
diff --git a/‎Pilot1/Uno/README.md‎
Lines changed: 122 additions & 0 deletions b/‎Pilot1/Uno/README.md‎
Lines changed: 122 additions & 0 deletions
diff --git a/‎Pilot1/Uno/p1infer.py‎
Lines changed: 75 additions & 0 deletions b/‎Pilot1/Uno/p1infer.py‎
Lines changed: 75 additions & 0 deletions
diff --git a/‎Pilot1/Uno/p1train.py‎
Lines changed: 95 additions & 0 deletions b/‎Pilot1/Uno/p1train.py‎
Lines changed: 95 additions & 0 deletions
@@ -0,0 +1,122 @@
+## Uno: Predicting Tumor Dose Response across Multiple Data Sources
+
+#### Example output
+Uno can be trained with a subset of dose response data sources. Here is an command line example of training with all 6 sources: CCLE, CTRP, gCSI, GDSC, NCI60 single drug response, ALMANAC drug pair response.
+
+```
+uno_baseline_keras2.py --train_sources all --cache cache/all --use_landmark_genes --preprocess_rnaseq source_scale --no_feature_source --no_response_source
+Using TensorFlow backend.
+Params: {'activation': 'relu', 'batch_size': 32, 'dense': [1000, 1000, 1000], 'dense_feature_layers': [1000, 1000, 1000], 'drop': 0, 'epochs': 10, 'learning_rate': None, 'loss':
+'mse', 'optimizer': 'adam', 'residual': False, 'rng_seed': 2018, 'save': 'save/uno', 'scaling': 'std', 'feature_subsample': 0, 'validation_split': 0.2, 'solr_root': '', 'timeout'
+: -1, 'train_sources': ['all'], 'test_sources': ['train'], 'cell_types': None, 'cell_features': ['rnaseq'], 'drug_features': ['descriptors', 'fingerprints'], 'cv': 1, 'max_val_lo
+ss': 1.0, 'base_lr': None, 'reduce_lr': False, 'warmup_lr': False, 'batch_normalization': False, 'no_gen': False, 'config_file': '/raid/fangfang/Benchmarks/Pilot1/Uno/uno_default
+_model.txt', 'verbose': False, 'logfile': None, 'train_bool': True, 'shuffle': True, 'alpha_dropout': False, 'gpus': [], 'experiment_id': 'EXP.000', 'run_id': 'RUN.000', 'by_cell
+': None, 'by_drug': None, 'drug_median_response_min': -1, 'drug_median_response_max': 1, 'no_feature_source': True, 'no_response_source': True, 'use_landmark_genes': True, 'use_f
+iltered_genes': False, 'preprocess_rnaseq': 'source_scale', 'cp': False, 'tb': False, 'partition_by': None, 'cache': 'cache/ALL', 'single': False, 'export_data': None, 'growth_bi
+ns': 0, 'datatype': <class 'numpy.float32'>}
+Cache parameter file does not exist: cache/ALL.params.json
+Loading data from scratch ...
+Loaded 27769716 single drug dose response measurements
+Loaded 3686475 drug pair dose response measurements
+Combined dose response data contains sources: ['CCLE' 'CTRP' 'gCSI' 'GDSC' 'NCI60' 'SCL' 'SCLC' 'ALMANAC.FG'
+ 'ALMANAC.FF' 'ALMANAC.1A']
+Summary of combined dose response by source:
+              Growth  Sample  Drug1  Drug2  MedianDose
+Source
+ALMANAC.1A    208605      60    102    102    7.000000
+ALMANAC.FF   2062098      60     92     71    6.698970
+ALMANAC.FG   1415772      60    100     29    6.522879
+CCLE           93251     504     24      0    6.602060
+CTRP         6171005     887    544      0    6.585027
+GDSC         1894212    1075    249      0    6.505150
+NCI60       18862308      59  52671      0    6.000000
+SCL           301336      65    445      0    6.908485
+SCLC          389510      70    526      0    6.908485
+gCSI           58094     409     16      0    7.430334
+Combined raw dose response data has 3070 unique samples and 53520 unique drugs
+Limiting drugs to those with response min <= 1, max >= -1, span >= 0, median_min <= -1, median_max >= 1 ...
+Selected 47005 drugs from 53520
+Loaded combined RNAseq data: (15198, 943)
+Loaded combined dragon7 drug descriptors: (53507, 5271)
+Loaded combined dragon7 drug fingerprints: (53507, 2049)
+Filtering drug response data...
+  2375 molecular samples with feature and response data
+  46837 selected drugs with feature and response data
+Summary of filtered dose response by source:
+              Growth  Sample  Drug1  Drug2  MedianDose
+Source
+ALMANAC.1A    206580      60    101    101    7.000000
+ALMANAC.FF   2062098      60     92     71    6.698970
+ALMANAC.FG   1293465      60     98     27    6.522879
+CCLE           80213     474     22      0    6.602060
+CTRP         3397103     812    311      0    6.585027
+GDSC         1022204     672    213      0    6.505150
+NCI60       17190561      59  46272      0    6.000000
+gCSI           50822     357     16      0    7.430334
+Grouped response data by drug_pair: 51763 groups
+Input features shapes:
+  dose1: (1,)
+  dose2: (1,)
+  cell.rnaseq: (942,)
+  drug1.descriptors: (5270,)
+  drug1.fingerprints: (2048,)
+  drug2.descriptors: (5270,)
+  drug2.fingerprints: (2048,)
+Total input dimensions: 15580
+Saved data to cache: cache/all.pkl
+Combined model:
+__________________________________________________________________________________________________
+Layer (type)                    Output Shape         Param #     Connected to
+==================================================================================================
+input.cell.rnaseq (InputLayer)  (None, 942)          0
+__________________________________________________________________________________________________
+input.drug1.descriptors (InputL (None, 5270)         0
+__________________________________________________________________________________________________
+input.drug1.fingerprints (Input (None, 2048)         0
+__________________________________________________________________________________________________
+input.drug2.descriptors (InputL (None, 5270)         0
+__________________________________________________________________________________________________
+input.drug2.fingerprints (Input (None, 2048)         0
+__________________________________________________________________________________________________
+input.dose1 (InputLayer)        (None, 1)            0
+__________________________________________________________________________________________________
+input.dose2 (InputLayer)        (None, 1)            0
+__________________________________________________________________________________________________
+cell.rnaseq (Model)             (None, 1000)         2945000     input.cell.rnaseq[0][0]
+__________________________________________________________________________________________________
+drug.descriptors (Model)        (None, 1000)         7273000     input.drug1.descriptors[0][0]
+                                                                 input.drug2.descriptors[0][0]
+__________________________________________________________________________________________________
+drug.fingerprints (Model)       (None, 1000)         4051000     input.drug1.fingerprints[0][0]
+                                                                 input.drug2.fingerprints[0][0]
+__________________________________________________________________________________________________
+concatenate_1 (Concatenate)     (None, 5002)         0           input.dose1[0][0]
+                                                                 input.dose2[0][0]
+                                                                 cell.rnaseq[1][0]
+                                                                 drug.descriptors[1][0]
+                                                                 drug.fingerprints[1][0]
+                                                                 drug.descriptors[2][0]
+                                                                 drug.fingerprints[2][0]
+__________________________________________________________________________________________________
+dense_10 (Dense)                (None, 1000)         5003000     concatenate_1[0][0]
+__________________________________________________________________________________________________
+dense_11 (Dense)                (None, 1000)         1001000     dense_10[0][0]
+__________________________________________________________________________________________________
+dense_12 (Dense)                (None, 1000)         1001000     dense_11[0][0]
+__________________________________________________________________________________________________
+dense_13 (Dense)                (None, 1)            1001        dense_12[0][0]
+==================================================================================================
+Total params: 21,275,001
+Trainable params: 21,275,001
+Non-trainable params: 0
+__________________________________________________________________________________________________
+Between random pairs in y_val:
+  mse: 0.6069
+  mae: 0.5458
+  r2: -0.9998
+  corr: 0.0001
+Data points per epoch: train = 20158325, val = 5144721
+Steps per epoch: train = 629948, val = 160773
+Epoch 1/10
+  8078/629948 [..............................] - ETA: 50:20:54 - loss: 0.1955 - mae: 0.2982 - r2: 0.2964
+```
@@ -0,0 +1,75 @@
+#! /usr/bin/env python
+
+import argparse
+import os
+import pickle
+import pandas as pd
+
+
+OUT_DIR = 'p1save'
+
+
+def get_parser(description='Run a trained machine learningn model in inference mode on new data'):
+    parser = argparse.ArgumentParser(description=description)
+    parser.add_argument("-d", "--data",
+                        help="data file to train on")
+    parser.add_argument("-m", "--model_file",
+                        help="saved trained model file")
+    parser.add_argument("-k", "--keepcols", nargs='+', default=[],
+                        help="columns from input data file to keep in prediction file; use 'all' to keep all original columns")
+    parser.add_argument("-o", "--out_dir", default=OUT_DIR,
+                        help="output directory")
+    parser.add_argument("-p", "--prefix",
+                        help="output prefix")
+    parser.add_argument("-y", "--ycol", default=None,
+                        help="0-based index or name of the column to be predicted")
+    parser.add_argument("-C", "--ignore_categoricals", action='store_true',
+                        help="ignore categorical feature columns")
+    return parser
+
+
+def main():
+    parser = get_parser()
+    args = parser.parse_args()
+
+    prefix = args.prefix or os.path.basename(args.data)
+    prefix = os.path.join(args.out_dir, prefix)
+    if not os.path.exists(args.out_dir):
+        os.makedirs(args.out_dir)
+
+    df = pd.read_table(args.data, engine='c')
+    df_x = df.copy()
+    cat_cols = df.select_dtypes(['object']).columns
+    if args.ignore_categoricals:
+        df_x[cat_cols] = 0
+    else:
+        df_x[cat_cols] = df_x[cat_cols].apply(lambda x: x.astype('category').cat.codes)
+
+    keepcols = args.keepcols
+    ycol = args.ycol
+    if ycol:
+        if ycol.isdigit():
+            ycol = df_x.columns[int(ycol)]
+        df_x = df_x.drop(ycol, axis=1)
+        keepcols = [ycol] + keepcols
+    else:
+        df_x = df_x
+    if 'all' in keepcols:
+        keepcols = list(df.columns)
+
+    with open(args.model_file, 'rb') as f:
+        model = pickle.load(f)
+
+    x = df_x.as_matrix()
+    y = model.predict(x)
+
+    df_pred = df[keepcols]
+    df_pred.insert(0, 'Pred', y)
+
+    fname = '{}.predicted.tsv'.format(prefix)
+    df_pred.to_csv(fname, sep='\t', index=False, float_format='%.3g')
+    print('Predictions saved in {}\n'.format(fname))
+
+
+if __name__ == '__main__':
+    main()
@@ -0,0 +1,95 @@
+#! /usr/bin/env python
+
+import argparse
+import os
+import numpy as np
+import pandas as pd
+import random
+from skwrapper import regress, classify, train, split_data
+
+
+MODELS = ['LightGBM', 'XGBoost', 'RandomForest']
+CV = 3
+THREADS = 4
+OUT_DIR = 'p1save'
+BINS = 0
+CUTOFFS = None
+FEATURE_SUBSAMPLE = 0
+SEED = 2018
+
+
+def get_parser(description='Run machine learning training algorithms implemented in scikit-learn'):
+    parser = argparse.ArgumentParser(description=description)
+    parser.add_argument("-b", "--bins", type=int, default=BINS,
+                        help="number of evenly distributed bins to make when classification mode is turned on")
+    parser.add_argument("-c", "--classify",  action="store_true",
+                        help="convert the regression problem into classification based on category cutoffs")
+    parser.add_argument("-d", "--data",
+                        help="data file to train on")
+    parser.add_argument("-g", "--groupcols", nargs='+',
+                        help="names of columns to be used in cross validation partitioning")
+    parser.add_argument("-m", "--models", nargs='+', default=MODELS,
+                        help="list of regression models: XGBoost, XGB.1K, XGB.10K, RandomForest, RF.1K, RF.10K, AdaBoost, Linear, ElasticNet, Lasso, Ridge; or list of classification models: XGBoost, XGB.1K, XGB.10K, RandomForest, RF.1K, RF.10K, AdaBoost, Logistic, Gaussian, Bayes, KNN, SVM")
+    parser.add_argument("-o", "--out_dir", default=OUT_DIR,
+                        help="output directory")
+    parser.add_argument("-p", "--prefix",
+                        help="output prefix")
+    parser.add_argument("-t", "--threads", type=int, default=THREADS,
+                        help="number of threads per machine learning training job; -1 for using all threads")
+    parser.add_argument("-y", "--ycol", default='0',
+                        help="0-based index or name of the column to be predicted")
+    parser.add_argument("--cutoffs", nargs='+', type=float, default=CUTOFFS,
+                        help="list of cutoffs delineating prediction target categories")
+    parser.add_argument("--cv", type=int, default=CV,
+                        help="cross validation folds")
+    parser.add_argument("--feature_subsample", type=int, default=FEATURE_SUBSAMPLE,
+                        help="number of features to randomly sample from each category, 0 means using all features")
+    parser.add_argument("-C", "--ignore_categoricals", action='store_true',
+                        help="ignore categorical feature columns")
+    parser.add_argument("--seed", type=int, default=SEED,
+                        help="specify random seed")
+    return parser
+
+
+def set_seed(seed):
+    os.environ['PYTHONHASHSEED'] = '0'
+    np.random.seed(seed)
+    random.seed(seed)
+
+
+def main():
+    parser = get_parser()
+    args = parser.parse_args()
+    set_seed(args.seed)
+
+    prefix = args.prefix or os.path.basename(args.data)
+    prefix = os.path.join(args.out_dir, prefix)
+
+    df = pd.read_table(args.data, engine='c')
+    x, y, splits, features = split_data(df, ycol=args.ycol, classify=args.classify, cv=args.cv,
+                                        bins=args.bins, cutoffs=args.cutoffs, groupcols=args.groupcols,
+                                        ignore_categoricals=args.ignore_categoricals, verbose=True)
+
+    if args.classify and len(np.unique(y)) < 2:
+        print('Not enough classes\n')
+        return
+
+    best_score, best_model = -np.Inf, None
+    for model in args.models:
+        if args.classify:
+            score = classify(model, x, y, splits, features, threads=args.threads, prefix=prefix, seed=args.seed)
+        else:
+            score = regress(model, x, y, splits, features, threads=args.threads, prefix=prefix, seed=args.seed)
+        if score >= best_score:
+            best_score = score
+            best_model = model
+
+    print('Training the best model ({}={:.3g}) on the entire dataset...'.format(best_model, best_score))
+    name = 'best.classifier' if args.classify else 'best.regressor'
+    fname = train(best_model, x, y, features, classify=args.classify,
+                  threads=args.threads, prefix=prefix, name=name, save=True)
+    print('Model saved in {}\n'.format(fname))
+
+
+if __name__ == '__main__':
+    main()