Skip to content

Commit 7379304

Browse files
committed
add README
1 parent 1163b4d commit 7379304

File tree

1 file changed

+193
-0
lines changed

1 file changed

+193
-0
lines changed

examples/M16/README.md

Lines changed: 193 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,193 @@
1+
# Feature Selection examples
2+
3+
The code is for demonstrate feature selection methods that CANDLE provides.
4+
5+
6+
```
7+
$ python M16_test.py
8+
9+
Importing candle utils for keras
10+
Origin = http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Data_For_Testing/small_drug_descriptor_data_unique_samples.txt
11+
Origin = http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Data_For_Testing/small_drug_response_data.txt
12+
Origin = http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Data_For_Testing/Gene_Expression_Full_Data_Unique_Samples.txt
13+
Origin = http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Data_For_Testing/CCLE_NCI60_Gene_Expression_Full_Data.txt
14+
Origin = http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Gene_Sets/MSigDB.v7.0/c2.cgp.v7.0.entrez.gmt
15+
Origin = http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Gene_Sets/MSigDB.v7.0/c2.cgp.v7.0.symbols.gmt
16+
Origin = http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Gene_Sets/MSigDB.v7.0/c2.cp.biocarta.v7.0.entrez.gmt
17+
Origin = http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Gene_Sets/MSigDB.v7.0/c2.cp.biocarta.v7.0.symbols.gmt
18+
Origin = http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Gene_Sets/MSigDB.v7.0/c2.cp.kegg.v7.0.entrez.gmt
19+
Origin = http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Gene_Sets/MSigDB.v7.0/c2.cp.kegg.v7.0.symbols.gmt
20+
Origin = http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Gene_Sets/MSigDB.v7.0/c2.cp.pid.v7.0.entrez.gmt
21+
Origin = http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Gene_Sets/MSigDB.v7.0/c2.cp.pid.v7.0.symbols.gmt
22+
Origin = http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Gene_Sets/MSigDB.v7.0/c2.cp.reactome.v7.0.entrez.gmt
23+
Origin = http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Gene_Sets/MSigDB.v7.0/c2.cp.reactome.v7.0.symbols.gmt
24+
Origin = http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Gene_Sets/MSigDB.v7.0/c5.bp.v7.0.entrez.gmt
25+
Origin = http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Gene_Sets/MSigDB.v7.0/c5.bp.v7.0.symbols.gmt
26+
Origin = http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Gene_Sets/MSigDB.v7.0/c5.cc.v7.0.entrez.gmt
27+
Origin = http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Gene_Sets/MSigDB.v7.0/c5.cc.v7.0.symbols.gmt
28+
Origin = http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Gene_Sets/MSigDB.v7.0/c5.mf.v7.0.entrez.gmt
29+
Origin = http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Gene_Sets/MSigDB.v7.0/c5.mf.v7.0.symbols.gmt
30+
Origin = http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Gene_Sets/MSigDB.v7.0/c6.all.v7.0.entrez.gmt
31+
Origin = http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Gene_Sets/MSigDB.v7.0/c6.all.v7.0.symbols.gmt
32+
Gene Set data is locally stored at /Users/hsyoo/projects/CANDLE/Benchmarks/common/../Data/examples/Gene_Sets/MSigDB.v7.0/
33+
34+
35+
Testing select_features_by_missing_values
36+
Drug descriptor dataframe includes 10 drugs (rows) and 10 drug descriptor features (columns)
37+
MW AMW Sv Se ... Mv Psi_e_1d Psi_e_1s VE3sign_X
38+
Drug_1 475.40 8.804 34.718 54.523 ... 0.643 NaN NaN NaN
39+
Drug_10 457.71 10.898 29.154 43.640 ... 0.694 NaN NaN -2.752
40+
Drug_100 561.80 6.688 49.975 83.607 ... 0.595 NaN NaN -4.335
41+
Drug_1000 362.51 6.840 32.794 52.461 ... 0.619 NaN NaN -9.968
42+
Drug_1001 628.83 7.763 51.593 81.570 ... 0.637 NaN NaN -2.166
43+
Drug_1002 377.19 10.777 26.191 36.578 ... 0.748 NaN NaN -1.526
44+
Drug_1003 371.42 8.254 30.896 45.473 ... 0.687 NaN NaN -4.983
45+
Drug_1004 453.60 8.100 37.949 55.872 ... 0.678 NaN NaN -4.100
46+
Drug_1005 277.35 7.704 23.940 35.934 ... 0.665 NaN NaN -5.234
47+
Drug_1006 409.47 8.189 34.423 50.356 ... 0.688 NaN NaN -2.513
48+
49+
[10 rows x 10 columns]
50+
Select features with missing rates smaller than 0.1
51+
Feature IDs [0 1 2 3 4 5 6]
52+
Select features with missing rates smaller than 0.3
53+
Feature IDs [0 1 2 3 4 5 6 9]
54+
55+
56+
Testing select_features_by_variation
57+
Select features with a variance larger than 100
58+
Feature IDs [0 3 5]
59+
Select the top 2 features with the largest standard deviation
60+
Feature IDs [0 5]
61+
62+
63+
Testing select_decorrelated_features
64+
Select features that are not identical to each other and are not all missing.
65+
Feature IDs [0 1 2 3 4 5 6 9]
66+
Select features whose absolute mutual Spearman correlation coefficient is smaller than 0.8
67+
Feature IDs [0 2 6 9]
68+
69+
70+
Testing generate_cross_validation_partition
71+
Generate 5-fold cross-validation partition of 10 samples twice
72+
[[[0, 5], [1, 2, 3, 4, 6, 7, 8, 9]], [[1, 6], [0, 2, 3, 4, 5, 7, 8, 9]], [[2, 7], [0, 1, 3, 4, 5, 6, 8, 9]], [[3, 8], [0, 1, 2, 4, 5, 6, 7, 9]], [[4, 9], [0, 1, 2, 3, 5, 6, 7, 8]], [[5, 8], [0, 1, 2, 3, 4, 6, 7, 9]], [[3, 9], [0, 1, 2, 4, 5, 6, 7, 8]], [[2, 4], [0, 1, 3, 5, 6, 7, 8, 9]], [[1, 7], [0, 2, 3, 4, 5, 6, 8, 9]], [[0, 6], [1, 2, 3, 4, 5, 7, 8, 9]]]
73+
Drug response data of 5 cell lines treated by various drugs.
74+
SOURCE CELL DRUG AUC EC50 EC50se R2fit HS
75+
0 CCLE CCLE.22RV1 CCLE.1 0.7153 5.660 0.6867 0.9533 0.6669
76+
1 CCLE CCLE.22RV1 CCLE.10 0.9579 7.023 0.7111 0.4332 4.0000
77+
2 CCLE CCLE.22RV1 CCLE.11 0.4130 7.551 0.0385 0.9948 1.3380
78+
3 CCLE CCLE.22RV1 CCLE.12 0.8004 5.198 11.7100 0.9944 4.0000
79+
4 CCLE CCLE.22RV1 CCLE.13 0.5071 7.149 0.3175 0.8069 1.0150
80+
.. ... ... ... ... ... ... ... ...
81+
95 CCLE CCLE.697 CCLE.12 0.7869 5.278 20.1200 0.8856 4.0000
82+
96 CCLE CCLE.697 CCLE.13 0.4433 7.474 0.0265 0.9978 3.7080
83+
97 CCLE CCLE.697 CCLE.14 0.4337 7.466 0.0106 0.9996 3.4330
84+
98 CCLE CCLE.697 CCLE.15 0.8721 3.097 29.1300 0.4884 0.2528
85+
99 CCLE CCLE.697 CCLE.16 0.7955 7.496 0.1195 0.9396 1.9560
86+
87+
[100 rows x 8 columns]
88+
Generate partition indices to divide the data into 4 sets without shared cell lines for 5 times.
89+
[[[68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91], [44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67], [92, 93, 94, 95, 96, 97, 98, 99], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43]], [[44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67], [92, 93, 94, 95, 96, 97, 98, 99], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23], [24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91]], [[92, 93, 94, 95, 96, 97, 98, 99], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23], [24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43], [44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91]], [[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23], [24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43], [68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91], [44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 92, 93, 94, 95, 96, 97, 98, 99]], [[24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43], [68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91], [44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 92, 93, 94, 95, 96, 97, 98, 99]]]
90+
Using TensorFlow backend.
91+
...
92+
found 2 batches
93+
found 0 numerical covariates...
94+
found 0 categorical variables:
95+
Standardizing Data across genes.
96+
Fitting L/S model and finding priors
97+
Finding parametric adjustments
98+
99+
100+
101+
Testing quantile_normalization
102+
Gene expression data of 897 cell lines (columns) and 17741 genes (rows).
103+
CCL_61 CCL_62 CCL_63 ... CCL_1076 CCL_1077 CCL_1078
104+
entrezID geneSymbol ...
105+
1 A1BG 0.99 0.03 0.36 ... 2.56 3.55 3.04
106+
29974 A1CF 4.03 3.03 0.00 ... 0.00 0.03 0.00
107+
2 A2M 2.68 0.03 0.16 ... 0.77 0.31 1.20
108+
144568 A2ML1 0.07 0.07 0.01 ... 0.01 0.00 1.09
109+
127550 A3GALT2 0.15 0.00 0.06 ... 2.34 0.00 0.03
110+
... ... ... ... ... ... ... ...
111+
440590 ZYG11A 0.41 0.06 1.70 ... 0.75 3.44 2.44
112+
79699 ZYG11B 4.45 4.23 3.08 ... 4.25 3.61 3.68
113+
7791 ZYX 4.65 5.72 6.67 ... 7.78 4.12 5.97
114+
23140 ZZEF1 4.14 3.98 3.90 ... 4.62 3.76 3.54
115+
26009 ZZZ3 4.77 5.01 3.90 ... 4.38 3.46 3.60
116+
117+
[17741 rows x 897 columns]
118+
Before normalization
119+
Max difference of third quartile between cell lines is 1.86
120+
Max difference of median between cell lines is 2.25
121+
Max difference of first quartile between cell lines is 0.5
122+
After normalization
123+
Max difference of third quartile between cell lines is 0.01
124+
Max difference of median between cell lines is 0.02
125+
Max difference of first quartile between cell lines is 0.06
126+
127+
128+
Testing generate_gene_set_data
129+
Generate gene-set-level data of 897 cell lines and 189 oncogenic signature gene sets
130+
GLI1_UP.V1_DN GLI1_UP.V1_UP ... LEF1_UP.V1_DN LEF1_UP.V1_UP
131+
CCL_61 -0.031096 0.283946 ... 0.096461 -0.329343
132+
CCL_62 0.362855 -0.101684 ... 0.426951 -0.477634
133+
CCL_63 -0.304989 -0.165160 ... 0.036932 -0.201916
134+
CCL_64 -0.037737 -0.043124 ... 0.154256 -0.210188
135+
CCL_65 0.102477 0.438871 ... -0.166487 0.287382
136+
... ... ... ... ... ...
137+
CCL_1074 0.508978 0.137934 ... 0.148213 0.166717
138+
CCL_1075 -0.145029 0.216169 ... -0.067391 0.258455
139+
CCL_1076 -0.357758 0.337235 ... 0.008950 0.186134
140+
CCL_1077 0.086597 -0.266070 ... 0.217244 -0.276022
141+
CCL_1078 0.374237 -0.428383 ... 0.312984 -0.303721
142+
143+
[897 rows x 189 columns]
144+
Generate gene-set-level data of 897 cell lines and 186 KEGG pathways
145+
KEGG_GLYCOLYSIS_GLUCONEOGENESIS ... KEGG_VIRAL_MYOCARDITIS
146+
CCL_61 6.495365 ... -30.504868
147+
CCL_62 30.679006 ... -7.205641
148+
CCL_63 10.534238 ... -5.414998
149+
CCL_64 6.142140 ... -10.555601
150+
CCL_65 -0.303868 ... -9.784998
151+
... ... ... ...
152+
CCL_1074 -1.945281 ... 6.891960
153+
CCL_1075 -21.373730 ... 0.612092
154+
CCL_1076 -11.711818 ... -10.353794
155+
CCL_1077 -11.576702 ... -31.679962
156+
CCL_1078 -10.355489 ... -26.232325
157+
158+
[897 rows x 186 columns]
159+
160+
161+
Testing combat_batch_effect_removal
162+
Gene expression data of 60 NCI60 cell lines and 1018 CCLE cell lines with 17741 genes.
163+
NCI60.786-0|CCL_1 ... CCLE.ZR7530|CCL_1078
164+
entrezID geneSymbol ...
165+
1 A1BG 0.00 ... 3.04
166+
29974 A1CF 0.00 ... 0.00
167+
2 A2M 0.00 ... 1.20
168+
144568 A2ML1 0.00 ... 1.09
169+
127550 A3GALT2 0.00 ... 0.03
170+
... ... ... ...
171+
440590 ZYG11A 0.01 ... 2.44
172+
79699 ZYG11B 3.37 ... 3.68
173+
7791 ZYX 7.05 ... 5.97
174+
23140 ZZEF1 4.05 ... 3.54
175+
26009 ZZZ3 4.10 ... 3.60
176+
177+
[17741 rows x 1078 columns]
178+
Before removal of batch effect between NCI60 and CCLE datasets
179+
Average third quartile of NCI60 cell lines is 4.0
180+
Average median of NCI60 cell lines is 1.71
181+
Average first quartile of NCI60 cell lines is 0.01
182+
Average third quartile of CCLE cell lines is 4.88
183+
Average median of CCLE cell lines is 2.75
184+
Average first quartile of CCLE cell lines is 0.14
185+
Adjusting data
186+
After removal of batch effect between NCI60 and CCLE datasets
187+
Average third quartile of NCI60 cell lines is 4.81
188+
Average median of NCI60 cell lines is 2.65
189+
Average first quartile of NCI60 cell lines is 0.23
190+
Average third quartile of CCLE cell lines is 4.83
191+
Average median of CCLE cell lines is 2.72
192+
Average first quartile of CCLE cell lines is 0.13
193+
```

0 commit comments

Comments
 (0)