|
| 1 | +# Background |
| 2 | + |
| 3 | +Data preprocessing is an important front-end step in data analysis that prepares data for subsequent analysis. |
| 4 | +It not only enables the subsequent analysis by processing and transforming data, but also influences the quality of subsequent analysis sometimes significantly. |
| 5 | +Several common examples of data preprocessing are data standardization and normalization to remove/suppress noise, removal of batch effect to combine datasets for larger studies, and generation of new representations of data to enable new analyses. |
| 6 | +Feature selection can be viewed as a kind of data preprocessing for prediction analysis. |
| 7 | +Its goal is to select a (minimum) subset of available features, based on which prediction models with a good performance can be constructed. |
| 8 | +And the performance can be evaluated from multiple aspects, such as the prediction accuracy and the speed of constructing the prediction model. |
| 9 | + |
| 10 | +The data preprocessing methods can generate data partitions to enable flexible cross-validation analysis, normalize and remove batch effects from gene expression data of cancer cells, and generate genomic representations at the gene set level for cancer cells. |
| 11 | +The feature selection methods can filter features based on missing values and variations, and perform feature decorrelation. |
| 12 | +Features without much variation might not be useful for prediction and highly-correlated features are not necessary to be all included in the prediction model. |
| 13 | +We also implement and extend the co-expression extrapolation (COXEN) gene selection method for Pilot 1 project [10], which can select predictive and generalizable genes for predicting drug response in the precision oncology applications. |
| 14 | + |
| 15 | +# General Data Preprocessing Functions |
| 16 | + |
| 17 | +```generate_cross_validation_partition``` |
| 18 | + |
| 19 | +To flexibly generate data partitions for cross-validation analysis, such as partitioning of grouped samples into sets that do not share groups. |
| 20 | + |
| 21 | +# Data Preprocessing Functions Specific to Pilot 1 Applications |
| 22 | + |
| 23 | +```quantile_normalizationa``` |
| 24 | + |
| 25 | +To perform quantile normalization of genomic data [8] with tolerance of missing values. |
| 26 | + |
| 27 | +```combat_batch_effect_removal``` |
| 28 | + |
| 29 | +To perform ComBat analysis [9] on gene expression data to remove batch effects. |
| 30 | + |
| 31 | +```generate_gene_set_data``` |
| 32 | + |
| 33 | +To calculate genomic representations at gene set level, such as the average expression values of genes in a pathway and the total number of SNP mutations in a genetic pathway. |
| 34 | + |
| 35 | + |
1 | 36 | # Feature Selection examples
|
2 | 37 |
|
3 | 38 | The code demonstrates feature selection methods that CANDLE provides.
|
@@ -184,7 +219,7 @@ Using TensorFlow backend.
|
184 | 219 | ...
|
185 | 220 | found 2 batches
|
186 | 221 | found 0 numerical covariates...
|
187 |
| -found 0 categorical variables: |
| 222 | +found 0 categorical variables: |
188 | 223 | Standardizing Data across genes.
|
189 | 224 | Fitting L/S model and finding priors
|
190 | 225 | Finding parametric adjustments
|
|
0 commit comments