Skip to content

Commit ab0ca03

Browse files
committed
Add more detail
1 parent 0f80328 commit ab0ca03

File tree

1 file changed

+36
-1
lines changed

1 file changed

+36
-1
lines changed

examples/M16/README.md

Lines changed: 36 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,38 @@
1+
# Background
2+
3+
Data preprocessing is an important front-end step in data analysis that prepares data for subsequent analysis.
4+
It not only enables the subsequent analysis by processing and transforming data, but also influences the quality of subsequent analysis sometimes significantly.
5+
Several common examples of data preprocessing are data standardization and normalization to remove/suppress noise, removal of batch effect to combine datasets for larger studies, and generation of new representations of data to enable new analyses.
6+
Feature selection can be viewed as a kind of data preprocessing for prediction analysis.
7+
Its goal is to select a (minimum) subset of available features, based on which prediction models with a good performance can be constructed.
8+
And the performance can be evaluated from multiple aspects, such as the prediction accuracy and the speed of constructing the prediction model.
9+
10+
The data preprocessing methods can generate data partitions to enable flexible cross-validation analysis, normalize and remove batch effects from gene expression data of cancer cells, and generate genomic representations at the gene set level for cancer cells.
11+
The feature selection methods can filter features based on missing values and variations, and perform feature decorrelation.
12+
Features without much variation might not be useful for prediction and highly-correlated features are not necessary to be all included in the prediction model.
13+
We also implement and extend the co-expression extrapolation (COXEN) gene selection method for Pilot 1 project [10], which can select predictive and generalizable genes for predicting drug response in the precision oncology applications.
14+
15+
# General Data Preprocessing Functions
16+
17+
```generate_cross_validation_partition```
18+
19+
To flexibly generate data partitions for cross-validation analysis, such as partitioning of grouped samples into sets that do not share groups.
20+
21+
# Data Preprocessing Functions Specific to Pilot 1 Applications
22+
23+
```quantile_normalizationa```
24+
25+
To perform quantile normalization of genomic data [8] with tolerance of missing values.
26+
27+
```combat_batch_effect_removal```
28+
29+
To perform ComBat analysis [9] on gene expression data to remove batch effects.
30+
31+
```generate_gene_set_data```
32+
33+
To calculate genomic representations at gene set level, such as the average expression values of genes in a pathway and the total number of SNP mutations in a genetic pathway.
34+
35+
136
# Feature Selection examples
237

338
The code demonstrates feature selection methods that CANDLE provides.
@@ -184,7 +219,7 @@ Using TensorFlow backend.
184219
...
185220
found 2 batches
186221
found 0 numerical covariates...
187-
found 0 categorical variables:
222+
found 0 categorical variables:
188223
Standardizing Data across genes.
189224
Fitting L/S model and finding priors
190225
Finding parametric adjustments

0 commit comments

Comments
 (0)