-
Notifications
You must be signed in to change notification settings - Fork 0
Feature extraction and PCA
The various approaches we have considered have led to two conclusions. The time-series are too noisy to try clustering on the raw data, or to extract very subtle features, and the two most important features in terms of biological meaning are the average, that characterize how well the mutants performs overall and the slope that characterize how sensitive to a certain condition the mutant is. This suggests a simple yet more robust feature extraction : the mean of the time-series and the slope of the associated linear regression.
We start by extracting the features on the mutant level data. The data is summed up in this table
| mutant_ID | plate | well_id | mean_y2_20h_HL | mean_y2_20h_ML | mean_y2_high_10min-10min | mean_y2_high_1min-1min | mean_y2_high_2h-2h | mean_y2_high_30s-30s | mean_y2_low_10min-10min | mean_y2_low_1min-1min | mean_y2_low_2h-2h | mean_y2_low_30s-30s | slope_y2_20h_HL | slope_y2_20h_ML | slope_y2_high_10min-10min | slope_y2_high_1min-1min | slope_y2_high_2h-2h | slope_y2_high_30s-30s | slope_y2_low_10min-10min | slope_y2_low_1min-1min | slope_y2_low_2h-2h | slope_y2_low_30s-30s | mutated_genes | GO | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | CC-4533 (bst4 WT) | 20 | B01 | -0.009797 | 0.011986 | NaN | 0.046552 | NaN | NaN | NaN | 0.062051 | NaN | NaN | -0.000819 | -0.000448 | NaN | 0.000051 | NaN | NaN | NaN | 0.000126 | NaN | NaN | special_mutant | [] |
| 1 | CC-4533 (bst4 WT) | 20 | B21 | -0.035841 | -0.026091 | NaN | -0.009882 | NaN | NaN | NaN | 0.033030 | NaN | NaN | -0.001451 | -0.001388 | NaN | 0.000222 | NaN | NaN | NaN | -0.000405 | NaN | NaN | special_mutant | [] |
| 2 | CC-4533 (bst4 WT) | 20 | E23 | -0.041810 | -0.008615 | NaN | 0.008853 | NaN | NaN | NaN | 0.037256 | NaN | NaN | -0.001303 | -0.001061 | NaN | -0.000238 | NaN | NaN | NaN | -0.000392 | NaN | NaN | special_mutant | [] |
| 3 | CC-4533 (bst4 WT) | 22 | B06 | -0.045971 | -0.083706 | -0.011122 | -0.020310 | -0.014951 | -0.048578 | -0.010728 | 0.005304 | 0.016054 | -0.028737 | -0.000806 | -0.001183 | -0.000210 | -0.001156 | 0.000389 | -0.001257 | -0.000138 | -0.000643 | -0.000926 | -0.000997 | special_mutant | [] |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 8166 | bst4:BST4trunc | 22 | I12 | -0.040321 | 0.013241 | -0.036010 | -0.033754 | -0.037506 | -0.041622 | -0.027682 | -0.010900 | -0.013915 | -0.035759 | 0.000038 | -0.000281 | 0.000185 | 0.000089 | 0.002490 | 0.000204 | 0.000046 | -0.000031 | 0.000908 | -0.000522 | special_mutant | [] |
| 8167 | bst4:BST4trunc | 22 | P19 | -0.015856 | -0.067083 | -0.054850 | 0.024635 | 0.036056 | 0.005947 | -0.053641 | 0.036188 | 0.021294 | 0.004977 | 0.000392 | -0.000245 | 0.000316 | 0.000488 | 0.004036 | 0.000287 | 0.000108 | 0.000107 | 0.000176 | -0.000201 | special_mutant | [] |
| 8168 | empty | 1 | F01 | -0.066531 | -0.072928 | -0.052681 | -0.076680 | -0.063990 | NaN | -0.057796 | -0.063788 | -0.072342 | NaN | -0.000481 | 0.000051 | -0.001407 | -0.000401 | -0.000826 | NaN | -0.000723 | -0.000599 | 0.000289 | NaN | special_mutant | [] |
| 8169 | empty | 1 | M05 | -0.114444 | -0.128275 | -0.087071 | -0.102684 | -0.106478 | NaN | -0.087142 | -0.097065 | -0.087711 | NaN | -0.000429 | -0.000075 | -0.000023 | -0.000354 | -0.001523 | NaN | -0.000367 | -0.001088 | -0.001700 | NaN | special_mutant | [] |
8170 rows × 25 columns
We first take a look at the correlation between the different features.

We see that the means and the slopes are each very correlated but the mean is much less correlated with the slope. This suggest that PCA is a good approach as it will take those correlations into account to render finer patterns.
Applying PCA on this data yields the following plot where the green points correspond to WT data :

We also overlay gene information onto this plot. The mutants with the same color are associated with the same gene :

We see that they don't necessarily cluster together well but they are usually relatively close, at least along one component.
However, it looks like the first two components of the PCA capture relatively well the similarity between time-series. If you look at a small region, here is what we have :

And some of the corresponding mutants :

That's why we decide to apply PCA on the gene level.