Skip to content

Feature extraction and PCA

samsongourevitch edited this page Jun 26, 2024 · 4 revisions

Introduction

The various approaches we have considered have led to two conclusions. The time-series are too noisy to try clustering on the raw data, or to extract very subtle features, and the two most important features in terms of biological meaning are the average, that characterize how well the mutants performs overall and the slope that characterize how sensitive to a certain condition the mutant is. This suggests a simple yet more robust feature extraction : the mean of the time-series and the slope of the associated linear regression.

Feature extraction and PCA

We start by extracting the features on the mutant level data. The data is summed up in this table

mutant_ID plate well_id mean_y2_20h_HL mean_y2_20h_ML mean_y2_high_10min-10min mean_y2_high_1min-1min mean_y2_high_2h-2h mean_y2_high_30s-30s mean_y2_low_10min-10min mean_y2_low_1min-1min mean_y2_low_2h-2h mean_y2_low_30s-30s slope_y2_20h_HL slope_y2_20h_ML slope_y2_high_10min-10min slope_y2_high_1min-1min slope_y2_high_2h-2h slope_y2_high_30s-30s slope_y2_low_10min-10min slope_y2_low_1min-1min slope_y2_low_2h-2h slope_y2_low_30s-30s mutated_genes GO
0 CC-4533 (bst4 WT) 20 B01 -0.009797 0.011986 NaN 0.046552 NaN NaN NaN 0.062051 NaN NaN -0.000819 -0.000448 NaN 0.000051 NaN NaN NaN 0.000126 NaN NaN special_mutant []
1 CC-4533 (bst4 WT) 20 B21 -0.035841 -0.026091 NaN -0.009882 NaN NaN NaN 0.033030 NaN NaN -0.001451 -0.001388 NaN 0.000222 NaN NaN NaN -0.000405 NaN NaN special_mutant []
2 CC-4533 (bst4 WT) 20 E23 -0.041810 -0.008615 NaN 0.008853 NaN NaN NaN 0.037256 NaN NaN -0.001303 -0.001061 NaN -0.000238 NaN NaN NaN -0.000392 NaN NaN special_mutant []
3 CC-4533 (bst4 WT) 22 B06 -0.045971 -0.083706 -0.011122 -0.020310 -0.014951 -0.048578 -0.010728 0.005304 0.016054 -0.028737 -0.000806 -0.001183 -0.000210 -0.001156 0.000389 -0.001257 -0.000138 -0.000643 -0.000926 -0.000997 special_mutant []
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
8166 bst4:BST4trunc 22 I12 -0.040321 0.013241 -0.036010 -0.033754 -0.037506 -0.041622 -0.027682 -0.010900 -0.013915 -0.035759 0.000038 -0.000281 0.000185 0.000089 0.002490 0.000204 0.000046 -0.000031 0.000908 -0.000522 special_mutant []
8167 bst4:BST4trunc 22 P19 -0.015856 -0.067083 -0.054850 0.024635 0.036056 0.005947 -0.053641 0.036188 0.021294 0.004977 0.000392 -0.000245 0.000316 0.000488 0.004036 0.000287 0.000108 0.000107 0.000176 -0.000201 special_mutant []
8168 empty 1 F01 -0.066531 -0.072928 -0.052681 -0.076680 -0.063990 NaN -0.057796 -0.063788 -0.072342 NaN -0.000481 0.000051 -0.001407 -0.000401 -0.000826 NaN -0.000723 -0.000599 0.000289 NaN special_mutant []
8169 empty 1 M05 -0.114444 -0.128275 -0.087071 -0.102684 -0.106478 NaN -0.087142 -0.097065 -0.087711 NaN -0.000429 -0.000075 -0.000023 -0.000354 -0.001523 NaN -0.000367 -0.001088 -0.001700 NaN special_mutant []

8170 rows × 25 columns

We first take a look at the correlation between the different features. image

We see that the means and the slopes are each very correlated but the mean is much less correlated with the slope. This suggest that PCA is a good approach as it will take those correlations into account to render finer patterns.

Applying PCA on this data yields the following plot where the green points correspond to WT data : image

We also overlay gene information onto this plot. The mutants with the same color are associated with the same gene : image

We see that they don't necessarily cluster together well but they are usually relatively close, at least along one component.

However, it looks like the first two components of the PCA capture relatively well the similarity between time-series. If you look at a small region, here is what we have :

image

And some of the corresponding mutants :

image image image

That's why we decide to apply PCA on the gene level.

Clone this wiki locally