Feature extraction and PCA

Introduction

The various approaches we have considered have led to two conclusions. The time-series are too noisy to try clustering on the raw data, or to extract very subtle features, and the two most important features in terms of biological meaning are the average, that characterize how well the mutants performs overall and the slope that characterize how sensitive to a certain condition the mutant is. This suggests a simple yet more robust feature extraction : the mean of the time-series and the slope of the associated linear regression.

Feature extraction and PCA

We start by extracting the features on the mutant level data. The data is summed up in this table

	mutant_ID	plate	well_id	mean_y2_20h_HL	mean_y2_20h_ML	mean_y2_high_10min-10min	mean_y2_high_1min-1min	mean_y2_high_2h-2h	mean_y2_high_30s-30s	mean_y2_low_10min-10min	mean_y2_low_1min-1min	mean_y2_low_2h-2h	mean_y2_low_30s-30s	slope_y2_20h_HL	slope_y2_20h_ML	slope_y2_high_10min-10min	slope_y2_high_1min-1min	slope_y2_high_2h-2h	slope_y2_high_30s-30s	slope_y2_low_10min-10min	slope_y2_low_1min-1min	slope_y2_low_2h-2h	slope_y2_low_30s-30s	mutated_genes	GO
0	CC-4533 (bst4 WT)	20	B01	-0.009797	0.011986	NaN	0.046552	NaN	NaN	NaN	0.062051	NaN	NaN	-0.000819	-0.000448	NaN	0.000051	NaN	NaN	NaN	0.000126	NaN	NaN	special_mutant	[]
1	CC-4533 (bst4 WT)	20	B21	-0.035841	-0.026091	NaN	-0.009882	NaN	NaN	NaN	0.033030	NaN	NaN	-0.001451	-0.001388	NaN	0.000222	NaN	NaN	NaN	-0.000405	NaN	NaN	special_mutant	[]
2	CC-4533 (bst4 WT)	20	E23	-0.041810	-0.008615	NaN	0.008853	NaN	NaN	NaN	0.037256	NaN	NaN	-0.001303	-0.001061	NaN	-0.000238	NaN	NaN	NaN	-0.000392	NaN	NaN	special_mutant	[]
3	CC-4533 (bst4 WT)	22	B06	-0.045971	-0.083706	-0.011122	-0.020310	-0.014951	-0.048578	-0.010728	0.005304	0.016054	-0.028737	-0.000806	-0.001183	-0.000210	-0.001156	0.000389	-0.001257	-0.000138	-0.000643	-0.000926	-0.000997	special_mutant	[]
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
8166	bst4:BST4trunc	22	I12	-0.040321	0.013241	-0.036010	-0.033754	-0.037506	-0.041622	-0.027682	-0.010900	-0.013915	-0.035759	0.000038	-0.000281	0.000185	0.000089	0.002490	0.000204	0.000046	-0.000031	0.000908	-0.000522	special_mutant	[]
8167	bst4:BST4trunc	22	P19	-0.015856	-0.067083	-0.054850	0.024635	0.036056	0.005947	-0.053641	0.036188	0.021294	0.004977	0.000392	-0.000245	0.000316	0.000488	0.004036	0.000287	0.000108	0.000107	0.000176	-0.000201	special_mutant	[]
8168	empty	1	F01	-0.066531	-0.072928	-0.052681	-0.076680	-0.063990	NaN	-0.057796	-0.063788	-0.072342	NaN	-0.000481	0.000051	-0.001407	-0.000401	-0.000826	NaN	-0.000723	-0.000599	0.000289	NaN	special_mutant	[]
8169	empty	1	M05	-0.114444	-0.128275	-0.087071	-0.102684	-0.106478	NaN	-0.087142	-0.097065	-0.087711	NaN	-0.000429	-0.000075	-0.000023	-0.000354	-0.001523	NaN	-0.000367	-0.001088	-0.001700	NaN	special_mutant	[]

8170 rows × 25 columns

We first take a look at the correlation between the different features.

We see that the means and the slopes are each very correlated but the mean is much less correlated with the slope. This suggest that PCA is a good approach as it will take those correlations into account to render finer patterns.

Applying PCA on this data yields the following plot where the green points correspond to WT data :

We also overlay gene information onto this plot. The mutants with the same color are associated with the same gene :

We see that they don't necessarily cluster together well but they are usually relatively close, at least along one component.

However, it looks like the first two components of the PCA capture relatively well the similarity between time-series. If you look at a small region, here is what we have :

And some of the corresponding mutants :

That's why we decide to apply PCA on the gene level.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature extraction and PCA

Introduction

Feature extraction and PCA

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally