Creating a model for the data

Introduction

Up until now, we've employed simple approaches to ensure satisfactory results. However, for a more thorough and quantitative analysis, we're introducing a statistical model for the data.

Model Goals

Accurate Plate Normalization: Improve accuracy and foundation of normalization across plates.
Defining Probability Distributions: Enable defining probability distributions for time-series.

The Model

In this model, we aim to account for two sources of errors: systematic error due to batch effects and random error affecting each time-point and mutant differently.

The model is represented as follows: $Y(II)^m_{obs} = Y(II)^m_{true} + \delta_b + \epsilon^m$

Where:

$Y(II)^m_{obs}$: Observed Y(II) time-series of mutant $m$.
$Y(II)^m_{true}$: 'True' time-series.
$\delta_b$ : Batch effect.
$\epsilon^m$ : Random error, i.i.d. for all $m$.

$\delta_b$ models homogenous biological effects, while $\epsilon$ models random experimental errors.

Additive batch effect checks

One of the assumption we make is about the structure of the batch effect. Indeed, we model it here as an additive effect but this has to be verified. Another possibility for instance would be a scaling batch effect, ie $Y(II)^m_{obs} = \delta_b(Y(II)^m_{true} + \epsilon^m)$.

To check if it is additive of multiplicative, we compute the average range of Y(II) for each plate. Indeed if it was a scaling effect, this range should depend on the plate.

We see that this range seems homogenous between the different plates which guides us toward an additive model.

We also check the variance of each time point of the time-series for each plate and light condition. Indeed if the effect was multiplicative this variance would be $\delta_b^2 \cdot Var(\epsilon^m_{i})$ so we should find substantial differences, whereas with an additive effect, the variance is simply $Var(\epsilon^m_{i})$. We thus show the average of those variances for i varying between 0 and the end of the experiment :

Once again, we find no significant difference so we move forward with the additive model.

Normalization

This model allows us to define a very natural normalization process. Since we want to compare the mutants to the Wild Types and get rid of the batch effect, we simply have to consider :

$Y(II){obs}^m - Y(II){obs}^{WT} = (Y(II){true}^{m} - Y(II){true}^{WT}) + (\epsilon^M - \epsilon^WT)$

Where $Y(II)_{obs}^{WT}$ is the observed time-series of one of the Wild Type present on the plate.

Actually, in order to reduce the variance of this normalized data, and since we have three WT on each plate, we compute the mean of those three and then subtract it from the observed Y(II).

So we have $Y(II)^M_{norm} = Y(II)^M_{obs} - \frac{Y(II)^{{WT}{1}}{obs} + Y(II){obs}^{WT{2}} + Y(II)^{{WT}{3}}{obs}}{3}$ = Y(II)^M_{true} - Y(II)^{WT}{true} + \epsilon^M - \frac{\epsilon^WT{1} \epsilon^WT_{1} + \epsilon^WT_{2} + \epsilon^WT_{3}}{3}$

Normalization effects

This normalization should remove the batch effect and thus, it should remove the part of the variance that is causes, at least across plates while not changing the variance within a plate.

We confirm that this normalization has this effect :

Fitting a good model

With this framework, we don't have to make any hypothesis on the nature of $\delta$ and $\epsilon$. But in order to quantify the probability of $Y(II){true}^{m}$ being different to $Y(II){true}^{WT}$, we have to assess the distribution of $\epsilon$.

To this end, we display histograms of the distribution of the normalized $Y(II)$ for the WT for plate 99 (which is the plate that contains only WT).

Indeed, in this case the normalized $Y(II)$ should be equal (to a scaling factor) to $\epsilon$. This factor comes from the fact that to normalize, we remove the average of all the WT in the plate so we actually have : $Y(II)_{norm}^{WT} = \sqrt{1 - \frac{1}{384}}\epsilon$ where the equal symbol is the equality in law.

Here are the resulting histograms for the light condition of medium light :

We overlayed the distribution of the empirical underlaying gaussian distribution to show to goodness of fit.

We also perform a Kolmogorov-Smirnov test to compute p-values with $H_0$ being that the different time-points are drawn from gaussian distributions :

	y2_1	y2_2	y2_3	y2_4	y2_5	y2_6	y2_7	y2_8	y2_9	y2_10	...	y2_31	y2_32	y2_33	y2_34	y2_35	y2_36	y2_37	y2_38	y2_39	y2_40
statistic	0.037446	0.040386	0.049549	0.043652	0.061762	0.038972	0.050765	0.046857	0.058197	0.036255	...	0.05818	0.057305	0.054980	0.034652	0.045091	0.047349	0.048678	0.038562	0.052576	0.065752
p-value	0.640469	0.544557	0.292741	0.444627	0.102431	0.590274	0.266534	0.357047	0.142614	0.679844	...	0.14283	0.154438	0.188977	0.732215	0.403885	0.344648	0.312579	0.603700	0.230744	0.069088

2 rows × 40 columns

We see that the p-values are well above 0.05. This means that our hypothesis of a gaussian error holds. To investigate this a bit further before using this model, we compute Q-Q plots of the those distributions to make sure that they match.

We see that the Q-Q plot against a normal distribution yields very satisfying results and we can be confident that modeling $\epsilon$ by a gaussian distribution is correct.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Creating a model for the data

Introduction

Model Goals

The Model

Additive batch effect checks

Normalization

Normalization effects

Fitting a good model

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally