-
Notifications
You must be signed in to change notification settings - Fork 0
Creating a model for the data
Up until now, we've employed simple approaches to ensure satisfactory results. However, for a more thorough and quantitative analysis, we're introducing a statistical model for the data.
- Accurate Plate Normalization: Improve accuracy and foundation of normalization across plates.
- Defining Probability Distributions: Enable defining probability distributions for time-series.
In this model, we aim to account for two sources of errors: systematic error due to batch effects and random error affecting each time-point and mutant differently.
The model is represented as follows:
Where:
-
$Y(II)^m_{obs}$ : Observed Y(II) time-series of mutant$m$ . -
$Y(II)^m_{true}$ : 'True' time-series. -
$\delta_b$ : Batch effect. -
$\epsilon^m$ : Random error, i.i.d. for all$m$ .
One of the assumption we make is about the structure of the batch effect. Indeed, we model it here as an additive effect but this has to be verified. Another possibility for instance would be a scaling batch effect, ie
To check if it is additive of multiplicative, we compute the average range of Y(II) for each plate. Indeed if it was a scaling effect, this range should depend on the plate.

We see that this range seems homogenous between the different plates which guides us toward an additive model.
We also check the variance of each time point of the time-series for each plate and light condition. Indeed if the effect was multiplicative this variance would be

Once again, we find no significant difference so we move forward with the additive model.
This model allows us to define a very natural normalization process. Since we want to compare the mutants to the Wild Types and get rid of the batch effect, we simply have to consider :
$Y(II){obs}^m - Y(II){obs}^{WT} = (Y(II){true}^{m} - Y(II){true}^{WT}) + (\epsilon^M - \epsilon^WT)$
Where
Actually, in order to reduce the variance of this normalized data, and since we have three WT on each plate, we compute the mean of those three and then subtract it from the observed Y(II).
So we have $Y(II)^M_{norm} = Y(II)^M_{obs} - \frac{Y(II)^{{WT}{1}}{obs} + Y(II){obs}^{WT{2}} + Y(II)^{{WT}{3}}{obs}}{3}$ = Y(II)^M_{true} - Y(II)^{WT}{true} + \epsilon^M - \frac{\epsilon^WT{1} \epsilon^WT_{1} + \epsilon^WT_{2} + \epsilon^WT_{3}}{3}$
This normalization should remove the batch effect and thus, it should remove the part of the variance that is causes, at least across plates while not changing the variance within a plate.
We confirm that this normalization has this effect :


With this framework, we don't have to make any hypothesis on the nature of
To this end, we display histograms of the distribution of the normalized
Indeed, in this case the normalized
Here are the resulting histograms for the light condition of medium light :

We overlayed the distribution of the empirical underlaying gaussian distribution to show to goodness of fit.
We also perform a Kolmogorov-Smirnov test to compute p-values with
| y2_1 | y2_2 | y2_3 | y2_4 | y2_5 | y2_6 | y2_7 | y2_8 | y2_9 | y2_10 | ... | y2_31 | y2_32 | y2_33 | y2_34 | y2_35 | y2_36 | y2_37 | y2_38 | y2_39 | y2_40 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| statistic | 0.037446 | 0.040386 | 0.049549 | 0.043652 | 0.061762 | 0.038972 | 0.050765 | 0.046857 | 0.058197 | 0.036255 | ... | 0.05818 | 0.057305 | 0.054980 | 0.034652 | 0.045091 | 0.047349 | 0.048678 | 0.038562 | 0.052576 | 0.065752 |
| p-value | 0.640469 | 0.544557 | 0.292741 | 0.444627 | 0.102431 | 0.590274 | 0.266534 | 0.357047 | 0.142614 | 0.679844 | ... | 0.14283 | 0.154438 | 0.188977 | 0.732215 | 0.403885 | 0.344648 | 0.312579 | 0.603700 | 0.230744 | 0.069088 |
2 rows × 40 columns
We see that the p-values are well above 0.05. This means that our hypothesis of a gaussian error holds. To investigate this a bit further before using this model, we compute Q-Q plots of the those distributions to make sure that they match.

We see that the Q-Q plot against a normal distribution yields very satisfying results and we can be confident that modeling