Skip to content

Creating a model for the data

samsongourevitch edited this page May 20, 2024 · 41 revisions

Introduction

Up until now, we've employed simple approaches to ensure satisfactory results. However, for a more thorough and quantitative analysis, we're introducing a statistical model for the data.

Model Goals

  1. Accurate Plate Normalization: Improve accuracy and foundation of normalization across plates.
  2. Defining Probability Distributions: Enable defining probability distributions for time-series.

The Model

In this model, we aim to account for two sources of errors: systematic error due to batch effects and random error affecting each time-point and mutant differently.

The model is represented as follows: $Y(II)^m_{obs} = Y(II)^m_{true} + \delta_b + \epsilon^m$

Where:

  • $Y(II)^m_{obs}$: Observed Y(II) time-series of mutant $m$.
  • $Y(II)^m_{true}$: 'True' time-series.
  • $\delta_b$ : Batch effect.
  • $\epsilon^m$ : Random error, i.i.d. for all $m$.

$\delta_b$ models homogenous biological effects, while $\epsilon$ models random experimental errors.


Additive batch effect checks

One of the assumption we make is about the structure of the batch effect. Indeed, we model it here as an additive effect but this has to be verified. Another possibility for instance would be a scaling batch effect, ie $Y(II)^m_{obs} = \delta_b(Y(II)^m_{true} + \epsilon^m)$.

To check if it is additive of multiplicative, we compute the average range of Y(II) for each plate. Indeed if it was a scaling effect, this range should depend on the plate.

image

We see that this range seems homogenous between the different plates which guides us toward an additive model.

We also check the variance of each time point of the time-series for each plate and light condition. Indeed if the effect was multiplicative this variance would be $\delta_b^2 \cdot Var(\epsilon^m_{i})$ so we should find substantial differences, whereas with an additive effect, the variance is simply $Var(\epsilon^m_{i})$. We thus show the average of those variances for i varying between 0 and the end of the experiment :

image

Once again, we find no significant difference so we move forward with the additive model.

Normalization

This model allows us to define a very natural normalization process. Since we want to compare the mutants to the Wild Types and get rid of the batch effect, we simply have to consider :

$Y(II){obs}^m - Y(II){obs}^{WT} = (Y(II){true}^{m} - Y(II){true}^{WT}) + (\epsilon^M - \epsilon^WT)$

Where $Y(II)_{obs}^{WT}$ is the observed time-series of one of the Wild Type present on the plate.

Actually, in order to reduce the variance of this normalized data, and since we have three WT on each plate, we compute the mean of those three and then subtract it from the observed Y(II).

So we have $Y(II)^M_{norm} = Y(II)^M_{obs} - \frac{Y(II)^{{WT}{1}}{obs} + Y(II){obs}^{WT{2}} + Y(II)^{{WT}{3}}{obs}}{3}$ = Y(II)^M_{true} - Y(II)^{WT}{true} + \epsilon^M - \frac{\epsilon^WT{1} \epsilon^WT_{1} + \epsilon^WT_{2} + \epsilon^WT_{3}}{3}$

Normalization effects

This normalization should remove the batch effect and thus, it should remove the part of the variance that is causes, at least across plates while not changing the variance within a plate.

We confirm that this normalization has this effect :

image

image

Fitting a good model

With this framework, we don't have to make any hypothesis on the nature of $\delta$ and $\epsilon$. But in order to quantify the probability of $Y(II){true}^{m}$ being different to $Y(II){true}^{WT}$, we have to assess the distribution of $\epsilon$.

To this end, we display histograms of the distribution of the normalized $Y(II)$ for the WT for plate 99 (which is the plate that contains only WT).

Indeed, in this case the normalized $Y(II)$ should be equal (to a scaling factor) to $\epsilon$. This factor comes from the fact that to normalize, we remove the average of all the WT in the plate so we actually have : $Y(II)_{norm}^{WT} = \sqrt{1 - \frac{1}{384}}\epsilon$ where the equal symbol is the equality in law.

Here are the resulting histograms for the light condition of medium light :

image

We overlayed the distribution of the empirical underlaying gaussian distribution to show to goodness of fit.

We also perform a Kolmogorov-Smirnov test to compute p-values with $H_0$ being that the different time-points are drawn from gaussian distributions :

y2_1 y2_2 y2_3 y2_4 y2_5 y2_6 y2_7 y2_8 y2_9 y2_10 ... y2_31 y2_32 y2_33 y2_34 y2_35 y2_36 y2_37 y2_38 y2_39 y2_40
statistic 0.037446 0.040386 0.049549 0.043652 0.061762 0.038972 0.050765 0.046857 0.058197 0.036255 ... 0.05818 0.057305 0.054980 0.034652 0.045091 0.047349 0.048678 0.038562 0.052576 0.065752
p-value 0.640469 0.544557 0.292741 0.444627 0.102431 0.590274 0.266534 0.357047 0.142614 0.679844 ... 0.14283 0.154438 0.188977 0.732215 0.403885 0.344648 0.312579 0.603700 0.230744 0.069088

2 rows × 40 columns

We see that the p-values are well above 0.05. This means that our hypothesis of a gaussian error holds. To investigate this a bit further before using this model, we compute Q-Q plots of the those distributions to make sure that they match.

image

We see that the Q-Q plot against a normal distribution yields very satisfying results and we can be confident that modeling $\epsilon$ by a gaussian distribution is correct.

Clone this wiki locally