Filtering: Possible bug and a question

Hi,

(1) I think the paper Methods and the actual code disagree in calculating the `Ms` score. In the paper the text reads

**... we calculate a score, `Ms`, for each putative microexon as `Ms =1 − (1 − PsPU2)/n`, where `PU2` is the probability that the observed U2 score came from the Gaussian with the _higher_ mean and `n` is the number of matches for a given intron.**


The corresponding calculation in the code seems to be (from https://github.com/hemberg-lab/MicroExonator/blob/master/src/final_filters3.R, lines 85-95)
```
fit_U2_score <- normalmixEM(ME_matches_filter$U2_score, maxit = 10000, epsilon = 1e-05)
#ggplot_mix_comps(fit_U2_score, "Mixture model Micro-exon >=3 after coverge filter")
post.df <- as.data.frame(cbind(x = fit_U2_score$x, fit_U2_score$posterior))

#ME_final <- ME_centric_raw[ME %in% uniq_seq_filter & len_micro_exon_seq_found>=3, ]
if(fit_U2_score$mu[1]<=fit_U2_score$mu[2]){
  ME_final$ME_P_value <-  1 - (1 - approx(post.df$x, post.df$comp.1, ME_final$U2_scores)$y * ME_final$P_MEs) / ME_final$total_number_of_micro_exons_matches
} else {
  ME_final$ME_P_value <-  1 - (1 - approx(post.df$x, post.df$comp.2, ME_final$U2_scores)$y * ME_final$P_MEs)/ ME_final$total_number_of_micro_exons_matches
}
```
So if `ME_final$ME_P_value` should be identified with `Ms` the code calculates `1 − (1 − Ps(1-PU2))/n` and not `1 − (1 − PsPU2)/n` defined in the paper (because it takes the probability of the LOWER mean component, and not HIGHER mean.)

Or should `Ms` actually be `1 - ME_final$ME_P_value`? In that case it will not match the paper either, I think ...

(2) In general, I am not sure I can understand the logic behind the expression for the `Ms` score in the paper. High scores are likely to indicate true microexons, and by definition `Ms = 1 − (1 − PsPu2)/n = 1 - 1/n + PsPu2/n`. Wouldn't you want `Ms` decrease with increasing `Ps`? The formula is the opposite. Also, if I understood correctly, `n` is the actual number of microexon+splice sites exact matches in the given intron sequence. I would expect `Ms` to grow for lower `n` (n>0), i.e., decrease with `n` increasing, But the formula is the opposite again. Shouldn't it be something like `(1-Ps)Pu2/n`? Or if you take it as `1 - ME_final$ME_P_value` from the code it will be `[1-(1-Pu2)Ps]/n`, which also would make sense to me. Am I missing the point completely?

Thank you

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filtering: Possible bug and a question #37

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Filtering: Possible bug and a question #37

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions