Skip to content

Filtering: Possible bug and a question #37

@therealgenna

Description

@therealgenna

Hi,

(1) I think the paper Methods and the actual code disagree in calculating the Ms score. In the paper the text reads

... we calculate a score, Ms, for each putative microexon as Ms =1 − (1 − PsPU2)/n, where PU2 is the probability that the observed U2 score came from the Gaussian with the higher mean and n is the number of matches for a given intron.

The corresponding calculation in the code seems to be (from https://github.com/hemberg-lab/MicroExonator/blob/master/src/final_filters3.R, lines 85-95)

fit_U2_score <- normalmixEM(ME_matches_filter$U2_score, maxit = 10000, epsilon = 1e-05)
#ggplot_mix_comps(fit_U2_score, "Mixture model Micro-exon >=3 after coverge filter")
post.df <- as.data.frame(cbind(x = fit_U2_score$x, fit_U2_score$posterior))

#ME_final <- ME_centric_raw[ME %in% uniq_seq_filter & len_micro_exon_seq_found>=3, ]
if(fit_U2_score$mu[1]<=fit_U2_score$mu[2]){
  ME_final$ME_P_value <-  1 - (1 - approx(post.df$x, post.df$comp.1, ME_final$U2_scores)$y * ME_final$P_MEs) / ME_final$total_number_of_micro_exons_matches
} else {
  ME_final$ME_P_value <-  1 - (1 - approx(post.df$x, post.df$comp.2, ME_final$U2_scores)$y * ME_final$P_MEs)/ ME_final$total_number_of_micro_exons_matches
}

So if ME_final$ME_P_value should be identified with Ms the code calculates 1 − (1 − Ps(1-PU2))/n and not 1 − (1 − PsPU2)/n defined in the paper (because it takes the probability of the LOWER mean component, and not HIGHER mean.)

Or should Ms actually be 1 - ME_final$ME_P_value? In that case it will not match the paper either, I think ...

(2) In general, I am not sure I can understand the logic behind the expression for the Ms score in the paper. High scores are likely to indicate true microexons, and by definition Ms = 1 − (1 − PsPu2)/n = 1 - 1/n + PsPu2/n. Wouldn't you want Ms decrease with increasing Ps? The formula is the opposite. Also, if I understood correctly, n is the actual number of microexon+splice sites exact matches in the given intron sequence. I would expect Ms to grow for lower n (n>0), i.e., decrease with n increasing, But the formula is the opposite again. Shouldn't it be something like (1-Ps)Pu2/n? Or if you take it as 1 - ME_final$ME_P_value from the code it will be [1-(1-Pu2)Ps]/n, which also would make sense to me. Am I missing the point completely?

Thank you

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions