Skip to content

Understanding large discrepancies between salmon counts and inferential replicate medians #994

@JBrownBiostat

Description

@JBrownBiostat

Hi Rob / Salmon team,

I'm observing unexpectedly large differences between salmon's directly computed gene-level counts and the corresponding median values across inferential replicates. I'd appreciate clarification on why this occurs and how it should be interpreted.

Reproducible example (based on the tximport vignette):

library(readr)
library(tximport)
library(tximportData)
library(matrixStats)

dir <- system.file("extdata", package = "tximportData")
samples <- read.table(file.path(dir, "samples.txt"), header = TRUE)
tx2gene <- read_csv(file.path(dir, "tx2gene.gencode.v27.csv"))

files <- file.path(dir, "salmon_gibbs", samples$run, "quant.sf.gz")
names(files) <- paste0("sample", 1:6)
txi.inf.rep <- tximport(
    files, type = "salmon", tx2gene = tx2gene, countsFromAbundance = "scaledTPM"
)

# Calculate median inf-rep #
inf_med <- do.call(
    cbind,
    lapply(
        txi.inf.rep$infReps,
        rowMedians
    )
)

# Compare inf-rep to counts #
sub_index <- rowMins(txi.inf.rep$counts) >= 10
summary(log10(inf_med[sub_index, ] + 1) - log10(txi.inf.rep$counts[sub_index, ] + 1))

Output:

    sample1             sample2            sample3             sample4            sample5            sample6       
 Min.   :-2.459335   Min.   :-2.39361   Min.   :-2.239718   Min.   :-2.64777   Min.   :-2.98350   Min.   :-3.1225  
 1st Qu.:-0.009933   1st Qu.: 0.03157   1st Qu.:-0.007428   1st Qu.: 0.03227   1st Qu.:-0.04803   1st Qu.:-0.0097  
 Median : 0.205464   Median : 0.24663   Median : 0.207480   Median : 0.24734   Median : 0.15535   Median : 0.2034  
 Mean   : 0.174246   Mean   : 0.21739   Mean   : 0.177997   Mean   : 0.21514   Mean   : 0.12840   Mean   : 0.1741  
 3rd Qu.: 0.411219   3rd Qu.: 0.45224   3rd Qu.: 0.414717   3rd Qu.: 0.45213   3rd Qu.: 0.35163   3rd Qu.: 0.4102  
 Max.   : 1.357469   Max.   : 1.38608   Max.   : 1.351277   Max.   : 1.40385   Max.   : 1.31025   Max.   : 1.3654  

Questions:

  1. What causes these large discrepancies? Some genes show multiple-orders-of-magnitude differences between the direct count estimates and inferential replicate medians. In my own data (200 inferential replicates), the direct counts are often so far outside the inferential replicate range that they seem inconsistent with expected one-in-ten-thousand sampling variation.

  2. How should this be interpreted? I initially thought this might relate to the algorithm properties discussed in Salmon read number discrepancy across identical runs #613, but the magnitude of differences seems beyond what I'd expect from those issues alone.

  3. Is using the median/mode of inferential replicates as a point estimate advisable? While Salmon read number discrepancy across identical runs #613 suggests incorporating quantification uncertainty into count estimates, I found that using inferential replicate summaries helped identify a drug-treatment effect (validated by qPCR) that wasn't detected using direct counts in differential expression analysis.

Thanks for any insights you can provide!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions