-
Notifications
You must be signed in to change notification settings - Fork 184
Description
Hi Rob / Salmon team,
I'm observing unexpectedly large differences between salmon's directly computed gene-level counts and the corresponding median values across inferential replicates. I'd appreciate clarification on why this occurs and how it should be interpreted.
Reproducible example (based on the tximport vignette):
library(readr)
library(tximport)
library(tximportData)
library(matrixStats)
dir <- system.file("extdata", package = "tximportData")
samples <- read.table(file.path(dir, "samples.txt"), header = TRUE)
tx2gene <- read_csv(file.path(dir, "tx2gene.gencode.v27.csv"))
files <- file.path(dir, "salmon_gibbs", samples$run, "quant.sf.gz")
names(files) <- paste0("sample", 1:6)
txi.inf.rep <- tximport(
files, type = "salmon", tx2gene = tx2gene, countsFromAbundance = "scaledTPM"
)
# Calculate median inf-rep #
inf_med <- do.call(
cbind,
lapply(
txi.inf.rep$infReps,
rowMedians
)
)
# Compare inf-rep to counts #
sub_index <- rowMins(txi.inf.rep$counts) >= 10
summary(log10(inf_med[sub_index, ] + 1) - log10(txi.inf.rep$counts[sub_index, ] + 1))
Output:
sample1 sample2 sample3 sample4 sample5 sample6
Min. :-2.459335 Min. :-2.39361 Min. :-2.239718 Min. :-2.64777 Min. :-2.98350 Min. :-3.1225
1st Qu.:-0.009933 1st Qu.: 0.03157 1st Qu.:-0.007428 1st Qu.: 0.03227 1st Qu.:-0.04803 1st Qu.:-0.0097
Median : 0.205464 Median : 0.24663 Median : 0.207480 Median : 0.24734 Median : 0.15535 Median : 0.2034
Mean : 0.174246 Mean : 0.21739 Mean : 0.177997 Mean : 0.21514 Mean : 0.12840 Mean : 0.1741
3rd Qu.: 0.411219 3rd Qu.: 0.45224 3rd Qu.: 0.414717 3rd Qu.: 0.45213 3rd Qu.: 0.35163 3rd Qu.: 0.4102
Max. : 1.357469 Max. : 1.38608 Max. : 1.351277 Max. : 1.40385 Max. : 1.31025 Max. : 1.3654
Questions:
-
What causes these large discrepancies? Some genes show multiple-orders-of-magnitude differences between the direct count estimates and inferential replicate medians. In my own data (200 inferential replicates), the direct counts are often so far outside the inferential replicate range that they seem inconsistent with expected one-in-ten-thousand sampling variation.
-
How should this be interpreted? I initially thought this might relate to the algorithm properties discussed in Salmon read number discrepancy across identical runs #613, but the magnitude of differences seems beyond what I'd expect from those issues alone.
-
Is using the median/mode of inferential replicates as a point estimate advisable? While Salmon read number discrepancy across identical runs #613 suggests incorporating quantification uncertainty into count estimates, I found that using inferential replicate summaries helped identify a drug-treatment effect (validated by qPCR) that wasn't detected using direct counts in differential expression analysis.
Thanks for any insights you can provide!