Understanding large discrepancies between salmon counts and inferential replicate medians

Hi Rob / Salmon team,

I'm observing unexpectedly large differences between salmon's directly computed gene-level counts and the corresponding median values across inferential replicates. I'd appreciate clarification on why this occurs and how it should be interpreted.

**Reproducible example** (based on the `tximport` vignette):

```
library(readr)
library(tximport)
library(tximportData)
library(matrixStats)

dir <- system.file("extdata", package = "tximportData")
samples <- read.table(file.path(dir, "samples.txt"), header = TRUE)
tx2gene <- read_csv(file.path(dir, "tx2gene.gencode.v27.csv"))

files <- file.path(dir, "salmon_gibbs", samples$run, "quant.sf.gz")
names(files) <- paste0("sample", 1:6)
txi.inf.rep <- tximport(
    files, type = "salmon", tx2gene = tx2gene, countsFromAbundance = "scaledTPM"
)

# Calculate median inf-rep #
inf_med <- do.call(
    cbind,
    lapply(
        txi.inf.rep$infReps,
        rowMedians
    )
)

# Compare inf-rep to counts #
sub_index <- rowMins(txi.inf.rep$counts) >= 10
summary(log10(inf_med[sub_index, ] + 1) - log10(txi.inf.rep$counts[sub_index, ] + 1))
```

**Output:**

```
    sample1             sample2            sample3             sample4            sample5            sample6       
 Min.   :-2.459335   Min.   :-2.39361   Min.   :-2.239718   Min.   :-2.64777   Min.   :-2.98350   Min.   :-3.1225  
 1st Qu.:-0.009933   1st Qu.: 0.03157   1st Qu.:-0.007428   1st Qu.: 0.03227   1st Qu.:-0.04803   1st Qu.:-0.0097  
 Median : 0.205464   Median : 0.24663   Median : 0.207480   Median : 0.24734   Median : 0.15535   Median : 0.2034  
 Mean   : 0.174246   Mean   : 0.21739   Mean   : 0.177997   Mean   : 0.21514   Mean   : 0.12840   Mean   : 0.1741  
 3rd Qu.: 0.411219   3rd Qu.: 0.45224   3rd Qu.: 0.414717   3rd Qu.: 0.45213   3rd Qu.: 0.35163   3rd Qu.: 0.4102  
 Max.   : 1.357469   Max.   : 1.38608   Max.   : 1.351277   Max.   : 1.40385   Max.   : 1.31025   Max.   : 1.3654  
```

**Questions:**

1. What causes these large discrepancies? Some genes show multiple-orders-of-magnitude differences between the direct count estimates and inferential replicate medians. In my own data (200 inferential replicates), the direct counts are often so far outside the inferential replicate range that they seem inconsistent with expected one-in-ten-thousand sampling variation.

2. How should this be interpreted? I initially thought this might relate to the algorithm properties discussed in #613, but the magnitude of differences seems beyond what I'd expect from those issues alone.

3. Is using the median/mode of inferential replicates as a point estimate advisable? While #613 suggests incorporating quantification uncertainty into count estimates, I found that using inferential replicate summaries helped identify a drug-treatment effect (validated by qPCR) that wasn't detected using direct counts in differential expression analysis.

Thanks for any insights you can provide!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understanding large discrepancies between salmon counts and inferential replicate medians #994

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Understanding large discrepancies between salmon counts and inferential replicate medians #994

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions