excelerate-scRNAseq/session-de/session-de-methods.Rmd at master · NBISweden/excelerate-scRNAseq · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
---
title: "Differential expression methods"
bibliography: bibliography.bib
header-include:
  - \usepackage{amsmath}
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(fig.path="session-de-files/figures/")
```

## Learning objectives
- describe main classes of methods used
- understand statistical concepts behind the key scRNA-seq DE methods
- run the key scRNA-seq DE methods

-------

## Methods examples

![Software tools for identifying DE genes using scRNAseq data (Wang et al, 2019)](session-de-files/images/methods-Wang-2019.png)


<img src="session-de-files/images/methods-Robinson-2018.pdf" alt="And even more examples in the Soneson, Charlotte, and Mark D Robinson (2018)"  width="4200" height="4200">

----------

## Main classes
#### non-parameteric tests
- generic methods
- e.g. Wilcoxon rank-sum test, Kruskal-Wallis, Kolmogorov-Smirnov test
- non-parametric tests generally convert observed expression values to ranks & test whether the distribution of ranks for one group are signficantly different from the distribution of ranks for the other group
- some non-parametric methods fail in the presence of a large number of tied values, such as the case for dropouts (zeros) in single-cell RNA-seq expression data
- if the conditions for a parametric test hold, then it will typically be more powerful than a non-parametric test

#### bulk RNA-seq based method
- developed for bulk RNA-seq
- e.g. edgeR, DE-seq2
- compare estimates of mean-expression (sample size), based on negative binomial distribution
- can be assessed by datasets where RNA-seq data has beeen validated by RT-qPCR

#### scRNA-seq specific methods
- developed for scRNA-seq
- e.g. MAST, SCDE, Monocle, Pagoda, D3E, DESeingle, SigEMD, etc.
- large number of samples, i.e. cells, for each group we are comparing in single-cell experiments. Thus we can take advantage of the whole distribution of expression values in each group to identify differences between groups
-we usually do not have a defined set of experimental conditions; instead we try to  identify the cell groups by using an unsupervised clustering approach.

#### modeling wise
- distribution free (non-parameteric)
- negative binomial
- zero inflated negative binomial
- Poisson and negative binomial
- GLM and GAM
- etc.

--------

## Statistical thinking

<figure>
<!-- <img src="session-de-files/images/methods-stats.png" width="400" height="400"> -->
<img src="session-de-files/images/methods-stats.png">
<figcaption>$$Outcome_i=Model_i+error_i$$</figcaption>
</figure>

#### Normal distribution example
```{r, dist-normal, echo=F, eval=T}
set.seed(1)
x <- rnorm(1000, 170, 1)
h <-hist(x, col=rgb(1,0,0,0.5), xlab="height [cm]", n=50, main="", xlim=c(165, 180))
xfit <-seq(min(x),max(x),length=40)
yfit <-dnorm(xfit,mean=mean(x),sd=sd(x))
yfit <- yfit*diff(h$mids[1:2])*length(x)
lines(xfit, yfit, col="black", lwd=2)

x2 <- rnorm(1000, 173, 1)
h<-hist(x2, col=rgb(0,0,1,0.5), xlab="height", n=50, add=T, main="")
xfit2<-seq(min(x2),max(x2),length=40)
yfit2<-dnorm(xfit2,mean=mean(x2),sd=sd(x2))
yfit2 <- yfit2*diff(h$mids[1:2])*length(x2)
lines(xfit2, yfit2, col="black", lwd=2)
```

$$t=\frac{m_1-m_2}{s_p\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}}$$

#### Generic recipe
- model e.g. gene expression with random error
- estimate model parameters
- use model for prediction and/or inference
- the better the model fits the data, the better statistics


## Common distributions

#### Negative binomial (NB)
```{r, dist-NB, echo=F, fig.height=3}

set.seed(1)
par(mfrow=c(1,3))
hist(rnbinom(1000, mu=4, size=100), col="grey", xlab="Read Counts", main=expression(paste(mu,"=4, ", alpha,"=1/100")),xlim=c(0,25),ylim=c(0,0.5),20,freq=FALSE)
hist(rnbinom(1000, mu=4, size=10), col="grey", xlab="Read Counts", main=expression(paste(mu,"=4, ", alpha,"=1/10")),xlim=c(0,25),ylim=c(0,.5),20,freq=FALSE)
hist(rnbinom(1000, mu=4, size=1), col="grey", xlab="Read Counts", main=expression(paste(mu,"=4, ", alpha,"=1")),xlim=c(0,25),ylim=c(0,.5),20,freq=FALSE)
```
$$Mean=\mu$$
$$Variance=\mu+\mu^2/\alpha$$
$\alpha$=dispersion parameter (extra variation compared to the Poisson)

NB! fits bulk RNA-seq data very well, and it is used for most statistical methods designed for such data. In addition, it has been show to fit the distribution of molecule counts obtained from data tagged by unique molecular identifiers (UMIs) quite well (Grun et al. 2014, Islam et al. 2011).

#### zero inflated NB
```{r, dist-zero-inflated-NB, echo=F, fig.height=3}

set.seed(1)
par(mfrow=c(1,3))

xlims<-c(0,10)
ylims<-c(0,0.6)

d = 0.2;
counts <- rnbinom(1000, mu=4, size=100);
counts[runif(1000) < d] = 0;
hist(counts, col="grey50", xlab="Read Counts", main=expression(paste(mu,"=4, ", alpha,"=1/100, ", d,"=0.2")),freq=FALSE,xlim=xlims,ylim=ylims,n=20);

d = 0.5;
counts <- rnbinom(1000, mu=4, size=100);
counts[runif(1000) < d] = 0;
hist(counts, col="grey50", xlab="Read Counts", main=expression(paste(mu,"=4, ", alpha,"=1/100, ", d,"=0.5")),freq=FALSE,xlim=xlims,ylim=ylims,n=20);

d = 0.5;
counts <- rnbinom(1000, mu=10, size=100);
counts[runif(1000) < d] = 0;
hist(counts, col="grey50", xlab="Read Counts", main=expression(paste(mu,"=10, ", alpha,"=1/100, ", d,"=0.2")),freq=FALSE,xlim=c(0,20),ylim=ylims,n=20);

```
$$Mean=\mu\cdot(1-d)$$
$$Variance=\mu\cdot(1-d)\cdot(1+\mu(d+\mu))$$

_d_, dropout rate; dropout rate of a gene is strongly correlated with the mean expression of the gene. Different zero-inflated negative binomial models use different relationships between _mu_ and _d_ and some may fit _mu_ and _d_ to the expression of each gene independently. Implemented in SCDE.

#### Poisson beta
```{r, dist-Poisson, echo=F, fig.height=3}

set.seed(1)
par(mfrow=c(1,2))

a <- 0.1
b <- 0.1
g <- 100
lambdas <- rbeta(1000, a, b)
counts <- sapply(g*lambdas, function(l) {rpois(1, lambda = l)})
hist(
    counts,
    col = "grey50",
    xlab = "Read Counts",
    main = "Poisson-Beta"
)


a <- 0.4
b <- 0.2
g <- 100
lambdas <- rbeta(1000, a, b)
counts <- sapply(g*lambdas, function(l) {rpois(1, lambda = l)})
hist(
    counts,
    col = "grey50",
    xlab = "Read Counts",
    main = "Poisson-Beta"
)

```
$$Mean=(g\cdot a)/(a+b)$$
$$Variance=g^2\cdot a\cdot b/((a+b+1)\cdot(a+b)^2)$$

_a_: the rate of activation of transcription;

_b_: the rate of inhibition of transcription;

_g_: the rate of transcript production while transcription is active at the locus.

Differential expression methods may test each of the parameters for differences across groups or only one (often _g_). Implemented in BPSC, D3E

May be further expanded to explicitly account for other sources of gene expression differences such as batch-effect or library depth depending on the particular DE algorithm.

----------

## Under the hood

### SCDE
- _models_ the read counts for each gene using a mixture of a NB, negative binomial, and a Poisson distribution
- _NB distribution_ models the transcripts that are amplified and detected
- _Poisson distribution_ models the unobserved or background-level signal of transcripts that are not amplified (e.g. dropout events)
- subset of robust genes is used to fit, via _EM_ algorithm, the parameters to the mixture of models
For DE, the posterior probability that the gene shows a fold expression difference between two conditions is computed using a _Bayesian approach_

#### Loading the data

Data can be downloaded from [here](https://stockholmuniversity.box.com/s/b2is7q3msfewnvnivky5w0gj9pj27pxi)
```{r, data-load, eval=F}

# we will read both counts and rpkms as different method are more adopted for different type of data
R<-read.table("data/mouse_embryo/rpkms_Deng2014_preinplantation.txt")
C<-read.table("data/mouse_embryo/counts_Deng2014_preinplantation.txt")
M <- read.table("data/mouse_embryo/Metadata_mouse_embryo.csv",sep=",",header=T)

# select only 8 and 16 stage embryos.
selection <- c(grep("8cell",M$Stage),grep("16cell",M$Stage))

# check number of cells
table(M$Stage[selection])

# select those cells only from the data frames
M<-M[selection,]
R <- R[,selection]
C <- C[,selection]

```

#### Run SCDE
```{r, scde, eval=F}
require(scde)

# make a count dataset and clean up
cd <- clean.counts(C, min.lib.size=1000, min.reads = 1, min.detected = 1)

# make factor for stage
stages <- factor(M$Stage,levels=c("X8cell","X16cell"))
names(stages) <- colnames(C)

# fit error models - takes a while to run (~20 mins).
# you can skip this step and load the precomputed file
# found [in here](https://github.com/NBISweden/excelerate-scRNAseq/blob/master/session-de/scde_error_models.Rdata?raw=true)
# Make sure to fix the file path in the next line.
savefile<-"data/mouse_embryo/DE/scde_error_models.Rdata"
if (file.exists(savefile)){
  load(savefile)
}else{
  #This works only if flexmix 2.3-13 (not 2.3-15) is installed, cf: http://hms-dbmi.github.io/scde/package.html
  #check your environment file to change this.
  o.ifm <- scde.error.models(counts = cd, groups = stages, n.cores = 1,
                             threshold.segmentation = TRUE, save.crossfit.plots = FALSE,
                             save.model.plots = FALSE, verbose = 0)
  save(o.ifm,file=savefile)
}

# estimate gene expression prior
o.prior <- scde.expression.prior(models = o.ifm, counts = cd, length.out = 400, show.plot = FALSE)

# run differential expression test
ediff <- scde.expression.difference(o.ifm, cd, o.prior, groups  =  stages,
                                    n.randomizations  =  100, n.cores  =  1, verbose  =  1)

# convert Z-score and corrected Z-score to 2-sided p-values
ediff$pvalue <- 2*pnorm(-abs(ediff$Z))
ediff$p.adjust <- 2*pnorm(-abs(ediff$cZ))

# sort and look at top results
ediff <- ediff[order(abs(ediff$Z), decreasing  =  TRUE), ]
head(ediff)

# write output to a file with top DE genes on top.
write.table(ediff,file="data/mouse_embryo/DE/scde_8cell_vs_16_cell.tab",sep="\t",quote=F)
```

#### Run SCDE with batch info
```{r, scde-batch, eval=F}

# include also batch information in the test
# in this case we will consider each embryo as a batch just for testing
# OBS! in this case there are no batch that belongs to both groups so the correction may be pointless.

batch <- factor(M$Embryo, levels <- unique(M$Embryo))
ediff.batch <- scde.expression.difference(o.ifm, cd, o.prior, groups  =  stages,
                                          n.randomizations  =  100, n.cores  =  1, batch=batch)

# now scde.expression.difference returns a list with the corrected values as well as the DE results.

de.batch <- ediff.batch$results
de.batch$pvalue <- 2*pnorm(-abs(de.batch$Z))
de.batch$p.adjust <- 2*pnorm(-abs(de.batch$cZ))

# look at top results
head(de.batch[order(abs(de.batch$Z), decreasing  =  TRUE), ])

```


### MAST
- uses _generalized linear hurdle model_
- designed to account for stochastic dropouts and bimodal expression distribution in which expression is either strongly non-zero or non-detectable
- the rate of expression _Z_, and the level of expression _Y_, are modeled for each gene _g_, indicating whether gene _g_ is expressed in cell _i_ (i.e., $$Z_{ig}=0$$ if $y_{ig}=0$ and $z_{ig}=1$ if $y_{ig}>0$)
- A *logistic regression model* for the discrete variable _Z_ and a _Gaussian linear model_ for the continuous variable (Y|Z=1):
    $logit (P_r(Z_{ig}=1))=X_i\beta_g^D$
    $P_r(Y_{ig}=Y|Z_{ig}=1)=N(X_i\beta_g^C,\sigma_g^2)$, where $X_i$ is a design matrix

- Model parameters are _fitted_ using an empirical Bayesian framework
- Allows for a joint estimate of nuisance and treatment effects
- DE is determined using _the likelihood ratio test_

#### Run MAST
```{r, mast, echo=T, eval=F}

library(MAST)

fdata <- data.frame(rownames(R))
rownames(fdata)<-rownames(R)

# Make a single cell assay object
sca <- FromMatrix(log2(as.matrix(R)+1),M,fdata)

# count number of detected genes and scale.
cdr2 <-colSums(assay(sca)>0)
colData(sca)$cngeneson <- scale(cdr2)

# Fit a hurdle model, modeling the condition and (centered) ngeneson factor, thus adjusting for
# the cellular detection rate. In this case we set the reference to be 8cell stage using relevel
# of the factor.

# takes a while to run, so save to an file so that you do not have to rerun each time
# You can load it [from here](https://github.com/NBISweden/excelerate-scRNAseq/blob/master/session-de/mast_data.Rdata?raw=true)
savefile<-"data/mouse_embryo/DE/mast_data.Rdata"
if (file.exists(savefile)){
  load(savefile)
}else{
  cond<-factor(colData(sca)$Stage)
  cond <- relevel(cond,"X8cell")
  colData(sca)$condition<-cond
  zlmCond <- zlm(~condition + cngeneson, sca)
  #Run likelihood ratio test for the condition coefficient.
  summaryCond <- summary(zlmCond, doLRT='conditionX16cell')
  save(sca,zlmCond,summaryCond,file=savefile)
}
# get the datatable with all outputs
summaryDt <- summaryCond$datatable

# Make a datatable with all the relevant output - combined Hurdle model
fcHurdle <- merge(summaryDt[contrast=='conditionX16cell' & component=='H',.(primerid, `Pr(>Chisq)`)],
	 summaryDt[contrast=='conditionX16cell' & component=='logFC',
	           .(primerid, coef, ci.hi, ci.lo)], by='primerid')

# change name of Pr(>Chisq) to fdr and sort by fdr
fcHurdle[,fdr:=p.adjust(`Pr(>Chisq)`, 'fdr')]
fcHurdle <- fcHurdle[order(fdr),]
head(fcHurdle)

# write output to a table
write.table(fcHurdle,file="data/mouse_embryo/DE/mast_8cell_vs_16_cell.tab",sep="\t",quote=F)


```


### SC3 Kruskall-Wallis test
In the SC3 package they have implemented non-parametric Kruskal-Wallis test for differential expression.

#### Run SC3 Kruskall-Wallis test

```{r, sc3, eval=F}

library(SC3)

# run their DE test using rpkm values and as labels, M$stage
de <- get_de_genes(R,M$Stage)

# output is simply a p-value with no information of directionality.
de.results <- data.frame(gene=rownames(R),p.value=de)
de.results <- de.results[order(de.results$p.value),]

head(de.results)
write.table(de.results,file="data/mouse_embryo/DE/sc3_kwtest_8cell_vs_16_cell.tab",sep="\t",quote=F)


```


### Seurat methods
Seurat has several tests for differential expression (DE) which can be set with the test.use parameter in the FindMarkers() function:

* "wilcox" : Wilcoxon rank sum test (default)
* "bimod" : Likelihood-ratio test for single cell gene expression, (McDavid et al., Bioinformatics, 2013)
* "roc" : Standard AUC classifier
* "t" : Student's t-test
* "negbinom" : Identifies differentially expressed genes between two groups of cells using a negative binomial generalized linear model. Use only for UMI-based datasets
* "poisson" : Identifies differentially expressed genes between two groups of cells using a poisson generalized linear model. Use only for UMI-based datasets
* "LR": Uses a logistic regression framework to determine differentially expressed genes. Constructs a logistic regression model predicting group membership based on each feature individually and compares this to a null model with a likelihood ratio test.
* "MAST" : GLM-framework that treates cellular detection rate as a covariate (Finak et al, Genome Biology, 2015)
* "DESeq2" : DE based on a model using the negative binomial distribution (Love et al, Genome Biology, 2014)

#### Run Seurat methods (for Seurat v3)
```{r, seurat, eval=F}

library(Seurat)

data <- CreateSeuratObject(C)

# Normalize the data
scale.factor <- mean(colSums(C))
data <- NormalizeData(object = data, normalization.method = "LogNormalize",
                     scale.factor = scale.factor)

# Scale the data (needed for PCA etc)
data <- ScaleData(object = data)

# run all DE methods
methods <- c("wilcox","bimod","roc","t","negbinom","poisson","LR","MAST","DESeq2")
DE <- list()
for (m in methods){
 outfile <- paste("data/mouse_embryo/DE/seurat_",m,"_8cell_vs_16_cell.tab", sep='')
 if(!file.exists(outfile)){
   DE[[m]]<- FindMarkers(object = data,ident.1 = "X8cell",
                         ident.2 = "X16cell",test.use = m)
   write.table(DE[[m]],file=outfile,sep="\t",quote=F)
 }
}

```

### Monocole
- Originally designed for ordering cells by progress through differentiation stages (pseudo-time)
- The mean expression level of each gene is _modeled with a GAM_, generalized additive model, which relates one or more predictor variables to a response variable as $g(E(Y))=\beta_0+f_1(x_1)+f_2(x_2)+...+f_m(x_m)$ where Y is a specific gene expression level, $x_i$ are predictor variables, _g_ is a link function, typically log function, and $f_i$ are non-parametric functions (e.g. cubic splines)
- The observable expression level Y is then modeled using GAM,
$E(Y)=s(\varphi_t(b_x, s_i))+\epsilon$ where $\varphi_t(b_x, s_i)$ is the assigned pseudo-time of a cell and $s$ is a cubic smoothing function with three degrees of freedom. The error term $\epsilon$ is normally distributed with a mean of zero
- The DE test is performed using an _approx. $\chi^2$ likelihood ratio test_


---------

# [Jump to Schedule](../schedule.md)
# [Back to Introduction](session-de.md)
# [Next to Methods Evaluation](session-de-methods-evaluation.md)


---------


-------