Analysis_of_scRNAseq_using_R/Analysis_Zeisel2015.Rmd at master · mkrzak/Analysis_of_scRNAseq_using_R · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
---
title: "Analysis of scRNAseq using R"
author: M. Krzak, A. Carissimo, D. Righelli, C. Angelini
output:
  html_document:
    toc: true
    toc_float: true
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
```

# Step-by-step workflow on Zeisel dataset

## Package loading

Before performing the analysis, several R-based packages should be installed on your local machine. Note that the packages may require an additional installation of other dependencies. Make sure that you have installed recommended version of the packages and your machine meets the mimimal system requirements for carrying out this analysis.

```{r message=FALSE}
library(scater) #data quality control and vizualization
library(M3Drop) #implements feature selection methods
library(monocle) #complete data analysis: preprocessing, dimension reduction, clustering and other
library(Seurat) #complete data analysis: QC, analysis and data exploration
library(mclust) #implements metrics for clustering validation
library(scran) #implements functions for data preprocessing and normalization
library(SC3) #mainly for clustering scRNAseq data
```

## Data information

**Zeisel2015** is a UMI-based scRNAseq dataset published in Zeisel et al. (2015) (Link to study: http://science.sciencemag.org/content/347/6226/1138) that contains the expression matrix of endogenous, spike-in and mitochondrial genes from 3005 cells from mouse cortex and hippocampus tissues. Additional metadata information are available, including cells origin, type or tissue. For the purpose of this tutorial we stored **Zeisel2015** count matrix with metadata information in two separate txt files that you can download from here: https://drive.google.com/drive/folders/1W7EG4ySF6T74NqjAVCXRfMEdiY9bHnqN?usp=sharing. We can read these files and create an easy-to-work *SingleCellExperiment* experiment object.

![ ](images/sce.png)

```{r message=FALSE}
all.counts <- read.table(file ="data/all.counts.txt",  sep = "\t")
metadata <- read.table(file ="data/metadata.txt", sep = "\t", header = TRUE)

sce <- SingleCellExperiment(list(counts=as.matrix(all.counts)), colData=metadata)
sce
```

We can explore the information about cells and genes stored in *SingleCellExperiment* experiment object under *colData* and *rawData* slots, respectively.

```{r}
counts(sce)[1:4,1:4]
names(colData(sce)) #available metadata information
table(colData(sce)$tissue) #tissue annotation
table(colData(sce)$level1class) #cell type annotation
names(rowData(sce))
```

As mentioned, **Zeisel2015** dataset contains also the expression of spike-ins and mitochondrial genes. Spike-in gene names usually starts from "ERCC" and mitochondrial genes contain "mt" string.

```{r}
is.spike <- grepl("^ERCC", rownames(sce))
summary(is.spike)

is.mito <- grepl("^mt-", rownames(sce))
summary(is.mito)
```

## Quality control and data preprocessing

We can use **scater** package to automatically calculate several quality metrics per cell and per gene. These metrics will be useful to i.e. detect low quality cells or lowly expressed genes that should be removed from further analysis. Remember to not define spike-ins in *feature_controls* if your dataset do not contain any ERCC genes.

```{r}
sce <- calculateQCMetrics(sce,  feature_controls=list(Spike=is.spike, Mt=is.mito))
```

### Quality metrics per cell

```{r}
colnames(colData(sce))
colData(sce)$total_counts[1:5] #Total count for each cell
colData(sce)$total_features_by_counts[1:5] #Number of genes with non zero counts per cell
if(length(is.spike==TRUE)>0){colData(sce)$pct_counts_Spike[1:5]} #Percentage of spike-in counts per cell
```

### Quality metrics per gene

```{r}
colnames(rowData(sce))
rowData(sce)$mean_counts[1:5] #Average count per gene
rowData(sce)$n_cells_by_counts[1:5] #Number of cells with non zero counts per gene
rowData(sce)$pct_dropout_by_counts[1:5] #Percentage of dropouts per gene
```

### Plot quality metrics

One can plot histograms based on quality metrics to determine possible thresholds for cell/gene filterings. We can also visualize some of the quality metrics on the PCA projected data.

```{r}
hist(sce$total_counts, breaks=50,  xlab="Library size", ylab="Number of cells")
assay(sce, "logcounts_raw") <- log2(counts(sce)+1)
plotPCA(sce, run_args=c(exprs_values="logcounts_raw"), size_by="total_counts")

hist(sce$total_features_by_counts, xlab="Number of expressed genes", breaks=50,  ylab="Number of cells")
plotPCA(sce, run_args=c(exprs_values="logcounts_raw"), size_by="total_features_by_counts")

if(length(is.spike==TRUE)>0)
{hist(sce$pct_counts_Spike, xlab="ERCC proportion (%)", ylab="Number of cells", breaks=50)
plotPCA(sce, run_args=c(exprs_values="logcounts_raw"), size_by="pct_counts_Spike")}

```

### Automatic filtering of outlier cells based on quality metrics

Alternatively, isOutlier function provides an automatic detection of low quality cells based on quality metrics. In this example, we are filtering cells based on library size, number of expressed genes per cell and total count over all spike-in transcripts in each cell. Excluded cells are those with the total number of expressed genes and the total sum of counts more than 3 median absolute deviations below the median across the genes. Note that we dont filter many cells in this step, it is likely that dataset was already quality controlled by the original study.

```{r}
libsize.drop <- isOutlier(sce$total_counts, nmads=3, type="lower", log=TRUE)
feature.drop <- isOutlier(sce$total_features_by_counts, nmads=3, type="lower", log=TRUE)
if(length(is.spike==TRUE)>0){spike.drop <- isOutlier(sce$pct_counts_Spike, nmads=3, type="higher")
ind <- libsize.drop | feature.drop | spike.drop}else{ind <- libsize.drop | feature.drop}

sce$removed=ind
plotPCA(sce, colour_by="removed", run_args=c(exprs_values="logcounts_raw"), shape_by="removed")

if(length(is.spike==TRUE)>0)
{sce <- sce[,!(libsize.drop | feature.drop | spike.drop)]
data.frame(ByLibSize=sum(libsize.drop), ByFeature=sum(feature.drop), BySpike=sum(spike.drop), Remaining=ncol(sce))
}else{sce <- sce[,!(libsize.drop | feature.drop)]
data.frame(ByLibSize=sum(libsize.drop), ByFeature=sum(feature.drop), Remaining=ncol(sce))
}

```

### Explore highly expressed genes

We can also explore highly and overexpressed genes. The overexpressed genes, are those genes which expression is higher than others by several magnitudes. Overexpressed genes can bias further procedures such as clustering, therefore, they should be removed before performing downstream analysis. Note that highly expressed genes mostly include spike-ins and mitochondrial genes. One of the most commonly overexpressed endogenous gene is "Malat1".

```{r}
plotHighestExprs(sce, exprs_values = "counts")
```


### Filter overexpressed features

In this step we will filter spike-in and mitochondrial genes as well as "Malat1".

```{r}
if(length(is.spike==TRUE)>0){sce=sce[-which(grepl("^ERCC-", rownames(sce))),]}
sce=sce[-which(grepl("^mt-", rownames(sce))),]
sce <- sce[-which(rownames(sce)=="Malat1"),]
dim(sce)
```

### Filter lowly expressed genes based on the average count per gene

Now, we should filter out lowly expressed genes as they do not provide any insights into the underlying biology. In the filtering we removed lowly expressed genes that are genes with average expression count (adjusted by library size) equal to 0. Note that we dont filter many genes in this step, it is likely that dataset was already quality controlled by the original study.

```{r}
rowData(sce)$ave.counts <- calcAverage(sce, exprs_values = "counts", use_size_factors=FALSE)
to.keep <- rowData(sce)$ave.counts > 0
sce <- sce[to.keep,]
summary(to.keep)
dim(sce)
```

### Perform standard normalization by computing counts per million (CPM)

After quality control and filtering, one should normalize the dataset to remove potential technical bias. One strategy is to use CPM (Count Per Million) normalization which is a correction to remove the noise related to sequencing depth. CPM divides each count by its total sum (across all the genes) and multiplies by one million. In this way, each cell has the same total sum of the counts. We can plot PCA on the log-transformed CPM normalized counts and compare with previous PCA on the log raw counts.

```{r}
cpm(sce) <- calculateCPM(sce, use_size_factors=FALSE)
cpm(sce)[1:4,1:4]

assay(sce, "logcounts") <- log2(cpm(sce)+1)
plotPCA(sce,colour_by="level1class",run_args=c(exprs_values="logcounts"))
set.seed(100)
plotTSNE(sce, colour_by="level1class", run_args=c(exprs_values = "logcounts"))
plotUMAP(sce, colour_by="level1class", run_args=c(exprs_values = "logcounts"))
```

### Perform normalization using deconvolution method

Another strategy, more suitable for scRNAseq data, is deconvolution method implemented in **scran** package. The deconvolution method normalizes data by cells-pooled size factors that account for dropout biases. More details about this normalization technique can be found in study https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0947-7. Similarly, we will plot PCA on the log **scran** normalized counts.

```{r}
set.seed(100)
clusters <- quickCluster(sce, min.size=200, min.mean=0.1, method="igraph")
sce <- computeSumFactors(sce, cluster=clusters, min.mean=0.1)
sce <- scater::normalize(sce, return_log = FALSE)
normcounts(sce)[1:4,1:4]

assay(sce, "logcounts") <- log2(normcounts(sce)+1)
plotPCA(sce,colour_by="level1class",run_args=c(exprs_values="logcounts"))
```


## Reducing dataset dimensionality

To deal with a large number of dimensions in scRNAseq dataset a feature selection or dimension reduction strategies can be applied.

### Feature selection - to retrieve most variable genes

Feature selection step is aimed at preserving biologically relevant information in the dataset and improving computational efficiency of downstream analyses. In most of the cases, it seeks for highly variable genes based on mean/variance relationship. **M3Drop** is one of the packages for automatic detection of most variable genes. It first calculates the mean and square coefficient of variation and fits the quadratic curve between the two variables. Then a chi-square test is used to find high variable genes which are significantly above the curve. We found CPM adjusted counts more appriopriate to use as input for this method rather than **scran** normalized values. Note that when using older versions of **M3Drop** package (<1.10.0) the output is an array of genes instead of gene table with p-values.

```{r}
hvg_table <- BrenneckeGetVariableGenes(cpm(sce), fdr = 0.01, minBiolDisp = 0.5)
length(hvg_table$Gene)
sce_sub=sce[as.character(hvg_table$Gene),]
```


### Dimension reduction - to project data into low-dimensional embedding

Dimension reduction can be used to visualize basic structure of the data or to reduce the number of features prior downstream analysis such as clustering. In contrast to the feature selection which extracts the most informative features, dimension reduction techniques projects the data into a new low-dimensional space that preserves the structure of the data. To reduce dataset dimensionality one can use previously mentioned PCA or more novel techniques such as tSNE or UMAP (all implemented in **scater** package). Note that projections should be applied to normalized and log-transformed count matrices. PCA is a deterministic approach while tSNE and UMAP are stochastic - for reproducibility of tSNE and UMAP projections one has to set the seed for generating random variables.

```{r}
assay(sce_sub, "logcounts") <- log2(cpm(sce_sub)+1)
plotPCA(sce_sub, colour_by="level1class", run_args=c(exprs_values = "logcounts"))
set.seed(100)
plotTSNE(sce_sub, colour_by="level1class", run_args=c(exprs_values = "logcounts"))
plotUMAP(sce_sub, colour_by="level1class", run_args=c(exprs_values = "logcounts"))

```

## Clustering

### Clusterize cells using SC3 package

To cluster cells we can use **SC3** package. **SC3** is a consensus clustering method for single-cell RNA-seq data. It first calculates distances between the cells using three metrics: Euclidean, Pearson, and Spearman. On each of the obtained distance matrix two transformations are applied: PCA and Laplacian graph. Then K-Means clustering technique is used to cluster the transformed distance matrices subject to the first d eigenvectors. In result, several individual clusterings are obtained which are further combined into a single consensus clustering using Cluster-based Similarity Partitioning Algorithm (CSPA).
In this example we will use available cell annotation in the clustering inference of **SC3**. In the real life applications we usually do not know the exact number populations in the sample. Note that clustering cells with **SC3** may take a substantial amount of time. We will visualize results on the PCA projected space.

![ ](images/sc3.png)

```{r}
k_input=length(unique(sce$level1class))
rowData(sce)$feature_symbol <- rownames(sce)
sce <- sc3(sce, ks = k_input, gene_filter = FALSE, biology = FALSE)
table(colData(sce)[paste0("sc3_",k_input, "_clusters")][[1]])

assay(sce, "logcounts") <- log2(cpm(sce)+1)
plotTSNE(sce, colour_by=paste0("sc3_",k_input, "_clusters"), run_args=c(exprs_values = "logcounts"))
```

#### Evaluate clustering based on annotation (using ARI Index)

If the cell annotation is available, one can measure the effectiveness of the clustering output using Adjusted Rand Index (ARI). ARI index measures similarity between the partition obtained from clustering and the partition obtained from dataset annotation. The values of the ARI range can be negative if the agreement of the partitions is worse then the agreement expected by chance, or between 0 and 1 for clustering better then chance. ARI close to 1 indicate high accuracy of the method in detecting annotated cell populations.

```{r}
adjustedRandIndex(colData(sce)[paste0("sc3_",k_input, "_clusters")][[1]],sce$level1class)
```

### Exercise
How the clustering performance would change if you would estimate the number of cell populations in **SC3** using *sc3_estimate_k* function? Is the number of estimated clusters close to the annotated number of cell populations ?

## Additional plots

You can use *colour_by* argument to color points in any of the above projections by a selected marker gene or any information stored in the annotation. Additional violin plots allow you to i.e. visualize the expression of the selected gene across all the cells from a given cell type. For the illustrative purpose, we used *Gad1* gene taken from the literature, which is a marker for interneuron cell type in the cortex tissue.

```{r}
set.seed(100)
plotTSNE(sce, colour_by="Gad1", run_args=c(exprs_values = "logcounts"))
plotExpression(sce, "Gad1", x = "level1class", colour_by = "level1class", exprs_values = "logcounts")
plotExpression(sce, "Gad1", x = paste0("sc3_",k_input, "_clusters")[[1]], colour_by = paste0("sc3_",k_input, "_clusters")[[1]], exprs_values = "logcounts")
```

# Complete data analysis using Seurat package

**Seurat** is an R package designed for QC, analysis, and exploration of single-cell RNA-seq data. All procedures are based on *Seurat* object which can store the information about counts and dataset annotation. Note that when creating new *Seurat* object one can perform cell and gene filterings based on specified thresholds. Here we set both filterings to 0, as we will input already quality controlled and filtered counts.

Note that in **Seurat** versions <3.1.1 slot *counts* is relaced by *raw.data*, slot *min.features* is replaced by *min.genes*, and instead of *seur@assays$RNA@counts* user access the data by calling *seur@data*.

### Create Seurat object

```{r}
seur <- CreateSeuratObject(counts = counts(sce), meta.data = data.frame(colData(sce)), min.cells = 0, min.features = 0)
seur
seur@assays$RNA@counts[1:4,1:4]
```

**Seurat** also provides function to normalize data by a scaling factor followed by log-transformation.

```{r}
seur <- NormalizeData(seur, normalization.method = "LogNormalize")
seur@assays$RNA@data[1:4,1:4]
```

### Extract highly variable genes

For extracting most informative genes one can use *FindVariableGenes* function. When using this function one can change the selection method (*selection.method*). For the illustrative purpose, we will use *vst* method which fits a line to the relationship of log(variance) and log(mean) using local polynomial regression (loess). Then standardizes the gene expression values using the observed mean and expected variance (given by the fitted line). Gene variance is then calculated on the standardized values after clipping to a maximum (*clip.max*). For more details see help of *FindVariableGenes* function.

Note that in **Seurat** versions <3.1.1 function *FindVariableFeatures* is relaced by *FindVariableGenes* and there is no function *VariableFeaturePlot*.

```{r}
seur <- FindVariableFeatures(object = seur, selection.method="vst", nfeatures = 2000)
top10 <- head(VariableFeatures(seur), 10)
plot1 <- VariableFeaturePlot(seur)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
plot2
```

### Scale the data

For centering and scaling dataset you can use *ScaleData* function. Setting *do.center=TRUE* will center the expression for each gene by subtracting the average expression for that gene. Setting *do.scale=TRUE* will scale the expression for each gene by dividing the centered gene expression levels by their standard deviations (if *do.center=TRUE*), and by their root mean square, otherwise.

```{r}
seur <- ScaleData(object = seur, do.center = TRUE, do.scale = TRUE)
seur@assays$RNA@scale.data[1:4,1:4]
```

### Reduce dataset dimension and clusterize cells

In this step, we will perform PCA dimension reduction, build nearest neighbor graph (on the Euclidean distances between any pair of cells in the PCA space) and clusterize cells using modularity optimization technique (Louvain algorithm) that seeks densily connected modules (cell clusters) in a graph structure. Note that *FindClusters* do not require from the user to input true number of cell populations. However the clustering performance strongly depends on the *resolution* parameter that controls the size of graph communities, hence the size of the clusters (for more details see the description of *FindClusters* function).

```{r echo=T, results='hide'}
seur <- RunPCA(object = seur, pc.genes=seur@var.genes, ndims.print = 1:2, nfeatures.print = 5) #Reducing dimension with PCA
seur <- FindNeighbors(seur) #Computing nearest neighbor graph
seur <- FindClusters(object = seur, resolution = 0.1, algorithm = 1) #Clustering cells with modularity detection algorithm on graph
```
```{r}
table(Idents(seur))
```

### Compare clusterization with annotated cell types

We can use UMAP dimension reduction to compare the **Seurat** clustering and annotated cell populations.

```{r}
seur <- RunUMAP(seur, dims = 1:10)
DimPlot(seur, group.by = "ident", reduction.use = "umap")
DimPlot(seur, group.by = "level1class", reduction.use = "umap")
```

### Compute Adjusted Rand Index

Finally, we will verify the accuracy of the clustering using ARI index.

```{r}
adjustedRandIndex(Idents(seur), seur@meta.data$level1class)
```

### Exercise
How the clustering would change if you would manipulate the *resolution* parameter?

# Complete data analysis using Monocle package

### Create CellDataSet object

Here we will use **Monocle** to perform a complete analysis on the example dataset. **Monocle** uses its own data object - *CellDataSet*, which can store the information about the experiment (expression matrix and metadata). Note that when creating new *CellDataSet* object the data should not be normalized - the method will normalize matrix internally when performing downstream analysis steps. It is suggested to set *expressionFamily=negbinomial.size()* when working with UMI or raw read counts to define the underlying distribution of the dataset. For more details see: http://cole-trapnell-lab.github.io/monocle-release/docs/#installing-monocle.

```{r}
rowData(sce)$gene_short_name <- rownames(sce)
pd <- new("AnnotatedDataFrame", data = data.frame(colData(sce))) #cell info
fd <- new("AnnotatedDataFrame", data = data.frame(rowData(sce))) #gene info
cds <- newCellDataSet(cellData = counts(sce), phenoData = pd, featureData = fd, expressionFamily=VGAM::negbinomial.size())
cds
```

### Extract highly variable genes

In the first step, we will call *estimateDispersions* function to compute a smooth curve describing the relationship between the variance of the genes and the mean expression across all the cells. *dispersionTable* retrieves table of computed mean, empirical dispersion and fitted dispersion values. From this table we can subset top variable genes by thresholding on the mean and empirical dispersion. Black dots on the mean/dispersion figure marks the selected features.

```{r}
cds <- BiocGenerics::estimateSizeFactors(cds)
cds <- estimateDispersions(cds)
disp <- dispersionTable(cds)
head(disp)
hvg <- subset(disp, mean_expression >= 0.5 & dispersion_empirical >= 0.1)
cds <- setOrderingFilter(cds, ordering_genes = hvg$gene_id)
plot_ordering_genes(cds)
cds <- cds[hvg$gene_id,]
```

### Reduce dataset dimension and clusterize cells

Before performing clustering, we will reduce the feature selected matrix down to two dimensions using tSNE. Note that *Monocle3* also provides UMAP as *reduction_method*.

```{r message=FALSE}
cds <- reduceDimension(cds, max_components = 2, reduction_method = 'tSNE', verbose = T)
```

Unsupervised clustering in **Monocle** can be done using one of two techniques, *densityPeak* or *louvain*. *densityPeak* partitions cells based on the cell local density and their nearest distance. It aims to group cells with a higher local density and distance smaller than a fixed threshold. The number of clusters in this technique is automatically estimated through the inference. On the contrary, *louvain* approach is based on a community detection algorithm that seeks densely connected modules in a graph built on scRNAseq dataset. Although it is possible to fix the number of nearest neighbors when creating the graph, the number of clusters cannot be directly controlled. For the illustrative purpose, we will clusterize cells with *densityPeak* algorithm.

```{r}
cds <- clusterCells(cds, num_clusters = NULL, method = "densityPeak")
table(cds$Cluster)
```

### Compare clusterization with annotated cell types

We can plot now the projected data colored by the clustering output and the available cell annotation. Note that plot will depend on the *reduction_method* you used previously (either tSNE or UMAP projected data will be plotted).

```{r}
p1 <- plot_cell_clusters(cds, color_by = 'Cluster')
p2 <- plot_cell_clusters(cds, color_by = 'level1class')
p1
p2
```

### Compute Adjusted Rand Index

Finally, we will verify the accuracy of the clustering using ARI index.

```{r}
adjustedRandIndex(cds$Cluster,cds$level1class)
```

### Exercise

Would you improve the performance of the method by setting *method = "louvain"* ? What is the difference between the number of estimated clusters in comparison to *densityPeak* approach?

# Package versions

```{r}
sessionInfo()
```