excelerate-scRNAseq/session-integration/Data_Integration.Rmd at master · NBISweden/excelerate-scRNAseq · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
---
title: "Data Integration"
output: github_document
---

Created by: Ahmed Mahfouz

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Overview


In this tutorial we will look at different ways of integrating multiple single cell RNA-seq datasets. We will explore two different methods to correct for batch effects across datasets. We will also look at a quantitative measure to assess the quality of the integrated data.


## Datasets

For this tutorial we will use 3 different human pancreatic islet cell datasets from four technologies, CelSeq (GSE81076) CelSeq2 (GSE85241), Fluidigm C1 (GSE86469), and SMART-Seq2 (E-MTAB-5061). The raw data matrix and associated metadata are available [here](https://www.dropbox.com/s/1zxbn92y5du9pu0/pancreas_v3_files.tar.gz?dl=1).


Load required packages:

```{r packages}
suppressMessages(require(Seurat))
suppressMessages(require(ggplot2))
suppressMessages(require(cowplot))
suppressMessages(require(scater))
suppressMessages(require(scran))
suppressMessages(require(BiocParallel))
suppressMessages(require(BiocNeighbors))
```

## Seurat (anchors and CCA)

First we will use the data integration method presented in [Comprehensive Integration of Single Cell Data](https://www.biorxiv.org/content/10.1101/460147v1).

### Data preprocessing

Load in expression matrix and metadata. The metadata file contains the technology (`tech` column) and cell type annotations (`cell type` column) for each cell in the four datasets.

```{r load}
pancreas.data <- readRDS(file = "~/scrna-seq2019/day1/3-integration/pancreas_expression_matrix.rds")
metadata <- readRDS(file = "~/scrna-seq2019/day1/3-integration/pancreas_metadata.rds")
```

Create a Seurat object with all datasets.

```{r create_Seurat}
pancreas <- CreateSeuratObject(pancreas.data, meta.data = metadata)
```

Let's first look at the datasets before applying any batch correction. We perform standard preprocessing (log-normalization), and identify variable features based on a variance stabilizing transformation (`"vst"`). Next, we scale the integrated data, run PCA, and visualize the results with UMAP. The integrated datasets cluster by cell type, instead of by technology

```{r norm_clust_raw}
# Normalize and find variable features
pancreas <- NormalizeData(pancreas, verbose = FALSE)
pancreas <- FindVariableFeatures(pancreas, selection.method = "vst", nfeatures = 2000, verbose = FALSE)

# Run the standard workflow for visualization and clustering
pancreas <- ScaleData(pancreas, verbose = FALSE)
pancreas <- RunPCA(pancreas, npcs = 30, verbose = FALSE)
pancreas <- RunUMAP(pancreas, reduction = "pca", dims = 1:30)
p1 <- DimPlot(pancreas, reduction = "umap", group.by = "tech")
p2 <- DimPlot(pancreas, reduction = "umap", group.by = "celltype", label = TRUE, repel = TRUE) +
    NoLegend()
plot_grid(p1, p2)
```

We split the combined object into a list, with each dataset as an element. We perform standard preprocessing (log-normalization), and identify variable features individually for each dataset based on a variance stabilizing transformation (`"vst"`).

```{r norm_indiv}
pancreas.list <- SplitObject(pancreas, split.by = "tech")

for (i in 1:length(pancreas.list)) {
    pancreas.list[[i]] <- NormalizeData(pancreas.list[[i]], verbose = FALSE)
    pancreas.list[[i]] <- FindVariableFeatures(pancreas.list[[i]], selection.method = "vst", nfeatures = 2000,
        verbose = FALSE)
}
```


### Integration of 4 pancreatic islet cell datasets

We identify anchors using the `FindIntegrationAnchors` function, which takes a list of Seurat objects as input.

```{r anchors}
reference.list <- pancreas.list[c("celseq", "celseq2", "smartseq2", "fluidigmc1")]
pancreas.anchors <- FindIntegrationAnchors(object.list = reference.list, dims = 1:30)
```

We then pass these anchors to the `IntegrateData` function, which returns a Seurat object.

```{r integrate}
pancreas.integrated <- IntegrateData(anchorset = pancreas.anchors, dims = 1:30)
```

After running `IntegrateData`, the `Seurat` object will contain a new `Assay` with the integrated (or ‘batch-corrected’) expression matrix. Note that the original (uncorrected values) are still stored in the object in the “RNA” assay, so you can switch back and forth.

We can then use this new integrated matrix for downstream analysis and visualization. Here we scale the integrated data, run PCA, and visualize the results with UMAP. The integrated datasets cluster by cell type, instead of by technology.

```{r clust_integrated}
# switch to integrated assay. The variable features of this assay are automatically set during
# IntegrateData
DefaultAssay(pancreas.integrated) <- "integrated"

# Run the standard workflow for visualization and clustering
pancreas.integrated <- ScaleData(pancreas.integrated, verbose = FALSE)
pancreas.integrated <- RunPCA(pancreas.integrated, npcs = 30, verbose = FALSE)
pancreas.integrated <- RunUMAP(pancreas.integrated, reduction = "pca", dims = 1:30)
p1 <- DimPlot(pancreas.integrated, reduction = "umap", group.by = "tech")
p2 <- DimPlot(pancreas.integrated, reduction = "umap", group.by = "celltype", label = TRUE, repel = TRUE) +
    NoLegend()
plot_grid(p1, p2)
```


## Mutual Nearest Neighbor (MNN)

An alternative approach to integrate single cell RNA-seq data is to use the Mutual Nearest Neighbor (MNN) batch-correction method by [Haghverdi et al.](https://www.nature.com/articles/nbt.4091).

You can either create a `SingleCellExperiment` (SCE) object directly from the count matrices, or convert directly from Seurat to SCE.

```{r sce}
celseq.data <- as.SingleCellExperiment(pancreas.list$celseq)
celseq2.data <- as.SingleCellExperiment(pancreas.list$celseq2)
fluidigmc1.data <- as.SingleCellExperiment(pancreas.list$fluidigmc1)
smartseq2.data <- as.SingleCellExperiment(pancreas.list$smartseq2)
```

### Data preprocessing

Find common genes and reduce each dataset to those common genes

```{r common_genes}
keep_genes <- Reduce(intersect, list(rownames(celseq.data),rownames(celseq2.data),
                                     rownames(fluidigmc1.data),rownames(smartseq2.data)))
celseq.data <- celseq.data[match(keep_genes, rownames(celseq.data)), ]
celseq2.data <- celseq2.data[match(keep_genes, rownames(celseq2.data)), ]
fluidigmc1.data <- fluidigmc1.data[match(keep_genes, rownames(fluidigmc1.data)), ]
smartseq2.data <- smartseq2.data[match(keep_genes, rownames(smartseq2.data)), ]
```

Calculate quality control characteristics using `calculateQCMetrics()` to determine low quality cells by finding outliers with uncharacteristically low total counts or total number of features (genes) detected.

```{r sce_qc}
## celseq.data
celseq.data <- calculateQCMetrics(celseq.data)
low_lib_celseq.data <- isOutlier(celseq.data$log10_total_counts, type="lower", nmad=3)
low_genes_celseq.data <- isOutlier(celseq.data$log10_total_features_by_counts, type="lower", nmad=3)
celseq.data <- celseq.data[, !(low_lib_celseq.data | low_genes_celseq.data)]
## celseq2.data
celseq2.data <- calculateQCMetrics(celseq2.data)
low_lib_celseq2.data <- isOutlier(celseq2.data$log10_total_counts, type="lower", nmad=3)
low_genes_celseq2.data <- isOutlier(celseq2.data$log10_total_features_by_counts, type="lower", nmad=3)
celseq2.data <- celseq2.data[, !(low_lib_celseq2.data | low_genes_celseq2.data)]
## fluidigmc1.data
fluidigmc1.data <- calculateQCMetrics(fluidigmc1.data)
low_lib_fluidigmc1.data <- isOutlier(fluidigmc1.data$log10_total_counts, type="lower", nmad=3)
low_genes_fluidigmc1.data <- isOutlier(fluidigmc1.data$log10_total_features_by_counts, type="lower", nmad=3)
fluidigmc1.data <- fluidigmc1.data[, !(low_lib_fluidigmc1.data | low_genes_fluidigmc1.data)]
## smartseq2.data
smartseq2.data <- calculateQCMetrics(smartseq2.data)
low_lib_smartseq2.data <- isOutlier(smartseq2.data$log10_total_counts, type="lower", nmad=3)
low_genes_smartseq2.data <- isOutlier(smartseq2.data$log10_total_features_by_counts, type="lower", nmad=3)
smartseq2.data <- smartseq2.data[, !(low_lib_smartseq2.data | low_genes_smartseq2.data)]
```

Normalize the data by clalculating size using the `computeSumFactors()` and `normalize()` functions from `scran`

```{r sce_normalize}
# Compute sizefactors
celseq.data <- computeSumFactors(celseq.data)
celseq2.data <- computeSumFactors(celseq2.data)
fluidigmc1.data <- computeSumFactors(fluidigmc1.data)
smartseq2.data <- computeSumFactors(smartseq2.data)

# Normalize
celseq.data <- normalize(celseq.data)
celseq2.data <- normalize(celseq2.data)
fluidigmc1.data <- normalize(fluidigmc1.data)
smartseq2.data <- normalize(smartseq2.data)
```

Feature selection: we use the `trendVar()` and `decomposeVar()` functions to calculate the per gene variance and separate it into technical versus biological components.

```{r sce_fs}
## celseq.data
fit_celseq.data <- trendVar(celseq.data, use.spikes=FALSE)
dec_celseq.data <- decomposeVar(celseq.data, fit_celseq.data)
dec_celseq.data$Symbol_TENx <- rowData(celseq.data)$Symbol_TENx
dec_celseq.data <- dec_celseq.data[order(dec_celseq.data$bio, decreasing = TRUE), ]

## celseq2.data
fit_celseq2.data <- trendVar(celseq2.data, use.spikes=FALSE)
dec_celseq2.data <- decomposeVar(celseq2.data, fit_celseq2.data)
dec_celseq2.data$Symbol_TENx <- rowData(celseq2.data)$Symbol_TENx
dec_celseq2.data <- dec_celseq2.data[order(dec_celseq2.data$bio, decreasing = TRUE), ]

## fluidigmc1.data
fit_fluidigmc1.data <- trendVar(fluidigmc1.data, use.spikes=FALSE)
dec_fluidigmc1.data <- decomposeVar(fluidigmc1.data, fit_fluidigmc1.data)
dec_fluidigmc1.data$Symbol_TENx <- rowData(fluidigmc1.data)$Symbol_TENx
dec_fluidigmc1.data <- dec_fluidigmc1.data[order(dec_fluidigmc1.data$bio, decreasing = TRUE), ]

## smartseq2.data
fit_smartseq2.data <- trendVar(smartseq2.data, use.spikes=FALSE)
dec_smartseq2.data <- decomposeVar(smartseq2.data, fit_smartseq2.data)
dec_smartseq2.data$Symbol_TENx <- rowData(smartseq2.data)$Symbol_TENx
dec_smartseq2.data <- dec_smartseq2.data[order(dec_smartseq2.data$bio, decreasing = TRUE), ]

# select the most informative genes that are shared across all datasets:
universe <- Reduce(intersect, list(rownames(dec_celseq.data),rownames(dec_celseq2.data),
                                   rownames(dec_fluidigmc1.data),rownames(dec_smartseq2.data)))
mean.bio <- (dec_celseq.data[universe,"bio"] + dec_celseq2.data[universe,"bio"] +
               dec_fluidigmc1.data[universe,"bio"] + dec_smartseq2.data[universe,"bio"])/4
hvg_genes <- universe[mean.bio > 0]
```

Combine the datasets into a unified SingleCellExperiment.

```{r sce_combined}
# total raw counts
counts_pancreas <- cbind(counts(celseq.data), counts(celseq2.data),
                         counts(fluidigmc1.data), counts(smartseq2.data))

# total normalized counts (with multibatch normalization)
logcounts_pancreas <- cbind(logcounts(celseq.data), logcounts(celseq2.data),
                            logcounts(fluidigmc1.data), logcounts(smartseq2.data))

# sce object of the combined data
sce <- SingleCellExperiment(
    assays = list(counts = counts_pancreas, logcounts = logcounts_pancreas),
    rowData = rowData(celseq.data), # same as rowData(pbmc4k)
    colData = rbind(colData(celseq.data), colData(celseq2.data),
                    colData(fluidigmc1.data), colData(smartseq2.data))
)

# store the hvg_genes from the prior section into the sce object’s metadata slot via:
metadata(sce)$hvg_genes <- hvg_genes
```

Let's first look at the datasets before applying MNN batch correction.

```{r sce_visualize, warning=FALSE}
sce <- runPCA(sce,
              ncomponents = 20,
              feature_set = hvg_genes,
              method = "irlba")

names(reducedDims(sce)) <- "PCA_naive"

p1 <- plotReducedDim(sce, use_dimred = "PCA_naive", colour_by = "tech") +
    ggtitle("PCA Without batch correction")
p2 <- plotReducedDim(sce, use_dimred = "PCA_naive", colour_by = "celltype") +
    ggtitle("PCA Without batch correction")
plot_grid(p1, p2)
```

### Integrating datasets using MNN

The mutual nearest neighbors (MNN) approach within the scran package utilizes a novel approach to adjust for batch effects. The `fastMNN()` function returns a representation of the data with reduced dimensionality, which can be used in a similar fashion to other lower-dimensional representations such as PCA. In particular, this representation can be used for downstream methods such as clustering.

Prior to running `fastMNN()`, we rescale each batch to adjust for differences in sequencing depth between batches. The `multiBatchNorm()` function from the `scran` package recomputes log-normalized expression values after adjusting the size factors for systematic differences in coverage between SingleCellExperiment objects. The previously computed size factors only remove biases between cells within a single batch. This improves the quality of the correction step by removing one aspect of the technical differences between batches.

```{r multiBatchNorm, warning=FALSE}
rescaled <- multiBatchNorm(celseq.data, celseq2.data, fluidigmc1.data, smartseq2.data)
celseq.data_rescaled <- rescaled[[1]]
celseq2.data_rescaled <- rescaled[[2]]
fluidigmc1.data_rescaled <- rescaled[[3]]
smartseq2.data_rescaled <- rescaled[[4]]
```

We now run fastMNN. The `BNPARAM` can be used to specify the specific nearest neighbors method to use from the BiocNeighbors package. Here we make use of the `[Annoy library](https://github.com/spotify/annoy)` via the `BiocNeighbors::AnnoyParam()` argument. We save the reduced-dimension MNN representation into the `reducedDims` slot of our `sce` object.

*Optional:* We can enable parallelization via the `BPPARAM` argument, enabling multicore processing.

```{r MNN}
mnn_out <- fastMNN(celseq.data_rescaled,
                   celseq2.data_rescaled,
                   fluidigmc1.data_rescaled,
                   smartseq2.data_rescaled,
                   subset.row = metadata(sce)$hvg_genes,
                   k = 20, d = 50, approximate = TRUE,
                   # BPPARAM = BiocParallel::MulticoreParam(8),
                   BNPARAM = BiocNeighbors::AnnoyParam())

reducedDim(sce, "MNN") <- mnn_out$correct
```

**NOTE:** fastMNN() does not produce a batch-corrected expression matrix. Thus, the result from fastMNN() should solely be treated as a reduced dimensionality representation, suitable for direct plotting, TSNE/UMAP, clustering, and trajectory analysis that relies on such results.


Plot the batch-corrected data.

```{r MNN_plot, warning=FALSE}
p1 <- plotReducedDim(sce, use_dimred = "MNN", colour_by = "tech") + ggtitle("MNN Ouput Reduced Dimensions")
p2 <- plotReducedDim(sce, use_dimred = "MNN", colour_by = "celltype") + ggtitle("MNN Ouput Reduced Dimensions")
plot_grid(p1, p2)
```


### Session info

```{r}
sessionInfo()
```