Skip to content

Commit 9ba86c5

Browse files
committed
Updated slide deck - removed HC and Kmeans
1 parent 6820d29 commit 9ba86c5

File tree

4 files changed

+102
-425
lines changed

4 files changed

+102
-425
lines changed

Images/ClusterwiseModularity.png

197 KB
Loading

Images/Silhouette.png

214 KB
Loading

Slides/08_ClusteringSlides.Rmd

Lines changed: 60 additions & 203 deletions
Original file line numberDiff line numberDiff line change
@@ -1,229 +1,65 @@
11
---
2-
title: "Introduction to single-cell RNA-seq analysis - Clustering"
3-
author: "Stephane Ballereau"
4-
date: "February 2022"
2+
title: "Clustering"
3+
author: "Ashley Sawle"
4+
date: 'April 2022'
55
output:
66
ioslides_presentation:
7-
logo: ../Images/uniOfCamCrukLogos.png
8-
smaller: yes
9-
widescreen: yes
7+
widescreen: true
8+
smaller: true
9+
logo: Images/uniOfCamCrukLogos.png
1010
css: css/stylesheet.css
11-
slidy_presentation: default
12-
beamer_presentation: default
1311
---
14-
15-
<!--
16-
logo: Images/CRUK_CC_web.jpg
17-
-->
1812

1913
```{r include=FALSE}
2014
library(tidyr)
2115
library(dplyr)
22-
#source("SOME_SCRIPT.R")
2316
```
2417

25-
## Outline
26-
27-
* Motivation
28-
29-
* Initial methods
30-
31-
* Graph-based methods
32-
* walktrap
33-
* louvain
34-
* leiden
35-
3618
## Single Cell RNAseq Analysis Workflow
3719

3820
```{r, echo=FALSE, out.width='70%', fig.align='center'}
39-
knitr::include_graphics('../Images/workflow2_clustering.png')
21+
knitr::include_graphics('Images/workflow2_clustering.png')
4022
```
4123

4224
## Motivation
4325

44-
The data has been QCed and normalized, confounders removed, noise limited,
45-
dimensionality reduced.
26+
The data has been QC'd and normalized, and batch corrected.
4627

4728
We can now ask biological questions.
4829

49-
* *de novo* discovery and annotation of cell-types based on transcription profiles
50-
51-
* unsupervised clustering:
30+
* unsupervised clustering: identification of groups of cells based on the
31+
similarities of the transcriptomes without any prior knowledge of the labels
32+
usually using the PCA output
5233

53-
* identification of groups of cells
54-
* based on the similarities of the transcriptomes
55-
* without any prior knowledge of the labels
56-
* usually using the PCA output
34+
* *de novo* discovery and annotation of cell-types based on transcription
35+
profiles
5736

5837
## Single Cell RNAseq Analysis Workflow
5938

6039
```{r, echo=FALSE, out.width='100%', fig.align='center'}
6140
knitr::include_graphics("../Images/Andrews2017_Fig1.png", auto_pdf = TRUE)
6241
```
6342

64-
## Motivation
65-
66-
We will introduce three widely used clustering methods:
67-
68-
* hierarchical
69-
* k-means
70-
* graph-based
71-
72-
The first two were developed first and are faster for small data sets.
73-
74-
The third is more recent and better suited for scRNA-seq, especially large data sets.
75-
76-
All three identify non-overlapping clusters.
77-
78-
## Hierarchical clustering
79-
80-
Hierarchical clustering builds:
81-
82-
* a hierarchy of clusters
83-
* yielding a dendrogram (i.e. tree)
84-
* that groups together cells with similar expression patterns
85-
* across the chosen genes.
86-
87-
There are two types of strategies:
88-
89-
* Agglomerative (bottom-up):
90-
* each observation (cell) starts in its own cluster,
91-
* pairs of clusters are merged as one moves up the hierarchy.
92-
* Divisive (top-down):
93-
* all observations (cells) start in one cluster,
94-
* splits are performed recursively as one moves down the hierarchy.
95-
96-
<!-- ## Hierarchical
97-
98-
```{r clust-hierarch-raw, echo=FALSE, out.height = '30%', fig.align='center'}
99-
knitr::include_graphics("../Images/bioCellGenHierar1.png")
100-
```
101-
-->
102-
103-
<!-- ## Hierarchical
104-
105-
```{r clust-hierarch-dendr, echo=FALSE, out.height = '30%', fig.align='center'}
106-
knitr::include_graphics("../Images/bioCellGenHierar2.png")
107-
```
108-
-->
109-
110-
## Hierarchical clustering {.columns-2 .smaller}
111-
112-
The raw data:
113-
114-
```{r, echo=FALSE, out.width = '40%', fig.align='center'}
115-
knitr::include_graphics("../Images/bioCellGenHierar1.png")
116-
```
117-
118-
<p class="forceBreak"></p>
119-
120-
The hierarchical clustering dendrogram:
121-
122-
```{r, echo=FALSE, out.width = '60%', fig.align='center'}
123-
knitr::include_graphics("../Images/bioCellGenHierar2.png")
124-
```
125-
126-
## Hierarchical clustering {.columns-2 .smaller}
127-
128-
Example: the Caron data set:
129-
130-
```{r clust-hierarch-dendr-caron, echo=FALSE, out.height = '80%', out.width = '80%', fig.align='center'}
131-
knitr::include_graphics("../Images/clustHierar3.png")
132-
```
133-
134-
<p class="forceBreak"></p>
135-
136-
Pros:
137-
138-
* deterministic method
139-
* returns partitions at all levels along the dendrogram
140-
141-
Cons:
142-
143-
* computationally expensive in time and memory
144-
* that increase proportionally
145-
* to the square of the number of data points
146-
147-
## k-means clustering {.columns-2 .smaller}
148-
149-
Goal: partition cells into k different clusters.
150-
151-
In an iterative manner,
152-
153-
* cluster centers are defined
154-
* each cell is assigned to its nearest cluster
155-
156-
Aim:
157-
158-
* minimise within-cluster variation
159-
* maximise between-cluster variation
160-
161-
<p class="forceBreak"></p>
162-
163-
Pros:
164-
165-
* fast
166-
167-
Cons:
168-
169-
* assumes a pre-determined number of clusters
170-
* sensitive to outliers
171-
* tends to define equally-sized clusters
172-
173-
## k-means clustering
174-
175-
```{r, echo=FALSE, out.width = '100%'}
176-
knitr::include_graphics("../Images/bioCellGenKmean.png", auto_pdf = TRUE)
177-
```
178-
179-
Set of steps to repeat:
180-
181-
* randomly select k data points to serve as initial cluster centers,
182-
* for each centers, 1) compute distance to centroids, 2) assign to closest cluster,
183-
* calculate the mean of each cluster (the ‘mean’ in ‘k-mean’) to define its centroid,
184-
* for each point compute the distance to these means to choose the closest,
185-
* repeat until the distance between centroids and data points is minimal (ie clusters do not change)
186-
or the maximum number of iterations is reached,
187-
compute the total variation within clusters
188-
189-
<!--
190-
**=> assign new centroids and repeat steps above**
191-
-->
192-
193-
## Separatedness
194-
195-
Congruence of clusters may be assessed by computing the sillhouette for each cell.
196-
197-
The larger the value the closer the cell to cells in its cluster than to cells in other clusters.
198-
199-
Cells closer to cells in other clusters have a negative value.
200-
201-
Good cluster separation is indicated by clusters whose cells have large silhouette values.
202-
203-
```{r, echo=FALSE, out.width = '100%'}
204-
knitr::include_graphics("../Images/clustKmeansBoth.png", auto_pdf = TRUE)
205-
```
206-
207-
20843
## Graph-based clustering {.columns-2 .smaller}
20944

21045
Nearest-Neighbour (NN) graph:
21146

21247
* cells as nodes
21348
* their similarity as edges
214-
215-
Aim: identify ‘communities’ of cells within the network
21649

217-
In a NN graph two nodes (cells), say X and Y, are connected by an edge:
50+
In a NN graph two nodes (cells), say X and Y, are connected by an edge if:
51+
52+
* the distance between them is amongst the **k** smallest distances from X to
53+
other cells, ‘**K**NN’
21854

219-
if the distance between them is amongst:
55+
or
22056

221-
* the **k** smallest distances from X to other cells, ‘**K**NN’)
222-
* and from Y to other cells for **shared**-NN, **S**NN.
57+
* the above plus the distance between them is amongst the **k** smallest
58+
distances from X to other cells **shared**-NN (**S**NN).
22359

224-
<p class="forceBreak"></p>
60+
Once edges have been defined, they can be weighted by various metrics.
22561

226-
Clusters are identified using metrics related to the number of neighbours (‘connections’) to find groups of highly interconnected cells.
62+
<p class="forceBreak"></p>
22763

22864
```{r, include=FALSE}
22965
require(igraph)
@@ -248,7 +84,7 @@ plot.igraph(
24884

24985
Example with different numbers of neighbours:
25086

251-
```{r, echo=FALSE, out.height='60%', out.width = '60%', fig.align="center"}
87+
```{r, echo=FALSE, out.height='100%', out.width = '100%', fig.align="center"}
25288
knitr::include_graphics("../Images/bioCellGenGraphDeng2.png", auto_pdf = TRUE)
25389
```
25490

@@ -257,22 +93,25 @@ knitr::include_graphics("../Images/bioCellGenGraphDeng2.png", auto_pdf = TRUE)
25793
Pros
25894

25995
* fast and memory efficient (no distance matrix for all pairs of cells)
260-
* no assumptions on the shape of the clusters or the distribution of cells within each cluster
96+
* no assumptions on the shape of the clusters or the distribution of cells
97+
within each cluster
26198
* no need to specify a number of clusters to identify
26299

263100
Cons
264101

265-
* loss of information beyond neighboring cells, which can affect community detection in regions with many cells.
102+
* loss of information beyond neighboring cells, which can affect community
103+
detection in regions with many cells.
266104

267105
## Modularity
268106

269-
Several methods to detect clusters (‘communities’) in networks rely on the ‘modulatrity’ metric.
270-
271-
For a given partition of cells into clusters,
107+
Several methods to detect clusters (‘communities’) in networks rely on the
108+
‘modularity’ metric.
272109

273-
modularity measures how separated clusters are from each other,
110+
Modularity measures how separated clusters are from each other.
274111

275-
based on the difference between the observed and expected weight of edges between nodes.
112+
Modularity is a ratio between the observed weights of the edges
113+
within a cluster versus the expected weights if the edges were randomly
114+
distributed between all nodes.
276115

277116
For the whole graph, the closer to 1 the better.
278117

@@ -297,7 +136,7 @@ Node similarity is measured based on these walks.
297136
Network example:
298137

299138
```{r, echo=FALSE, out.height='60%', out.width = '60%', fig.align="center"}
300-
knitr::include_graphics("../Images/clusGraphExample.png", auto_pdf = TRUE)
139+
knitr::include_graphics("Images/clusGraphExample.png", auto_pdf = TRUE)
301140
```
302141

303142
## Louvain {.columns-2 .smaller}
@@ -353,18 +192,36 @@ knitr::include_graphics("../Images/leiden_Fig3_noLegend.png", auto_pdf = TRUE)
353192

354193
([Traag et al, From Louvain to Leiden: guaranteeing well-connected communities](https://www.nature.com/articles/s41598-019-41695-z))
355194

356-
## Cluster-wise modularity to assess clusters quality
195+
## Separatedness - silhouette width
357196

358-
Clusters that are well separated mostly comprise intra-cluster edges and harbour a high modularity score on the diagonal and low scores off that diagonal.
197+
Congruence of clusters may be assessed by computing the silhouette width for
198+
each cell.
359199

360-
Two poorly separated clusters will share edges and the pair will have a high score.
200+
For each cell in the cluster calculate the the average distance to all other
201+
cells in the cluster and the average distance to all cells not in the cluster.
202+
The cells silhouette width is the difference between these divided by the
203+
maximum of the two values.
361204

362-
```{r, echo=FALSE, out.height='100%', out.width = '100%', fig.align="center"}
363-
knitr::include_graphics("../Images/clustLouvainBoth.png", auto_pdf = TRUE)
205+
Cells with a large silhouette are strongly related to cells in the cluster,
206+
cells with a negative silhouette width are more closely related to other
207+
clusters.
208+
209+
Good cluster separation is indicated by clusters whose cells have large
210+
silhouette values.
211+
212+
## Separatedness - silhouette width
213+
214+
```{r, echo=FALSE, out.width = '100%', fig.align="center"}
215+
knitr::include_graphics("Images/Silhouette.png")
364216
```
365217

366-
## Recap
218+
## Cluster-wise modularity to assess clusters quality
367219

368-
* hierarchical and k-means methods are fast for small data sets
220+
Clusters that are well separated mostly comprise intra-cluster edges and harbour
221+
a high modularity score on the diagonal and low scores off that diagonal.
369222

370-
* graph-based methods are better suited for large data sets and cluster detection
223+
Two poorly separated clusters will share edges and the pair will have a high score.
224+
225+
```{r, echo=FALSE, out.width = '90%', fig.align="center"}
226+
knitr::include_graphics("Images/ClusterwiseModularity.png")
227+
```

0 commit comments

Comments
 (0)