11---
2- title : " Introduction to single-cell RNA-seq analysis - Clustering"
3- author : " Stephane Ballereau "
4- date : " February 2022"
2+ title : " Clustering"
3+ author : " Ashley Sawle "
4+ date : ' April 2022'
55output :
66 ioslides_presentation :
7- logo : ../Images/uniOfCamCrukLogos.png
8- smaller : yes
9- widescreen : yes
7+ widescreen : true
8+ smaller : true
9+ logo : Images/uniOfCamCrukLogos.png
1010 css : css/stylesheet.css
11- slidy_presentation : default
12- beamer_presentation : default
1311---
14-
15- <!--
16- logo: Images/CRUK_CC_web.jpg
17- -->
1812
1913``` {r include=FALSE}
2014library(tidyr)
2115library(dplyr)
22- #source("SOME_SCRIPT.R")
2316```
2417
25- ## Outline
26-
27- * Motivation
28-
29- * Initial methods
30-
31- * Graph-based methods
32- * walktrap
33- * louvain
34- * leiden
35-
3618## Single Cell RNAseq Analysis Workflow
3719
3820``` {r, echo=FALSE, out.width='70%', fig.align='center'}
39- knitr::include_graphics('../ Images/workflow2_clustering.png')
21+ knitr::include_graphics('Images/workflow2_clustering.png')
4022```
4123
4224## Motivation
4325
44- The data has been QCed and normalized, confounders removed, noise limited,
45- dimensionality reduced.
26+ The data has been QC'd and normalized, and batch corrected.
4627
4728We can now ask biological questions.
4829
49- * * de novo * discovery and annotation of cell-types based on transcription profiles
50-
51- * unsupervised clustering:
30+ * unsupervised clustering: identification of groups of cells based on the
31+ similarities of the transcriptomes without any prior knowledge of the labels
32+ usually using the PCA output
5233
53- * identification of groups of cells
54- * based on the similarities of the transcriptomes
55- * without any prior knowledge of the labels
56- * usually using the PCA output
34+ * * de novo* discovery and annotation of cell-types based on transcription
35+ profiles
5736
5837## Single Cell RNAseq Analysis Workflow
5938
6039``` {r, echo=FALSE, out.width='100%', fig.align='center'}
6140knitr::include_graphics("../Images/Andrews2017_Fig1.png", auto_pdf = TRUE)
6241```
6342
64- ## Motivation
65-
66- We will introduce three widely used clustering methods:
67-
68- * hierarchical
69- * k-means
70- * graph-based
71-
72- The first two were developed first and are faster for small data sets.
73-
74- The third is more recent and better suited for scRNA-seq, especially large data sets.
75-
76- All three identify non-overlapping clusters.
77-
78- ## Hierarchical clustering
79-
80- Hierarchical clustering builds:
81-
82- * a hierarchy of clusters
83- * yielding a dendrogram (i.e. tree)
84- * that groups together cells with similar expression patterns
85- * across the chosen genes.
86-
87- There are two types of strategies:
88-
89- * Agglomerative (bottom-up):
90- * each observation (cell) starts in its own cluster,
91- * pairs of clusters are merged as one moves up the hierarchy.
92- * Divisive (top-down):
93- * all observations (cells) start in one cluster,
94- * splits are performed recursively as one moves down the hierarchy.
95-
96- <!-- ## Hierarchical
97-
98- ```{r clust-hierarch-raw, echo=FALSE, out.height = '30%', fig.align='center'}
99- knitr::include_graphics("../Images/bioCellGenHierar1.png")
100- ```
101- -->
102-
103- <!-- ## Hierarchical
104-
105- ```{r clust-hierarch-dendr, echo=FALSE, out.height = '30%', fig.align='center'}
106- knitr::include_graphics("../Images/bioCellGenHierar2.png")
107- ```
108- -->
109-
110- ## Hierarchical clustering {.columns-2 .smaller}
111-
112- The raw data:
113-
114- ``` {r, echo=FALSE, out.width = '40%', fig.align='center'}
115- knitr::include_graphics("../Images/bioCellGenHierar1.png")
116- ```
117-
118- <p class =" forceBreak " ></p >
119-
120- The hierarchical clustering dendrogram:
121-
122- ``` {r, echo=FALSE, out.width = '60%', fig.align='center'}
123- knitr::include_graphics("../Images/bioCellGenHierar2.png")
124- ```
125-
126- ## Hierarchical clustering {.columns-2 .smaller}
127-
128- Example: the Caron data set:
129-
130- ``` {r clust-hierarch-dendr-caron, echo=FALSE, out.height = '80%', out.width = '80%', fig.align='center'}
131- knitr::include_graphics("../Images/clustHierar3.png")
132- ```
133-
134- <p class =" forceBreak " ></p >
135-
136- Pros:
137-
138- * deterministic method
139- * returns partitions at all levels along the dendrogram
140-
141- Cons:
142-
143- * computationally expensive in time and memory
144- * that increase proportionally
145- * to the square of the number of data points
146-
147- ## k-means clustering {.columns-2 .smaller}
148-
149- Goal: partition cells into k different clusters.
150-
151- In an iterative manner,
152-
153- * cluster centers are defined
154- * each cell is assigned to its nearest cluster
155-
156- Aim:
157-
158- * minimise within-cluster variation
159- * maximise between-cluster variation
160-
161- <p class =" forceBreak " ></p >
162-
163- Pros:
164-
165- * fast
166-
167- Cons:
168-
169- * assumes a pre-determined number of clusters
170- * sensitive to outliers
171- * tends to define equally-sized clusters
172-
173- ## k-means clustering
174-
175- ``` {r, echo=FALSE, out.width = '100%'}
176- knitr::include_graphics("../Images/bioCellGenKmean.png", auto_pdf = TRUE)
177- ```
178-
179- Set of steps to repeat:
180-
181- * randomly select k data points to serve as initial cluster centers,
182- * for each centers, 1) compute distance to centroids, 2) assign to closest cluster,
183- * calculate the mean of each cluster (the ‘mean’ in ‘k-mean’) to define its centroid,
184- * for each point compute the distance to these means to choose the closest,
185- * repeat until the distance between centroids and data points is minimal (ie clusters do not change)
186- or the maximum number of iterations is reached,
187- compute the total variation within clusters
188-
189- <!--
190- **=> assign new centroids and repeat steps above**
191- -->
192-
193- ## Separatedness
194-
195- Congruence of clusters may be assessed by computing the sillhouette for each cell.
196-
197- The larger the value the closer the cell to cells in its cluster than to cells in other clusters.
198-
199- Cells closer to cells in other clusters have a negative value.
200-
201- Good cluster separation is indicated by clusters whose cells have large silhouette values.
202-
203- ``` {r, echo=FALSE, out.width = '100%'}
204- knitr::include_graphics("../Images/clustKmeansBoth.png", auto_pdf = TRUE)
205- ```
206-
207-
20843## Graph-based clustering {.columns-2 .smaller}
20944
21045Nearest-Neighbour (NN) graph:
21146
21247 * cells as nodes
21348 * their similarity as edges
214-
215- Aim: identify ‘communities’ of cells within the network
21649
217- In a NN graph two nodes (cells), say X and Y, are connected by an edge:
50+ In a NN graph two nodes (cells), say X and Y, are connected by an edge if:
51+
52+ * the distance between them is amongst the ** k** smallest distances from X to
53+ other cells, ‘** K** NN’
21854
219- if the distance between them is amongst:
55+ or
22056
221- * the ** k ** smallest distances from X to other cells, ‘ ** K ** NN’)
222- * and from Y to other cells for ** shared** -NN, ‘** S** NN’ .
57+ * the above plus the distance between them is amongst the ** k ** smallest
58+ distances from X to other cells ** shared** -NN ( ‘** S** NN) .
22359
224- < p class = " forceBreak " ></ p >
60+ Once edges have been defined, they can be weighted by various metrics.
22561
226- Clusters are identified using metrics related to the number of neighbours (‘connections’) to find groups of highly interconnected cells.
62+ < p class = " forceBreak " ></ p >
22763
22864``` {r, include=FALSE}
22965require(igraph)
@@ -248,7 +84,7 @@ plot.igraph(
24884
24985Example with different numbers of neighbours:
25086
251- ``` {r, echo=FALSE, out.height='60 %', out.width = '60 %', fig.align="center"}
87+ ``` {r, echo=FALSE, out.height='100 %', out.width = '100 %', fig.align="center"}
25288knitr::include_graphics("../Images/bioCellGenGraphDeng2.png", auto_pdf = TRUE)
25389```
25490
@@ -257,22 +93,25 @@ knitr::include_graphics("../Images/bioCellGenGraphDeng2.png", auto_pdf = TRUE)
25793Pros
25894
25995 * fast and memory efficient (no distance matrix for all pairs of cells)
260- * no assumptions on the shape of the clusters or the distribution of cells within each cluster
96+ * no assumptions on the shape of the clusters or the distribution of cells
97+ within each cluster
26198 * no need to specify a number of clusters to identify
26299
263100Cons
264101
265- * loss of information beyond neighboring cells, which can affect community detection in regions with many cells.
102+ * loss of information beyond neighboring cells, which can affect community
103+ detection in regions with many cells.
266104
267105## Modularity
268106
269- Several methods to detect clusters (‘communities’) in networks rely on the ‘modulatrity’ metric.
270-
271- For a given partition of cells into clusters,
107+ Several methods to detect clusters (‘communities’) in networks rely on the
108+ ‘modularity’ metric.
272109
273- modularity measures how separated clusters are from each other,
110+ Modularity measures how separated clusters are from each other.
274111
275- based on the difference between the observed and expected weight of edges between nodes.
112+ Modularity is a ratio between the observed weights of the edges
113+ within a cluster versus the expected weights if the edges were randomly
114+ distributed between all nodes.
276115
277116For the whole graph, the closer to 1 the better.
278117
@@ -297,7 +136,7 @@ Node similarity is measured based on these walks.
297136Network example:
298137
299138``` {r, echo=FALSE, out.height='60%', out.width = '60%', fig.align="center"}
300- knitr::include_graphics("../ Images/clusGraphExample.png", auto_pdf = TRUE)
139+ knitr::include_graphics("Images/clusGraphExample.png", auto_pdf = TRUE)
301140```
302141
303142## Louvain {.columns-2 .smaller}
@@ -353,18 +192,36 @@ knitr::include_graphics("../Images/leiden_Fig3_noLegend.png", auto_pdf = TRUE)
353192
354193([ Traag et al, From Louvain to Leiden: guaranteeing well-connected communities] ( https://www.nature.com/articles/s41598-019-41695-z ) )
355194
356- ## Cluster-wise modularity to assess clusters quality
195+ ## Separatedness - silhouette width
357196
358- Clusters that are well separated mostly comprise intra-cluster edges and harbour a high modularity score on the diagonal and low scores off that diagonal.
197+ Congruence of clusters may be assessed by computing the silhouette width for
198+ each cell.
359199
360- Two poorly separated clusters will share edges and the pair will have a high score.
200+ For each cell in the cluster calculate the the average distance to all other
201+ cells in the cluster and the average distance to all cells not in the cluster.
202+ The cells silhouette width is the difference between these divided by the
203+ maximum of the two values.
361204
362- ``` {r, echo=FALSE, out.height='100%', out.width = '100%', fig.align="center"}
363- knitr::include_graphics("../Images/clustLouvainBoth.png", auto_pdf = TRUE)
205+ Cells with a large silhouette are strongly related to cells in the cluster,
206+ cells with a negative silhouette width are more closely related to other
207+ clusters.
208+
209+ Good cluster separation is indicated by clusters whose cells have large
210+ silhouette values.
211+
212+ ## Separatedness - silhouette width
213+
214+ ``` {r, echo=FALSE, out.width = '100%', fig.align="center"}
215+ knitr::include_graphics("Images/Silhouette.png")
364216```
365217
366- ## Recap
218+ ## Cluster-wise modularity to assess clusters quality
367219
368- * hierarchical and k-means methods are fast for small data sets
220+ Clusters that are well separated mostly comprise intra-cluster edges and harbour
221+ a high modularity score on the diagonal and low scores off that diagonal.
369222
370- * graph-based methods are better suited for large data sets and cluster detection
223+ Two poorly separated clusters will share edges and the pair will have a high score.
224+
225+ ``` {r, echo=FALSE, out.width = '90%', fig.align="center"}
226+ knitr::include_graphics("Images/ClusterwiseModularity.png")
227+ ```
0 commit comments