You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: day2_Rnotebooks/clusterProfiler.Rmd
+62-49Lines changed: 62 additions & 49 deletions
Original file line number
Diff line number
Diff line change
@@ -36,8 +36,17 @@ head(data)
36
36
```
37
37
38
38
39
-
This time, instead of filtering for DEGs, we will extract all genes, and sort them by fold change (largest to smallest). The R object type for `gseKEGG` needs to be a 'vector'. Unfortunately, this detail is under the `enrichKEGG` function, not the `gseKEGG` function! For `enrichKEGG`, the parameter `gene` is described as requiring "a vector of entrez gene id", yet for `gseKEGG`, the description for `geneList` is "order ranked geneList". There is a little bit of sleuthing required at times!
39
+
This time, instead of filtering for DEGs, we will extract all genes, and sort them by fold change (largest to smallest)to make our ranked gene list for GSEA.
40
40
41
+
What class of R object does our ranked gene list need to be in?
42
+
43
+
```{r help gseKEGG}
44
+
?gseKEGG
45
+
```
46
+
47
+
We can see `geneList = order ranked geneList`. Not informative!
48
+
49
+
Unfortunately, the required type of object is detailed under the `enrichKEGG` function, not the `gseKEGG` function! For `enrichKEGG`, the parameter `gene` is described as requiring "a vector of entrez gene id", yet for `gseKEGG`, the description for `geneList` is "order ranked geneList". There is a little bit of sleuthing required at times!
41
50
42
51
43
52
```{r extract ranked list}
@@ -54,7 +63,7 @@ tail(ranked)
54
63
55
64
# 2. Check `gseKEGG` function arguments and requirements
56
65
57
-
Let's start by reviewing the function arguments. Once you run the below code chunk, the user guide for the function will open in the `Help` pane of RStudio (lower right)
66
+
Review the parameters:
58
67
59
68
```{r help gsekegg}
60
69
?gseKEGG
@@ -65,12 +74,12 @@ Most of those defaults look suitable to start.
65
74
66
75
We have human so the default `organism = "hsa"` argument is correct. If you were working with a species other than human, you first need to obtain your organism code. You can derive this from [KEGG Organisms](https://www.genome.jp/kegg/tables/br08606.html) or using the `clusterProfiler` function `search_kegg_organism`.
67
76
68
-
Pick your favourite species and search for the KEGG organism code by editing the variable 'organism' then executing the code chunk:
77
+
Pick your favourite species and search for the KEGG organism code by editing the variable 'fave' then executing the code chunk:
69
78
70
-
```{r}
79
+
```{r check kegg orgs}
71
80
fave <- "horse"
72
81
73
-
# search by common_name or scientific_name
82
+
# search by common_name or scientific_name or kegg.code
74
83
search_kegg_organism(fave, by = "common_name")
75
84
```
76
85
@@ -90,25 +99,25 @@ Since our input data does not match any of the valid namespaces, we need to conv
90
99
91
100
Check the usage for the `bitr` function:
92
101
93
-
```{r}
102
+
```{r help bitr}
94
103
?bitr
95
104
```
96
105
97
-
We need to understand what the valid `fromType` and `toType` values are, and it turns out we need an Org.db to use `bitr`! This is a Bioconductor annotation package, of which there are currently only 20. So while the `gseKEGG` function supports all organisms in KEGG, performing gene ID conversion within ``clusterProfiler` may not be possible for non-model species and you would need to seek a different method.
106
+
We need to understand what the valid `fromType` and `toType` values are, and it turns out we need an Org.db to use `bitr`! This is a Bioconductor annotation package, of which there are currently only 20. So while the `gseKEGG` function supports all organisms in KEGG, performing gene ID conversion within `clusterProfiler` may not be possible for non-model species and you would need to seek a different method.
98
107
99
108
We have already loaded the `org.Hs.eg.db` annotation library. We can use this to search `keytypes`:
100
109
101
-
```{r}
110
+
```{r check org db keytypes for huamn}
102
111
keytypes(org.Hs.eg.db)
103
112
```
104
113
105
114
Converting to ENTREZ will be compatible with `kegg`. So our `fromType` is `ENSEMBL` and `toType` is `ENTREZID`. Note that these are case sensitive!
106
115
107
116
If we were to run `enrichGO` or `gseaGO` that require a Bioconductor Org.db package, we would not need to do a conversion, as both of the gene ID types within our input data (`ENSEMBL` and `SYMBOL`) are natively supported.
108
117
109
-
The below code performs the reformatting and handles duplicates by keeping only the first occurrence of duplicate Entrez IDs within the input. Note that this is not ideal and for a real experiment you should print out a list of duplicates, carefully review these and choose which and how to retain based on your biological context.
118
+
Gene ID conversion often results in duplicates. The below code performs the reformatting and handles duplicates by keeping only the first occurrence of duplicate Entrez IDs within the input. Note that this is not ideal and for a real experiment you should print out a list of duplicates, carefully review these and choose which and how to retain based on your biological context.
110
119
111
-
First, convert map the ENSEMBL gene IDs from our 'ranked' list to ENTREZ gene IDs:
120
+
First, convert the ENSEMBL gene IDs from our 'ranked' list to ENTREZ gene IDs:
112
121
113
122
```{r convert gene IDs}
114
123
# Convert the Ensembl IDs to Entrez IDs, dropping NAs
@@ -146,6 +155,7 @@ head(ranked_entrez)
146
155
tail(ranked_entrez)
147
156
148
157
```
158
+
149
159
Looks good! Now we are ready to enrich!
150
160
151
161
# 4. Run GSEA over KEGG database
@@ -155,19 +165,19 @@ Now run GSEA over the KEGG database with `gseKEGG` function, setting a `seed` va
155
165
156
166
```{r gseKEGG}
157
167
gsea_kegg <- gseKEGG(
158
-
ranked_entrez,
159
-
organism = "hsa",
160
-
keyType = "kegg",
161
-
exponent = 1,
162
-
minGSSize = 10,
163
-
maxGSSize = 500,
164
-
eps = 1e-10,
165
-
pvalueCutoff = 0.05,
166
-
pAdjustMethod = "BH",
167
-
verbose = TRUE,
168
-
use_internal_data = FALSE,
169
-
seed = 123,
170
-
by = "fgsea"
168
+
ranked_entrez, # ranked gene list vector object
169
+
organism = "hsa", # species
170
+
keyType = "kegg", # gene ID type/namespace
171
+
exponent = 1, # weight of each step
172
+
minGSSize = 10, # minimum gene set size
173
+
maxGSSize = 500, # maximum gene set size
174
+
eps = 1e-10, # sets the boundary for calculating the p value
175
+
pvalueCutoff = 0.05, # adjusted P value cutoff for enriched terms
176
+
pAdjustMethod = "BH", # multiple testing correction with BH method
177
+
verbose = TRUE, # print output as it runs
178
+
use_internal_data = FALSE, # use latest online KEGG data not local data
179
+
seed = 123, # set seed for random number generation, for reproducibility
180
+
by = "fgsea" # GSEA algorithm
171
181
)
172
182
173
183
@@ -177,7 +187,7 @@ We have 3 warnings, one about ties in the ranked list, which we could resolve ma
177
187
178
188
The second warning may be resolved by following the suggestion to set `nPermSimple = 10000`. How frustrating that this parameter is not described within the `gseKEGG` help menu, nor the `clusterProfiler` PDF at all!
179
189
180
-
Let's rerun following both suggestions, to use permutations (`nPermSimple`) and set `eps = 0`. We will keep the rest of the arguments the same as before.
190
+
Let's rerun following both suggestions, to use permutations (`nPermSimple = 1000`) and set `eps = 0`. We will keep the rest of the arguments the same as before.
181
191
182
192
```{r gseKEGG perms}
183
193
gsea_kegg <- gseKEGG(
@@ -187,17 +197,18 @@ gsea_kegg <- gseKEGG(
187
197
exponent = 1,
188
198
minGSSize = 10,
189
199
maxGSSize = 500,
190
-
eps = 0,
200
+
eps = 0, # changed to zero
191
201
pvalueCutoff = 0.05,
192
202
pAdjustMethod = "BH",
193
203
verbose = TRUE,
194
204
use_internal_data = FALSE,
195
205
seed = 123,
196
206
by = "fgsea",
197
-
nPermSimple = 10000
207
+
nPermSimple = 10000 # added permutations
198
208
)
199
209
200
210
```
211
+
201
212
Great, those last 2 warnings have resolved and we only have the expected one about tied rankings.
202
213
203
214
@@ -232,7 +243,7 @@ You can view the table in the `editor` pane by clicking on it from the `Files` p
232
243
233
244
One of the key advantages of using R over web tools is flexibility with visualisations. Next we will produce a range of plot types from the `enrichplot` package, developed by the same authors as `clusterProfiler`.
234
245
235
-
In this package, some of the plots can be used to plot ORA or GSEA, and others only for one type or the other.
246
+
In this package, some of the plots can be used to plot ORA or GSEA results, and others only for one type or the other.
236
247
237
248
To determine which plot functions are compatible with which FEA results type, you can check the help menu. If you see `## S4 method for signature 'enrichResult'` this plot is compatible with ORA results object. If you see `## S4 method for signature 'gseaResult'` this plot is compatible with GSEA results object.
238
249
@@ -253,7 +264,7 @@ And review the help menu for `ridgeplot` which can only plot GSEA results:
253
264
254
265
Unfortunately, `enrichplot` is only compatible with the enrichment results from the packages produced by this development team, although the desire to use `enrichplot` with the output of other tools is widespread. The R package `multienrichjam` has a function `enrichDF2enrichResult` that converts ORA dataframe results from other FEA tools to the format required for `enrichplot`.
255
266
256
-
`multienrichjam` has a lot of dependencies and has not been installed on these VMs so we will not be performing this today. However, this functionality and flexibility is pretty cool, so if you wanted to install this on your own computer outside the workshop, below is the code for installing :-)
267
+
`multienrichjam` has a lot of dependencies and has not been installed on these VMs so we will not be performing this today. However, this functionality and flexibility is pretty cool, so if you wanted to install this on your own computer outside the workshop, below is the code for installing :-)
257
268
258
269
259
270
```{r install multienrichjam}
@@ -265,6 +276,8 @@ Unfortunately, `enrichplot` is only compatible with the enrichment results from
@@ -276,7 +289,7 @@ We can change the number of top terms that are plotted with the `showCategory` p
276
289
277
290
```{r GSEA KEGG dotplot }
278
291
279
-
clusterProfiler::dotplot(
292
+
enrichplot::dotplot(
280
293
gsea_kegg,
281
294
x = "geneRatio",
282
295
color = "p.adjust",
@@ -294,18 +307,16 @@ clusterProfiler::dotplot(
294
307
295
308
The upset plot shows the pattern of shared genes among the enriched terms (since a gene can belong to multiple terms or pathways).
296
309
297
-
For ORA, upsetplot will calculate the overlaps among different gene sets.
298
-
299
-
For GSEA, it will plot the fold change distributions of different categories (e.g. unique to pathway, overlaps among different pathways).
310
+
For GSEA, the plot includess fold change distributions on the Y axis.
300
311
301
312
`n` controls the number of terms plotted - here we have restricted to 10 for ease of viewing. Plot view is quite small within a notebook, but if you wanted to plot more terms, you could print to an A4 size output file for enhanced clarity.
302
313
303
314
304
315
305
-
```{r GSEA KEGG upset plot}
316
+
```{r GSEA KEGG upset plot, fig.width = 10 }
306
317
307
318
# n = the number of top enrichments to plot
308
-
enrichplot::upsetplot(gsea_kegg, n = 10)
319
+
enrichplot::upsetplot(gsea_kegg, n = 15)
309
320
310
321
```
311
322
@@ -319,21 +330,21 @@ It is required to run the function `pairwise_termsim` before producing the plot.
319
330
320
331
When running `pairwise_termsim`, we don't need to save the output to a new object, because the information is added to the object and does not change any other attributes.
321
332
333
+
Each node is a term, and the number of genes associated with the term is shown by the dot size, with P values by dot colour.
334
+
322
335
323
336
```{r GSEA KEGG emapplot}
324
337
# calculate pairwise similarities between the enriched terms
The treeplot provides the same information as the emapplot but in a different visualisation.
335
-
336
-
Each node is a term, and the number of genes associated with the term is shown by the dot size, with P values by dot colour.
347
+
The treeplot provides the same information as the emapplot but presented in a different way.
337
348
338
349
Terms that share more genes or biological functions will be closer together in the tree structure. Clades are colour coded and 'cluster tags' assigned. You can control the number of words in the tag (default is 4). The user guide describes the argument `nWords` however running that will throw an error (it says 'warning' but it is fatal so to me that's an error!):
The cnetplot is helpful to understanding which genes are involved in the enriched terms, details that are not available in the plots generated so far. It depicts the linkages of genes and terms as a network.
367
+
The cnetplot depicts the linkages of genes and terms as a network. This is helpful to understanding which genes are involved in the enriched terms, ading a level of detail not offered by the plots we have generated so far.
357
368
358
369
For GSEA, where all genes (not just DEGs) are used, only the 'core' enriched genes are used to create the network plot. These are the 'leading edge genes', those genes up to the point where the Enrichment Score (ES) gets maximised from the base zero. In other words, the subset of genes that are most strongly associated with a specific term.
359
370
360
371
There are a few parameters to play around with here to get a readable plot.
361
372
362
-
Try changing the number of terms plotted (`showCategory`) and the `node_label` which controls whether labels are put on terms, genes (`item`), or `all`.
373
+
Try changing the number of terms plotted (`showCategory`).
374
+
375
+
Try changing the `node_label` which controls whether labels are put on terms (`category`), genes (`item`), or both (`all`).
363
376
364
377
Since the information that is attempted to be plotted is complex, having a large number of terms and attempting to label everything won't look very informative! If you want to plot both gene IDs and term names, you will need to plot a small number of categories.
365
378
@@ -372,15 +385,15 @@ Since the information that is attempted to be plotted is complex, having a large
With 8 terms and no gene IDs, we don't get any more detail than from the treeplot or emapplot.
380
391
381
-
You can also plot the interaction between specific terms. This is helpful to obtain gene ID level resolution for term interactions of interest.
392
+
This plot is really useful for showing a detailed look at a small number of terms. Just plotting the top 3 terms may not look very helpful (try plotting the top 3!)
382
393
383
-
The 3 terms listed below are for 'IL-17 signaling pathway', 'Viral protein interaction with cytokine and cytokine receptor' and 'Chemokine signaling pathway' which are top 10 enrichments with a relationship of shared genes, evident from the plot above.
394
+
A useful application is plotting the interaction between specific terms. This is helpful to obtain gene ID level resolution for term interactions of interest.
395
+
396
+
The 3 terms listed below are for 'IL-17 signaling pathway', 'Viral protein interaction with cytokine and cytokine receptor' and 'Chemokine signaling pathway' which are among the top 10 enrichments with a relationship of shared genes, evident from the plot above.
384
397
385
398
Run the code below or select a handful of terms of your choosing from the results table we printed earlier. We need the KEGG ID (column 1).
386
399
@@ -410,10 +423,10 @@ Now we can clearly see the gene IDs for this network.
410
423
Like the upset plot, the heatplot shows shared genes across enriched terms. When there are a lot of genes, it can be poorly readable, especially within the notebook. This function does not include a parameter to restrict to 'core' genes like the `cnetplot` or `ridgeplot` so even when plotting a very small number of enriched terms, the X-axis is too cramped to be readable.
411
424
412
425
413
-
414
426
```{r heatplot GSEA KEGG }
415
427
enrichplot::heatplot(gsea_kegg, showCategory = 3)
416
428
```
429
+
417
430
The gseKEGG result contains a data column containing the core/leading edge genes:
The term 'hsa04062' (Chemokine signaling pathway) has a negative NE, a highyl significant adjusted P value of 1.11E-06 and a set size of 129.
506
+
The term 'hsa04062' (Chemokine signaling pathway) has a negative NE, a highly significant adjusted P value of 1.11E-06 and a set size of 129.
494
507
495
508
```{r gseaplot GSEA KEGG downreg, fig.width=10 }
496
509
@@ -536,7 +549,7 @@ The `volplot` function within `enrichplot` does not support GSEA result objects,
536
549
The volcano plot has the advantage of showing whether the genes in the enriched terms were predominantly from the top (upregulated) or bottom (downregulated) end of the list.
537
550
538
551
539
-
```{r}
552
+
```{r gsea volcano plot}
540
553
ggplot2::ggplot(gsea_kegg@result, aes(x = enrichmentScore, y = -log10(p.adjust), color = p.adjust)) +
0 commit comments