Skip to content

Commit f8bdf8b

Browse files
committed
oops this one is final CP
1 parent 8cfb217 commit f8bdf8b

File tree

1 file changed

+62
-49
lines changed

1 file changed

+62
-49
lines changed

day2_Rnotebooks/clusterProfiler.Rmd

Lines changed: 62 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -36,8 +36,17 @@ head(data)
3636
```
3737

3838

39-
This time, instead of filtering for DEGs, we will extract all genes, and sort them by fold change (largest to smallest). The R object type for `gseKEGG` needs to be a 'vector'. Unfortunately, this detail is under the `enrichKEGG` function, not the `gseKEGG` function! For `enrichKEGG`, the parameter `gene` is described as requiring "a vector of entrez gene id", yet for `gseKEGG`, the description for `geneList` is "order ranked geneList". There is a little bit of sleuthing required at times!
39+
This time, instead of filtering for DEGs, we will extract all genes, and sort them by fold change (largest to smallest) to make our ranked gene list for GSEA.
4040

41+
What class of R object does our ranked gene list need to be in?
42+
43+
```{r help gseKEGG}
44+
?gseKEGG
45+
```
46+
47+
We can see `geneList = order ranked geneList`. Not informative!
48+
49+
Unfortunately, the required type of object is detailed under the `enrichKEGG` function, not the `gseKEGG` function! For `enrichKEGG`, the parameter `gene` is described as requiring "a vector of entrez gene id", yet for `gseKEGG`, the description for `geneList` is "order ranked geneList". There is a little bit of sleuthing required at times!
4150

4251

4352
```{r extract ranked list}
@@ -54,7 +63,7 @@ tail(ranked)
5463

5564
# 2. Check `gseKEGG` function arguments and requirements
5665

57-
Let's start by reviewing the function arguments. Once you run the below code chunk, the user guide for the function will open in the `Help` pane of RStudio (lower right)
66+
Review the parameters:
5867

5968
```{r help gsekegg}
6069
?gseKEGG
@@ -65,12 +74,12 @@ Most of those defaults look suitable to start.
6574

6675
We have human so the default `organism = "hsa"` argument is correct. If you were working with a species other than human, you first need to obtain your organism code. You can derive this from [KEGG Organisms](https://www.genome.jp/kegg/tables/br08606.html) or using the `clusterProfiler` function `search_kegg_organism`.
6776

68-
Pick your favourite species and search for the KEGG organism code by editing the variable 'organism' then executing the code chunk:
77+
Pick your favourite species and search for the KEGG organism code by editing the variable 'fave' then executing the code chunk:
6978

70-
```{r}
79+
```{r check kegg orgs}
7180
fave <- "horse"
7281
73-
# search by common_name or scientific_name
82+
# search by common_name or scientific_name or kegg.code
7483
search_kegg_organism(fave, by = "common_name")
7584
```
7685

@@ -90,25 +99,25 @@ Since our input data does not match any of the valid namespaces, we need to conv
9099

91100
Check the usage for the `bitr` function:
92101

93-
```{r}
102+
```{r help bitr}
94103
?bitr
95104
```
96105

97-
We need to understand what the valid `fromType` and `toType` values are, and it turns out we need an Org.db to use `bitr`! This is a Bioconductor annotation package, of which there are currently only 20. So while the `gseKEGG` function supports all organisms in KEGG, performing gene ID conversion within ``clusterProfiler` may not be possible for non-model species and you would need to seek a different method.
106+
We need to understand what the valid `fromType` and `toType` values are, and it turns out we need an Org.db to use `bitr`! This is a Bioconductor annotation package, of which there are currently only 20. So while the `gseKEGG` function supports all organisms in KEGG, performing gene ID conversion within `clusterProfiler` may not be possible for non-model species and you would need to seek a different method.
98107

99108
We have already loaded the `org.Hs.eg.db` annotation library. We can use this to search `keytypes`:
100109

101-
```{r}
110+
```{r check org db keytypes for huamn}
102111
keytypes(org.Hs.eg.db)
103112
```
104113

105114
Converting to ENTREZ will be compatible with `kegg`. So our `fromType` is `ENSEMBL` and `toType` is `ENTREZID`. Note that these are case sensitive!
106115

107116
If we were to run `enrichGO` or `gseaGO` that require a Bioconductor Org.db package, we would not need to do a conversion, as both of the gene ID types within our input data (`ENSEMBL` and `SYMBOL`) are natively supported.
108117

109-
The below code performs the reformatting and handles duplicates by keeping only the first occurrence of duplicate Entrez IDs within the input. Note that this is not ideal and for a real experiment you should print out a list of duplicates, carefully review these and choose which and how to retain based on your biological context.
118+
Gene ID conversion often results in duplicates. The below code performs the reformatting and handles duplicates by keeping only the first occurrence of duplicate Entrez IDs within the input. Note that this is not ideal and for a real experiment you should print out a list of duplicates, carefully review these and choose which and how to retain based on your biological context.
110119

111-
First, convert map the ENSEMBL gene IDs from our 'ranked' list to ENTREZ gene IDs:
120+
First, convert the ENSEMBL gene IDs from our 'ranked' list to ENTREZ gene IDs:
112121

113122
```{r convert gene IDs}
114123
# Convert the Ensembl IDs to Entrez IDs, dropping NAs
@@ -146,6 +155,7 @@ head(ranked_entrez)
146155
tail(ranked_entrez)
147156
148157
```
158+
149159
Looks good! Now we are ready to enrich!
150160

151161
# 4. Run GSEA over KEGG database
@@ -155,19 +165,19 @@ Now run GSEA over the KEGG database with `gseKEGG` function, setting a `seed` va
155165

156166
```{r gseKEGG}
157167
gsea_kegg <- gseKEGG(
158-
ranked_entrez,
159-
organism = "hsa",
160-
keyType = "kegg",
161-
exponent = 1,
162-
minGSSize = 10,
163-
maxGSSize = 500,
164-
eps = 1e-10,
165-
pvalueCutoff = 0.05,
166-
pAdjustMethod = "BH",
167-
verbose = TRUE,
168-
use_internal_data = FALSE,
169-
seed = 123,
170-
by = "fgsea"
168+
ranked_entrez, # ranked gene list vector object
169+
organism = "hsa", # species
170+
keyType = "kegg", # gene ID type/namespace
171+
exponent = 1, # weight of each step
172+
minGSSize = 10, # minimum gene set size
173+
maxGSSize = 500, # maximum gene set size
174+
eps = 1e-10, # sets the boundary for calculating the p value
175+
pvalueCutoff = 0.05, # adjusted P value cutoff for enriched terms
176+
pAdjustMethod = "BH", # multiple testing correction with BH method
177+
verbose = TRUE, # print output as it runs
178+
use_internal_data = FALSE, # use latest online KEGG data not local data
179+
seed = 123, # set seed for random number generation, for reproducibility
180+
by = "fgsea" # GSEA algorithm
171181
)
172182
173183
@@ -177,7 +187,7 @@ We have 3 warnings, one about ties in the ranked list, which we could resolve ma
177187

178188
The second warning may be resolved by following the suggestion to set `nPermSimple = 10000`. How frustrating that this parameter is not described within the `gseKEGG` help menu, nor the `clusterProfiler` PDF at all!
179189

180-
Let's rerun following both suggestions, to use permutations (`nPermSimple`) and set `eps = 0`. We will keep the rest of the arguments the same as before.
190+
Let's rerun following both suggestions, to use permutations (`nPermSimple = 1000`) and set `eps = 0`. We will keep the rest of the arguments the same as before.
181191

182192
```{r gseKEGG perms}
183193
gsea_kegg <- gseKEGG(
@@ -187,17 +197,18 @@ gsea_kegg <- gseKEGG(
187197
exponent = 1,
188198
minGSSize = 10,
189199
maxGSSize = 500,
190-
eps = 0,
200+
eps = 0, # changed to zero
191201
pvalueCutoff = 0.05,
192202
pAdjustMethod = "BH",
193203
verbose = TRUE,
194204
use_internal_data = FALSE,
195205
seed = 123,
196206
by = "fgsea",
197-
nPermSimple = 10000
207+
nPermSimple = 10000 # added permutations
198208
)
199209
200210
```
211+
201212
Great, those last 2 warnings have resolved and we only have the expected one about tied rankings.
202213

203214

@@ -232,7 +243,7 @@ You can view the table in the `editor` pane by clicking on it from the `Files` p
232243

233244
One of the key advantages of using R over web tools is flexibility with visualisations. Next we will produce a range of plot types from the `enrichplot` package, developed by the same authors as `clusterProfiler`.
234245

235-
In this package, some of the plots can be used to plot ORA or GSEA, and others only for one type or the other.
246+
In this package, some of the plots can be used to plot ORA or GSEA results, and others only for one type or the other.
236247

237248
To determine which plot functions are compatible with which FEA results type, you can check the help menu. If you see `## S4 method for signature 'enrichResult'` this plot is compatible with ORA results object. If you see `## S4 method for signature 'gseaResult'` this plot is compatible with GSEA results object.
238249

@@ -253,7 +264,7 @@ And review the help menu for `ridgeplot` which can only plot GSEA results:
253264

254265
Unfortunately, `enrichplot` is only compatible with the enrichment results from the packages produced by this development team, although the desire to use `enrichplot` with the output of other tools is widespread. The R package `multienrichjam` has a function `enrichDF2enrichResult` that converts ORA dataframe results from other FEA tools to the format required for `enrichplot`.
255266

256-
`multienrichjam` has a lot of dependencies and has not been installed on these VMs so we will not be performing this today. However, this functionality and flexibility is pretty cool, so if you wanted to install this on your own computer outside the workshop, below is the code for installing :-)
267+
`multienrichjam` has a lot of dependencies and has not been installed on these VMs so we will not be performing this today. However, this functionality and flexibility is pretty cool, so if you wanted to install this on your own computer outside the workshop, below is the code for installing :-)
257268

258269

259270
```{r install multienrichjam}
@@ -265,6 +276,8 @@ Unfortunately, `enrichplot` is only compatible with the enrichment results from
265276
#remotes::install_github("jmw86069/multienrichjam", dependencies=TRUE)
266277
#library(multienrichjam)
267278
279+
# check function help:
280+
#?enrichDF2enrichResult
268281
```
269282

270283

@@ -276,7 +289,7 @@ We can change the number of top terms that are plotted with the `showCategory` p
276289

277290
```{r GSEA KEGG dotplot }
278291
279-
clusterProfiler::dotplot(
292+
enrichplot::dotplot(
280293
gsea_kegg,
281294
x = "geneRatio",
282295
color = "p.adjust",
@@ -294,18 +307,16 @@ clusterProfiler::dotplot(
294307

295308
The upset plot shows the pattern of shared genes among the enriched terms (since a gene can belong to multiple terms or pathways).
296309

297-
For ORA, upsetplot will calculate the overlaps among different gene sets.
298-
299-
For GSEA, it will plot the fold change distributions of different categories (e.g. unique to pathway, overlaps among different pathways).
310+
For GSEA, the plot includess fold change distributions on the Y axis.
300311

301312
`n` controls the number of terms plotted - here we have restricted to 10 for ease of viewing. Plot view is quite small within a notebook, but if you wanted to plot more terms, you could print to an A4 size output file for enhanced clarity.
302313

303314

304315

305-
```{r GSEA KEGG upset plot}
316+
```{r GSEA KEGG upset plot, fig.width = 10 }
306317
307318
# n = the number of top enrichments to plot
308-
enrichplot::upsetplot(gsea_kegg, n = 10)
319+
enrichplot::upsetplot(gsea_kegg, n = 15)
309320
310321
```
311322

@@ -319,21 +330,21 @@ It is required to run the function `pairwise_termsim` before producing the plot.
319330

320331
When running `pairwise_termsim`, we don't need to save the output to a new object, because the information is added to the object and does not change any other attributes.
321332

333+
Each node is a term, and the number of genes associated with the term is shown by the dot size, with P values by dot colour.
334+
322335

323336
```{r GSEA KEGG emapplot}
324337
# calculate pairwise similarities between the enriched terms
325338
gsea_kegg <- enrichplot::pairwise_termsim(gsea_kegg)
326339
327340
# plot
328-
enrichplot::emapplot(gsea_kegg, showCategory = 15)
341+
enrichplot::emapplot(gsea_kegg, showCategory = 25)
329342
330343
```
331344

332345
### treeplot
333346

334-
The treeplot provides the same information as the emapplot but in a different visualisation.
335-
336-
Each node is a term, and the number of genes associated with the term is shown by the dot size, with P values by dot colour.
347+
The treeplot provides the same information as the emapplot but presented in a different way.
337348

338349
Terms that share more genes or biological functions will be closer together in the tree structure. Clades are colour coded and 'cluster tags' assigned. You can control the number of words in the tag (default is 4). The user guide describes the argument `nWords` however running that will throw an error (it says 'warning' but it is fatal so to me that's an error!):
339350

@@ -353,13 +364,15 @@ enrichplot::treeplot(gsea_kegg, showCategory = 15, color = "p.adjust", cluster.p
353364

354365
### cnetplot
355366

356-
The cnetplot is helpful to understanding which genes are involved in the enriched terms, details that are not available in the plots generated so far. It depicts the linkages of genes and terms as a network.
367+
The cnetplot depicts the linkages of genes and terms as a network. This is helpful to understanding which genes are involved in the enriched terms, ading a level of detail not offered by the plots we have generated so far.
357368

358369
For GSEA, where all genes (not just DEGs) are used, only the 'core' enriched genes are used to create the network plot. These are the 'leading edge genes', those genes up to the point where the Enrichment Score (ES) gets maximised from the base zero. In other words, the subset of genes that are most strongly associated with a specific term.
359370

360371
There are a few parameters to play around with here to get a readable plot.
361372

362-
Try changing the number of terms plotted (`showCategory`) and the `node_label` which controls whether labels are put on terms, genes (`item`), or `all`.
373+
Try changing the number of terms plotted (`showCategory`).
374+
375+
Try changing the `node_label` which controls whether labels are put on terms (`category`), genes (`item`), or both (`all`).
363376

364377
Since the information that is attempted to be plotted is complex, having a large number of terms and attempting to label everything won't look very informative! If you want to plot both gene IDs and term names, you will need to plot a small number of categories.
365378

@@ -372,15 +385,15 @@ Since the information that is attempted to be plotted is complex, having a large
372385
373386
enrichplot::cnetplot(gsea_kegg, showCategory = 8, cex_label_gene = 0.5, cex_label_category = 0.8, colorEdge = TRUE, node_label = "category")
374387
375-
376-
377-
378388
```
379389

390+
With 8 terms and no gene IDs, we don't get any more detail than from the treeplot or emapplot.
380391

381-
You can also plot the interaction between specific terms. This is helpful to obtain gene ID level resolution for term interactions of interest.
392+
This plot is really useful for showing a detailed look at a small number of terms. Just plotting the top 3 terms may not look very helpful (try plotting the top 3!)
382393

383-
The 3 terms listed below are for 'IL-17 signaling pathway', 'Viral protein interaction with cytokine and cytokine receptor' and 'Chemokine signaling pathway' which are top 10 enrichments with a relationship of shared genes, evident from the plot above.
394+
A useful application is plotting the interaction between specific terms. This is helpful to obtain gene ID level resolution for term interactions of interest.
395+
396+
The 3 terms listed below are for 'IL-17 signaling pathway', 'Viral protein interaction with cytokine and cytokine receptor' and 'Chemokine signaling pathway' which are among the top 10 enrichments with a relationship of shared genes, evident from the plot above.
384397

385398
Run the code below or select a handful of terms of your choosing from the results table we printed earlier. We need the KEGG ID (column 1).
386399

@@ -410,10 +423,10 @@ Now we can clearly see the gene IDs for this network.
410423
Like the upset plot, the heatplot shows shared genes across enriched terms. When there are a lot of genes, it can be poorly readable, especially within the notebook. This function does not include a parameter to restrict to 'core' genes like the `cnetplot` or `ridgeplot` so even when plotting a very small number of enriched terms, the X-axis is too cramped to be readable.
411424

412425

413-
414426
```{r heatplot GSEA KEGG }
415427
enrichplot::heatplot(gsea_kegg, showCategory = 3)
416428
```
429+
417430
The gseKEGG result contains a data column containing the core/leading edge genes:
418431

419432
```{r show GES result columns}
@@ -470,7 +483,7 @@ enrichplot::ridgeplot(gsea_kegg,
470483

471484
Another GSEA-specific plot, which shows the contribution to the normalised enrichment score (NES) of genes across the gene list.
472485

473-
The gseaplot can plot one term at a time.
486+
The gseaplot can plot one term at a time. You would run GSEA plots for key terms of interest in your results.
474487

475488
The term 'hsa04020' (Calcium signaling pathway) has a positive NES, an adjusted P value of 0.00046 and a set size of 279.
476489

@@ -490,7 +503,7 @@ enrichplot::gseaplot( gsea_kegg,
490503
491504
```
492505

493-
The term 'hsa04062' (Chemokine signaling pathway) has a negative NE, a highyl significant adjusted P value of 1.11E-06 and a set size of 129.
506+
The term 'hsa04062' (Chemokine signaling pathway) has a negative NE, a highly significant adjusted P value of 1.11E-06 and a set size of 129.
494507

495508
```{r gseaplot GSEA KEGG downreg, fig.width=10 }
496509
@@ -536,7 +549,7 @@ The `volplot` function within `enrichplot` does not support GSEA result objects,
536549
The volcano plot has the advantage of showing whether the genes in the enriched terms were predominantly from the top (upregulated) or bottom (downregulated) end of the list.
537550

538551

539-
```{r}
552+
```{r gsea volcano plot}
540553
ggplot2::ggplot(gsea_kegg@result, aes(x = enrichmentScore, y = -log10(p.adjust), color = p.adjust)) +
541554
geom_point(alpha = 0.7, size = 2) + # Adjust point size
542555
scale_color_gradient(low = "blue", high = "red") + # Color by p.adjust values

0 commit comments

Comments
 (0)