Skip to content

Commit 00e387d

Browse files
authored
Merge pull request #4 from Sydney-Informatics-Hub/day2-edits
day 2 - minor clusterProfiler edits
2 parents c4b9a70 + 614884d commit 00e387d

File tree

1 file changed

+23
-17
lines changed

1 file changed

+23
-17
lines changed

day2_Rnotebooks/clusterProfiler.Rmd

Lines changed: 23 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ head(data)
3434
```
3535

3636

37-
This time, instead of filtering for DEGs, we will extract all genes, and sort them by fold change largest to smallest. The R object type for `gseKEGG` needs to be a 'vector'. Unfortunately, this detail is under the `enrichKEGG` function, not the `gseKEGG` function! For `enrichKEGG`, the parameter `gene` is described as requiring "a vector of entrez gene id", yet for `gseKEGG, the description for `geneList` is "order ranked geneList". There is a little bit of sleuthing required at times!
37+
This time, instead of filtering for DEGs, we will extract all genes, and sort them by fold change (largest to smallest). The R object type for `gseKEGG` needs to be a 'vector'. Unfortunately, this detail is under the `enrichKEGG` function, not the `gseKEGG` function! For `enrichKEGG`, the parameter `gene` is described as requiring "a vector of entrez gene id", yet for `gseKEGG`, the description for `geneList` is "order ranked geneList". There is a little bit of sleuthing required at times!
3838

3939

4040

@@ -43,6 +43,7 @@ This time, instead of filtering for DEGs, we will extract all genes, and sort th
4343
# Ranked gene list vector for GSEA
4444
ranked <- setNames(data$Log2FC[order(-data$Log2FC)], data$Gene.ID[order(-data$Log2FC)])
4545
46+
# Inspect the vector
4647
head(ranked)
4748
tail(ranked)
4849
@@ -60,26 +61,26 @@ Let's start by reviewing the function arguments. Once you run the below code chu
6061

6162
Most of those defaults look suitable to start.
6263

63-
We have human so the default "hsa" for argument `organism` is correct. If you were working with a species other than human, you first need to obtain your organism code. You can derive this from [KEGG Organisms](https://www.genome.jp/kegg/tables/br08606.html) or using the `clusterProfiler` function `search_kegg_organism`.
64+
We have human so the default `organism = "hsa"` argument is correct. If you were working with a species other than human, you first need to obtain your organism code. You can derive this from [KEGG Organisms](https://www.genome.jp/kegg/tables/br08606.html) or using the `clusterProfiler` function `search_kegg_organism`.
6465

6566

6667
Pick your favourite species and search for the KEGG organism code by editing the variable 'organism' then executing the code chunk:
6768

6869
```{r}
6970
fave <- "horse"
7071
71-
# search by commn_name or scientific_name
72+
# search by common_name or scientific_name
7273
search_kegg_organism(fave, by = "common_name")
7374
```
7475

7576

76-
Now back to the paramters - the defaults for P value correction and filtering, and gene set size limits are acceptable. We don't want to use internal data (ie, we want to search agaisntthe latest KEGG online) so `FALSE` is apt here, and we do want to use `fgsea` algorithm for analysis.
77+
Now back to the parameters - the defaults for P value correction and filtering, and gene set size limits are acceptable. We don't want to use internal data (i.e. we want to search against the latest KEGG online) so `FALSE` is apt here, and we do want to use the `fgsea` algorithm for analysis.
7778

78-
For the `seed` parameter, we should provide a value, to ensure results are the same each time the command. By setting a seed, you fix the sequence of random numbers generated within the GSEA algorithm.
79+
For the `seed` parameter, we should provide a value, to ensure results are the same each time the command is run. By setting a seed, you fix the sequence of random numbers generated within the GSEA algorithm.
7980

8081
We also need to check the `keyType` (gene namespace) against the input data that we have. From the help page, we can see the supported namespaces for the KEGG database are one of 'kegg', 'ncbi-geneid', 'ncib-proteinid' or 'uniprot'.
8182

82-
Going back to where we loaded the input data and ran `head` to view the first few rows, we can see our input has ENSEMBL gene IDs as well as official gene symbols. ENSEMBL gene IDs are generally preferable for bioinformatics analyses because they are more unique and stable compared to gene symbol.
83+
Going back to where we loaded the input data and ran `head` to view the first few rows, we can see our input has ENSEMBL gene IDs as well as official gene symbols. ENSEMBL gene IDs are generally preferable for bioinformatics analyses because they are more unique and stable compared to gene symbol.
8384

8485
Since our input data does not match any of the valid namespaces, we need to convert gene IDs! `clusterProfiler` has the `bitr` function to do this. `BiomaRt` is also a popular R package for this task.
8586

@@ -92,7 +93,7 @@ Check the usage for the `bitr` function:
9293
?bitr
9394
```
9495

95-
We need to understand what are the valid `fromType` and `toType`, and it turns out we need an Org.db to use `bitr`! This is a Bioconductor annotation package, of which there are currently only 20. So while the `gseKEGG` function supports all organisms in KEGG, performing gene ID conversion within `clusterProfiler may not be possible for non-model species and you would need to seek a different method.
96+
We need to understand what the valid `fromType` and `toType` values are, and it turns out we need an Org.db to use `bitr`! This is a Bioconductor annotation package, of which there are currently only 20. So while the `gseKEGG` function supports all organisms in KEGG, performing gene ID conversion within ``clusterProfiler` may not be possible for non-model species and you would need to seek a different method.
9697

9798
We have already loaded the `org.Hs.eg.db` annotation library. We can use this to search `keytypes`:
9899

@@ -116,11 +117,11 @@ converted_ids <- bitr(names(ranked),
116117
OrgDb = org.Hs.eg.db,
117118
drop = TRUE)
118119
```
119-
<1% failing ot map is pretty good.
120+
<1% failing to map is pretty good.
120121

121122
The 1:many mappings warning means that some of our gene IDs matched more than 1 ENTREZ ID. We need to ensure that the final list we provide to GSEA does not contain duplicates. This needs to happen at two stages: first of all, ensuring that each ENSEMBL ID is mapped to only one ENTREZ ID, and then once the final converted vector has been created, check it for duplicated ENTREZ IDs, which could occur when two different ENSEMBL IDs from our input map to the the same ENTREZ ID.
122123

123-
The below code does this by selecting the first of each set of duplicates. This is not ideal. In a real experiment, you should print out the duplicates and directly manage how to handle them by reviewing the gene IDs involved and deciding whether it is valid to select one ID over another, or at times you may choose to merge the values for duplcaie genes.
124+
The below code does this by selecting the first of each set of duplicates. This is not ideal. In a real experiment, you should print out the duplicates and directly manage how to handle them by reviewing the gene IDs involved and deciding whether it is valid to select one ID over another, or at times you may choose to merge the values for duplicate genes.
124125

125126
```{r filter converted ids }
126127
@@ -139,6 +140,7 @@ names(ranked_entrez) <- converted_ids$ENTREZID[match(names(ranked_entrez), conve
139140
# Remove duplicates from the vector by keeping only the first occurrence of each Entrez IDs (not ideal, see note above)
140141
ranked_entrez <- ranked_entrez[!duplicated(names(ranked_entrez))]
141142
143+
# Inspect the vector
142144
head(ranked_entrez)
143145
tail(ranked_entrez)
144146
@@ -170,11 +172,11 @@ gsea_kegg <- gseKEGG(
170172
171173
```
172174

173-
We have 3 warnings, one about ties in the ranked list, which we could resolve manually by reviewing the raw data, or just ignore as its only 0.01% of the list.
175+
We have 3 warnings, one about ties in the ranked list, which we could resolve manually by reviewing the raw data, or just ignore it as it is only 0.01% of the list.
174176

175-
The second warning may be resolved by following the suggestion to set `nPermSimple = 10000)`. How frustrating that this parameter is not described withint he `gseKEGG` help menu, nor the `clusterProfiler` PDF at all!
177+
The second warning may be resolved by following the suggestion to set `nPermSimple = 10000`. How frustrating that this parameter is not described within the `gseKEGG` help menu, nor the `clusterProfiler` PDF at all!
176178

177-
Let's rerun following both suggestions, to use permutations and set `eps` to zero.
179+
Let's rerun following both suggestions, to use permutations (`nPermSimple`) and set `eps = 0`. We will keep the rest of the arguments the same as before.
178180

179181
```{r gseKEGG perms}
180182
gsea_kegg <- gseKEGG(
@@ -203,14 +205,16 @@ Great, those last 2 warnings have resolved and we only have the expected one abo
203205

204206
## Tabular results
205207

206-
First let's preview the results. We can see 41 significant enrichments.
208+
First, let's preview the results.
207209

208210

209211
```{r preview gsea results}
210212
print(gsea_kegg)
211213
212214
```
213215

216+
We can see 41 significant enrichments.
217+
214218
Extract results to a TSV file. This will print the significant enrichments, sorted by adjusted P value.
215219

216220
```{r print GSEA results table }
@@ -287,10 +291,12 @@ The treeplot provides the same information as the emapplot but in a different vi
287291

288292
Each node is a term, and the number of genes associated with the term is shown by the dot size, with P values by dot colour.
289293

290-
Terms that share more genes or biological functions will be closer together in the tree structure. Clades are colour coded and 'cluster tags' assigned. You can control the number of words in the tag (default is 4). The user guide describes the argument `nWords` however running that will throw an error (it says 'warning' but its fatal so to me that's an eror!):
294+
Terms that share more genes or biological functions will be closer together in the tree structure. Clades are colour coded and 'cluster tags' assigned. You can control the number of words in the tag (default is 4). The user guide describes the argument `nWords` however running that will throw an error (it says 'warning' but its fatal so to me that's an error!):
291295

296+
```
292297
"Warning: Use 'cluster.params = list(label_words_n = your_value)' instead of 'nWords'.
293298
The nWords parameter will be removed in the next version."
299+
```
294300

295301
This plot also requires the pairwise similarity matrix calculation that emapplot does. Since we have already run it, it is hashed out in the code chunk below.
296302

@@ -307,7 +313,7 @@ enrichplot::treeplot(gsea_kegg, showCategory = 15, color = "p.adjust", cluster.p
307313

308314
The cnetplot is helpful to understanding which genes are involved in the enriched terms, details that are not available in the plots generated so far. It depicts the linkages of genes and terms as a network.
309315

310-
For GSEA, where all genes (not jsut DEGs) are used, only the 'core' enriched genes are used to create the network plot. These are the 'leading edge genes', those genes up to the point where the Enrichment Score (ES) gets maximised from the base zero. In other words, the subset of genes that are most strongly associated with a specific term.
316+
For GSEA, where all genes (not just DEGs) are used, only the 'core' enriched genes are used to create the network plot. These are the 'leading edge genes', those genes up to the point where the Enrichment Score (ES) gets maximised from the base zero. In other words, the subset of genes that are most strongly associated with a specific term.
311317

312318
There are a few parameters to play around with here to get a readable plot.
313319

@@ -334,7 +340,7 @@ You can also plot the interaction between specific terms. This is helpful to obt
334340

335341
The 3 terms listed below are for 'IL-17 signaling pathway', 'Viral protein interaction with cytokine and cytokine receptor' and 'Chemokine signaling pathway' which are top 10 enrichments with a relationship of shared genes, evident from the plot above.
336342

337-
Run the code below or select a handful of terms of your choosing from the results table we printed earlier. We need the KEGG ID (column 1). T
343+
Run the code below or select a handful of terms of your choosing from the results table we printed earlier. We need the KEGG ID (column 1).
338344

339345
```{r cnetplot custom terms}
340346
# Select terms of interest
@@ -385,7 +391,7 @@ enrichplot::heatplot(gsea_kegg, foldChange = ranked_core_genes, showCategory = 3
385391
386392
```
387393

388-
Saving to a file only improve things slightly.
394+
Saving to a file only improves things slightly.
389395

390396
```{r}
391397
png("clusterprofiler_gseKEGG_heatplot.png", width = 11.7, height = 8.3, units = "in", res = 300)

0 commit comments

Comments
 (0)