You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: day2_Rnotebooks/clusterProfiler.Rmd
+23-17Lines changed: 23 additions & 17 deletions
Original file line number
Diff line number
Diff line change
@@ -34,7 +34,7 @@ head(data)
34
34
```
35
35
36
36
37
-
This time, instead of filtering for DEGs, we will extract all genes, and sort them by fold change largest to smallest. The R object type for `gseKEGG` needs to be a 'vector'. Unfortunately, this detail is under the `enrichKEGG` function, not the `gseKEGG` function! For `enrichKEGG`, the parameter `gene` is described as requiring "a vector of entrez gene id", yet for `gseKEGG, the description for `geneList` is "order ranked geneList". There is a little bit of sleuthing required at times!
37
+
This time, instead of filtering for DEGs, we will extract all genes, and sort them by fold change (largest to smallest). The R object type for `gseKEGG` needs to be a 'vector'. Unfortunately, this detail is under the `enrichKEGG` function, not the `gseKEGG` function! For `enrichKEGG`, the parameter `gene` is described as requiring "a vector of entrez gene id", yet for `gseKEGG`, the description for `geneList` is "order ranked geneList". There is a little bit of sleuthing required at times!
38
38
39
39
40
40
@@ -43,6 +43,7 @@ This time, instead of filtering for DEGs, we will extract all genes, and sort th
@@ -60,26 +61,26 @@ Let's start by reviewing the function arguments. Once you run the below code chu
60
61
61
62
Most of those defaults look suitable to start.
62
63
63
-
We have human so the default "hsa" for argument`organism` is correct. If you were working with a species other than human, you first need to obtain your organism code. You can derive this from [KEGG Organisms](https://www.genome.jp/kegg/tables/br08606.html) or using the `clusterProfiler` function `search_kegg_organism`.
64
+
We have human so the default `organism = "hsa"`argument is correct. If you were working with a species other than human, you first need to obtain your organism code. You can derive this from [KEGG Organisms](https://www.genome.jp/kegg/tables/br08606.html) or using the `clusterProfiler` function `search_kegg_organism`.
64
65
65
66
66
67
Pick your favourite species and search for the KEGG organism code by editing the variable 'organism' then executing the code chunk:
67
68
68
69
```{r}
69
70
fave <- "horse"
70
71
71
-
# search by commn_name or scientific_name
72
+
# search by common_name or scientific_name
72
73
search_kegg_organism(fave, by = "common_name")
73
74
```
74
75
75
76
76
-
Now back to the paramters - the defaults for P value correction and filtering, and gene set size limits are acceptable. We don't want to use internal data (ie, we want to search agaisntthe latest KEGG online) so `FALSE` is apt here, and we do want to use `fgsea` algorithm for analysis.
77
+
Now back to the parameters - the defaults for P value correction and filtering, and gene set size limits are acceptable. We don't want to use internal data (i.e. we want to search against the latest KEGG online) so `FALSE` is apt here, and we do want to use the`fgsea` algorithm for analysis.
77
78
78
-
For the `seed` parameter, we should provide a value, to ensure results are the same each time the command. By setting a seed, you fix the sequence of random numbers generated within the GSEA algorithm.
79
+
For the `seed` parameter, we should provide a value, to ensure results are the same each time the command is run. By setting a seed, you fix the sequence of random numbers generated within the GSEA algorithm.
79
80
80
81
We also need to check the `keyType` (gene namespace) against the input data that we have. From the help page, we can see the supported namespaces for the KEGG database are one of 'kegg', 'ncbi-geneid', 'ncib-proteinid' or 'uniprot'.
81
82
82
-
Going back to where we loaded the input data and ran `head` to view the first few rows, we can see our input has ENSEMBL gene IDs as well as official gene symbols. ENSEMBL gene IDs are generally preferable for bioinformatics analyses because they are more unique and stable compared to gene symbol.
83
+
Going back to where we loaded the input data and ran `head` to view the first few rows, we can see our input has ENSEMBL gene IDs as well as official gene symbols. ENSEMBL gene IDs are generally preferable for bioinformatics analyses because they are more unique and stable compared to gene symbol.
83
84
84
85
Since our input data does not match any of the valid namespaces, we need to convert gene IDs! `clusterProfiler` has the `bitr` function to do this. `BiomaRt` is also a popular R package for this task.
85
86
@@ -92,7 +93,7 @@ Check the usage for the `bitr` function:
92
93
?bitr
93
94
```
94
95
95
-
We need to understand what are the valid `fromType` and `toType`, and it turns out we need an Org.db to use `bitr`! This is a Bioconductor annotation package, of which there are currently only 20. So while the `gseKEGG` function supports all organisms in KEGG, performing gene ID conversion within `clusterProfiler may not be possible for non-model species and you would need to seek a different method.
96
+
We need to understand what the valid `fromType` and `toType` values are, and it turns out we need an Org.db to use `bitr`! This is a Bioconductor annotation package, of which there are currently only 20. So while the `gseKEGG` function supports all organisms in KEGG, performing gene ID conversion within ``clusterProfiler` may not be possible for non-model species and you would need to seek a different method.
96
97
97
98
We have already loaded the `org.Hs.eg.db` annotation library. We can use this to search `keytypes`:
The 1:many mappings warning means that some of our gene IDs matched more than 1 ENTREZ ID. We need to ensure that the final list we provide to GSEA does not contain duplicates. This needs to happen at two stages: first of all, ensuring that each ENSEMBL ID is mapped to only one ENTREZ ID, and then once the final converted vector has been created, check it for duplicated ENTREZ IDs, which could occur when two different ENSEMBL IDs from our input map to the the same ENTREZ ID.
122
123
123
-
The below code does this by selecting the first of each set of duplicates. This is not ideal. In a real experiment, you should print out the duplicates and directly manage how to handle them by reviewing the gene IDs involved and deciding whether it is valid to select one ID over another, or at times you may choose to merge the values for duplcaie genes.
124
+
The below code does this by selecting the first of each set of duplicates. This is not ideal. In a real experiment, you should print out the duplicates and directly manage how to handle them by reviewing the gene IDs involved and deciding whether it is valid to select one ID over another, or at times you may choose to merge the values for duplicate genes.
We have 3 warnings, one about ties in the ranked list, which we could resolve manually by reviewing the raw data, or just ignore as its only 0.01% of the list.
175
+
We have 3 warnings, one about ties in the ranked list, which we could resolve manually by reviewing the raw data, or just ignore it as it is only 0.01% of the list.
174
176
175
-
The second warning may be resolved by following the suggestion to set `nPermSimple = 10000)`. How frustrating that this parameter is not described withint he`gseKEGG` help menu, nor the `clusterProfiler` PDF at all!
177
+
The second warning may be resolved by following the suggestion to set `nPermSimple = 10000`. How frustrating that this parameter is not described within the`gseKEGG` help menu, nor the `clusterProfiler` PDF at all!
176
178
177
-
Let's rerun following both suggestions, to use permutations and set `eps` to zero.
179
+
Let's rerun following both suggestions, to use permutations (`nPermSimple`) and set `eps = 0`. We will keep the rest of the arguments the same as before.
178
180
179
181
```{r gseKEGG perms}
180
182
gsea_kegg <- gseKEGG(
@@ -203,14 +205,16 @@ Great, those last 2 warnings have resolved and we only have the expected one abo
203
205
204
206
## Tabular results
205
207
206
-
First let's preview the results. We can see 41 significant enrichments.
208
+
First, let's preview the results.
207
209
208
210
209
211
```{r preview gsea results}
210
212
print(gsea_kegg)
211
213
212
214
```
213
215
216
+
We can see 41 significant enrichments.
217
+
214
218
Extract results to a TSV file. This will print the significant enrichments, sorted by adjusted P value.
215
219
216
220
```{r print GSEA results table }
@@ -287,10 +291,12 @@ The treeplot provides the same information as the emapplot but in a different vi
287
291
288
292
Each node is a term, and the number of genes associated with the term is shown by the dot size, with P values by dot colour.
289
293
290
-
Terms that share more genes or biological functions will be closer together in the tree structure. Clades are colour coded and 'cluster tags' assigned. You can control the number of words in the tag (default is 4). The user guide describes the argument `nWords` however running that will throw an error (it says 'warning' but its fatal so to me that's an eror!):
294
+
Terms that share more genes or biological functions will be closer together in the tree structure. Clades are colour coded and 'cluster tags' assigned. You can control the number of words in the tag (default is 4). The user guide describes the argument `nWords` however running that will throw an error (it says 'warning' but its fatal so to me that's an error!):
291
295
296
+
```
292
297
"Warning: Use 'cluster.params = list(label_words_n = your_value)' instead of 'nWords'.
293
298
The nWords parameter will be removed in the next version."
299
+
```
294
300
295
301
This plot also requires the pairwise similarity matrix calculation that emapplot does. Since we have already run it, it is hashed out in the code chunk below.
The cnetplot is helpful to understanding which genes are involved in the enriched terms, details that are not available in the plots generated so far. It depicts the linkages of genes and terms as a network.
309
315
310
-
For GSEA, where all genes (not jsut DEGs) are used, only the 'core' enriched genes are used to create the network plot. These are the 'leading edge genes', those genes up to the point where the Enrichment Score (ES) gets maximised from the base zero. In other words, the subset of genes that are most strongly associated with a specific term.
316
+
For GSEA, where all genes (not just DEGs) are used, only the 'core' enriched genes are used to create the network plot. These are the 'leading edge genes', those genes up to the point where the Enrichment Score (ES) gets maximised from the base zero. In other words, the subset of genes that are most strongly associated with a specific term.
311
317
312
318
There are a few parameters to play around with here to get a readable plot.
313
319
@@ -334,7 +340,7 @@ You can also plot the interaction between specific terms. This is helpful to obt
334
340
335
341
The 3 terms listed below are for 'IL-17 signaling pathway', 'Viral protein interaction with cytokine and cytokine receptor' and 'Chemokine signaling pathway' which are top 10 enrichments with a relationship of shared genes, evident from the plot above.
336
342
337
-
Run the code below or select a handful of terms of your choosing from the results table we printed earlier. We need the KEGG ID (column 1). T
343
+
Run the code below or select a handful of terms of your choosing from the results table we printed earlier. We need the KEGG ID (column 1).
0 commit comments