oops this one is final CP

calizilla · calizilla · commit f8bdf8b68e53 · 2024-11-20T21:33:30.000+11:00
diff --git a/day2_Rnotebooks/clusterProfiler.Rmd b/day2_Rnotebooks/clusterProfiler.Rmd
@@ -36,8 +36,17 @@ head(data)
 ```
 
 
-This time, instead of filtering for DEGs, we will extract all genes, and sort them by fold change (largest to smallest). The R object type for `gseKEGG` needs to be a 'vector'. Unfortunately, this detail is under the `enrichKEGG` function, not the `gseKEGG` function! For `enrichKEGG`, the parameter `gene` is described as requiring "a vector of entrez gene id", yet for `gseKEGG`, the description for `geneList` is "order ranked geneList". There is a little bit of sleuthing required at times! 
+This time, instead of filtering for DEGs, we will extract all genes, and sort them by fold change (largest to smallest) to make our ranked gene list for GSEA. 
 
+What class of R object does our ranked gene list need to be in?
+
+```{r help gseKEGG}
+?gseKEGG
+```
+
+We can see `geneList = order ranked geneList`. Not informative!
+
+Unfortunately, the required type of object is detailed under the `enrichKEGG` function, not the `gseKEGG` function! For `enrichKEGG`, the parameter `gene` is described as requiring "a vector of entrez gene id", yet for `gseKEGG`, the description for `geneList` is "order ranked geneList". There is a little bit of sleuthing required at times! 
 
 
 ```{r extract ranked list}
@@ -54,7 +63,7 @@ tail(ranked)
 
 # 2. Check `gseKEGG` function arguments and requirements
 
-Let's start by reviewing the function arguments. Once you run the below code chunk, the user guide for the function will open in the `Help` pane of RStudio (lower right) 
+Review the parameters:
 
 ```{r help gsekegg}
 ?gseKEGG
@@ -65,12 +74,12 @@ Most of those defaults look suitable to start.
 
 We have human so the default `organism = "hsa"` argument is correct. If you were working with a species other than human, you first need to obtain your organism code. You can derive this from [KEGG Organisms](https://www.genome.jp/kegg/tables/br08606.html) or using the `clusterProfiler` function `search_kegg_organism`.
 
-Pick your favourite species and search for the KEGG organism code by editing the variable 'organism' then executing the code chunk:
+Pick your favourite species and search for the KEGG organism code by editing the variable 'fave' then executing the code chunk:
 
-```{r}
+```{r check kegg orgs}
 fave <- "horse"
 
-# search by common_name or scientific_name 
+# search by common_name or scientific_name or kegg.code 
 search_kegg_organism(fave, by = "common_name")
 ```
 
@@ -90,25 +99,25 @@ Since our input data does not match any of the valid namespaces, we need to conv
 
 Check the usage for the `bitr` function:
 
-```{r}
+```{r help bitr}
 ?bitr
 ```
 
-We need to understand what the valid `fromType` and `toType` values are, and it turns out we need an Org.db to use `bitr`! This is a Bioconductor annotation package, of which there are currently only 20. So while the `gseKEGG` function supports all organisms in KEGG, performing gene ID conversion within ``clusterProfiler` may not be possible for non-model species and you would need to seek a different method.  
+We need to understand what the valid `fromType` and `toType` values are, and it turns out we need an Org.db to use `bitr`! This is a Bioconductor annotation package, of which there are currently only 20. So while the `gseKEGG` function supports all organisms in KEGG, performing gene ID conversion within `clusterProfiler` may not be possible for non-model species and you would need to seek a different method.  
 
 We have already loaded the `org.Hs.eg.db` annotation library. We can use this to search `keytypes`:
 
-```{r}
+```{r check org db keytypes for huamn}
 keytypes(org.Hs.eg.db)
 ```
 
 Converting to ENTREZ will be compatible with `kegg`. So our `fromType` is `ENSEMBL` and `toType` is `ENTREZID`. Note that these are case sensitive! 
 
 If we were to run `enrichGO` or `gseaGO` that require a Bioconductor Org.db package, we would not need to do a conversion, as both of the gene ID types within our input data (`ENSEMBL` and `SYMBOL`) are natively supported. 
 
-The below code performs the reformatting and handles duplicates by keeping only the first occurrence of duplicate Entrez IDs within the input. Note that this is not ideal and for a real experiment you should print out a list of duplicates, carefully review these and choose which and how to retain based on your biological context.
+Gene ID conversion often results in duplicates. The below code performs the reformatting and handles duplicates by keeping only the first occurrence of duplicate Entrez IDs within the input. Note that this is not ideal and for a real experiment you should print out a list of duplicates, carefully review these and choose which and how to retain based on your biological context.
 
-First, convert map the ENSEMBL gene IDs from our 'ranked' list to ENTREZ gene IDs: 
+First, convert the ENSEMBL gene IDs from our 'ranked' list to ENTREZ gene IDs: 
 
 ```{r convert gene IDs}
 # Convert the Ensembl IDs to Entrez IDs, dropping NAs
@@ -146,6 +155,7 @@ head(ranked_entrez)
 tail(ranked_entrez)
 
 ```
+
 Looks good! Now we are ready to enrich!
 
 # 4. Run GSEA over KEGG database
@@ -155,19 +165,19 @@ Now run GSEA over the KEGG database with `gseKEGG` function, setting a `seed` va
 
 ```{r gseKEGG}
 gsea_kegg <- gseKEGG( 
-  ranked_entrez, 
-  organism = "hsa", 
-  keyType = "kegg", 
-  exponent = 1, 
-  minGSSize = 10, 
-  maxGSSize = 500, 
-  eps = 1e-10, 
-  pvalueCutoff = 0.05, 
-  pAdjustMethod = "BH",
-  verbose = TRUE, 
-  use_internal_data = FALSE, 
-  seed = 123, 
-  by = "fgsea"
+  ranked_entrez, # ranked gene list vector object
+  organism = "hsa", # species
+  keyType = "kegg", # gene ID type/namespace
+  exponent = 1, # weight of each step
+  minGSSize = 10, # minimum gene set size
+  maxGSSize = 500, # maximum gene set size
+  eps = 1e-10, # sets the boundary for calculating the p value
+  pvalueCutoff = 0.05, # adjusted P value cutoff for enriched terms
+  pAdjustMethod = "BH", # multiple testing correction with BH method 
+  verbose = TRUE, # print output as it runs
+  use_internal_data = FALSE, # use latest online KEGG data not local data
+  seed = 123, # set seed for random number generation, for reproducibility 
+  by = "fgsea" # GSEA algorithm 
 )
 
 
@@ -177,7 +187,7 @@ We have 3 warnings, one about ties in the ranked list, which we could resolve ma
 
 The second warning may be resolved by following the suggestion to set `nPermSimple = 10000`. How frustrating that this parameter is not described within the `gseKEGG` help menu, nor the `clusterProfiler` PDF at all!  
 
-Let's rerun following both suggestions, to use permutations (`nPermSimple`) and set `eps = 0`. We will keep the rest of the arguments the same as before.    
+Let's rerun following both suggestions, to use permutations (`nPermSimple = 1000`) and set `eps = 0`. We will keep the rest of the arguments the same as before.    
 
 ```{r gseKEGG perms}
 gsea_kegg <- gseKEGG( 
@@ -187,17 +197,18 @@ gsea_kegg <- gseKEGG(
   exponent = 1, 
   minGSSize = 10, 
   maxGSSize = 500, 
-  eps = 0, 
+  eps = 0, # changed to zero 
   pvalueCutoff = 0.05, 
   pAdjustMethod = "BH",
   verbose = TRUE, 
   use_internal_data = FALSE, 
   seed = 123, 
   by = "fgsea",
-  nPermSimple = 10000
+  nPermSimple = 10000 # added permutations
 )
 
 ```
+
 Great, those last 2 warnings have resolved and we only have the expected one about tied rankings. 
 
 
@@ -232,7 +243,7 @@ You can view the table in the `editor` pane by clicking on it from the `Files` p
 
 One of the key advantages of using R over web tools is flexibility with visualisations. Next we will produce a range of plot types from the `enrichplot` package, developed by the same authors as `clusterProfiler`. 
 
-In this package, some of the plots can be used to plot ORA or GSEA, and others only for one type or the other. 
+In this package, some of the plots can be used to plot ORA or GSEA results, and others only for one type or the other. 
 
 To determine which plot functions are compatible with which FEA results type, you can check the help menu. If you see `## S4 method for signature 'enrichResult'` this plot is compatible with ORA results object.  If you see `## S4 method for signature 'gseaResult'` this plot is compatible with GSEA results object.
 
@@ -253,7 +264,7 @@ And review the help menu for `ridgeplot` which can only plot GSEA results:
 
 Unfortunately, `enrichplot` is only compatible with the enrichment results from the packages produced by this development team, although the desire to use `enrichplot` with the output of other tools is widespread. The R package `multienrichjam` has a function `enrichDF2enrichResult` that converts ORA dataframe results from other FEA tools to the format required for `enrichplot`. 
  
- `multienrichjam` has a lot of dependencies and has not been installed on these VMs so we will not be performing this today. However, this functionality and flexibility is pretty cool, so if you wanted to install this on your own computer outside the workshop, below is the code for installing :-) 
+`multienrichjam` has a lot of dependencies and has not been installed on these VMs so we will not be performing this today. However, this functionality and flexibility is pretty cool, so if you wanted to install this on your own computer outside the workshop, below is the code for installing :-) 
  
  
 ```{r install multienrichjam}
@@ -265,6 +276,8 @@ Unfortunately, `enrichplot` is only compatible with the enrichment results from
 #remotes::install_github("jmw86069/multienrichjam", dependencies=TRUE)
 #library(multienrichjam)
 
+# check function help: 
+#?enrichDF2enrichResult
 ```
 
 
@@ -276,7 +289,7 @@ We can change the number of top terms that are plotted with the `showCategory` p
 
 ```{r GSEA KEGG dotplot }
 
-clusterProfiler::dotplot(
+enrichplot::dotplot(
   gsea_kegg,
   x = "geneRatio",
   color = "p.adjust",
@@ -294,18 +307,16 @@ clusterProfiler::dotplot(
 
 The upset plot shows the pattern of shared genes among the enriched terms (since a gene can belong to multiple terms or pathways).
 
-For ORA, upsetplot will calculate the overlaps among different gene sets. 
-
-For GSEA, it will plot the fold change distributions of different categories (e.g. unique to pathway, overlaps among different pathways).
+For GSEA, the plot includess fold change distributions on the Y axis. 
 
 `n` controls the number of terms plotted - here we have restricted to 10 for ease of viewing. Plot view is quite small within a notebook, but if you wanted to plot more terms, you could print to an A4 size output file for enhanced clarity. 
 
 
 
-```{r GSEA KEGG upset plot}
+```{r GSEA KEGG upset plot, fig.width = 10 }
 
 # n = the number of top enrichments to plot
-enrichplot::upsetplot(gsea_kegg, n = 10)
+enrichplot::upsetplot(gsea_kegg, n = 15)
 
 ```
 
@@ -319,21 +330,21 @@ It is required to run the function `pairwise_termsim` before producing the plot.
 
 When running `pairwise_termsim`, we don't need to save the output to a new object, because the information is added to the object and does not change any other attributes. 
 
+Each node is a term, and the number of genes associated with the term is shown by the dot size, with P values by dot colour.
+
 
 ```{r GSEA KEGG emapplot}
 # calculate pairwise similarities between the enriched terms
 gsea_kegg <- enrichplot::pairwise_termsim(gsea_kegg)
 
 # plot
-enrichplot::emapplot(gsea_kegg, showCategory = 15)
+enrichplot::emapplot(gsea_kegg, showCategory = 25)
 
 ```
 
 ### treeplot
 
-The treeplot provides the same information as the emapplot but in a different visualisation. 
-
-Each node is a term, and the number of genes associated with the term is shown by the dot size, with P values by dot colour. 
+The treeplot provides the same information as the emapplot but presented in a different way.
 
 Terms that share more genes or biological functions will be closer together in the tree structure. Clades are colour coded and 'cluster tags' assigned. You can control the number of words in the tag (default is 4). The user guide describes the argument `nWords` however running that will throw an error (it says 'warning' but it is fatal so to me that's an error!):
 
@@ -353,13 +364,15 @@ enrichplot::treeplot(gsea_kegg, showCategory = 15, color = "p.adjust", cluster.p
 
 ### cnetplot
 
-The cnetplot is helpful to understanding which genes are involved in the enriched terms, details that are not available in the plots generated so far. It depicts the linkages of genes and terms as a network. 
+The cnetplot depicts the linkages of genes and terms as a network. This is helpful to understanding which genes are involved in the enriched terms, ading a level of detail not offered by the plots we have generated so far.  
 
 For GSEA, where all genes (not just DEGs) are used, only the 'core' enriched genes are used to create the network plot. These are the 'leading edge genes', those genes up to the point where the Enrichment Score (ES) gets maximised from the base zero. In other words, the subset of genes that are most strongly associated with a specific term. 
 
 There are a few parameters to play around with here to get a readable plot. 
 
-Try changing the number of terms plotted (`showCategory`) and the `node_label` which controls whether labels are put on terms, genes (`item`), or `all`. 
+Try changing the number of terms plotted (`showCategory`). 
+
+Try changing the `node_label` which controls whether labels are put on terms (`category`), genes (`item`), or both (`all`). 
 
 Since the information that is attempted to be plotted is complex, having a large number of terms and attempting to label everything won't look very informative! If you want to plot both gene IDs and term names, you will need to plot a small number of categories. 
 
@@ -372,15 +385,15 @@ Since the information that is attempted to be plotted is complex, having a large
 
 enrichplot::cnetplot(gsea_kegg, showCategory = 8, cex_label_gene = 0.5, cex_label_category = 0.8, colorEdge = TRUE, node_label = "category")
 
-
-
-
 ```
 
+With 8 terms and no gene IDs, we don't get any more detail than from the treeplot or emapplot. 
 
-You can also plot the interaction between specific terms. This is helpful to obtain gene ID level resolution for term interactions of interest. 
+This plot is really useful for showing a detailed look at a small number of terms. Just plotting the top 3 terms may not look very helpful (try plotting the top 3!) 
 
-The 3 terms listed below are for 'IL-17 signaling pathway', 'Viral protein interaction with cytokine and cytokine receptor' and 'Chemokine signaling pathway' which are top 10 enrichments with a relationship of shared genes, evident from the plot above. 
+A useful application is plotting the interaction between specific terms. This is helpful to obtain gene ID level resolution for term interactions of interest. 
+
+The 3 terms listed below are for 'IL-17 signaling pathway', 'Viral protein interaction with cytokine and cytokine receptor' and 'Chemokine signaling pathway' which are among the top 10 enrichments with a relationship of shared genes, evident from the plot above. 
 
 Run the code below or select a handful of terms of your choosing from the results table we printed earlier. We need the KEGG ID (column 1).  
 
@@ -410,10 +423,10 @@ Now we can clearly see the gene IDs for this network.
 Like the upset plot, the heatplot shows shared genes across enriched terms. When there are a lot of genes, it can be poorly readable, especially within the notebook. This function does not include a parameter to restrict to 'core' genes like the `cnetplot` or `ridgeplot` so even when plotting a very small number of enriched terms, the X-axis is too cramped to be readable.
 
 
-
 ```{r heatplot GSEA KEGG }
 enrichplot::heatplot(gsea_kegg, showCategory = 3)
 ```
+
 The gseKEGG result contains a data column containing the core/leading edge genes: 
 
 ```{r show GES result columns}
@@ -470,7 +483,7 @@ enrichplot::ridgeplot(gsea_kegg,
 
 Another GSEA-specific plot, which shows the contribution to the normalised enrichment score (NES) of genes across the gene list.
 
-The gseaplot can plot one term at a time. 
+The gseaplot can plot one term at a time. You would run GSEA plots for key terms of interest in your results. 
 
 The term 'hsa04020' (Calcium signaling pathway) has a positive NES, an adjusted P value of 0.00046 and a set size of 279. 
 
@@ -490,7 +503,7 @@ enrichplot::gseaplot( gsea_kegg,
 
 ```
 
-The term 'hsa04062' (Chemokine signaling pathway) has a negative NE, a highyl significant adjusted P value of 1.11E-06 and a set size of 129. 
+The term 'hsa04062' (Chemokine signaling pathway) has a negative NE, a highly significant adjusted P value of 1.11E-06 and a set size of 129. 
 
 ```{r gseaplot GSEA KEGG downreg, fig.width=10 }
 
@@ -536,7 +549,7 @@ The `volplot` function within `enrichplot` does not support GSEA result objects,
 The volcano plot has the advantage of showing whether the genes in the enriched terms were predominantly from the top (upregulated) or bottom (downregulated) end of the list. 
 
 
-```{r}
+```{r gsea volcano plot}
 ggplot2::ggplot(gsea_kegg@result, aes(x = enrichmentScore, y = -log10(p.adjust), color = p.adjust)) +
   geom_point(alpha = 0.7, size = 2) +  # Adjust point size
   scale_color_gradient(low = "blue", high = "red") +  # Color by p.adjust values