MonashBioinformaticsPlatform
diff --git a/‎13-webgestaltr.Rmd‎
Lines changed: 1 addition & 2 deletions b/‎13-webgestaltr.Rmd‎
Lines changed: 1 addition & 2 deletions
diff --git a/‎14-novel-species-FEA.Rmd‎
Lines changed: 54 additions & 18 deletions b/‎14-novel-species-FEA.Rmd‎
Lines changed: 54 additions & 18 deletions
diff --git a/‎day2_Rnotebooks/WebGestaltR.Rmd‎
Lines changed: 21 additions & 34 deletions b/‎day2_Rnotebooks/WebGestaltR.Rmd‎
Lines changed: 21 additions & 34 deletions
@@ -26,7 +26,7 @@ This tool (both the web and R versions) has many features and advantages:
 
 1. Explore the organisms, databases/gene sets and namespaces supported natively
 2. Run ORA over pathway databases and explore the interactive HTML output
-3. Run GSEA over the GO 'non-redundant' and full database and compare the results
+3. Run GSEA over the `WebGestalt` `GO noRedundant` and full database and compare the results
 
 <p>&nbsp;</p>  <!-- insert blank line -->
 
@@ -51,5 +51,4 @@ You could also open the file by selecting `File` &rarr; `Open file`, or use the
 - We have reviewed the organisms and databases that are natively supported by this easy to use tool 
 - We have run both ORA and GSEA and explored the interactive HTML results summary  
 - We have touched on the redundancy filters available within this tool, for GO as well as two external algorithms applied automatically to any enrichment performed
-- We have learnt about an R package that can create compatability between `WebGestaltR` (and other FEA tools) with `enichplot` for visualisation options
 - In the next session, we will use `WebGestaltR` with novel species
@@ -30,6 +30,18 @@ Despite the lack of quality resources, there is much 'omics work conducted in ax
 
 Today we will use [public RNAseq data](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5419050/#SD7) from axolotl, comparing gene expression in the blastema after proximal (at the shoulder) and distal (at the hand) limb amputation. The blastema is a collection of undifferentiated progenitor cells that give rise to the regenerated limb. Maybe our functional enrichment analysis of differentially expressed genes can help us understand processes that cause the blastema to grow into a full limb or just a hand! 
 
+## Caveat!
+
+This is not a real experiment! It was tricky to find a novel species that had 
+
+a) a reference genome
+b) a GTF
+c) was not natively supported by any FEA tool
+d) had publicly available RNAseq data
+e) FEA described in a peer-reviewed study 
+
+This axolotl data ticked A through D yet not E (RNA from various tissues were sequenced to aid a genome assembly project rather than for a valid biological experiment). Please keep this in mind when reviewing the results of the FEA we perform 😉 The goal of the session is not *really* to uncover how a blastema differentiates into a hand or an arm, but to demonstrate to you how you can apply this method to your own novel species. 
+
 <p>&nbsp;</p>  <!-- insert blank line -->
 
 ### Raw data sources
@@ -130,7 +142,7 @@ The `WebGestaltR` tab delimited `description` file with `.des` suffix has these
 
 **Example .des format:**
 
-![webgr-gmt](images/webgr-des-file.png)
+<img src="images/webgr-des-file.png" style="border: none; box-shadow: none; background: none; ">
 
 <p>&nbsp;</p>  <!-- insert blank line -->
 
@@ -142,6 +154,27 @@ The `.gmt` file is provided to the parameter `enrichDatabaseFile` and the `.des`
 
 <p>&nbsp;</p>  <!-- insert blank line -->
 
+
+### Mapping database terms to descriptions
+
+Annotating a proteome with a tool such as `emapper` provides a connection between your novel species gene IDs and term IDs. To add the term description, we need a database file. In this analysis, we will use the [GO 'core' ontology file](https://purl.obolibrary.org/obo/go.obo) and the [KEGG Pathways file](https://www.pathway.jp/en/academic.html). These files were downloaded to the `workshop` folder of the VMs during our setup session. 
+
+The GO `go.obo` is a text file (>600K lines) with details for all terms in GO at the time of download. The second line of the file contains the GO database version, in this case: `data-version: releases/2024-06-17`. 
+
+**go.obo term information is structured like this:**
+
+<img src="images/go-obo-example.png" style="border: none; box-shadow: none; background: none; width: 100%;">
+
+
+The KEGG Pathways file is identical in format to what is required for both the `clusterProfiler` `TERM2NAME` and `WebGestaltR` `.des` files:
+
+**KEGG Patwhays file format:**
+
+<img src="images/webgr-des-file.png" style="border: none; box-shadow: none; background: none; ">
+
+<p>&nbsp;</p>  <!-- insert blank line -->
+
+
 ## RNotebook novel species FEA
 
 &#x27A4; Return to RStudio and open the notebook `novel_species.Rmd`.
@@ -160,7 +193,7 @@ We will now take a quick look at novel species FEA online with `STRING`.
 
 <p>&nbsp;</p>  <!-- insert blank line -->
 
-## STRING novel species FEA
+## STRING (web) novel species FEA
 
 The axolotl putative proteome was previosuly uploaded to STRING and custom annotation performed. This completed using the STRING servers, with compute time less than one day. 
 
@@ -201,31 +234,33 @@ Note that the `Organism` field is pre-filled with `STRG0A90SNX (axolotl)`.
 
 <p>&nbsp;</p>  <!-- insert blank line -->
 
-&#x27A4; In the RStudio `Files` pane, locate your saved ORA gene list from earlier - `workshop/Axolotl_DEGs.txt`. Click the file to view it in RStudio, then copy paste the list into the STRING `List of Names` search field, then select `SEARCH`. 
+&#x27A4; In the RStudio `Files` pane, locate your saved ORA gene list from earlier - `workshop/Axolotl_DEGs.txt`. Click the file to view it in RStudio, then copy paste the list into the STRING `List of Names` search field, then select `SEARCH`
+
+&#x27A4; Click `CONTINUE` at the gene ID review page
 
-- Note that there is no option at this query page (even under `Advanced Settings`) to provide a custom background gene list. This can be done after the initial search has been run. 
+Before we explore the results, note that we have performed ORA without a background gene list! 😮
 
-- Click `CONTINUE` at the gene ID review page
+There is no option at the query page (even under `Advanced Settings`) to provide a custom background gene list. This must be done *after* the initial search has been run. Hopefully this will change in future versions. 
 
-- Then scroll ALL the way to the bottom of the page to the subheading `Statistical background`
-    - If you did not have a relevant saved background gene list, you would select `Add background`
-    - In this case, there is a saved background list for this annotation called `axolotl-blastema-background` (**FRED is this showing for you?**), so you can select the background from the dropdown menu then click `UPDATE`. This will re-cmpute the enrichment. 
+**To add a custom background to STRING ORA:**
 
+In order to perform this part, you need a `STRING` login. If you have one, feel free to follow these instructions, otherwise, view along. 
 
-<img src="images/string-add-bg.png" style="border: none; box-shadow: none; background: none; width: 100%;">
+1. Click on the `ANALYSIS` tab
+2. Scroll ALL the way to the bottom of the page to the subheading `Statistical background`
+    - If you do not have a relevant saved background gene list in your `STRING` saved datasets, you would select `Add background`. See next sub-heading for details 
+    - If you do have a relevant background saved, change `Whole Genome` to select the relevant gene background list from your saved datasets
+3. Once you have selected the background, click `UPDATE` and the ORA will be re-computed using the custom statistical background 
 
 <p>&nbsp;</p>  <!-- insert blank line -->
 
 <img src="images/string-update-bg.png" style="border: none; box-shadow: none; background: none; width: 100%;">
 
-<p>&nbsp;</p>  <!-- insert blank line -->
-
-**Note**: you may need to login! If you do not have a `STRING` login, you can watch along
 
 <p>&nbsp;</p>  <!-- insert blank line -->
 
-**If you need to add a custom background:**
-- Select `ADD BACKGROUND`
+**Load axolotl background gene list to your STRING profile**
+- Under `ANALYSIS` tab of ORA results page, at `Statistical background`, select `ADD BACKGROUND`
 - You will be prompted to login if you are not already
 - Under `1) name your new set` provide a descriptive name for your background gene list. I chose `axolotl-blastema-background`
 - At `2) identify your organism` this will be pre-filled with our custom axolotl annotation
@@ -238,17 +273,18 @@ Note that the `Organism` field is pre-filled with `STRG0A90SNX (axolotl)`.
 
 The gene list will be mapped to the STRING database. This may take several minutes.  
 
+<p>&nbsp;</p>  <!-- insert blank line -->
+
 `STRING` saves your custom datasets under `My Data`:
 
 <img src="images/string-set-novel-bg.png" style="border: none; box-shadow: none; background: none; width: 100%;">
 
-Despite this, there is not (yet) an option to run the initial analysis with a pre-saved background gene list! Hopefully this feature will be added in the future. 
-
 
 <p>&nbsp;</p>  <!-- insert blank line -->
 
 **<span style="color: green;">Now let's explore the results!</span>**
 
+
 Some suggestions:
 
 - Select a node on the network, and then `Show this node's terms in the analysis table` to highlight the terms the gene was present in 
@@ -260,7 +296,7 @@ Some suggestions:
 
 ### How do the STRING results compare to those we generated in R? 
 
-We expect a large difference in the results because of the differing proteome annotation methods. 
+**We expect a large difference in the results because of the differing proteome annotation methods - both the annotation tool and the databases that were annotated against.**  
 
 The `eggNOG emapper` annotations we used in `R` rely on orthology-based predictions, employing extensive similarity searches to map genes to their closest evolutionary counterparts across diverse species. This results in a more comprehensive catalog of functional terms, even if they are inferred rather than directly evidenced.
 
@@ -281,6 +317,6 @@ And a clear lack of overlap in number of enriched terms and term IDs between `ST
 
 <img src="images/string-novel-ora-compare.png" style="border: none; box-shadow: none; background: none; ">
 
-These GO terms from `STRING` may be parent terms of more specific child terms prevalent in the R output. For a real world analysis, it would be optimal to compare, and deduce whether both methods could provide valuable and complimentary insights, or whether the results from one annotation approach or the other were more suited to your novel species. 
+These GO terms from `STRING` may be parent terms of more specific child terms prevalent in the `R` output. For a real world analysis, it would be optimal to compare, and deduce whether both methods could provide valuable and complimentary insights, or whether the results from one annotation approach or the other were more suited to your novel species. 
 
 Whichever you choose, strength to you! This is not an easy space to work in 💪 
@@ -15,6 +15,8 @@ library(enrichplot)
 
 # 0. Working directory
 
+Ensure the 'workshop' directory is your current working directory:
+
 ```{r check notebook workdir}
 getwd()
 ```
@@ -216,7 +218,7 @@ Some things to note:
 
 - You can increase the default view of 20 rows to 'All' for the enrichment table, but this does not necessarily show all significant enrichments! Check the output file `enrichment_results_ORA_pathways.txt` and you can see 85 significant terms, yet 'All' view with default of 20 rows shows 30 something. To increase the number of rows included in the HTML report, use the parameter `reportNum` 
 
-- You can run algorithms to reduce the number of terms through clustering, in order to make the results more manageable. This is discussed in the WebGestalt 2019 update publication [Liao et al 2019](https://academic.oup.com/nar/article/47/W1/W199/5494758). The authors maintain that "important biological themes are all covered with these selected gene sets" 
+- You can run algorithms to reduce the number of terms through clustering, in order to make the results more manageable. This is discussed in the WebGestalt 2019 update publication [Liao et al 2019](https://academic.oup.com/nar/article/47/W1/W199/5494758). The authors maintain that "important biological themes are all covered with these selected gene sets". Built-in redundancy handling/term clustering is a feature of `WebGestaltR` (and the web version). To what extent this is appropriate for the database you are using is up to you to determine. For example, in the next analysis we will perform GSEA over the `noRedundant` GO MF database. Applying a double layer of redundancy filters over a database seems quite dubious to me..  
 
 - Selecting a term from the 'Enrichment Results' table updates the term under 'Select an enriched analyte set', where more detailed results are shown, including the genes from your gene list present within the gene set for that term 
 
@@ -250,6 +252,16 @@ Let's run enrichment over the full and the non-redundant version of the GO MF da
 
 Let's use GSEA since we have already tried ORA with this package. GSEA is slower and GO is large, so even with 7 threads these commands will take a few minutes (longer for redundant than non-redundant, of course). Feel free to use the compute time to ask questions on slack or explore the ORA pathways output some more!
 
+There is no `seed` parameter for `WebGestaltR` GSEA as there is for `clusterProfiler`. We can set it in R instead with `set.seed()`. 
+
+```{r set seed}
+set.seed(123)
+```
+
+A note from testing: without setting the seed in R, a slightly different number of enriched terms were returned over 3 runs. With setting the seed, the same number and IDs of terms were significant among the replicate runs, BUT the NES and the FDR were slightly different! The unadjusted ES and P values were the same. 
+
+
+
 ```{r GSEA GO MF with redundant}
 
 outputDirectory <- "WebGestaltR_results" 
@@ -279,7 +291,6 @@ suppressWarnings({ gomf <- WebGestaltR(
 
 
 ```{r GSEA GO MF nonredundant}
-
 outputDirectory <- "WebGestaltR_results" 
 project <- "GSEA_GO-MF_non-redundant"
 database  <- "geneontology_Molecular_Function_noRedundant"
@@ -367,36 +378,11 @@ By grouping so many similar terms with the non-redundant analyses, the overall n
 For your own research, you could explore the relationships between these terms by viewing the neighborhood of GO terms on AmiGO: https://amigo.geneontology.org/amigo, or using NaviGO https://kiharalab.org/navigo/views/goset.php
 
 
+# 5. Save versions and session details
 
-# 5 Plot `WebGestaltR` results with `enichplot` 
-
- A key advantage of R is flexibility with data manipulation and visualisation. In the previous exercise, we have explored the many plot options of `enrichplot`. This package was designed to work with an `enrichResult` object from `clusterProfler` and a handful of other packages written by the same team. The desire to use `enrichplot` with the output of other tools is widespread. 
- 
-The R package `multienrichjam` has a function `enrichDF2enrichResult` that converts dataframe type results from other FEA tools to the format required for `enrichplot`!  
- 
- `multienrichjam` has a lot of dependencies and has not been installed on these VMs so we will not be performing this today. However, this functionality and flexibility is pretty cool, so if you wanted to install this on your own computer outside the workshop, below is the code for installing :-)  
-
-
-```{r install multienrichjam}
-# github remotes package required:
-#library(remotes)
+## Database query dates
 
-# install multienrichjam:
-remotes::install_github("jmw86069/multienrichjam", dependencies=TRUE)
-
-# load library: 
-#library(multienrichjam)
-
-# check function help menu: 
-#?enrichDF2enrichResult
-```
-
-
-
-
-# 6. Save versions and session details
-
-Unlike `gprofiler`, `WebGestaltR` does not have a function to list the version of the queried database. 
+Unlike `gprofiler`, `WebGestaltR` does not have a function to list the version of the queried databases. 
 
 For this reason, we will save the analysis date to our rendered notebook, so the external database version could be back-calculated from the date if required:
 
@@ -406,16 +392,17 @@ print(Sys.Date())
 
 ```
 
-
-R version and package versions:
+## R version and R package versions
 
 ```{r info }
 sessionInfo()
 ```
 
 
 
-And RStudio version. Typically, we would simply run `RStudio.Version()` to print the version details. However, when we knit this document to HTML, the `RStudio.Version()` function is not available and will cause an error. So to make sure our version details are saved to our static record of the work, we will save to a file, then print the file contents back into the notebook. 
+## RStudio version
+
+Typically, we would simply run `RStudio.Version()` to print the version details. However, when we knit this document to HTML, the `RStudio.Version()` function is not available and will cause an error. So to make sure our version details are saved to our static record of the work, we will save to a file, then print the file contents back into the notebook. 
 
 
 ```{r rstudio version - not run during knit, eval=FALSE}
@@ -450,7 +437,7 @@ rstudio_version_text
 
 
 
-# 7.  Knit workbook to HTML
+# 6.  Knit workbook to HTML
 
 The last task is to knit the notebook. Our notebook is editable, and can be changed. Deleting code deletes the output, so we could lose valuable details. If we knit the notebook to HTML, we have a permanent static copy of the work.