bioinformatics-core-shared-training
diff --git a/‎Markdowns/12_Gene_set_testing.Rmd‎
Lines changed: 65 additions & 109 deletions b/‎Markdowns/12_Gene_set_testing.Rmd‎
Lines changed: 65 additions & 109 deletions
@@ -7,47 +7,34 @@ output:
     toc: yes
   html_document:
     toc: yes
-always_allow_html: true
 bibliography: ref.bib
+always_allow_html: true
 ---
 
 ```{r setup, include=FALSE, cache=FALSE}
-options(bitmapType='cairo')
-knitr::opts_chunk$set(dev = c("png"))
-
 library(tidyverse)
 ```
 
 The list of differentially expressed genes is sometimes so long that its 
 interpretation becomes cumbersome and time consuming. It may also be very
 short while some genes have low p-value yet higher than the given threshold.
 
-A common downstream procedure to combine information across genes is gene set testing.
-It aims at finding pathways or gene networks the differentially expressed genes play a role in.
+A common downstream procedure to combine information across genes is gene set
+testing. It aims at finding pathways or gene networks the differentially
+expressed genes play a role in.
 
 Various ways exist to test for enrichment of biological pathways. We will look
 into over representation and gene set enrichment analyses.
 
-A gene set comprises genes that share a biological function, chromosomal location, or any other
-relevant criterion.
-
-<!--
-- Define gene set
-- gene set == "pathway"
-- over-represented == enriched
-ie pathway A is enriched in our diff exp gene list
--->
+A gene set comprises genes that share a biological function, chromosomal
+location, or any other relevant criterion.
 
 To save time and effort there are a number of packages that make applying these
 tests to a large number of gene sets simpler, and which will import gene lists 
 for testing from various sources.
 
-Today we will use [`clusterProfiler`](https://yulab-smu.github.io/clusterProfiler-book/index.html).
-
-<!--
-https://yulab-smu.github.io/clusterProfiler-book/index.html
-https://yulab-smu.github.io/clusterProfiler-book/chapter2.html#over-representation-analysis
--->
+Today we will use 
+[`clusterProfiler`](https://yulab-smu.github.io/clusterProfiler-book/index.html).
 
 # Over-representation
 
@@ -72,15 +59,15 @@ And test for independence of the two variables with the Fisher exact test.
 ## `clusterProfiler`
 
 `clusterprofiler` [@Yu2012] supports direct online access of the current KEGG
-database (KEGG: Kyoto Encyclopedia of Genes and Genomes),
-rather than relying on R annotation packages.
+database (KEGG: Kyoto Encyclopedia of Genes and Genomes), rather than relying on
+R annotation packages.
 It also provides some nice visualisation options.
 
 We first search the resource for mouse data:
 
 ```{r loadClusterProfiler, message=FALSE}
-library(clusterProfiler)
 library(tidyverse)
+library(clusterProfiler)
 
 search_kegg_organism('mouse', by='common_name')
 ```
@@ -115,8 +102,8 @@ sigGenes <- shrink.d11 %>%
     filter(FDR < 0.05 & abs(logFC) > 1) %>% 
     pull(Entrez)
 
-kk <- enrichKEGG(gene = sigGenes, organism = 'mmu')
-head(kk, n=10) %>%  as_tibble()
+keggRes <- enrichKEGG(gene = sigGenes, organism = 'mmu')
+as_tibble(keggRes)
 ```
 
 ### Visualise a pathway in a browser
@@ -127,7 +114,7 @@ highlighting the genes we selected as differentially expressed.
 We will show one of the top hits: pathway 'mmu04612' for 'Antigen processing and presentation'.
 
 ```{r browseKegg}
-browseKEGG(kk, 'mmu04612')
+browseKEGG(keggRes, 'mmu04612')
 ```
 
 ### Visualise a pathway as a file
@@ -143,10 +130,6 @@ colour by any numeric vector, e.g. p-value).
 The package plots the KEGG pathway to a `png` file in the working directory.
 
 ```{r pathview, message=F}
-# check working directory
-#getwd()
-
-# run pathview
 library(pathview)
 logFC <- shrink.d11$logFC
 names(logFC) <- shrink.d11$Entrez
@@ -162,21 +145,17 @@ pathview(gene.data = logFC,
 
 > ### Exercise 1 {.challenge}
 >
-> 1. Use `pathview` to export a figure for "mmu04659" or "mmu04658", but this time only
-> use genes that are statistically significant at FDR < 0.01
+> 1. Use `pathview` to export a figure for "mmu04659" or "mmu04658", but this 
+> time only use genes that are statistically significant at FDR < 0.01
 >
-
-```{r solution1, eval=F}
-
-```
-
 > ### Exercise 2 - GO term enrichment analysis
 >
 > `clusterProfiler` can also perform over-representation analysis on GO terms 
 using the command `enrichGO`. Check:
 >
-> * the help page for the command `enrichGO` (type `?enrichGO` at the console prompt)
-> * and the instructions in the
+> * the help page for the command `enrichGO` (type `?enrichGO` at the console
+> prompt) 
+> * the instructions in the
 > [clusterProfiler book](http://yulab-smu.top/clusterProfiler-book/chapter5.html#go-over-representation-test).
 > 
 > 1. Run the over-representation analysis for GO terms 
@@ -193,11 +172,6 @@ using the command `enrichGO`. Check:
 > 2. Use the `dotplot` function to visualise the results.
 
 
-```{r solution2, eval=F}
-# may need devtools::install_github("YuLab-SMU/enrichplot")
-# to avoid a 'wrong orderBy parameter' warning.
-```
-
 # GSEA analysis
 
 Gene Set Enrichment Analysis (GSEA) identifies gene sets that are related to the
@@ -226,27 +200,29 @@ library(msigdbr)
 The analysis is performed by:
 
 1. ranking all genes in the data set  
-2. identifying in the ranked data set the rank positions of all members of the gene set 
-3. calculating an enrichment score (ES) that represents the difference 
-between the observed rankings and that which would be expected assuming a random 
-rank distribution.
+2. identifying in the ranked data set the rank positions of all members of the 
+gene set 
+3. calculating an enrichment score (ES) that represents the difference between
+the observed rankings and that which would be expected assuming a random rank
+distribution.
 
 The article describing the original software is available 
 [here](http://www.pnas.org/content/102/43/15545.long),
-while this [commentary on GSEA](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1266131/) provides a shorter description.
+while this 
+[commentary on GSEA](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1266131/) 
+provides a shorter description.
 
 ![](../images/gseaArticleFig1.png)
 
 We will use `clusterProfiler`'s [`GSEA`](http://yulab-smu.top/clusterProfiler-book/chapter2.html#gene-set-enrichment-analysis) 
 package [@Yu2012] that implements the same algorithm in R. 
 
-<!-- We will use `GSEA` in `clusterProfiler`. -->
-
 ## Rank genes
 
-We need to provide `GSEA` with a vector containing values for a given gene mtric, e.g. log(fold change), sorted in decreasing order.
+We need to provide `GSEA` with a vector containing values for a given gene
+mtric, e.g. log(fold change), sorted in decreasing order.
 
-To start with we will simply use a rank based on their fold change.
+To start with we will simply use a rank the genes based on their fold change.
 
 We must exclude genes with no Entrez ID.
 
@@ -255,18 +231,23 @@ Also, we should use the shrunk LFC values.
 ```{r preparedata}
 rankedGenes <- shrink.d11 %>%
   drop_na(Entrez) %>%
-  mutate(rank = logFC) %>%
-  arrange(-rank) %>%
+  arrange(desc(logFC)) %>%
   pull(rank,Entrez)
 ```
 
 ## Load pathways
 
-We will load the MSigDB Hallmark gene set with `msigdbr`, setting the `category` parameter to 'H' for **H**allmark gene set. The object created is a `tibble` with information on each {gene set; gene} pair (one per row). We will only keep the the gene set name, gene Entrez ID and symbol, in mouse.
+We will load the MSigDB Hallmark gene set with `msigdbr`, setting the `category`
+parameter to 'H' for **H**allmark gene set. The object created is a `tibble`
+with information on each {gene set; gene} pair (one per row). We will only keep
+the the gene set name, gene Entrez ID.
 
 ```{r loadPathways_msigdbr}
-m_H_t2g <- msigdbr(species = "Mus musculus", category = "H") %>% 
-  dplyr::select(gs_name, entrez_gene, gene_symbol)
+term2gene <- msigdbr(species = "Mus musculus", category = "H") %>% 
+  select(gs_name, entrez_gene)
+term2name <- msigdbr(species = "Mus musculus", category = "H") %>% 
+  select(gs_name, gs_description) %>% 
+  distinct()
 ```
 
 ## Conduct analysis
@@ -280,35 +261,31 @@ Arguments passed to `GSEA` include:
 
 ```{r runGsea}
 gseaRes <- GSEA(rankedGenes,
-                TERM2GENE = m_H_t2g[,1:2],
-                #pvalueCutoff = 0.05,
-                pvalueCutoff = 1.00, # to retrieve whole output
+                TERM2GENE = term2gene,
+                TERM2NAME = term2name,
+                pvalueCutoff = 1.00, 
                 minGSSize = 15,
                 maxGSSize = 500)
 ```
 
 Let's look at the top 10 results.
 
-```{r top10GseaPrint, echo=!FALSE}
-# have function to format in scientific notation
-format.e1 <- function(x) (sprintf("%.1e", x))
-# show table
-gseaRes %>%
-  arrange(desc(abs(NES))) %>%
-  top_n(10, -p.adjust) %>%
-  dplyr::select(-core_enrichment) %>%
-  dplyr::select(-Description) %>%
-  data.frame() %>%
-  remove_rownames() %>%
-  # format
-  mutate(ES=formatC(enrichmentScore, digits = 3)) %>%
-  mutate(NES=formatC(NES, digits = 3)) %>%
-  # format p-values
-  modify_at(
-    c("pvalue", "p.adjust", "qvalues"),
-    format.e1
-  ) %>%
-  DT::datatable(options = list(dom = 't'))
+```{r top10GseaPrint, eval=FALSE}
+as_tibble(gseaRes) %>% 
+  arrange(desc(abs(NES))) %>% 
+  top_n(10, wt=-p.adjust) %>% 
+  select(-core_enrichment) %>%
+  mutate(across(c("enrichmentScore", "NES"), round, digits=3)) %>% 
+  mutate(across(c("pvalue", "p.adjust", "qvalues"), scales::scientific))
+```
+```{r top10GseaPrintactual, echo=FALSE}
+as_tibble(gseaRes) %>% 
+  arrange(desc(abs(NES))) %>% 
+  top_n(10, wt=-p.adjust) %>% 
+  select(-core_enrichment) %>%
+  mutate(across(c("enrichmentScore", "NES"), round, digits=3)) %>% 
+  mutate(across(c("pvalue", "p.adjust", "qvalues"), scales::scientific)) %>% 
+  DT::datatable(option=list(dom='t'))
 ```
 
 ## Enrichment score plot
@@ -320,27 +297,10 @@ pathway (no tick for genes not in the pathway)
 * the enrichment score: the green curve shows the difference between the observed
 rankings and that which would be expected assuming a random rank distribution.
 
-```{r }
-# HALLMARK_INFLAMMATORY_RESPONSE is 4th
-topx <- match("HALLMARK_INFLAMMATORY_RESPONSE", data.frame(gseaRes)$ID)
-```
-
-Gene log(fold change):
-
-```{r gseaEnrichmentPlot_preranked}
-gseaplot(gseaRes, geneSetID = topx, by = "preranked", title = gseaRes$Description[topx])
-```
-
-Running score:
-
-```{r gseaEnrichmentPlot_runningScore}
-gseaplot(gseaRes, geneSetID = topx, by = "runningScore", title = gseaRes$Description[topx])
-```
-
-Both the log(fold change) and running score:
-
 ```{r gseaEnrichmentPlot_both}
-gseaplot(gseaRes, geneSetID = topx, title = gseaRes$Description[topx])
+gseaplot(gseaRes, 
+         geneSetID = "HALLMARK_INFLAMMATORY_RESPONSE", 
+         title = "HALLMARK_INFLAMMATORY_RESPONSE")
 ```
 
 Remember to check the [GSEA 
@@ -356,14 +316,10 @@ explanation.
 > 1. Rank the genes by statistical significance - you will need to create
 > a new ranking value using `-log10({p value}) * sign({Fold Change})`.
 > 2. Run `fgsea` using the new ranked genes and the H pathways.
-> 3. Conduct the same analysis for the d33 vs control contrast.
+> 3. Conduct the same analysis for the day 33 Infected vs Uninfected contrast.
 > Extended: Do results differ between ranking scheme?  
-> Extended: Do results differ between d11 and d33, with the significance-based ranking scheme?  
-
-```{r solution3}
-
-```
-
+> Extended: Do results differ between day 11 and day 33, with the 
+> significance-basedranking scheme?  
 
 ---------------------------------------------------------------