@@ -7,47 +7,34 @@ output:
77 toc : yes
88 html_document :
99 toc : yes
10- always_allow_html : true
1110bibliography : ref.bib
11+ always_allow_html : true
1212---
1313
1414``` {r setup, include=FALSE, cache=FALSE}
15- options(bitmapType='cairo')
16- knitr::opts_chunk$set(dev = c("png"))
17-
1815library(tidyverse)
1916```
2017
2118The list of differentially expressed genes is sometimes so long that its
2219interpretation becomes cumbersome and time consuming. It may also be very
2320short while some genes have low p-value yet higher than the given threshold.
2421
25- A common downstream procedure to combine information across genes is gene set testing.
26- It aims at finding pathways or gene networks the differentially expressed genes play a role in.
22+ A common downstream procedure to combine information across genes is gene set
23+ testing. It aims at finding pathways or gene networks the differentially
24+ expressed genes play a role in.
2725
2826Various ways exist to test for enrichment of biological pathways. We will look
2927into over representation and gene set enrichment analyses.
3028
31- A gene set comprises genes that share a biological function, chromosomal location, or any other
32- relevant criterion.
33-
34- <!--
35- - Define gene set
36- - gene set == "pathway"
37- - over-represented == enriched
38- ie pathway A is enriched in our diff exp gene list
39- -->
29+ A gene set comprises genes that share a biological function, chromosomal
30+ location, or any other relevant criterion.
4031
4132To save time and effort there are a number of packages that make applying these
4233tests to a large number of gene sets simpler, and which will import gene lists
4334for testing from various sources.
4435
45- Today we will use [ ` clusterProfiler ` ] ( https://yulab-smu.github.io/clusterProfiler-book/index.html ) .
46-
47- <!--
48- https://yulab-smu.github.io/clusterProfiler-book/index.html
49- https://yulab-smu.github.io/clusterProfiler-book/chapter2.html#over-representation-analysis
50- -->
36+ Today we will use
37+ [ ` clusterProfiler ` ] ( https://yulab-smu.github.io/clusterProfiler-book/index.html ) .
5138
5239# Over-representation
5340
@@ -72,15 +59,15 @@ And test for independence of the two variables with the Fisher exact test.
7259## ` clusterProfiler `
7360
7461` clusterprofiler ` [ @Yu2012 ] supports direct online access of the current KEGG
75- database (KEGG: Kyoto Encyclopedia of Genes and Genomes),
76- rather than relying on R annotation packages.
62+ database (KEGG: Kyoto Encyclopedia of Genes and Genomes), rather than relying on
63+ R annotation packages.
7764It also provides some nice visualisation options.
7865
7966We first search the resource for mouse data:
8067
8168``` {r loadClusterProfiler, message=FALSE}
82- library(clusterProfiler)
8369library(tidyverse)
70+ library(clusterProfiler)
8471
8572search_kegg_organism('mouse', by='common_name')
8673```
@@ -115,8 +102,8 @@ sigGenes <- shrink.d11 %>%
115102 filter(FDR < 0.05 & abs(logFC) > 1) %>%
116103 pull(Entrez)
117104
118- kk <- enrichKEGG(gene = sigGenes, organism = 'mmu')
119- head(kk, n=10) %>% as_tibble()
105+ keggRes <- enrichKEGG(gene = sigGenes, organism = 'mmu')
106+ as_tibble(keggRes )
120107```
121108
122109### Visualise a pathway in a browser
@@ -127,7 +114,7 @@ highlighting the genes we selected as differentially expressed.
127114We will show one of the top hits: pathway 'mmu04612' for 'Antigen processing and presentation'.
128115
129116``` {r browseKegg}
130- browseKEGG(kk , 'mmu04612')
117+ browseKEGG(keggRes , 'mmu04612')
131118```
132119
133120### Visualise a pathway as a file
@@ -143,10 +130,6 @@ colour by any numeric vector, e.g. p-value).
143130The package plots the KEGG pathway to a ` png ` file in the working directory.
144131
145132``` {r pathview, message=F}
146- # check working directory
147- #getwd()
148-
149- # run pathview
150133library(pathview)
151134logFC <- shrink.d11$logFC
152135names(logFC) <- shrink.d11$Entrez
@@ -162,21 +145,17 @@ pathview(gene.data = logFC,
162145
163146> ### Exercise 1 {.challenge}
164147>
165- > 1 . Use ` pathview ` to export a figure for "mmu04659" or "mmu04658", but this time only
166- > use genes that are statistically significant at FDR < 0.01
148+ > 1 . Use ` pathview ` to export a figure for "mmu04659" or "mmu04658", but this
149+ > time only use genes that are statistically significant at FDR < 0.01
167150>
168-
169- ``` {r solution1, eval=F}
170-
171- ```
172-
173151> ### Exercise 2 - GO term enrichment analysis
174152>
175153> ` clusterProfiler ` can also perform over-representation analysis on GO terms
176154 using the command ` enrichGO ` . Check:
177155>
178- > * the help page for the command ` enrichGO ` (type ` ?enrichGO ` at the console prompt)
179- > * and the instructions in the
156+ > * the help page for the command ` enrichGO ` (type ` ?enrichGO ` at the console
157+ > prompt)
158+ > * the instructions in the
180159> [ clusterProfiler book] ( http://yulab-smu.top/clusterProfiler-book/chapter5.html#go-over-representation-test ) .
181160>
182161> 1 . Run the over-representation analysis for GO terms
@@ -193,11 +172,6 @@ using the command `enrichGO`. Check:
193172> 2 . Use the ` dotplot ` function to visualise the results.
194173
195174
196- ``` {r solution2, eval=F}
197- # may need devtools::install_github("YuLab-SMU/enrichplot")
198- # to avoid a 'wrong orderBy parameter' warning.
199- ```
200-
201175# GSEA analysis
202176
203177Gene Set Enrichment Analysis (GSEA) identifies gene sets that are related to the
@@ -226,27 +200,29 @@ library(msigdbr)
226200The analysis is performed by:
227201
2282021 . ranking all genes in the data set
229- 2 . identifying in the ranked data set the rank positions of all members of the gene set
230- 3 . calculating an enrichment score (ES) that represents the difference
231- between the observed rankings and that which would be expected assuming a random
232- rank distribution.
203+ 2 . identifying in the ranked data set the rank positions of all members of the
204+ gene set
205+ 3 . calculating an enrichment score (ES) that represents the difference between
206+ the observed rankings and that which would be expected assuming a random rank
207+ distribution.
233208
234209The article describing the original software is available
235210[ here] ( http://www.pnas.org/content/102/43/15545.long ) ,
236- while this [ commentary on GSEA] ( https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1266131/ ) provides a shorter description.
211+ while this
212+ [ commentary on GSEA] ( https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1266131/ )
213+ provides a shorter description.
237214
238215![ ] ( ../images/gseaArticleFig1.png )
239216
240217We will use ` clusterProfiler ` 's [ ` GSEA ` ] ( http://yulab-smu.top/clusterProfiler-book/chapter2.html#gene-set-enrichment-analysis )
241218package [ @Yu2012 ] that implements the same algorithm in R.
242219
243- <!-- We will use `GSEA` in `clusterProfiler`. -->
244-
245220## Rank genes
246221
247- We need to provide ` GSEA ` with a vector containing values for a given gene mtric, e.g. log(fold change), sorted in decreasing order.
222+ We need to provide ` GSEA ` with a vector containing values for a given gene
223+ mtric, e.g. log(fold change), sorted in decreasing order.
248224
249- To start with we will simply use a rank based on their fold change.
225+ To start with we will simply use a rank the genes based on their fold change.
250226
251227We must exclude genes with no Entrez ID.
252228
@@ -255,18 +231,23 @@ Also, we should use the shrunk LFC values.
255231``` {r preparedata}
256232rankedGenes <- shrink.d11 %>%
257233 drop_na(Entrez) %>%
258- mutate(rank = logFC) %>%
259- arrange(-rank) %>%
234+ arrange(desc(logFC)) %>%
260235 pull(rank,Entrez)
261236```
262237
263238## Load pathways
264239
265- We will load the MSigDB Hallmark gene set with ` msigdbr ` , setting the ` category ` parameter to 'H' for ** H** allmark gene set. The object created is a ` tibble ` with information on each {gene set; gene} pair (one per row). We will only keep the the gene set name, gene Entrez ID and symbol, in mouse.
240+ We will load the MSigDB Hallmark gene set with ` msigdbr ` , setting the ` category `
241+ parameter to 'H' for ** H** allmark gene set. The object created is a ` tibble `
242+ with information on each {gene set; gene} pair (one per row). We will only keep
243+ the the gene set name, gene Entrez ID.
266244
267245``` {r loadPathways_msigdbr}
268- m_H_t2g <- msigdbr(species = "Mus musculus", category = "H") %>%
269- dplyr::select(gs_name, entrez_gene, gene_symbol)
246+ term2gene <- msigdbr(species = "Mus musculus", category = "H") %>%
247+ select(gs_name, entrez_gene)
248+ term2name <- msigdbr(species = "Mus musculus", category = "H") %>%
249+ select(gs_name, gs_description) %>%
250+ distinct()
270251```
271252
272253## Conduct analysis
@@ -280,35 +261,31 @@ Arguments passed to `GSEA` include:
280261
281262``` {r runGsea}
282263gseaRes <- GSEA(rankedGenes,
283- TERM2GENE = m_H_t2g[,1:2] ,
284- #pvalueCutoff = 0.05 ,
285- pvalueCutoff = 1.00, # to retrieve whole output
264+ TERM2GENE = term2gene ,
265+ TERM2NAME = term2name ,
266+ pvalueCutoff = 1.00,
286267 minGSSize = 15,
287268 maxGSSize = 500)
288269```
289270
290271Let's look at the top 10 results.
291272
292- ``` {r top10GseaPrint, echo=!FALSE}
293- # have function to format in scientific notation
294- format.e1 <- function(x) (sprintf("%.1e", x))
295- # show table
296- gseaRes %>%
297- arrange(desc(abs(NES))) %>%
298- top_n(10, -p.adjust) %>%
299- dplyr::select(-core_enrichment) %>%
300- dplyr::select(-Description) %>%
301- data.frame() %>%
302- remove_rownames() %>%
303- # format
304- mutate(ES=formatC(enrichmentScore, digits = 3)) %>%
305- mutate(NES=formatC(NES, digits = 3)) %>%
306- # format p-values
307- modify_at(
308- c("pvalue", "p.adjust", "qvalues"),
309- format.e1
310- ) %>%
311- DT::datatable(options = list(dom = 't'))
273+ ``` {r top10GseaPrint, eval=FALSE}
274+ as_tibble(gseaRes) %>%
275+ arrange(desc(abs(NES))) %>%
276+ top_n(10, wt=-p.adjust) %>%
277+ select(-core_enrichment) %>%
278+ mutate(across(c("enrichmentScore", "NES"), round, digits=3)) %>%
279+ mutate(across(c("pvalue", "p.adjust", "qvalues"), scales::scientific))
280+ ```
281+ ``` {r top10GseaPrintactual, echo=FALSE}
282+ as_tibble(gseaRes) %>%
283+ arrange(desc(abs(NES))) %>%
284+ top_n(10, wt=-p.adjust) %>%
285+ select(-core_enrichment) %>%
286+ mutate(across(c("enrichmentScore", "NES"), round, digits=3)) %>%
287+ mutate(across(c("pvalue", "p.adjust", "qvalues"), scales::scientific)) %>%
288+ DT::datatable(option=list(dom='t'))
312289```
313290
314291## Enrichment score plot
@@ -320,27 +297,10 @@ pathway (no tick for genes not in the pathway)
320297* the enrichment score: the green curve shows the difference between the observed
321298rankings and that which would be expected assuming a random rank distribution.
322299
323- ``` {r }
324- # HALLMARK_INFLAMMATORY_RESPONSE is 4th
325- topx <- match("HALLMARK_INFLAMMATORY_RESPONSE", data.frame(gseaRes)$ID)
326- ```
327-
328- Gene log(fold change):
329-
330- ``` {r gseaEnrichmentPlot_preranked}
331- gseaplot(gseaRes, geneSetID = topx, by = "preranked", title = gseaRes$Description[topx])
332- ```
333-
334- Running score:
335-
336- ``` {r gseaEnrichmentPlot_runningScore}
337- gseaplot(gseaRes, geneSetID = topx, by = "runningScore", title = gseaRes$Description[topx])
338- ```
339-
340- Both the log(fold change) and running score:
341-
342300``` {r gseaEnrichmentPlot_both}
343- gseaplot(gseaRes, geneSetID = topx, title = gseaRes$Description[topx])
301+ gseaplot(gseaRes,
302+ geneSetID = "HALLMARK_INFLAMMATORY_RESPONSE",
303+ title = "HALLMARK_INFLAMMATORY_RESPONSE")
344304```
345305
346306Remember to check the [ GSEA
@@ -356,14 +316,10 @@ explanation.
356316> 1 . Rank the genes by statistical significance - you will need to create
357317> a new ranking value using ` -log10({p value}) * sign({Fold Change}) ` .
358318> 2 . Run ` fgsea ` using the new ranked genes and the H pathways.
359- > 3 . Conduct the same analysis for the d33 vs control contrast.
319+ > 3 . Conduct the same analysis for the day 33 Infected vs Uninfected contrast.
360320> Extended: Do results differ between ranking scheme?
361- > Extended: Do results differ between d11 and d33, with the significance-based ranking scheme?
362-
363- ``` {r solution3}
364-
365- ```
366-
321+ > Extended: Do results differ between day 11 and day 33, with the
322+ > significance-basedranking scheme?
367323
368324---------------------------------------------------------------
369325
0 commit comments