Skip to content

Commit c5a782a

Browse files
committed
Update GSE section
1 parent ed99085 commit c5a782a

File tree

6 files changed

+357
-591
lines changed

6 files changed

+357
-591
lines changed

Markdowns/12_Gene_set_testing.Rmd

Lines changed: 65 additions & 109 deletions
Original file line numberDiff line numberDiff line change
@@ -7,47 +7,34 @@ output:
77
toc: yes
88
html_document:
99
toc: yes
10-
always_allow_html: true
1110
bibliography: ref.bib
11+
always_allow_html: true
1212
---
1313

1414
```{r setup, include=FALSE, cache=FALSE}
15-
options(bitmapType='cairo')
16-
knitr::opts_chunk$set(dev = c("png"))
17-
1815
library(tidyverse)
1916
```
2017

2118
The list of differentially expressed genes is sometimes so long that its
2219
interpretation becomes cumbersome and time consuming. It may also be very
2320
short while some genes have low p-value yet higher than the given threshold.
2421

25-
A common downstream procedure to combine information across genes is gene set testing.
26-
It aims at finding pathways or gene networks the differentially expressed genes play a role in.
22+
A common downstream procedure to combine information across genes is gene set
23+
testing. It aims at finding pathways or gene networks the differentially
24+
expressed genes play a role in.
2725

2826
Various ways exist to test for enrichment of biological pathways. We will look
2927
into over representation and gene set enrichment analyses.
3028

31-
A gene set comprises genes that share a biological function, chromosomal location, or any other
32-
relevant criterion.
33-
34-
<!--
35-
- Define gene set
36-
- gene set == "pathway"
37-
- over-represented == enriched
38-
ie pathway A is enriched in our diff exp gene list
39-
-->
29+
A gene set comprises genes that share a biological function, chromosomal
30+
location, or any other relevant criterion.
4031

4132
To save time and effort there are a number of packages that make applying these
4233
tests to a large number of gene sets simpler, and which will import gene lists
4334
for testing from various sources.
4435

45-
Today we will use [`clusterProfiler`](https://yulab-smu.github.io/clusterProfiler-book/index.html).
46-
47-
<!--
48-
https://yulab-smu.github.io/clusterProfiler-book/index.html
49-
https://yulab-smu.github.io/clusterProfiler-book/chapter2.html#over-representation-analysis
50-
-->
36+
Today we will use
37+
[`clusterProfiler`](https://yulab-smu.github.io/clusterProfiler-book/index.html).
5138

5239
# Over-representation
5340

@@ -72,15 +59,15 @@ And test for independence of the two variables with the Fisher exact test.
7259
## `clusterProfiler`
7360

7461
`clusterprofiler` [@Yu2012] supports direct online access of the current KEGG
75-
database (KEGG: Kyoto Encyclopedia of Genes and Genomes),
76-
rather than relying on R annotation packages.
62+
database (KEGG: Kyoto Encyclopedia of Genes and Genomes), rather than relying on
63+
R annotation packages.
7764
It also provides some nice visualisation options.
7865

7966
We first search the resource for mouse data:
8067

8168
```{r loadClusterProfiler, message=FALSE}
82-
library(clusterProfiler)
8369
library(tidyverse)
70+
library(clusterProfiler)
8471
8572
search_kegg_organism('mouse', by='common_name')
8673
```
@@ -115,8 +102,8 @@ sigGenes <- shrink.d11 %>%
115102
filter(FDR < 0.05 & abs(logFC) > 1) %>%
116103
pull(Entrez)
117104
118-
kk <- enrichKEGG(gene = sigGenes, organism = 'mmu')
119-
head(kk, n=10) %>% as_tibble()
105+
keggRes <- enrichKEGG(gene = sigGenes, organism = 'mmu')
106+
as_tibble(keggRes)
120107
```
121108

122109
### Visualise a pathway in a browser
@@ -127,7 +114,7 @@ highlighting the genes we selected as differentially expressed.
127114
We will show one of the top hits: pathway 'mmu04612' for 'Antigen processing and presentation'.
128115

129116
```{r browseKegg}
130-
browseKEGG(kk, 'mmu04612')
117+
browseKEGG(keggRes, 'mmu04612')
131118
```
132119

133120
### Visualise a pathway as a file
@@ -143,10 +130,6 @@ colour by any numeric vector, e.g. p-value).
143130
The package plots the KEGG pathway to a `png` file in the working directory.
144131

145132
```{r pathview, message=F}
146-
# check working directory
147-
#getwd()
148-
149-
# run pathview
150133
library(pathview)
151134
logFC <- shrink.d11$logFC
152135
names(logFC) <- shrink.d11$Entrez
@@ -162,21 +145,17 @@ pathview(gene.data = logFC,
162145

163146
> ### Exercise 1 {.challenge}
164147
>
165-
> 1. Use `pathview` to export a figure for "mmu04659" or "mmu04658", but this time only
166-
> use genes that are statistically significant at FDR < 0.01
148+
> 1. Use `pathview` to export a figure for "mmu04659" or "mmu04658", but this
149+
> time only use genes that are statistically significant at FDR < 0.01
167150
>
168-
169-
```{r solution1, eval=F}
170-
171-
```
172-
173151
> ### Exercise 2 - GO term enrichment analysis
174152
>
175153
> `clusterProfiler` can also perform over-representation analysis on GO terms
176154
using the command `enrichGO`. Check:
177155
>
178-
> * the help page for the command `enrichGO` (type `?enrichGO` at the console prompt)
179-
> * and the instructions in the
156+
> * the help page for the command `enrichGO` (type `?enrichGO` at the console
157+
> prompt)
158+
> * the instructions in the
180159
> [clusterProfiler book](http://yulab-smu.top/clusterProfiler-book/chapter5.html#go-over-representation-test).
181160
>
182161
> 1. Run the over-representation analysis for GO terms
@@ -193,11 +172,6 @@ using the command `enrichGO`. Check:
193172
> 2. Use the `dotplot` function to visualise the results.
194173
195174

196-
```{r solution2, eval=F}
197-
# may need devtools::install_github("YuLab-SMU/enrichplot")
198-
# to avoid a 'wrong orderBy parameter' warning.
199-
```
200-
201175
# GSEA analysis
202176

203177
Gene Set Enrichment Analysis (GSEA) identifies gene sets that are related to the
@@ -226,27 +200,29 @@ library(msigdbr)
226200
The analysis is performed by:
227201

228202
1. ranking all genes in the data set
229-
2. identifying in the ranked data set the rank positions of all members of the gene set
230-
3. calculating an enrichment score (ES) that represents the difference
231-
between the observed rankings and that which would be expected assuming a random
232-
rank distribution.
203+
2. identifying in the ranked data set the rank positions of all members of the
204+
gene set
205+
3. calculating an enrichment score (ES) that represents the difference between
206+
the observed rankings and that which would be expected assuming a random rank
207+
distribution.
233208

234209
The article describing the original software is available
235210
[here](http://www.pnas.org/content/102/43/15545.long),
236-
while this [commentary on GSEA](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1266131/) provides a shorter description.
211+
while this
212+
[commentary on GSEA](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1266131/)
213+
provides a shorter description.
237214

238215
![](../images/gseaArticleFig1.png)
239216

240217
We will use `clusterProfiler`'s [`GSEA`](http://yulab-smu.top/clusterProfiler-book/chapter2.html#gene-set-enrichment-analysis)
241218
package [@Yu2012] that implements the same algorithm in R.
242219

243-
<!-- We will use `GSEA` in `clusterProfiler`. -->
244-
245220
## Rank genes
246221

247-
We need to provide `GSEA` with a vector containing values for a given gene mtric, e.g. log(fold change), sorted in decreasing order.
222+
We need to provide `GSEA` with a vector containing values for a given gene
223+
mtric, e.g. log(fold change), sorted in decreasing order.
248224

249-
To start with we will simply use a rank based on their fold change.
225+
To start with we will simply use a rank the genes based on their fold change.
250226

251227
We must exclude genes with no Entrez ID.
252228

@@ -255,18 +231,23 @@ Also, we should use the shrunk LFC values.
255231
```{r preparedata}
256232
rankedGenes <- shrink.d11 %>%
257233
drop_na(Entrez) %>%
258-
mutate(rank = logFC) %>%
259-
arrange(-rank) %>%
234+
arrange(desc(logFC)) %>%
260235
pull(rank,Entrez)
261236
```
262237

263238
## Load pathways
264239

265-
We will load the MSigDB Hallmark gene set with `msigdbr`, setting the `category` parameter to 'H' for **H**allmark gene set. The object created is a `tibble` with information on each {gene set; gene} pair (one per row). We will only keep the the gene set name, gene Entrez ID and symbol, in mouse.
240+
We will load the MSigDB Hallmark gene set with `msigdbr`, setting the `category`
241+
parameter to 'H' for **H**allmark gene set. The object created is a `tibble`
242+
with information on each {gene set; gene} pair (one per row). We will only keep
243+
the the gene set name, gene Entrez ID.
266244

267245
```{r loadPathways_msigdbr}
268-
m_H_t2g <- msigdbr(species = "Mus musculus", category = "H") %>%
269-
dplyr::select(gs_name, entrez_gene, gene_symbol)
246+
term2gene <- msigdbr(species = "Mus musculus", category = "H") %>%
247+
select(gs_name, entrez_gene)
248+
term2name <- msigdbr(species = "Mus musculus", category = "H") %>%
249+
select(gs_name, gs_description) %>%
250+
distinct()
270251
```
271252

272253
## Conduct analysis
@@ -280,35 +261,31 @@ Arguments passed to `GSEA` include:
280261

281262
```{r runGsea}
282263
gseaRes <- GSEA(rankedGenes,
283-
TERM2GENE = m_H_t2g[,1:2],
284-
#pvalueCutoff = 0.05,
285-
pvalueCutoff = 1.00, # to retrieve whole output
264+
TERM2GENE = term2gene,
265+
TERM2NAME = term2name,
266+
pvalueCutoff = 1.00,
286267
minGSSize = 15,
287268
maxGSSize = 500)
288269
```
289270

290271
Let's look at the top 10 results.
291272

292-
```{r top10GseaPrint, echo=!FALSE}
293-
# have function to format in scientific notation
294-
format.e1 <- function(x) (sprintf("%.1e", x))
295-
# show table
296-
gseaRes %>%
297-
arrange(desc(abs(NES))) %>%
298-
top_n(10, -p.adjust) %>%
299-
dplyr::select(-core_enrichment) %>%
300-
dplyr::select(-Description) %>%
301-
data.frame() %>%
302-
remove_rownames() %>%
303-
# format
304-
mutate(ES=formatC(enrichmentScore, digits = 3)) %>%
305-
mutate(NES=formatC(NES, digits = 3)) %>%
306-
# format p-values
307-
modify_at(
308-
c("pvalue", "p.adjust", "qvalues"),
309-
format.e1
310-
) %>%
311-
DT::datatable(options = list(dom = 't'))
273+
```{r top10GseaPrint, eval=FALSE}
274+
as_tibble(gseaRes) %>%
275+
arrange(desc(abs(NES))) %>%
276+
top_n(10, wt=-p.adjust) %>%
277+
select(-core_enrichment) %>%
278+
mutate(across(c("enrichmentScore", "NES"), round, digits=3)) %>%
279+
mutate(across(c("pvalue", "p.adjust", "qvalues"), scales::scientific))
280+
```
281+
```{r top10GseaPrintactual, echo=FALSE}
282+
as_tibble(gseaRes) %>%
283+
arrange(desc(abs(NES))) %>%
284+
top_n(10, wt=-p.adjust) %>%
285+
select(-core_enrichment) %>%
286+
mutate(across(c("enrichmentScore", "NES"), round, digits=3)) %>%
287+
mutate(across(c("pvalue", "p.adjust", "qvalues"), scales::scientific)) %>%
288+
DT::datatable(option=list(dom='t'))
312289
```
313290

314291
## Enrichment score plot
@@ -320,27 +297,10 @@ pathway (no tick for genes not in the pathway)
320297
* the enrichment score: the green curve shows the difference between the observed
321298
rankings and that which would be expected assuming a random rank distribution.
322299

323-
```{r }
324-
# HALLMARK_INFLAMMATORY_RESPONSE is 4th
325-
topx <- match("HALLMARK_INFLAMMATORY_RESPONSE", data.frame(gseaRes)$ID)
326-
```
327-
328-
Gene log(fold change):
329-
330-
```{r gseaEnrichmentPlot_preranked}
331-
gseaplot(gseaRes, geneSetID = topx, by = "preranked", title = gseaRes$Description[topx])
332-
```
333-
334-
Running score:
335-
336-
```{r gseaEnrichmentPlot_runningScore}
337-
gseaplot(gseaRes, geneSetID = topx, by = "runningScore", title = gseaRes$Description[topx])
338-
```
339-
340-
Both the log(fold change) and running score:
341-
342300
```{r gseaEnrichmentPlot_both}
343-
gseaplot(gseaRes, geneSetID = topx, title = gseaRes$Description[topx])
301+
gseaplot(gseaRes,
302+
geneSetID = "HALLMARK_INFLAMMATORY_RESPONSE",
303+
title = "HALLMARK_INFLAMMATORY_RESPONSE")
344304
```
345305

346306
Remember to check the [GSEA
@@ -356,14 +316,10 @@ explanation.
356316
> 1. Rank the genes by statistical significance - you will need to create
357317
> a new ranking value using `-log10({p value}) * sign({Fold Change})`.
358318
> 2. Run `fgsea` using the new ranked genes and the H pathways.
359-
> 3. Conduct the same analysis for the d33 vs control contrast.
319+
> 3. Conduct the same analysis for the day 33 Infected vs Uninfected contrast.
360320
> Extended: Do results differ between ranking scheme?
361-
> Extended: Do results differ between d11 and d33, with the significance-based ranking scheme?
362-
363-
```{r solution3}
364-
365-
```
366-
321+
> Extended: Do results differ between day 11 and day 33, with the
322+
> significance-basedranking scheme?
367323
368324
---------------------------------------------------------------
369325

0 commit comments

Comments
 (0)