Skip to content

Commit 61b7e77

Browse files
committed
last tweak to webgr
1 parent 8faffef commit 61b7e77

File tree

1 file changed

+34
-23
lines changed

1 file changed

+34
-23
lines changed

day2_Rnotebooks/WebGestaltR.Rmd

Lines changed: 34 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,6 @@ output:
1010
library(WebGestaltR)
1111
library(readr)
1212
library(dplyr)
13-
library(enrichplot)
1413
```
1514

1615
# 0. Working directory
@@ -33,14 +32,14 @@ dir.create("WebGestaltR_results")
3332

3433
WebGestaltR supports 12 species directly, however, you can import your own database files to perform ORA and GSEA for novel species :-) We will do that in the next session.
3534

36-
The next two commands have a default setting of `organism = "hsapiens"` (hint: check with `?listOrganism` and `?listIdType`).
37-
38-
Let's view the list of supported organisms. We don't have to specify any arguments to the function as the defaults do what we need. We do however need the empty brackets. Without them will print the function source code.
35+
Let's view the list of supported organisms. We don't have to specify any arguments, but we do need the empty brackets. Without them will print the function source code.
3936

4037
```{r list organisms }
4138
listOrganism()
4239
```
4340

41+
The next two commands have a default setting of `organism = "hsapiens"`, so running without any argument will show the genesets (databases) and ID types (namespaces) that are supported for human.
42+
4443
View databases for human. Use the black arrow on the right of the table to view the other 2 columns, and use the numbers below the table (or 'next') to view the next 10 rows.
4544

4645
```{r list databases}
@@ -71,10 +70,6 @@ listGeneSet(organism = "cfamiliaris")
7170
```
7271

7372

74-
75-
76-
77-
7873
# 2. Load input data and extract gene lists
7974

8075
We will use the same Pezzini RNAseq dataset as earlier. Since we have previously saved our ranked list, DEGs and background genes to the `workshop` folder, we could import those. However, clarity of how the gene list inputs were made is retained within the notebook, and this enhances reproducibility. Gene lists are quick and simple to extract from the input data. If the process was slow and compute-intensive, we would instead document the source and methods behind the gene lists in the notebook comments instead of re-creating them.
@@ -100,30 +95,32 @@ Bring up the help menu for the `WebGestaltR` function and spend a few minutes re
10095

10196
There are quite a few! For many of them (eg gene set size filters, multiple testing correction method, P value cutoff) the default settings are suitable.
10297

103-
In particular, look for the arguments that control:
98+
In particular, look for the parameters that control:
10499

105100
- whether ORA, GSEA or NTA is performed
106101

107102
- which database/s to run enrichment on
108103

109-
- what is the namespace for the gene list query
104+
- what is the namespace/gee ID type for the gene list query
110105

111106
- how to specify the input gene list/s
112107

113-
Hopefully you've discovered that the `WebGestaltR` function can intake either gene lists from files (as long as the right column format and file suffix is provided) or R objects.
108+
Hopefully you've discovered that the `WebGestaltR` function can intake EITHER gene lists from files (as long as the right column format and file suffix is provided) or R objects.
109+
110+
Since we have decided to extract the gene lists from the DE matrix to R objects, we need to provide the gene list object to `interestGene` parameter (and `referenceGene` for ORA background).
114111

115-
Since we have decided to extract the gene lists from the DE matrix to R objects, we need to provide the gene list object to `interestGene` parameter (and `referenceGene` for ORA background). For ORA, the gene lists need to be vectors, and for GSEA, a 2-column dataframe (unlike `clusterProfiler`, which requires a GSEA vector).
112+
For ORA, the gene lists need to be vectors, and for GSEA, a 2-column dataframe (unlike `clusterProfiler`, which requires a GSEA vector).
116113

117114
Our input matrix contains ENSEMBL IDs as well as official gene symbols, so we could use "ensembl_gene_id" or "genesymbol" for the parameter `interestGeneType`. Let's extract the ENSEMBL IDs since they are more specific than symbol.
118115

119116

120117
```{r extract ora gene list vectors}
121-
# Filter genes with adjusted p-value < 0.01 and absolute log2 fold change > 2
118+
# Filter genes with adjusted p-value < 0.01 and absolute log2 fold change > 2 and saved as 'DEGs' vector
122119
DEGs <- data %>%
123120
filter(FDR < 0.01, abs(Log2FC) > 2) %>%
124121
pull(Gene.ID)
125122
126-
# Extract all gene IDs as the background
123+
# Extract all gene IDs as the 'background' vector
127124
background <- data %>%
128125
pull(Gene.ID)
129126
@@ -136,7 +133,7 @@ cat("Fist 6 background genes:", head(background), "\n")
136133

137134

138135
```{r extract ranked gene list dataframe}
139-
# extract ranked dataframe
136+
# extract ranked dataframe, saved as 'ranked' object
140137
ranked <- data %>%
141138
arrange(desc(Log2FC)) %>%
142139
dplyr::select(Gene.ID, Log2FC)
@@ -154,13 +151,13 @@ tail(ranked)
154151
For this task, let's focus on the pathway gene sets. From skimming the output of `listGeneSet()` there were a few. We could manually locate these and copy them in to our list, or take advantage of the fact that the `WebGestaltR` developers have been systematic in the gene set naming, ensuring all database names are prefixed with their type, ie `geneontology_`, `pathway_`, `network_`, plus a few others.
155152

156153
```{r select pathway databases}
157-
# Save the list of databases for human
154+
# Save the databases for human
158155
databases <- listGeneSet()
159156
160157
# Extract the the pathways from the 'name' column that start with 'pathway'
161158
pathway_dbs <- subset(databases, grepl("^pathway", name))
162159
163-
# Save the pathway names to a list
160+
# Save the pathway 'names' column to a list
164161
pathway_names <- pathway_dbs$name
165162
166163
# Check the list
@@ -194,20 +191,26 @@ WebGestaltR(
194191
referenceGene = background, # Background gene list
195192
referenceGeneType = "ensembl_gene_id", # Gene ID type for background
196193
enrichDatabase = pathway_names, # Database name or list of databases to enrich over
197-
isOutput = TRUE, # Set to FALSE if you don't want files saved to disk
194+
isOutput = TRUE, # yes save report files saved to disk
198195
fdrMethod = "BH", # Multiple testing correction method (BH = Benjamini-Hochberg)
199196
sigMethod = "fdr", # Significance method ('fdr' = false discover rate)
200197
fdrThr = 0.05, # FDR significance threshold
201198
minNum = 10, # Minimum number of genes in a gene set to include
202199
maxNum = 500, # Maximum number of genes in a gene set to include
203-
outputDirectory = outdir,
200+
outputDirectory = outdir,
204201
projectName = project,
205-
nThreads = 6
202+
nThreads = 6 # use 6 threads for faster run
206203
)
207204
```
208205

209206
The results are saved within a new folder inside our new folder `WebGestaltR_results/Project_ORA_pathways`. There are a number of results files, the one we will focus on is the interactive HTML summary file.
210207

208+
209+
210+
STOP: to save time for GSEA compute, skip ahead, run the code chunk labelled `GSEA GO MF with redundant` (it takes several minutes) then return here where we will explore the ORA HTML while the GSEA runs!!!
211+
212+
213+
211214
In the `Files` pane, open the folder `WebGestaltR_results/Project_ORA_pathways` then click on the `Report_ORA_pathways.html` file. Select `View in Web Browser`.
212215

213216
Some things to note:
@@ -323,7 +326,7 @@ Notice in the GSEA code chunks above, the R function `supressWarnings` has been
323326

324327
Now that we have both results saved in R objects, we can compare the enriched terms.
325328

326-
Visualise the number of unique and shared terms:
329+
How many significant terms from each DB?
327330

328331
```{r Count signif terms}
329332
@@ -360,24 +363,31 @@ shared_terms <- intersect(nr_terms, r_terms)
360363
361364
```
362365

366+
Print shared terms from both DBs:
367+
363368
```{r shared signif terms}
364369
print(shared_terms)
365370
```
371+
372+
Print terms only in non-redundant:
373+
366374
```{r terms signif only in nonredundant}
367375
print(unique_nr_terms)
368376
```
369377

378+
Print terms only in redundant:
379+
370380
```{r terms signif only in with-redundant}
371381
print(unique_r_terms)
372382
```
373383

374-
Scanning the list of terms only within the full GO MF (including redundant terms) we see many terms to do with DNA molecular functions.
384+
Scanning the list of terms only within the full GO MF (including redundant terms) we see many terms to do with DNA activity and binding.
375385

376386
Significant in the 'non-redundant' analysis, we can see just 2 DNA activity functions: "DNA secondary structure binding" (significant in both) and "single-stranded DNA binding" (unique to GO MF NR).
377387

378388
By grouping so many similar terms with the non-redundant analyses, the overall number of enrichments is lower and more targeted, providing a more concise overview of the biology from your results.
379389

380-
For your own research, you could explore the relationships between these terms by viewing the neighborhood of GO terms on AmiGO: https://amigo.geneontology.org/amigo, or using NaviGO https://kiharalab.org/navigo/views/goset.php
390+
For your own research, you could explore the relationships between these terms by viewing the neighborhood of GO terms on AmiGO: https://amigo.geneontology.org/amigo, or using NaviGO https://kiharalab.org/navigo/views/goset.php (enter multiple GO IDs to see their relationships).
381391

382392

383393
# 5. Save versions and session details
@@ -438,6 +448,7 @@ rstudio_version_text
438448
```
439449

440450

451+
STOP: while this is knitting, we will commence the novel species activity in the online workshop materials.
441452

442453
# 6. Knit workbook to HTML
443454

0 commit comments

Comments
 (0)