You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: 14-novel-species-FEA.Rmd
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -60,7 +60,7 @@ This axolotl data ticked A through D yet not E (RNA from various tissues were se
60
60
61
61
The [reference genome](https://www.axolotl-omics.org/dl/AmexG_v6.0-DD.fa.gz) and [GTF gene prediction file](https://www.axolotl-omics.org/dl/AmexT_v47-AmexG_v6.0-DD.gtf.gz) were downloaded from www.axolotl-omics.org. A proteome was created by extracting the predicted peptide sequences from the GTF then retaining the longest isoform per gene with `AGAT` v 1.4.0. The predicated proteome was then annotated with `eggNOG emapper` v 2.1.12.
62
62
63
-
The annotation output file has been provided to you, and we will import this into R and use the `dplyr` package v 1.1.4 to extract GO and KEGG IDs into the required format for R-based FEA with `clsuterProfiler` and `WebGestaltR`.
63
+
The annotation output file has been provided to you, and we will import this into R and use the `dplyr` package v 1.1.4 to extract GO and KEGG IDs into the required format for R-based FEA with `clusterProfiler` and `WebGestaltR`.
64
64
65
65
The predicted proteome was also annotated with [STRING](https://string-db.org/) v 12.0. As of version 12, STRING includes a feature `Add any organism to STRING / Annotate proteome` ([Szklarczyk etl al 2023](https://pmc.ncbi.nlm.nih.gov/articles/PMC9825434/)). The axolotl proteome was uploaded to STRING and annotation performed on STRING servers in less than 1 day. The [resulting annotation](https://version-12-0.string-db.org/organism/STRG0A90SNX) is persistent and shareable and can be used for all of STRING's search functions including ORA (`Multiple proteins`) and GSEA (`Proteins with Values/Ranks`).
Copy file name to clipboardExpand all lines: day2_Rnotebooks/novel_species.Rmd
+44-14Lines changed: 44 additions & 14 deletions
Original file line number
Diff line number
Diff line change
@@ -146,7 +146,9 @@ tail(ranked_df)
146
146
147
147
## 2.3 Create gene lists for ORA
148
148
149
-
For ORA, both tools require vector class gene lists. We will filter for adjusted P value < 0.01 and absolute log2 fold change greater than 1.5. The matrix has already filtered out genes with very low counts so we take all genes present as the background.
149
+
For ORA, both tools require vector class gene lists. We will filter for adjusted P value < 0.01 and absolute log2 fold change greater than 1.5.
150
+
151
+
The matrix has already filtered out genes with very low counts so we take all genes present as the background.
150
152
151
153
```{r create ORA vector gene lists}
152
154
@@ -169,6 +171,9 @@ head(background)
169
171
170
172
```
171
173
174
+
Note the large drop in gene numbers: 100K in GTF, 48K in predicted proteome, 24K expressed in the blastema! By reducing the number of background genes to what are expressed in the studied tissue, we can reduce falsely inflated P values and false positives within our list of enriched terms.
175
+
176
+
172
177
## 2.4 Save gene lists
173
178
174
179
Saving any outputs generated from R code is vital to reproducibility! You should include all analysed gene lists within the supplementary materials of your manuscript.
# 3. Reformat annotation files for `clusterProfiler` GO and KEGG analysis
188
193
194
+
Now we have annotation files and gene lists, we will bring those together to create the custom database files required for R FEA!
195
+
189
196
## 3.1 Create TERM2GENE files
190
197
191
198
These are 2 column text files with the term ID (one per line) alongside the ID of the gene that maps to the term. A gene can map to many terms and thus be present on multiple lines. A term can be mapped to more than one gene and thus be present on many lines.
@@ -200,7 +207,9 @@ We need `GOs` and `KEGG_Pathway` columns.
200
207
201
208
### 3.1.1 GO TERM2GENE
202
209
203
-
Next, we will extract the GO IDs from the `emapper` annotation file, and wrangle into the correct format for `clusterProfiler``TERM2GENE`. There are several steps to this - comments have been included to outline what each step is doing.
210
+
Next, we will extract the GO IDs from the `emapper` annotation file, and wrangle into the correct format for `clusterProfiler``TERM2GENE`.
211
+
212
+
There are several steps to this - comments have been included to outline what each step is doing.
204
213
205
214
```{r GO TERM2GENE}
206
215
go_term2gene <- eggnog_anno %>%
@@ -265,16 +274,15 @@ head(kegg_term2gene)
265
274
266
275
### 3.2.1 GO
267
276
268
-
The GO Core Ontology file (we have saved in our `workshop` folder and labelled the filename as R object `go_core`) will enable us to assign term descriptions to term IDs and create our `TERM2NAME` files.
269
-
277
+
Now we will assign term descriptions to term IDs and create our `TERM2NAME` files.
270
278
271
-
This may take a few moments to run. It will use the `ontology` object we created earlier.
279
+
This may take a few moments to run. It will use the `ontology` object we created earlier from the `go.obo` file.
272
280
273
281
```{r GO TERM2NAME}
274
282
275
283
# Create term to name table, removing duplicates, missing values and obsolete terms
276
-
go_term2name <- go_term2gene %>%
277
-
mutate(name = ontology$name[term]) %>%
284
+
go_term2name <- go_term2gene %>% # only keep terms that are in our term2gene object (ie, mapped to axolotl)
285
+
mutate(name = ontology$name[term]) %>%
278
286
dplyr::select(term, name) %>%
279
287
distinct() %>%
280
288
drop_na() %>%
@@ -318,7 +326,9 @@ head(kegg_term2name)
318
326
319
327
## 3.3 Count annotations
320
328
321
-
How much of our proteome was annotated? What about our DEGs and background? Genes that do not have any annotation are excluded from enrichment analysis, so having an understanding of the extent of annotation for your novel species is very important when interpreting results!
329
+
How much of our proteome was annotated? What about our DEGs and background?
330
+
331
+
Genes that do not have any annotation are excluded from enrichment analysis, so having an understanding of the extent of annotation for your novel species is very important when interpreting results!
322
332
323
333
324
334
Count the number of GO terms found within the genome, and the number of genes with GO annotations:
@@ -345,7 +355,7 @@ print(paste("Number of unique genes with 1 or more annotation terms:", kegg_uniq
345
355
346
356
```
347
357
348
-
47,196 putative axolotl proteins were annotated. That's around 1/4 of our genes mapped to KEGG Pathways, and less than half of our genes mapped to GO! Ouch. As much as we expect this with uncurated novel species genomes, it's still unpleasant to face :-)
358
+
47,196 putative axolotl proteins were annotated. That's around 1/4 of our predicted proteins mapped to KEGG Pathways, and less than half of our genes mapped to GO! Ouch. As much as we expect this with uncurated novel species genomes, it's still unpleasant to face :-)
349
359
350
360
351
361
What of the genes in our gene list specifically? We have an uncurated proteome, yet the genes in our input matrix were expressed at a meaningful level within axolotl, so these may actually have a higher annotation percentage than all genes in the proteome.
@@ -367,6 +377,7 @@ cat("Number of input genes with GO annotations:", unique_genes_with_go, "(",perc
367
377
368
378
```
369
379
380
+
370
381
```{r report gene list KEGG annotation percentages}
371
382
# Filter the term2gene table to only include genes in the background gene list
There are parameters for both adjusted P value and q value. Terms must pass all thresholds (unadjusted P, adjusted P, and q value) so the important filter will be the most stringent test applied. Let's go with BH and 0.05 which we have used regularly within this workshop and are fairly common choices in the field.
407
418
419
+
we need to provide term2gene and term2name, and don't specify an organism.
420
+
408
421
409
422
```{r CP GO ORA }
410
423
@@ -428,7 +441,11 @@ cp_go_ora
428
441
429
442
91 significantly enriched terms at P.adj < 0.05.
430
443
431
-
Note that we used the gene list object `degs` which had 247 genes, but the tool has applied the input size as 145 - this is because it is automatically discarding any that do not have annotations. Results would be the same if we instead used `annotated_degs` object. Likewise, the background size is being reported as 15072 (the number annotated) not 24,419 (the total in background list).
444
+
Look at the geneRatio column: our gene list object `degs` has 247 genes, but the tool has applied the input size as 145 - this is because it is automatically discarding any that do not have annotations.
445
+
446
+
Results would be the same if we instead used `annotated_degs` object.
447
+
448
+
Likewise, the background size is being reported as 15072 (the number annotated) not 24,419 (the total in background list).
432
449
433
450
434
451
Save the results to a text file:
@@ -574,7 +591,7 @@ Note that in the below code, the first command is identical the one that created
574
591
```{r GO GMT}
575
592
576
593
# Step 1: Extract relevant columns (GO terms and gene IDs) from eggnog_anno
577
-
go_data <- eggnog_anno %>%
594
+
go_data <- eggnog_anno %>% # use the emapper annotations for axolotl
578
595
dplyr::select(GOs, `#query`) %>% # Select the GO terms and the gene IDs
579
596
dplyr::filter(GOs != "-") %>% # Filter out rows where the GO ID is missing ("-")
580
597
separate_rows(GOs, sep = ",") %>% # Split comma-delimited list of GO terms into separate rows
minNum = 10, # Minimum number of genes per category
747
775
maxNum = 500, # Maximum number of genes per category
748
-
boxplot = TRUE,
749
776
outputDirectory = outputDirectory,
750
777
projectName = project
751
778
)
@@ -757,7 +784,9 @@ Open the HTML report `WebGestaltR_results/Project_Axolotl_ORA_GO/Report_Axolotl_
757
784
758
785
Note some term similarity to what we have seen with the past 2 analyses (that's reassuring!)
759
786
760
-
Change the 'Enrichment Results' view from table to 'Bar chart', then try the 'Affinity propagation' and 'Weighted set cover' term clustering algorithms. Both methods help focus the results.
787
+
We no longer have GO Slim, as this needs to call the actual GO database, which we haven't used.
788
+
789
+
Change the 'Enrichment Results' view from table to 'Bar chart', then try the 'Affinity propagation' and 'Weighted set cover' term clustering algorithms. 'All' has more terms with higher specificty, and the term redundancy has performed clustering to give fewwer terms and provide a more concise overview. It's up to you as the researcher to decided which approach is best suited to your dataset!
761
790
762
791
Confirm that our GMT file correctly included the term link by selecting a term and clicking the hyperlink at `Analyte set`. Pretty neat huh :-)
763
792
@@ -771,6 +800,7 @@ There is no `seed` parameter for `WebGestaltR` GSEA as there is for `clusterProf
771
800
set.seed(123)
772
801
```
773
802
803
+
Again we are specifying `organism = "others"` and providing our GMT and description file:
0 commit comments