Skip to content

Commit 49bd63b

Browse files
committed
final changes to novel
1 parent 6325e38 commit 49bd63b

File tree

2 files changed

+45
-15
lines changed

2 files changed

+45
-15
lines changed

14-novel-species-FEA.Rmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,7 @@ This axolotl data ticked A through D yet not E (RNA from various tissues were se
6060

6161
The [reference genome](https://www.axolotl-omics.org/dl/AmexG_v6.0-DD.fa.gz) and [GTF gene prediction file](https://www.axolotl-omics.org/dl/AmexT_v47-AmexG_v6.0-DD.gtf.gz) were downloaded from www.axolotl-omics.org. A proteome was created by extracting the predicted peptide sequences from the GTF then retaining the longest isoform per gene with `AGAT` v 1.4.0. The predicated proteome was then annotated with `eggNOG emapper` v 2.1.12.
6262

63-
The annotation output file has been provided to you, and we will import this into R and use the `dplyr` package v 1.1.4 to extract GO and KEGG IDs into the required format for R-based FEA with `clsuterProfiler` and `WebGestaltR`.
63+
The annotation output file has been provided to you, and we will import this into R and use the `dplyr` package v 1.1.4 to extract GO and KEGG IDs into the required format for R-based FEA with `clusterProfiler` and `WebGestaltR`.
6464

6565
The predicted proteome was also annotated with [STRING](https://string-db.org/) v 12.0. As of version 12, STRING includes a feature `Add any organism to STRING / Annotate proteome` ([Szklarczyk etl al 2023](https://pmc.ncbi.nlm.nih.gov/articles/PMC9825434/)). The axolotl proteome was uploaded to STRING and annotation performed on STRING servers in less than 1 day. The [resulting annotation](https://version-12-0.string-db.org/organism/STRG0A90SNX) is persistent and shareable and can be used for all of STRING's search functions including ORA (`Multiple proteins`) and GSEA (`Proteins with Values/Ranks`).
6666

day2_Rnotebooks/novel_species.Rmd

Lines changed: 44 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -146,7 +146,9 @@ tail(ranked_df)
146146

147147
## 2.3 Create gene lists for ORA
148148

149-
For ORA, both tools require vector class gene lists. We will filter for adjusted P value < 0.01 and absolute log2 fold change greater than 1.5. The matrix has already filtered out genes with very low counts so we take all genes present as the background.
149+
For ORA, both tools require vector class gene lists. We will filter for adjusted P value < 0.01 and absolute log2 fold change greater than 1.5.
150+
151+
The matrix has already filtered out genes with very low counts so we take all genes present as the background.
150152

151153
```{r create ORA vector gene lists}
152154
@@ -169,6 +171,9 @@ head(background)
169171
170172
```
171173

174+
Note the large drop in gene numbers: 100K in GTF, 48K in predicted proteome, 24K expressed in the blastema! By reducing the number of background genes to what are expressed in the studied tissue, we can reduce falsely inflated P values and false positives within our list of enriched terms.
175+
176+
172177
## 2.4 Save gene lists
173178

174179
Saving any outputs generated from R code is vital to reproducibility! You should include all analysed gene lists within the supplementary materials of your manuscript.
@@ -186,6 +191,8 @@ write.table(ranked_df, file = "Axolotl_rankedFC.txt", sep = "\t", quote = FALSE,
186191

187192
# 3. Reformat annotation files for `clusterProfiler` GO and KEGG analysis
188193

194+
Now we have annotation files and gene lists, we will bring those together to create the custom database files required for R FEA!
195+
189196
## 3.1 Create TERM2GENE files
190197

191198
These are 2 column text files with the term ID (one per line) alongside the ID of the gene that maps to the term. A gene can map to many terms and thus be present on multiple lines. A term can be mapped to more than one gene and thus be present on many lines.
@@ -200,7 +207,9 @@ We need `GOs` and `KEGG_Pathway` columns.
200207

201208
### 3.1.1 GO TERM2GENE
202209

203-
Next, we will extract the GO IDs from the `emapper` annotation file, and wrangle into the correct format for `clusterProfiler` `TERM2GENE`. There are several steps to this - comments have been included to outline what each step is doing.
210+
Next, we will extract the GO IDs from the `emapper` annotation file, and wrangle into the correct format for `clusterProfiler` `TERM2GENE`.
211+
212+
There are several steps to this - comments have been included to outline what each step is doing.
204213

205214
```{r GO TERM2GENE}
206215
go_term2gene <- eggnog_anno %>%
@@ -265,16 +274,15 @@ head(kegg_term2gene)
265274

266275
### 3.2.1 GO
267276

268-
The GO Core Ontology file (we have saved in our `workshop` folder and labelled the filename as R object `go_core`) will enable us to assign term descriptions to term IDs and create our `TERM2NAME` files.
269-
277+
Now we will assign term descriptions to term IDs and create our `TERM2NAME` files.
270278

271-
This may take a few moments to run. It will use the `ontology` object we created earlier.
279+
This may take a few moments to run. It will use the `ontology` object we created earlier from the `go.obo` file.
272280

273281
```{r GO TERM2NAME}
274282
275283
# Create term to name table, removing duplicates, missing values and obsolete terms
276-
go_term2name <- go_term2gene %>%
277-
mutate(name = ontology$name[term]) %>%
284+
go_term2name <- go_term2gene %>% # only keep terms that are in our term2gene object (ie, mapped to axolotl)
285+
mutate(name = ontology$name[term]) %>%
278286
dplyr::select(term, name) %>%
279287
distinct() %>%
280288
drop_na() %>%
@@ -318,7 +326,9 @@ head(kegg_term2name)
318326

319327
## 3.3 Count annotations
320328

321-
How much of our proteome was annotated? What about our DEGs and background? Genes that do not have any annotation are excluded from enrichment analysis, so having an understanding of the extent of annotation for your novel species is very important when interpreting results!
329+
How much of our proteome was annotated? What about our DEGs and background?
330+
331+
Genes that do not have any annotation are excluded from enrichment analysis, so having an understanding of the extent of annotation for your novel species is very important when interpreting results!
322332

323333

324334
Count the number of GO terms found within the genome, and the number of genes with GO annotations:
@@ -345,7 +355,7 @@ print(paste("Number of unique genes with 1 or more annotation terms:", kegg_uniq
345355
346356
```
347357

348-
47,196 putative axolotl proteins were annotated. That's around 1/4 of our genes mapped to KEGG Pathways, and less than half of our genes mapped to GO! Ouch. As much as we expect this with uncurated novel species genomes, it's still unpleasant to face :-)
358+
47,196 putative axolotl proteins were annotated. That's around 1/4 of our predicted proteins mapped to KEGG Pathways, and less than half of our genes mapped to GO! Ouch. As much as we expect this with uncurated novel species genomes, it's still unpleasant to face :-)
349359

350360

351361
What of the genes in our gene list specifically? We have an uncurated proteome, yet the genes in our input matrix were expressed at a meaningful level within axolotl, so these may actually have a higher annotation percentage than all genes in the proteome.
@@ -367,6 +377,7 @@ cat("Number of input genes with GO annotations:", unique_genes_with_go, "(",perc
367377
368378
```
369379

380+
370381
```{r report gene list KEGG annotation percentages}
371382
# Filter the term2gene table to only include genes in the background gene list
372383
kegg_filtered_term2gene <- kegg_term2gene %>% filter(gene %in% background)
@@ -405,6 +416,8 @@ Let's review the help page:
405416

406417
There are parameters for both adjusted P value and q value. Terms must pass all thresholds (unadjusted P, adjusted P, and q value) so the important filter will be the most stringent test applied. Let's go with BH and 0.05 which we have used regularly within this workshop and are fairly common choices in the field.
407418

419+
we need to provide term2gene and term2name, and don't specify an organism.
420+
408421

409422
```{r CP GO ORA }
410423
@@ -428,7 +441,11 @@ cp_go_ora
428441

429442
91 significantly enriched terms at P.adj < 0.05.
430443

431-
Note that we used the gene list object `degs` which had 247 genes, but the tool has applied the input size as 145 - this is because it is automatically discarding any that do not have annotations. Results would be the same if we instead used `annotated_degs` object. Likewise, the background size is being reported as 15072 (the number annotated) not 24,419 (the total in background list).
444+
Look at the geneRatio column: our gene list object `degs` has 247 genes, but the tool has applied the input size as 145 - this is because it is automatically discarding any that do not have annotations.
445+
446+
Results would be the same if we instead used `annotated_degs` object.
447+
448+
Likewise, the background size is being reported as 15072 (the number annotated) not 24,419 (the total in background list).
432449

433450

434451
Save the results to a text file:
@@ -574,7 +591,7 @@ Note that in the below code, the first command is identical the one that created
574591
```{r GO GMT}
575592
576593
# Step 1: Extract relevant columns (GO terms and gene IDs) from eggnog_anno
577-
go_data <- eggnog_anno %>%
594+
go_data <- eggnog_anno %>% # use the emapper annotations for axolotl
578595
dplyr::select(GOs, `#query`) %>% # Select the GO terms and the gene IDs
579596
dplyr::filter(GOs != "-") %>% # Filter out rows where the GO ID is missing ("-")
580597
separate_rows(GOs, sep = ",") %>% # Split comma-delimited list of GO terms into separate rows
@@ -589,7 +606,7 @@ colnames(go_data) <- c("term", "gene")
589606
go_data <- go_data %>%
590607
dplyr::mutate(external_link = paste0("https://www.ebi.ac.uk/QuickGO/term/", term))
591608
592-
# Step 3: Group genes by GO term and concatenate gene list by tab
609+
# Step 3: Group genes by GO term and concatenate gene list by tab so all genes per term are on the same row
593610
go_term_grouped <- go_data %>%
594611
dplyr::group_by(term) %>%
595612
dplyr::summarize(genes = paste(gene, collapse = "\t"), .groups = "drop")
@@ -606,6 +623,9 @@ go_gmt <- go_term_grouped %>%
606623
# Save to file
607624
write.table(go_gmt, file = "Axolotl_GO.gmt", sep = "\t", quote = FALSE, row.names = FALSE, col.names = FALSE)
608625
626+
# check only the first line (the lines can be long re all genes per term!:
627+
cat(go_gmt$gmt_entry[1:1], sep = "\n")
628+
609629
```
610630

611631

@@ -652,6 +672,9 @@ kegg_gmt <- kegg_term_grouped %>%
652672
# Save to file
653673
write.table(kegg_gmt, file = "Axolotl_KEGG-pathways.gmt", sep = "\t", quote = FALSE, row.names = FALSE, col.names = FALSE)
654674
675+
# check only the first line (the lines can be long re all genes per term!:
676+
cat(kegg_gmt$gmt_entry[1:1], sep = "\n")
677+
655678
```
656679

657680
## 5.2 Create description objects
@@ -726,6 +749,11 @@ We will use the same approach as with `clusterProfiler` and run ORA with GO and
726749

727750
## 6.1 WebGestaltR ORA of GO terms
728751

752+
Parameters to note for novel species:
753+
754+
- `organism = others`
755+
- `enrichDatabaseFile = "Axolotl_GO.gmt"`
756+
- `enrichDatabaseDescriptionFile = "Axolotl_GO.des"`
729757

730758
```{r WGR ORA GO }
731759
@@ -745,7 +773,6 @@ WebGestaltR(
745773
fdrThr = 0.05, # FDR significance threshold
746774
minNum = 10, # Minimum number of genes per category
747775
maxNum = 500, # Maximum number of genes per category
748-
boxplot = TRUE,
749776
outputDirectory = outputDirectory,
750777
projectName = project
751778
)
@@ -757,7 +784,9 @@ Open the HTML report `WebGestaltR_results/Project_Axolotl_ORA_GO/Report_Axolotl_
757784

758785
Note some term similarity to what we have seen with the past 2 analyses (that's reassuring!)
759786

760-
Change the 'Enrichment Results' view from table to 'Bar chart', then try the 'Affinity propagation' and 'Weighted set cover' term clustering algorithms. Both methods help focus the results.
787+
We no longer have GO Slim, as this needs to call the actual GO database, which we haven't used.
788+
789+
Change the 'Enrichment Results' view from table to 'Bar chart', then try the 'Affinity propagation' and 'Weighted set cover' term clustering algorithms. 'All' has more terms with higher specificty, and the term redundancy has performed clustering to give fewwer terms and provide a more concise overview. It's up to you as the researcher to decided which approach is best suited to your dataset!
761790

762791
Confirm that our GMT file correctly included the term link by selecting a term and clicking the hyperlink at `Analyte set`. Pretty neat huh :-)
763792

@@ -771,6 +800,7 @@ There is no `seed` parameter for `WebGestaltR` GSEA as there is for `clusterProf
771800
set.seed(123)
772801
```
773802

803+
Again we are specifying `organism = "others"` and providing our GMT and description file:
774804

775805
```{r WGR GSEA KEGG }
776806

0 commit comments

Comments
 (0)