acropora_digitifera_wgs/09.annotate_genes.Rmd at main · WWz33/acropora_digitifera_wgs · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
---
title: "Gene Annotations"
output:
  github_document:
bibliography: bibliography.bib
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE, fig.retina = 2)
library(tidyverse)
```

As a precursor to a range of gene functional analyses, we generated an annotated gene set for *A. digitifera*.  Our annotations are based on gene models provided by Dr. Chuya Shinzato for the version 2 genome, and available via the [OIST marine genomics unit website](https://marinegenomics.oist.jp/adig/viewer/info?project_id=87). Since these are provided in the coordinates of their original reference we performed a [liftOver process](x40.liftover.md) to translate them into coordinates for the RefSeq assembly, `GCA_014634065.1_Adig_2.0_genomic.fna`.  These annotations (in RefSeq coordinates) are available as the file `data/genome/adig-v2-ncbi.gff`.


To generate functional annotations for these genes we extracted protein coding sequences and nucleotide sequences for the longest isoform per gene model. These sequences were used for BLAST and InterProScan analyses.

```bash
cgat gff2gff --filter=longest-gene -I adig-v2-ncbi.gff -S adig-v2-ncbi_longest_gene.gff

gffread -g GCA_014634065.1_Adig_2.0_genomic.fna -y protein.fa adig-v2-ncbi_longest_gene.gff
gffread -g GCA_014634065.1_Adig_2.0_genomic.fna -x CDS.fa adig-v2-ncbi_longest_gene.gff
```

137 of these predicted proteins obtained by this process have a `.` character due to the presence of small gaps (N's) in the genome. These characters are not accepted by Interproscan so we remove them prior to running further analyses.  This will result in gaps in alignment but should not otherwise interfere with detection of conserved domains.

```bash
cat protein.fa | awk -f cleanprot.awk > protein.fasta
```

## Scan for conserved domains

Next we used [InterProScan5](https://www.ebi.ac.uk/interpro/search/sequence/) version `5.53-87` to identify conserved domains in protein translations of all genes.  Prior to running this scan any non-standard amino acid characters (ambiguities denoted by ".") in protein sequences were removed. This analysis was primarily used to provide a set of GO term assignments based on conserved domains rather than specific genes.  We call these GO terms `ipr_go` to distinguish them from those obtained from uniprot, which are terms assigned to a specific homologous gene. The `ipr_go` terms will tend to be less specific but are likely to be more reliable than those provided by uniprot.

Interproscan was invoked in batches of 1000 sequences as follows.

```bash
interproscan.sh -i $seqs --disable-precalc -goterms
```

The `tsv` files produced by this process were concatenated to produce a single file which we include here as `data/hpc/annotation/all_ipr.tsv`

```{r}
# We extract GO term assignments from Interproscan results.
ipr_colnames <- c("adi_id","md5","seqlen","analysis","signature_acc","signature_desc","start","stop","score","status","date","ipr_acc","ipr_ann","ipr_go", "pathways")

interpro2 <- read_tsv("data/hpc/annotation/all_ipr.tsv", guess_max = 10000, col_names = ipr_colnames) %>%
  dplyr::select(adi_id,ipr_go) %>%
  filter(ipr_go!="-") %>%
  na.omit()

interpro_go <- interpro2 %>%
  separate_rows(ipr_go,sep="\\|") %>%
  mutate(geneid=str_remove(adi_id,"\\.t\\d")) %>%
  unique %>%
  group_by(geneid) %>%
  dplyr::summarise(go=paste(ipr_go,collapse = "; "))
```

## Homology search with BLAST

To identify homologs of *A. digitifera* with high quality functional annotations we used BLAST[xp] to search all genes against the swissprot uniprot database. After filtering blast results to include only those with evalue <1e-5, we then selected the best available hit based on evalue.  For all these best hits we then looked up putative gene names, GO terms, and Kegg information from [Uniprot ID mapping](https://www.uniprot.org/uploadlists/) from UniprotKB AC/ID to UniprotKB.

```{r}
bl6_cols <- c("qaccver","saccver","pident","length","mismatch","gapopen","qstart","qend","sstart","send","evalue","bitscore")

blastp <- read_tsv("data/hpc/annotation/Adigitifera_uniprot_blastp.outfmt6", col_names = bl6_cols) %>%
  dplyr::select(adi_id=qaccver, uniprot_id=saccver,evalue) %>%
  add_column(method="blastp") %>%
  mutate(geneid=str_remove(adi_id,'\\.t\\d')) %>%
  group_by(geneid) %>%
  slice_min(evalue) %>%
  ungroup()

blastx <- read_tsv("data/hpc/annotation/Adigitifera_uniprot_blastx.outfmt6", col_names = bl6_cols) %>%
  dplyr::select(adi_id=qaccver, uniprot_id=saccver,evalue) %>%
  add_column(method="blastx") %>%
  mutate(geneid=str_remove(adi_id,'\\.t\\d'))%>%
  group_by(geneid) %>%
  slice_min(evalue) %>%
  ungroup()
```


```{r, eval=FALSE}
# This file is used for the uniprot lookup
write_tsv(distinct(rbind(blastp,blastx) %>% select(uniprot_id)),path = "data/hpc/annotation/blast_genes.txt")
```


```{r}
# This file was obtained using the uniprot mapping service
uniprotkb_tab <- readxl::read_excel("data/hpc/annotation/uniprot-yourlist.xlsx") %>%
  dplyr::select(-Status)

uniprot_gene_annot <- rbind(blastp,blastx) %>%
  group_by(geneid) %>%
  slice_min(order_by = evalue, n = 1, with_ties = FALSE) %>%
  left_join(uniprotkb_tab,by=c("uniprot_id"="Entry name")) %>%
  ungroup %>%
  dplyr::select(geneid,uniprot_id,
          genename="Gene names",go="Gene ontology IDs",kegg="Cross-reference (KEGG)",protein="Protein names") %>%
    mutate(genename=ifelse(is.na(genename),uniprot_id,genename)) %>%
    left_join((interpro_go %>% dplyr::rename(go_ipr=go)),by=c("geneid"))
```


```{r}
# Write out results to a table.  This can easily be used in other analyses for GO enrichment etc.
write_tsv(uniprot_gene_annot, "data/hpc/annotation/uniprot_gene_annot.tsv")
```


```{r annotation-res-stats-table}
all <- read_tsv("data/hpc/annotation/allgenes_ids_extracted_from_gff.txt",col_names = "geneid")
all_annot<- all %>% left_join(uniprot_gene_annot)  %>%
  dplyr::rename("InterProScan GO"=go_ipr,"Uniprot GO (blast)"=go,"Uniprot ID (blast)"=genename)

all_summ <- map_df(all_annot %>% dplyr::select(`InterProScan GO`,`Uniprot GO (blast)`,`Uniprot ID (blast)`), ~sum(!is.na(.))/nrow(all_annot))
```

## Summary

A final table of annotated genes is provided as the file `data/hpc/annotation/uniprot_gene_annot.tsv` available as part of the data package for this repository. Overall we found that `r all_summ[,1]` percent of genes could be annotated with a GO term by Interproscan while `r all_summ[,2]` could be annotated with a blast hit to Swissprot.