As a precursor to a range of gene functional analyses, we generated an
annotated gene set for A. digitifera. Our annotations are based on
gene models provided by Dr. Chuya Shinzato for the version 2 genome, and
available via the OIST marine genomics unit
website.
Since these are provided in the coordinates of their original reference
we performed a liftOver process to translate them
into coordinates for the RefSeq assembly,
GCA_014634065.1_Adig_2.0_genomic.fna. These annotations (in RefSeq
coordinates) are available as the file data/genome/adig-v2-ncbi.gff.
To generate functional annotations for these genes we extracted protein coding sequences and nucleotide sequences for the longest isoform per gene model. These sequences were used for BLAST and InterProScan analyses.
cgat gff2gff --filter=longest-gene -I adig-v2-ncbi.gff -S adig-v2-ncbi_longest_gene.gff
gffread -g GCA_014634065.1_Adig_2.0_genomic.fna -y protein.fa adig-v2-ncbi_longest_gene.gff
gffread -g GCA_014634065.1_Adig_2.0_genomic.fna -x CDS.fa adig-v2-ncbi_longest_gene.gff137 of these predicted proteins obtained by this process have a .
character due to the presence of small gaps (N’s) in the genome. These
characters are not accepted by Interproscan so we remove them prior to
running further analyses. This will result in gaps in alignment but
should not otherwise interfere with detection of conserved domains.
cat protein.fa | awk -f cleanprot.awk > protein.fastaNext we used
InterProScan5 version
5.53-87 to identify conserved domains in protein translations of all
genes. Prior to running this scan any non-standard amino acid characters
(ambiguities denoted by “.”) in protein sequences were removed. This
analysis was primarily used to provide a set of GO term assignments
based on conserved domains rather than specific genes. We call these GO
terms ipr_go to distinguish them from those obtained from uniprot,
which are terms assigned to a specific homologous gene. The ipr_go
terms will tend to be less specific but are likely to be more reliable
than those provided by uniprot.
Interproscan was invoked in batches of 1000 sequences as follows.
interproscan.sh -i $seqs --disable-precalc -gotermsThe tsv files produced by this process were concatenated to produce a
single file which we include here as data/hpc/annotation/all_ipr.tsv
To identify homologs of A. digitifera with high quality functional annotations we used BLAST[xp] to search all genes against the swissprot uniprot database. After filtering blast results to include only those with evalue <1e-5, we then selected the best available hit based on evalue. For all these best hits we then looked up putative gene names, GO terms, and Kegg information from Uniprot ID mapping from UniprotKB AC/ID to UniprotKB.
A final table of annotated genes is provided as the file
data/hpc/annotation/uniprot_gene_annot.tsv available as part of the
data package for this repository. Overall we found that 0.532956 percent
of genes could be annotated with a GO term by Interproscan while
0.9302989 could be annotated with a blast hit to Swissprot.