Skip to content

Latest commit

 

History

History
77 lines (62 loc) · 3.38 KB

File metadata and controls

77 lines (62 loc) · 3.38 KB

Gene Annotations

As a precursor to a range of gene functional analyses, we generated an annotated gene set for A. digitifera. Our annotations are based on gene models provided by Dr. Chuya Shinzato for the version 2 genome, and available via the OIST marine genomics unit website. Since these are provided in the coordinates of their original reference we performed a liftOver process to translate them into coordinates for the RefSeq assembly, GCA_014634065.1_Adig_2.0_genomic.fna. These annotations (in RefSeq coordinates) are available as the file data/genome/adig-v2-ncbi.gff.

To generate functional annotations for these genes we extracted protein coding sequences and nucleotide sequences for the longest isoform per gene model. These sequences were used for BLAST and InterProScan analyses.

cgat gff2gff --filter=longest-gene -I adig-v2-ncbi.gff -S adig-v2-ncbi_longest_gene.gff

gffread -g GCA_014634065.1_Adig_2.0_genomic.fna -y protein.fa adig-v2-ncbi_longest_gene.gff
gffread -g GCA_014634065.1_Adig_2.0_genomic.fna -x CDS.fa adig-v2-ncbi_longest_gene.gff

137 of these predicted proteins obtained by this process have a . character due to the presence of small gaps (N’s) in the genome. These characters are not accepted by Interproscan so we remove them prior to running further analyses. This will result in gaps in alignment but should not otherwise interfere with detection of conserved domains.

cat protein.fa | awk -f cleanprot.awk > protein.fasta

Scan for conserved domains

Next we used InterProScan5 version 5.53-87 to identify conserved domains in protein translations of all genes. Prior to running this scan any non-standard amino acid characters (ambiguities denoted by “.”) in protein sequences were removed. This analysis was primarily used to provide a set of GO term assignments based on conserved domains rather than specific genes. We call these GO terms ipr_go to distinguish them from those obtained from uniprot, which are terms assigned to a specific homologous gene. The ipr_go terms will tend to be less specific but are likely to be more reliable than those provided by uniprot.

Interproscan was invoked in batches of 1000 sequences as follows.

interproscan.sh -i $seqs --disable-precalc -goterms

The tsv files produced by this process were concatenated to produce a single file which we include here as data/hpc/annotation/all_ipr.tsv

Homology search with BLAST

To identify homologs of A. digitifera with high quality functional annotations we used BLAST[xp] to search all genes against the swissprot uniprot database. After filtering blast results to include only those with evalue <1e-5, we then selected the best available hit based on evalue. For all these best hits we then looked up putative gene names, GO terms, and Kegg information from Uniprot ID mapping from UniprotKB AC/ID to UniprotKB.

Summary

A final table of annotated genes is provided as the file data/hpc/annotation/uniprot_gene_annot.tsv available as part of the data package for this repository. Overall we found that 0.532956 percent of genes could be annotated with a GO term by Interproscan while 0.9302989 could be annotated with a blast hit to Swissprot.