Skip to content

EI-CoreBioinformatics/annooddities

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AnnoOddities

Gemy George Kaithakottil, David Swarbreck
DOI

AnnoOddities is a Python utility for detecting, identifying and characterising oddities in genome annotations. It parses and integrates statistics from multiple tools - including AGAT, Mikado, and GFFread, to generate a harmonised set of extended metrics that help assess annotation quality and highlight potential anomalies within genome annotations.

In addition to summarising standard outputs, AnnoOddities computes a range of additional “oddity” measures, for example, counts of unusually large or small introns, exons, and untranslated regions; detection of transcripts lacking start or stop codons; identification of in-frame stop codons; estimation of canonical intron proportions; and more. These metrics help flag potential structural inconsistencies or systematic issues in gene models. The consolidated outputs include both summary statistics and detailed per-transcript results, exported in widely used machine- and human-readable formats (TSV, CSV, JSON, YAML, TOML) for flexible downstream analysis and visualisation.

Currently, AnnoOddities runs AGAT, Mikado, and GFFread to generate and combine statistics into unified reports. Future developments will focus on extending the workflow to incorporate additional quality assessment tools such as BUSCO and OMArk, enabling integrated comparisons of annotation and genome completeness metrics.

This tool was developed as part of BioHackathon 2025, Project 23: Streamlining FAIR Metadata for Biodiversity Genome Annotations. Project details: https://github.com/elixir-europe/biohackathon-projects-2025/blob/main/23.md

Installation

All installation methods below will install AnnoOddities along with its dependencies.

Docker Installation

AnnoOddities can be installed with Docker. If you don't have Docker, please install docker first. Then you can pull the Docker image with AnnoOddities installed

VERSION=0.1.0
docker run gemygk/annooddities:v${VERSION} annooddities -h

Singularity Installation

AnnoOddities can be installed with Singularity. If you don't have Singularity, please install singularity first. Then you can pull the singularity image with AnnoOddities installed.

We can directly run AnnoOddities from the Singularity image hosted on DockerHub

VERSION=0.1.0
singularity exec docker://gemygk/annooddities:v${VERSION} annooddities -h

Or, we can build and run a Singularity image, following the steps below:

# Create a Singularity definition file, like below:

$ cat annooddities-0.1.0.def
bootstrap: docker
from: gemygk/annooddities:v0.1.0

# Build the Singularity image
$ sudo singularity build annooddities-0.1.0.sif annooddities-0.1.0.def

# Execute AnnoOddities from the Singularity image
$ singularity exec annooddities-0.1.0.sif annooddities -h

Manual Installation

Install dependencies

Get AnnoOddities

First, obtain the source code:

git clone https://github.com/EI-CoreBioinformatics/annooddities.git
cd annooddities

Build and install using UV

version=0.1.0 \
     && uv build \
     && pip install --prefix=/path/to/software/annooddities/${version}/x86_64 -U dist/*whl

Also, make sure that both PATH and PYTHONPATH (below is for python3.10) environments are updated

export PATH=/path/to/software/annooddities/${version}/x86_64/bin:$PATH
export PYTHONPATH=/path/to/software/annooddities/${version}/x86_64/lib/python3.10/site-packages

Usage

$ annooddities -h
usage: annooddities.py [-h] --genome_fasta GENOME_FASTA --gff3_file GFF3_FILE [--five_prime_utr_length FIVE_PRIME_UTR_LENGTH] [--three_prime_utr_length THREE_PRIME_UTR_LENGTH] [--five_utr_num FIVE_UTR_NUM] [--three_utr_num THREE_UTR_NUM]
                        [--min_intron_length MIN_INTRON_LENGTH] [--max_intron_length MAX_INTRON_LENGTH] [--min_exon_length MIN_EXON_LENGTH] [--max_exon_length MAX_EXON_LENGTH] [--selected_cds_fraction SELECTED_CDS_FRACTION]
                        [--canonical_intron_motifs CANONICAL_INTRON_MOTIFS] [--output_prefix OUTPUT_PREFIX] [--force] [--verbosity {debug,info,warning,error,critical}]

Find annotation oddities from GFF3 and genome FASTA files

optional arguments:
  -h, --help            show this help message and exit
  --genome_fasta GENOME_FASTA
                        Provide Genome FASTA file
  --gff3_file GFF3_FILE
                        Provide GFF3 file with transcript annotations
  --output_prefix OUTPUT_PREFIX
                        Provide sample prefix for the output table [default:output]
  --force               Force rerun even if output files exist
  --verbosity {debug,info,warning,error,critical}
                        Set logging verbosity level [default:info]

Oddity Thresholds:
  --five_prime_utr_length FIVE_PRIME_UTR_LENGTH
                        Threshold for 5' UTR length oddity [default:10000]
  --three_prime_utr_length THREE_PRIME_UTR_LENGTH
                        Threshold for 3' UTR length oddity [default:10000]
  --five_utr_num FIVE_UTR_NUM
                        Threshold for 5' UTR exon number oddity [default:5]
  --three_utr_num THREE_UTR_NUM
                        Threshold for 3' UTR exon number oddity [default:4]
  --min_intron_length MIN_INTRON_LENGTH
                        Threshold for minimum intron length oddity [default:5]
  --max_intron_length MAX_INTRON_LENGTH
                        Threshold for maximum intron length oddity.
                        Below are some guidelines:
                        - For fungi species, consider setting this to 1000 bp.
                        - For plant species, consider setting this to 10000 bp.
                        - For invertebrates species, consider setting this to 60000 bp.
                        - For vertebrates species, consider setting this to 120000 bp.
                        [default:120000]
  --min_exon_length MIN_EXON_LENGTH
                        Threshold for minimum exon length oddity [default:5]
  --max_exon_length MAX_EXON_LENGTH
                        Threshold for maximum exon length oddity [default:10000]
  --selected_cds_fraction SELECTED_CDS_FRACTION
                        Threshold for selected CDS fraction oddity.
                        This is the proportion of coding sequence to that of the transcript.
                        Values range from 0.0 to 1.0 [default:0.3]
  --canonical_intron_motifs CANONICAL_INTRON_MOTIFS
                        Comma-separated list of canonical intron motifs. [default:'GT-AG,GC-AG,AT-AC']

Running AnnoOddities

Example Command

To run AnnoOddities, use the command line interface with the required arguments for the genome FASTA file and GFF3 annotation file. For example:

annooddities \
    --genome_fasta input_genome.fna \
    --gff3_file input_genome.gff

You can also specify optional parameters to customise the oddity detection thresholds. For example, to set custom thresholds for minimum exon length and maximum intron length (for plants), you can run:

annooddities \
    --genome_fasta input_genome.fna \
    --gff3_file input_genome.gff \
    --min_exon_length 3 \
    --max_intron_length 10000

Or, if you want to restrict canonical intron motifs to only 'GT-AG', you can run:

annooddities \
    --genome_fasta input_genome.fna \
    --gff3_file input_genome.gff \
    --canonical_intron_motifs 'GT-AG'

Output

An example output directory structure:

CMD: annooddities --genome_fasta input_genome.fna --gff3_file input_genome.gff

DIRECTORY: output_directory
├── input_genome.fna
├── input_genome.gff
├── input_genome.agat.log
├── output.agat_standardised.gff
├── output.agat_standardised.log
├── output.agat_standardised.agat.log
├── output.agat_sp_statistics.yaml
├── output.agat_sp_statistics.txt
├── output.agat_sp_statistics.log
├── output.agat_sp_statistics.json
├── output.agat_sp_statistics.toml
├── input_genome.fna.fai
├── output.gffread_table.log
├── output.gffread_table.tbl
├── output.mikado_tab_stats.log
├── output.mikado_tab_stats.tsv
├── output.mikado_stats.tsv
├── output.mikado_stats.yaml
├── output.mikado_summary_stats.tsv
├── output.mikado_stats.json
├── output.mikado_stats.toml
├── oddity_files
│   ├── output.AnnoOddities.has_inframe_stop.gff
│   ├── output.AnnoOddities.max_exon_length_gt_10000.gff
│   ├── output.AnnoOddities.min_exon_length_lte_5.gff
│   ├── output.AnnoOddities.exon_num_eq_1.gff
│   ├── output.AnnoOddities.exon_num_gt_1.gff
│   ├── output.AnnoOddities.is_fragment.gff
│   ├── output.AnnoOddities.not_has_start_codon.gff
│   ├── output.AnnoOddities.not_has_stop_codon.gff
│   ├── output.AnnoOddities.not_is_complete.gff
│   └── output.AnnoOddities.selected_cds_fraction_lte_0.3.gff
├── output.AnnoOddities.combined_statistics.yaml
├── output.AnnoOddities.combined_statistics.json
├── output.AnnoOddities.combined_statistics.toml
├── output.AnnoOddities.all_stats.tsv
├── output.AnnoOddities.gff
└── output.AnnoOddities.oddity_summary.txt

AnnoOddities Summary

Provides both high-level summaries and more granular diagnostic reports to support manual review or automated pipelines.

Below is an example of the AnnoOddities.oddity_summary.txt output file:

FILE: output.AnnoOddities.oddity_summary.txt
AnnoOddities                      output
exon_num == 1                     4520
exon_num > 1                      12900
five_utr_length > 10000           0
five_utr_num > 5                  0
three_utr_length > 10000          0
three_utr_num > 4                 0
not is_complete                   428
not has_start_codon               174
not has_stop_codon                280
is_fragment                       26
has_inframe_stop                  1
max_exon_length > 10000           8
max_intron_length > 120000        0
min_exon_length <= 5              308
0 < min_intron_length <= 5        0
selected_cds_fraction <= 0.3      158
canonical_intron_proportion != 1  0
only_non_canonical_splicing       0
suspicious_splicing               0

AnnoOddities All Stats

Detailed per-transcript statistics (as detailed below) computed by AnnoOddities.

Below are the columns included in the AnnoOddities.all_stats.tsv output file:

FILE: output.AnnoOddities.all_stats.tsv
1 transcript_id                             - transcript identifier
2 gene_id                                   - gene identifier
3 chromosome                                - chromosome name or number
4 start                                     - start position
5 end                                       - end position
6 region                                    - genomic region in the format chr:start-end
7 strand                                    - strand information
8 exon_num                                  - number of exons
9 exons                                     - list of exon positions
10 exon_lengths                             - list of exon lengths
11 max_exon_length                          - maximum exon length
12 min_exon_length                          - minimum exon length
13 total_exon_length                        - total length of all exons
14 cds_exon_num                             - number of CDS exons
15 cds_exons                                - list of CDS exon positions
16 cds_exon_lengths                         - list of CDS exon lengths
17 max_cds_length                           - maximum CDS exon length
18 min_cds_length                           - minimum CDS exon length
19 total_cds_length                         - total length of all CDS exons
20 cds_cdna_ratio                           - ratio of CDS length to cDNA length
21 intron_num                               - number of introns
22 introns                                  - list of intron positions
23 intron_lengths                           - list of intron lengths
24 max_intron_length                        - maximum intron length
25 min_intron_length                        - minimum intron length
26 total_intron_length                      - total length of all introns
27 five_utr_num                             - number of 5' UTRs
28 five_utr_length                          - total length of 5' UTRs
29 three_utr_num                            - number of 3' UTRs
30 three_utr_length                         - total length of 3' UTRs
31 has_start_codon                          - presence of start codon
32 has_stop_codon                           - presence of stop codon
33 is_complete                              - completeness of the transcript
34 is_fragment                              - whether the transcript is a fragment
35 has_inframe_stop                         - presence of in-frame stop codons
36 known_strand_junctions_dict              - dictionary of known strand junctions
37 known_strand_junctions_str               - string representation of known strand junctions
38 unknown_strand_junctions_dict            - dictionary of unknown strand junctions
39 unknown_strand_junctions_str             - string representation of unknown strand junctions
40 canonical_intron_count                   - count of canonical introns
41 non_canonical_intron_count               - count of non-canonical introns
42 suspicious_intron_count                  - count of suspicious introns
43 unknown_canonical_intron_count           - count of unknown canonical introns
44 unknown_non_canonical_intron_count       - count of unknown non-canonical introns
45 unknown_suspicious_intron_count          - count of unknown suspicious introns
46 unknown_predicted_strand                 - predicted strand for unknown introns
47 canonical_intron_proportion              - proportion of canonical introns
48 only_non_canonical_splicing              - whether only non-canonical splicing is present
49 suspicious_splicing                      - presence of suspicious splicing
50 matched_oddities                         - matched oddities

AnnoOddities Combined Statistics

Consolidated statistics from AGAT, Mikado, GFFread, and AnnoOddities.

FILE: output.AnnoOddities.combined_statistics.yaml
FILE: output.AnnoOddities.combined_statistics.json
FILE: output.AnnoOddities.combined_statistics.toml

AnnoOddities Individual Oddity GFFs

Separate GFF3 files for each detected oddity, facilitating targeted review and analysis.

DIRECTORY: oddity_files
FILE: output.AnnoOddities.has_inframe_stop.gff
FILE: output.AnnoOddities.max_exon_length_gt_10000.gff
FILE: output.AnnoOddities.min_exon_length_lte_5.gff
FILE: output.AnnoOddities.exon_num_eq_1.gff
FILE: output.AnnoOddities.exon_num_gt_1.gff
FILE: output.AnnoOddities.is_fragment.gff
FILE: output.AnnoOddities.not_has_start_codon.gff
FILE: output.AnnoOddities.not_has_stop_codon.gff
FILE: output.AnnoOddities.not_is_complete.gff
FILE: output.AnnoOddities.selected_cds_fraction_lte_0.3.gff

AnnoOddities GFF File

A GFF3 file containing all transcripts annotated with their respective oddities. The oddities are included in the attributes column for easy identification, appeneded to the 'Note' field to the respective transcript entries.

FILE: output.AnnoOddities.gff

License

MIT

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages