Skip to content

1. more info on prepTG

Rauf Salamzade edited this page Nov 1, 2025 · 23 revisions

prepTG creates a database directory of genomes to search for homologous instances of reference/query gene-clusters in using fai.

The input is simply a directory of either FASTA or GenBank formatted files - with CDS features for the latter - representing bacterial genomes or metagenomes.

For eukaryotic genomes full GenBank format with CDS features are expected; however, FASTA formatted assemblies may instead be provided if a "reference proteome" is provided.

Check out example commands for prepTG on the 4. basic usage examples wiki page.

Using linclust to remove redundancy in the DIAMOND database

In v1.6.1, we made DIAMOND linclust redundancy dereplication of proteins at 90% identity and 90% mutual coverage the default. This can be skipped using the -sd flag however and the parameters for clustering can also be adjusted. This aids fai to run faster downstream. For larger databases, prepTG with linclust often takes more time and is aided by increased memory. Currently (v.1.6.1), to store the relationships between co-clustered proteins needed for processing of DIAMOND blastp results in fai, increased memory is also required - however we plan to resolve this in the next release. In addition, DIAMOND blastp results in the fai spreadsheet for some instances will be a proxy based on the cluster protein representative. The extracted sequences however can be extracted and investigated via zol and should not be affected.

In v1.6.12, we introduced -rc/--recluster as an option to refine clustering as heuristics used in DIAMOND clustering algorithms, which make it practically efficient, can lead to inexact solutions. This invokes a secondary reclustering using the DIAMOND recluster workflow, which is described on their Wiki documentation.

For a database of 6,191 Streptomyces genomes, using prepTG (v1.6.12) with default criteria for DIAMOND linclust (v2.1.11) collapsing proteins (--approx-id 90.0 --mutual-cover 90.0 - the default) takes around 3.55 hours to complete using 20 threads and allowing for 256 GB of memory. This results in 17,795,217 proteins selected as representatives from a total set of 47,611,157 proteins. A random subset (n=10,000) of the non-representative were extracted and aligned to the database of representative proteins using DIAMOND blastp (in --fast mode) to assess how well the clustering reflected the requested criteria:

  • 98.28% of non-representative proteins aligned to a representative protein at both the requested identity and coverage cutoffs.
  • 0.95% of non-representative proteins aligned to a representative protein at the requested coverage cutoffs but not identity.
  • 0.76% of non-representative proteins aligned to a representative protein at the requested identity cutoff but not the coverage cutoffs.
  • 0.01% of non-representative proteins met neither criteria.

Applying -rc/--recluster took over 75 hours and produced only slightly more representative proteins, resulting in the following marginally improved stats:

  • 98.31% of non-representative proteins aligned to a representative protein at both the requested 90% identity and 90% bi-directional coverage cutoffs.
  • 0.93% of non-representative proteins aligned to a representative protein at the requested coverage cutoffs but not identity.
  • 0.75% of non-representative proteins aligned to a representative protein at the requested identity cutoff but not the coverage cutoffs.
  • 0.01% of non-representative proteins met neither criteria.

However, if we instead just run using slightly more stringent parameters, e.g. "--approx-id 95.0 --mutual-cover 95.0", then we get much improved results for meeting the initial >=90% identity and >=90% bi-directional coverage criteria for the same set of 10,000 proteins (some of which might even be representatives themselves now) at a much more reasonable efficiency - 3.82 hours using 20 threads:

  • 99.58% of non-representative proteins aligned to a representative protein at both the 90% identity and 90% bi-directional coverage cutoffs.
  • 0.06% of non-representative proteins aligned to a representative protein at the requested coverage cutoffs but not identity.
  • 0.36% of non-representative proteins aligned to a representative protein at the requested identity cutoff but not the coverage cutoffs.
  • 0.00% of non-representative proteins met neither criteria.

This approaches results in 34,888,076 representative proteins - a 26.7% reduction, much smaller than the 62.6% reduction that resulted from using the default parameters clustering parameters. This will result in slower searches for gene clusters downstream in fai. Thus, if you are concerned about losing sensitivity in gene cluster detection more than search efficiency, you can opt for more stringent criteria for clustering using the -cp/--collapsing-params or skipping clustering entirely using the -sd/--skip-diamond-linclust option.

Note

If a gene cluster is detected by fai, it will be extracted with intermediate genes included (regardless of whether they make it into the final representative/clustered/non-redundant database in prepTG used for DIAMOND blastp searching). Thus, "not well-represented" proteins should still make it into the downstream analyses (e.g. investigation using zol or visualization using tools like clinker) if the gene cluster as a whole is able to be detected in a target genome.

Gene-calling bacterial genomes using prodigal and pyrodigal

For bacterial genomes or bacterial metagenomes, users are able to use pyrodigal (default) or prodigal to perform de novo gene calling. More recently, we also have the availability of prodigal-gv as an option for gene-calling when phages are the gene clusters of interest.

Gene-mapping in eukaryotic genomes using miniprot

For eukaryotic genomes, users are able to map a high-quality gene-calling prediction for some reference genome to the remainder of the genomes. This approach is generally recommended only for single-species investigations and has only been tested with microbial eukaryotic genomes of a modest size (e.g. fungii, not gigantic genomes such as those of plants).

prepTG usage

Usage: prepTG [-h] [-i INPUT_DIR] [-g GTDB_TAXON] [-gr GTDB_RELEASE]
              [-d DOWNLOAD_PREMADE] -o OUTPUT_DIR [-r] [-enl] [-ent]
              [-l LOCUS_TAG_LENGTH] [-gcm GENE_CALLING_METHOD] [-m]
              [-rp REFERENCE_PROTEOME] [-cst] [-ma] [-cp COLLAPSING_PARAMS]
              [-sd] [-rc] [-c THREADS] [-mm MAX_MEMORY] [-v]

Program: prepTG
Author: Rauf Salamzade
Affiliation: Kalan Lab, UW Madison, Department of Medical Microbiology and Immunology

Prepares a directory of target genomes for being searched for query gene clusters using fai.

Premade databases of representative genomes are available for the following genera:

Acinetobacter (n = 1, 643), Bacillales (n = 3, 150), Corynebacterium (n = 726),
Enterobacter (n = 878), Enterococcus (n = 937), Escherichia (n = 2, 436),
Klebsiella (n = 1, 022), Listeria (n = 353), Mycobacterium (n = 744),
Pseudomonas (n = 2, 666), Salmonella (n = 308), Staphylococcus (n = 496),
Streptomyces (n = 1, 555), Streptococcus (n = 2, 452), Cutibacterium (n = 27),
Neisseria (n = 414), Lactobacillus (n = 541), and Micromonospora (n = 211).

In addition, users can simply request all genomes belonging to a specific species/genus
in GTDB to be downloaded.

---------------------------------------------------------------------------------------
> Example commands:

1. Setup a prepTG database which includes some local genomes in FASTA format

    $ prepTG -i User_Genomes_Directory/

2. Setup a prepTG database which includes some local genomes and all Cutibacterium
   granulosum genomes in GTDB:

   $ prepTG -i User_Genomes_Directory/ -g "Cutibacterium granulosum" -o prepTG_Database/

3. Setup local prepTG database by downloading a premade one of representative
   Cutibacterium genomes:

   $ prepTG -d Cutibacterium -o prepTG_Database/

---------------------------------------------------------------------------------------
> Considerations
If FASTA format is provided, assumption is that genomes are prokaryotic and
pyrodigal/prodigal will be used to perform gene-calling. Eukaryotic genomes can
be provided as FASTA format but the --reference-proteome file should be used in
such case to map proteins from a reference proteome (from the same species ideally)
on to the target genomes. This will prevent detection of new genes in gene-clusters
detected by fai but synchronize gene-calling and allow for better similarity
assessment between orthologous genes.

If you are interested in inferring horizontal gene transfer using salt downstream
and are working with bacterial genomes - consider issuing the "--mge-annotation"
flag to annotate phage, plasmid and IS element associated proteins.

If GenBank files are provided, CDS features are expected and further each CDS
feature should contain a "translation" qualifier which features the protein sequence
and optionally a "locus_tag" qualifier. Options to consider when providing GenBank
files as input include --rename-locus-tags, --error-no-lt, and --error-no-translation.
Note, by default, if a CDS does not feature a locus tag, it will be given an arbitrary
one. Also, if a CDS does not feature a translation, the CDS feature will be skipped.
If > 10% of CDS features are skipped for lacking a translation, then the entire genome
will be skipped.

Options:
  -h, --help            show this help message and exit
  -i, --input-dir INPUT_DIR
                        Directory with target genomes (either featuring GenBanks or FASTAs).
  -g, --gtdb-taxon GTDB_TAXON
                        Name of a GTDB valid genus or species to incorporate genomes from.
                        Should be surrounded by
                        quotes (e.g. 'Escherichia coli').
  -gr, --gtdb-release GTDB_RELEASE
                        GTDB release to use. [Current default is R226].
  -d, --download-premade DOWNLOAD_PREMADE
                        Download and setup pre-made databases of representative genomes
                        for specific taxon/genus. Provide name of the taxon,
                        e.g. "Escherichia"
  -o, --output-dir OUTPUT_DIR
                        Output directory, which can then be provided as input for the
                        "-tg" argument in fai.
  -r, --rename-locus-tags
                        Whether to rename all locus tags for CDS features if provided
                        GenBanks as input.
  -enl, --error-no-lt   Do not rename locus tags if not found/other issue -
                        will result in skipping inclusion of entire genome.
  -ent, --error-no-translation
                        Do not skip CDS without translation -
                        will result in skipping inclusion of entire genome.
  -l, --locus-tag-length LOCUS_TAG_LENGTH
                        Length of locus tags to set. [Default is 3,
                        allows for < ~18k genomes].
  -gcm, --gene-calling-method GENE_CALLING_METHOD
                        Method to use for gene calling. Options are: pyrodigal, prodigal,
                        or prodigal-gv. [Default is pyrodigal].
  -m, --meta-mode       Flag to use meta mode instead of single for pyrodigal/prodigal.
  -rp, --reference-proteome REFERENCE_PROTEOME
                        Provide path to a reference proteome to use for protein/
                        gene-calling in target genomes - which should be in FASTA
                        format.
  -cst, --create-species-tree
                        Use skani to infer a neighbor-joing based species
                        tree for the genomes.
  -ma, --mge-annotation
                        Perform MGE annotation of proteins - for
                        bacterial genomes only.
  -cp, --collapsing-params COLLAPSING_PARAMS
                        DIAMOND linclust parameters used for collapsing proteins for
                        creating DIAMOND database. [Default is "--approx-id 90
                        --mutual-cover 90"].
  -sd, --skip-diamond-linclust
                        Skip DIAMOND linclust clustering of proteins.
  -rc, --recluster      Perform DIAMOND recluster to improve protein clustering.
  -c, --threads THREADS
                        The number of threads to use. [Default is 1].
  -mm, --max-memory MAX_MEMORY
                        Uses resource module to set soft memory limit. Provide
                        in Giga-bytes. Configured in the shell environment. For
                        DIAMOND linclust if this parameter is not set it will default
                        to using a memory of 16GB. [Default is None].
  -v, --version         Get version and exit.

Detailed Usage on Major Options:

Option (Short) Option (Long) Description
-i --input-dir Directory with target genomes (either featuring GenBanks or FASTAs). If GenBank files are provided - they must feature CDS features with gene predictions. They should preferably also include a "locus_tag" for each CDS feature, but this can be overcome using the --rename-locus-tags or --rename-problem-gbk-lts options.
-d --download-premade Download and setup pre-made databases of representative genomes for specific taxon/genus selected by skDER. For details on available databases, see this wiki page. Provide the name of the taxon, e.g. "Escherichia".
-g --gtdb-taxon Name of a GTDB valid genus or species to incorporate genomes from. Should be surrounded by quotes (e.g. "Escherichia coli").
-gr --gtdb-release GTDB release to use. [Current default is R220].
-o --output-dir Output directory, which can then be provided as input for the -tg argument in fai.
-l --locus-tag-length Length of locus tags to set. Default is 3, allows for <~18k genomes.
-r --rename-locus-tags Whether to rename locus tags for CDS features if GenBank files are provided as input.
-ro --rename-problem-gbk-lts Whether to rename locus tags of only problem GenBank files which have CDS but no locus_tag qualifier. By default such GenBank files are skipped.
-gcm --gene-calling-method Method to use for gene calling. Options are: pyrodigal, prodigal, or prodigal-gv. [Default is pyrodigal].
-m --meta-mode Flag to use metagenomic mode for pyrodigal/prodigal.
-rp --reference-proteome Provide path to a reference proteome to use for miniprot-based protein mapping in target genomes. The input proteome should be in FASTA format.
-cst --create-species-tree Use skani ANI predictions to infer a neighbor-joining based species tree for the genomes. Not recommended if working with diverse genome sets spanning multiple species.
-ma --mge-annotation Perform MGE annotation of proteins - for bacterial genomes only. This is only used for the salt program downstream, be careful when using for large datasets as it increases prepTG runtime significantly.

Clone this wiki locally