Skip to content

Commit 3d7e936

Browse files
authored
Merge pull request #438 from Ecogenomics/staging
2.2.0
2 parents 08ef9cd + 99c2ee2 commit 3d7e936

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

62 files changed

+3138
-361
lines changed

README.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,6 @@
77
[![Docker Image Version (latest by date)](https://img.shields.io/docker/v/ecogenomic/gtdbtk?sort=date&color=299bec&label=docker)](https://hub.docker.com/r/ecogenomic/gtdbtk)
88
[![Docker Pulls](https://img.shields.io/docker/pulls/ecogenomic/gtdbtk?color=299bec&label=pulls)](https://hub.docker.com/r/ecogenomic/gtdbtk)
99

10-
<b>GTDB-Tk v2.1.0+ requires an updated reference package ([R207_v2](https://data.gtdb.ecogenomic.org/releases/latest/auxillary_files/gtdbtk_v2_data.tar.gz)), [read more](https://ecogenomics.github.io/GTDBTk/installing/index.html#gtdb-tk-reference-data).</b>
11-
1210
GTDB-Tk is a software toolkit for assigning objective taxonomic classifications to bacterial and archaeal genomes based
1311
on the Genome Database Taxonomy ([GTDB](https://gtdb.ecogenomic.org/)). It is designed to work with recent advances that
1412
allow hundreds or thousands of metagenome-assembled genomes (MAGs) to be obtained directly from environmental samples.
@@ -39,13 +37,15 @@ Documentation for GTDB-Tk can be found [here](https://ecogenomics.github.io/GTDB
3937

4038
## ✨ New Features
4139

42-
GTDB-Tk v2.1.0 includes the following new features:
43-
- GTDB-TK now uses a **divide-and-conquer** approach where the bacterial reference tree is split into multiple **class**-level subtrees. This reduces the memory requirements of GTDB-Tk from **320 GB** of RAM when using the full GTDB R07-RS207 reference tree to approximately **55 GB**. A manuscript describing this approach is in preparation. If you wish to continue using the full GTDB reference tree use the `--full-tree` flag.
44-
This is the main change from v2.0.0. The split tree approach has been modified from order-level trees to class-level trees to resolve specific classification issues (See [#383](https://github.com/Ecogenomics/GTDBTk/issues/383)).
45-
- Genomes that cannot be assigned to a domain (e.g. genomes with no bacterial or archaeal markers or genomes with no genes called by Prodigal) are now reported in the `gtdbtk.bac120.summary.tsv` as 'Unclassified'
46-
- Genomes filtered out during the alignment step are now reported in the `gtdbtk.bac120.summary.tsv` or `gtdbtk.ar53.summary.tsv` as 'Unclassified Bacteria/Archaea'
47-
- `--write_single_copy_genes` flag in now available in the `classify_wf` and `de_novo_wf` workflows.
40+
GTDB-Tk v2.2.0+ includes the following new features:
41+
- GTDB-TK `classify` and `classify_wf` have changed in version 2.2.0+. There is now an ANI classification stage (`ANI screen`) that precedes classification by placement in a reference tree.
42+
- **This is now the default behavior for `classify` and `classify_wf`.**
43+
- In `classify`, user genomes are first compared against a Mash database comprised of all GTDB representative genomes and genome pairs of sufficient similarity processed by FastANI. User genomes classified to a GTDB representative based on FastANI results are not run through pplacer.
44+
- In the `classify_wf` workflow, genomes are classified using Mash and FastANI before executing the identify step. User genomes classified with FastANI are not run through the remainder of the pipeline (identify, align, classify).
45+
- To classify genomes without the additional `ani_screen` step, use the `--skip_ani_screen` flag.
4846

47+
## 📈 Performance
48+
Using ANI screen "can" reduce computation by >50%, although it depends on the set of input genomes. A set of input genomes consisting primarily of new species will not benefit from ANI screen as much as a set of genomes that are largely assigned to GTDB species clusters. In the latter case, the ANI screen will reduce the number of genomes that need to be classified by pplacer which reduces computation time subsantially (between 25% and 60% in our testing).
4949

5050
## 📚 References
5151

docs/assets/gtdbtk_logo.ai

Lines changed: 1529 additions & 0 deletions
Large diffs are not rendered by default.

docs/src/announcements.rst

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,14 @@
11
Announcements
22
=============
33

4+
GTDB-Tk 2.2.0 available
5+
-----------------------
6+
7+
*February 14, 2023*
8+
9+
* GTDB-Tk version ``2.2.0`` is now available.
10+
* This version of GTDB-Tk **does not** require a new version of the GTDB-Tk reference package.
11+
412

513
GTDB-Tk 2.1.0 available
614
-----------------------

docs/src/changelog.rst

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,38 @@
22
Change log
33
==========
44

5+
2.2.0
6+
-----
7+
8+
Minor changes:
9+
10+
* (`#433 <https://github.com/Ecogenomics/GTDBTk/issues/433>`_) Added additional checks to ensure that the `--outgroup_taxon` cannot be set to a domain (`root`, `de_novo_wf`).
11+
* (`#459 <https://github.com/Ecogenomics/GTDBTk/issues/459>`_ / `#462 <https://github.com/Ecogenomics/GTDBTk/issues/462>`_ ) Fix deprecated np.bool in prodigal_biolib.py. Special thanks to @neoformit for his contribution.
12+
* (`#466 <http://github.com/Ecogenomics/GTDBTk/issues/466>`_) RED value has been rounded to 5 decimals after the comma.
13+
* (`#451 <http://github.com/Ecogenomics/GTDBTk/issues/451>`_) Extra checks have been added when Prodigal fails.
14+
* (`#448 <http://github.com/Ecogenomics/GTDBTk/issues/448>`_) Warning has been added when all the genomes are filtered out and not classified.
15+
16+
Bug Fixes:
17+
18+
* (`#420 <https://github.com/Ecogenomics/GTDBTk/issues/420>`_) Fixed an issue where GTDB-Tk might hang when classifying TIGRFAM markers (`identify`, `classify_wf`, `de_novo_wf`). Special thanks to @lfenske-93 and @sjaenick for their contribution.
19+
* (`#428 <https://github.com/Ecogenomics/GTDBTk/issues/428>`_) Fixed an issue where the `--gtdbtk_classification_file` would raise an error trying to read the `classify` summary (`root`, `de_novo_wf`).
20+
* (`#439 <https://github.com/Ecogenomics/GTDBTk/issues/439>`_) Fix the pipeline when using protein files instead of nucleotide files. symlink uses absolute path instead.
21+
22+
23+
24+
25+
2.1.1
26+
-----
27+
28+
Documentation:
29+
30+
* (`#410 <https://github.com/Ecogenomics/GTDBTk/issues/410>`_) Add documentation for `convert_to_itol`
31+
32+
Bug Fixes:
33+
34+
* (`#399 <https://github.com/Ecogenomics/GTDBTk/issues/399>`_) Fix `--genes` option attempting to create a directory.
35+
* (`#400 <https://github.com/Ecogenomics/GTDBTk/issues/400>`_) Updated contig.py to fix inconsistent pplacer paths causing the program to crash.
36+
537

638
2.1.0
739
-----

docs/src/commands/align.rst

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -22,12 +22,14 @@ Files output
2222

2323

2424
* :ref:`[prefix].log <files/gtdbtk.log>`
25+
* :ref:`[prefix].json <files/gtdbtk.json>`
2526
* :ref:`[prefix].warnings.log <files/gtdbtk.warnings.log>`
26-
* :ref:`align/[prefix].[domain].msa.fasta.gz <files/msa.fasta>`
27-
* :ref:`align/[prefix].[domain].user_msa.fasta.gz <files/user_msa.fasta>`
28-
* :ref:`align/[prefix].[domain].filtered.tsv <files/filtered.tsv>`
29-
* :ref:`align/intermediate_results/[prefix].[domain].marker_info.tsv <files/marker_info.tsv>`
30-
27+
* align
28+
* :ref:`[prefix].[domain].msa.fasta.gz <files/msa.fasta>`
29+
* :ref:`[prefix].[domain].user_msa.fasta.gz <files/user_msa.fasta>`
30+
* :ref:`[prefix].[domain].filtered.tsv <files/filtered.tsv>`
31+
* intermediate_results
32+
* :ref:`[prefix].[domain].marker_info.tsv <files/marker_info.tsv>`
3133

3234
Example
3335
-------

docs/src/commands/classify.rst

Lines changed: 44 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -20,17 +20,33 @@ Files output
2020
------------
2121

2222
* classify
23+
* :ref:`[prefix].[domain].summary.tsv <files/summary.tsv>`
24+
* :ref:`[prefix].backbone.[domain].classify.tree <files/classify.tree>`
25+
* :ref:`[prefix].[domain].tree.mapping.tsv <files/tree.mapping.tsv>`
26+
* :ref:`[prefix].[domain].classify.tree.[index].tree <files/classify.tree>`
2327
* intermediate_results
24-
* :ref:`[prefix].[domain].classification_pplacer.tsv <files/classification_pplacer.tsv>`
25-
* :ref:`[prefix].[domain].classify.tree <files/classify.tree>`
28+
* :ref:`[prefix].[domain].backbone.classification_pplacer.tsv <files/classification_pplacer.tsv>`
29+
* :ref:`[prefix].[domain].class_level.classification_pplacer_tree_[index].tsv <files/classification_pplacer.tsv>`
30+
* :ref:`[prefix].[domain].prescreened.msa.fasta <files/msa.fasta>`
31+
* :ref:`[prefix].[domain].red_dictionary.tsv <files/red_dictionary.tsv>`
2632
* pplacer
27-
* :ref:`pplacer.[domain].json <files/pplacer.domain.json>`
28-
* :ref:`pplacer.[domain].out <files/pplacer.domain.out>`
29-
* :ref:`[prefix].[domain].red_dictionary.tsv <files/red_dictionary.tsv>`
33+
* :ref:`pplacer.backbone.[domain].json <files/pplacer.domain.json>`
34+
* :ref:`pplacer.backbone.[domain].out <files/pplacer.domain.out>`
35+
* tree_[index]
36+
* :ref:`[prefix].[domain].user_msa.fasta <files/user_msa.fasta>`
37+
* :ref:`pplacer.class_level.[domain].out <files/pplacer.domain.out>`
38+
* :ref:`pplacer.class_level.[domain].json <files/pplacer.domain.json>`
39+
* ani_screen
40+
* intermediate_results
41+
* mash
42+
* :ref:`[prefix].mash_distances.tsv <files/mash_distances.msh>`
43+
* :ref:`[prefix].user_query_sketch.msh <files/user_query_sketch.msh>`
3044
* :ref:`[prefix].[domain].summary.tsv <files/summary.tsv>`
3145
* :ref:`[prefix].log <files/gtdbtk.log>`
46+
* :ref:`[prefix].json <files/gtdbtk.json>`
3247
* :ref:`[prefix].warnings.log <files/gtdbtk.warnings.log>`
3348

49+
3450
Example
3551
-------
3652

@@ -51,16 +67,26 @@ Output
5167

5268
.. code-block:: text
5369
54-
[2022-04-11 12:02:06] INFO: GTDB-Tk v2.0.0
55-
[2022-04-11 12:02:06] INFO: gtdbtk classify --genome_dir /tmp/gtdbtk/genomes --align_dir /tmp/gtdbtk/align --out_dir /tmp/gtdbtk/classify -x gz --cpus 2
56-
[2022-04-11 12:02:06] INFO: Using GTDB-Tk reference data version r207: /srv/db/gtdbtk/official/release207
57-
[2022-04-11 12:02:07] TASK: Placing 2 archaeal genomes into reference tree with pplacer using 2 CPUs (be patient).
58-
[2022-04-11 12:02:07] INFO: pplacer version: v1.1.alpha19-0-g807f6f3
59-
[2022-04-11 12:07:06] INFO: Calculating RED values based on reference tree.
60-
[2022-04-11 12:07:06] TASK: Traversing tree to determine classification method.
61-
[2022-04-11 12:07:06] INFO: Completed 2 genomes in 0.00 seconds (18,558.87 genomes/second).
62-
[2022-04-11 12:07:06] TASK: Calculating average nucleotide identity using FastANI (v1.32).
63-
[2022-04-11 12:07:08] INFO: Completed 4 comparisons in 1.61 seconds (2.49 comparisons/second).
64-
[2022-04-11 12:07:08] INFO: 2 genome(s) have been classified using FastANI and pplacer.
65-
[2022-04-11 12:07:08] INFO: Note that Tk classification mode is insufficient for publication of new taxonomic designations. New designations should be based on one or more de novo trees, an example of which can be produced by Tk in de novo mode.
66-
[2022-04-11 12:07:08] INFO: Done.
70+
[2023-02-08 12:53:42] INFO: GTDB-Tk v2.2.0
71+
[2023-02-08 12:53:42] INFO: gtdbtk classify --align_dir align_3lines/ --batchfile 3lines_batchfile.tsv --out_dir 3classify_ani --mash_db mash_db_dir/ --cpus 20
72+
[2023-02-08 12:53:42] INFO: Using GTDB-Tk reference data version r207: /path/to/gtdbtk/database/release207_v2/
73+
[2023-02-08 12:53:43] INFO: Loading reference genomes.
74+
[2023-02-08 12:53:43] INFO: Using Mash version 2.2.2
75+
[2023-02-08 12:53:43] INFO: Loading data from existing Mash sketch file: 3classify_ani/classify/ani_screen/intermediate_results/mash/gtdbtk.user_query_sketch.msh
76+
[2023-02-08 12:53:43] INFO: Loading data from existing Mash sketch file: mash_db_dir/gtdb_ref_sketch.msh
77+
[2023-02-08 12:53:46] INFO: Calculating Mash distances.
78+
[2023-02-08 12:53:49] INFO: Calculating ANI with FastANI v1.3.
79+
[2023-02-08 12:53:49] INFO: Completed 12 comparisons in 0.44 seconds (27.54 comparisons/second).
80+
[2023-02-08 12:53:49] INFO: 2 genome(s) have been classified using the ANI pre-screening step.
81+
[2023-02-08 12:53:49] TASK: Placing 1 bacterial genomes into backbone reference tree with pplacer using 20 CPUs (be patient).
82+
[2023-02-08 12:53:49] INFO: pplacer version: v1.1.alpha19-0-g807f6f3
83+
[2023-02-08 12:55:02] INFO: Calculating RED values based on reference tree.
84+
[2023-02-08 12:55:03] INFO: 1 out of 1 have an class assignments. Those genomes will be reclassified.
85+
[2023-02-08 12:55:03] TASK: Placing 1 bacterial genomes into class-level reference tree 5 (1/1) with pplacer using 20 CPUs (be patient).
86+
[2023-02-08 12:57:38] INFO: Calculating RED values based on reference tree.
87+
[2023-02-08 12:57:40] TASK: Traversing tree to determine classification method.
88+
[2023-02-08 12:57:40] INFO: Completed 1 genome in 0.04 seconds (23.86 genomes/second).
89+
[2023-02-08 12:57:40] INFO: 0 genome(s) have been classified using FastANI and pplacer.
90+
[2023-02-08 12:57:40] WARNING: 1 of 3 genome has a warning (see summary file).
91+
[2023-02-08 12:57:40] INFO: Note that Tk classification mode is insufficient for publication of new taxonomic designations. New designations should be based on one or more de novo trees, an example of which can be produced by Tk in de novo mode.
92+
[2023-02-08 12:57:40] INFO: Done.

docs/src/commands/classify_wf.rst

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,11 @@ For arguments and output files, see each of the individual steps:
1212
* :ref:`commands/align`
1313
* :ref:`commands/classify`
1414

15-
The classify workflow consists of three steps: ``identify``, ``align``, and ``classify``.
15+
The classify workflow consists of four steps: ``ani_screen``, ``identify``, ``align``, and ``classify``.
16+
17+
The ``ani_screen`` step compares user genomes against a `Mash <https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0997-x>`_ database composed of all GTDB representative genomes,
18+
then verify the best mash hits using `FastANI <https://www.nature.com/articles/s41467-018-07641-9>`_. User genomes classified with FastANI are not run through the rest of the pipeline (``identify``, ``align``, ``classify``)
19+
and are reported in the summary file.
1620

1721
The ``identify`` step calls genes using `Prodigal <http://compbio.ornl.gov/prodigal/>`_,
1822
and uses HMM models and the `HMMER <http://hmmer.org/>`_ package to identify the

docs/src/commands/identify.rst

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,11 +19,13 @@ Arguments
1919
## Files output
2020

2121
* :ref:`[prefix].log <files/gtdbtk.log>`
22+
* :ref:`[prefix].json <files/gtdbtk.json>`
2223
* :ref:`[prefix].warnings.log <files/gtdbtk.warnings.log>`
23-
* identify/
24+
* identify
2425
* :ref:`[prefix].[domain].markers_summary.tsv <files/markers_summary.tsv>`
2526
* :ref:`[prefix].translation_table_summary.tsv <files/translation_table_summary.tsv>`
26-
* identify/intermediate_results/marker_genes/[genome_id]/
27+
* :ref:`[prefix].failed_genomes.tsv <files/failed_genomes.tsv>`
28+
* intermediate_results/marker_genes/[genome_id]/
2729
* :ref:`[genome_id]_pfam_tophit.tsv <files/pfam_tophit.tsv>`
2830
* :ref:`[genome_id]_pfam.tsv <files/pfam.tsv>`
2931
* :ref:`[genome_id]_protein.faa <files/protein.faa>`
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
.. _files/failed_genomes.tsv:
2+
3+
failed.genomes.tsv
4+
===================
5+
6+
File reporting failed genomes which have been excluded from analysis due to Prodigal failing to call any genes.
7+
8+
Produced by
9+
-----------
10+
* :ref:`commands/identify`
11+
* :ref:`commands/classify_wf`
12+
13+
Example
14+
-------
15+
16+
.. code-block:: text
17+
18+
GCA_000002165.1,No genes were called by Prodigal
19+
GCA_000002175.1,No genes were called by Prodigal
20+
GCA_000002185.1,No genes were called by Prodigal
21+
GCA_000002195.1,No genes were called by Prodigal
22+
GCA_000002205.1,No genes were called by Prodigal

docs/src/files/gtdbtk.json.rst

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
.. _files/gtdbtk.json:
2+
3+
gtdbtk.json
4+
===========
5+
6+
The console output of GTDB-Tk saved to disk in a JSON format.
7+
8+
Produced by
9+
-----------
10+
11+
* :ref:`commands/align`
12+
* :ref:`commands/align`
13+
* :ref:`commands/classify`
14+
* :ref:`commands/classify_wf`
15+
* :ref:`commands/de_novo_wf`
16+
* :ref:`commands/identify`
17+
* :ref:`commands/infer`
18+
19+
Example
20+
-------
21+
22+
.. code-block:: text
23+
24+
{
25+
"version": "2.1.1",
26+
"command_line": "gtdbtk classify_wf --batchfile /srv/projects/gtdbtk/test_new_features/gems_benchmark/3lines_batchfile.tsv --out_dir /srv/projects/gtdbtk/test_new_features/gems_benchmark/classify_wf_outdir_prescreen_3lines/ --keep_intermediates --cpus 20 --mash_db /srv/projects/gtdbtk/test_new_features/gems_benchmark/mash_sketch/cli/",
27+
"database_version": "r207",
28+
"database_path": "/srv/projects/gtdbtk/test_new_features/release207_v2/",
29+
"steps": [
30+
{
31+
"name": "ANI screen",
32+
"input": "/srv/projects/gtdbtk/test_new_features/gems_benchmark/3lines_batchfile.tsv",
33+
"output_dir": "/srv/projects/gtdbtk/test_new_features/gems_benchmark/classify_wf_outdir_prescreen_3lines/",
34+
"output_files": {
35+
"bac120": "/srv/projects/gtdbtk/test_new_features/gems_benchmark/classify_wf_outdir_prescreen_3lines/classify/ani_screen/gtdbtk.bac120.ani_summary.tsv"
36+
},
37+
"starts_at": "2023-02-01T08:02:17.814231",
38+
"ends_at": "2023-02-01T08:02:27.782442",
39+
"duration": "0:00:09",
40+
"status": "completed",
41+
"mash_k": 16,
42+
"mash_s": 5000,
43+
"mash_v": 1.0,
44+
"mash_max_dist": 0.1,
45+
"mash_db": "/srv/projects/gtdbtk/test_new_features/gems_benchmark/mash_sketch/cli/"
46+
},

0 commit comments

Comments
 (0)