Ecogenomics
diff --git a/‎README.md‎
Lines changed: 8 additions & 8 deletions b/‎README.md‎
Lines changed: 8 additions & 8 deletions
diff --git a/‎docs/assets/gtdbtk_logo.ai‎
Lines changed: 1529 additions & 0 deletions b/‎docs/assets/gtdbtk_logo.ai‎
Lines changed: 1529 additions & 0 deletions
diff --git a/‎docs/src/announcements.rst‎
Lines changed: 8 additions & 0 deletions b/‎docs/src/announcements.rst‎
Lines changed: 8 additions & 0 deletions
diff --git a/‎docs/src/changelog.rst‎
Lines changed: 32 additions & 0 deletions b/‎docs/src/changelog.rst‎
Lines changed: 32 additions & 0 deletions
diff --git a/‎docs/src/commands/align.rst‎
Lines changed: 7 additions & 5 deletions b/‎docs/src/commands/align.rst‎
Lines changed: 7 additions & 5 deletions
diff --git a/‎docs/src/commands/classify.rst‎
Lines changed: 44 additions & 18 deletions b/‎docs/src/commands/classify.rst‎
Lines changed: 44 additions & 18 deletions
diff --git a/‎docs/src/commands/classify_wf.rst‎
Lines changed: 5 additions & 1 deletion b/‎docs/src/commands/classify_wf.rst‎
Lines changed: 5 additions & 1 deletion
diff --git a/‎docs/src/commands/identify.rst‎
Lines changed: 4 additions & 2 deletions b/‎docs/src/commands/identify.rst‎
Lines changed: 4 additions & 2 deletions
diff --git a/‎docs/src/files/failed_genomes.tsv.rst‎
Lines changed: 22 additions & 0 deletions b/‎docs/src/files/failed_genomes.tsv.rst‎
Lines changed: 22 additions & 0 deletions
diff --git a/‎docs/src/files/gtdbtk.json.rst‎
Lines changed: 46 additions & 0 deletions b/‎docs/src/files/gtdbtk.json.rst‎
Lines changed: 46 additions & 0 deletions
@@ -7,8 +7,6 @@
 [![Docker Image Version (latest by date)](https://img.shields.io/docker/v/ecogenomic/gtdbtk?sort=date&color=299bec&label=docker)](https://hub.docker.com/r/ecogenomic/gtdbtk)
 [![Docker Pulls](https://img.shields.io/docker/pulls/ecogenomic/gtdbtk?color=299bec&label=pulls)](https://hub.docker.com/r/ecogenomic/gtdbtk)
 
-<b>GTDB-Tk v2.1.0+ requires an updated reference package ([R207_v2](https://data.gtdb.ecogenomic.org/releases/latest/auxillary_files/gtdbtk_v2_data.tar.gz)), [read more](https://ecogenomics.github.io/GTDBTk/installing/index.html#gtdb-tk-reference-data).</b>
-
 GTDB-Tk is a software toolkit for assigning objective taxonomic classifications to bacterial and archaeal genomes based 
 on the Genome Database Taxonomy ([GTDB](https://gtdb.ecogenomic.org/)). It is designed to work with recent advances that 
 allow hundreds or thousands of metagenome-assembled genomes (MAGs) to be obtained directly from environmental samples. 
@@ -39,13 +37,15 @@ Documentation for GTDB-Tk can be found [here](https://ecogenomics.github.io/GTDB
 
 ## ✨ New Features
 
-GTDB-Tk v2.1.0 includes the following new features:
-- GTDB-TK now uses a **divide-and-conquer** approach where the bacterial reference tree is split into multiple **class**-level subtrees. This reduces the memory requirements of GTDB-Tk from **320 GB** of RAM when using the full GTDB R07-RS207 reference tree to approximately **55 GB**. A manuscript describing this approach is in preparation. If you wish to continue using the full GTDB reference tree use the `--full-tree` flag.  
-This is the main change from v2.0.0. The split tree approach has been modified from order-level trees to class-level trees to resolve specific classification issues (See [#383](https://github.com/Ecogenomics/GTDBTk/issues/383)). 
-- Genomes that cannot be assigned to a domain (e.g. genomes with no bacterial or archaeal markers or genomes with no genes called by Prodigal) are now reported in the `gtdbtk.bac120.summary.tsv` as 'Unclassified'
-- Genomes filtered out during the alignment step are now reported in the `gtdbtk.bac120.summary.tsv` or `gtdbtk.ar53.summary.tsv` as 'Unclassified Bacteria/Archaea'
-- `--write_single_copy_genes` flag in now available in the `classify_wf` and `de_novo_wf` workflows.
+GTDB-Tk v2.2.0+ includes the following new features:
+- GTDB-TK `classify` and `classify_wf` have changed in version 2.2.0+. There is now an ANI classification stage (`ANI screen`) that precedes classification by placement in a reference tree.
+  - **This is now the default behavior for `classify` and `classify_wf`.**
+  - In `classify`, user genomes are first compared against a Mash database comprised of all GTDB representative genomes and genome pairs of sufficient similarity processed by FastANI. User genomes classified to a GTDB representative based on FastANI results are not run through pplacer. 
+  - In the `classify_wf` workflow, genomes are classified using Mash and FastANI before executing the identify step. User genomes classified with FastANI are not run through the remainder of the pipeline (identify, align, classify).
+  - To classify genomes without the additional `ani_screen` step, use the `--skip_ani_screen` flag.
 
+## 📈 Performance
+Using ANI screen "can" reduce computation by >50%, although it depends on the set of input genomes. A set of input genomes consisting primarily of new species will not benefit from ANI screen as much as a set of genomes that are largely assigned to GTDB species clusters. In the latter case, the ANI screen will reduce the number of genomes that need to be classified by pplacer which reduces computation time subsantially (between 25% and 60% in our testing).
 
 ## 📚 References
 
 
@@ -1,6 +1,14 @@
 Announcements
 =============
 
+GTDB-Tk 2.2.0 available
+-----------------------
+
+*February 14, 2023*
+
+* GTDB-Tk version ``2.2.0`` is now available.
+* This version of GTDB-Tk **does not** require a new version of the GTDB-Tk reference package.
+
 
 GTDB-Tk 2.1.0 available
 -----------------------
 
@@ -2,6 +2,38 @@
 Change log
 ==========
 
+2.2.0
+-----
+
+Minor changes:
+
+* (`#433 <https://github.com/Ecogenomics/GTDBTk/issues/433>`_) Added additional checks to ensure that the `--outgroup_taxon` cannot be set to a domain (`root`, `de_novo_wf`).
+* (`#459 <https://github.com/Ecogenomics/GTDBTk/issues/459>`_ / `#462 <https://github.com/Ecogenomics/GTDBTk/issues/462>`_ ) Fix deprecated np.bool in prodigal_biolib.py. Special thanks to @neoformit for his contribution.
+* (`#466 <http://github.com/Ecogenomics/GTDBTk/issues/466>`_) RED value has been rounded to 5 decimals after the comma.
+* (`#451 <http://github.com/Ecogenomics/GTDBTk/issues/451>`_) Extra checks have been added when Prodigal fails.
+* (`#448 <http://github.com/Ecogenomics/GTDBTk/issues/448>`_) Warning has been added when all the genomes are filtered out and not classified.
+
+Bug Fixes:
+
+* (`#420 <https://github.com/Ecogenomics/GTDBTk/issues/420>`_) Fixed an issue where GTDB-Tk might hang when classifying TIGRFAM markers (`identify`, `classify_wf`, `de_novo_wf`). Special thanks to @lfenske-93 and @sjaenick for their contribution.
+* (`#428 <https://github.com/Ecogenomics/GTDBTk/issues/428>`_) Fixed an issue where the `--gtdbtk_classification_file` would raise an error trying to read the `classify` summary (`root`, `de_novo_wf`).
+* (`#439 <https://github.com/Ecogenomics/GTDBTk/issues/439>`_) Fix the pipeline when using protein files instead of nucleotide files. symlink uses absolute path instead.
+
+
+
+
+2.1.1
+-----
+
+Documentation:
+
+* (`#410 <https://github.com/Ecogenomics/GTDBTk/issues/410>`_) Add documentation for `convert_to_itol`
+
+Bug Fixes:
+
+* (`#399 <https://github.com/Ecogenomics/GTDBTk/issues/399>`_) Fix `--genes` option attempting to create a directory.
+* (`#400 <https://github.com/Ecogenomics/GTDBTk/issues/400>`_) Updated contig.py to fix inconsistent pplacer paths causing the program to crash.
+
 
 2.1.0
 -----
 
@@ -22,12 +22,14 @@ Files output
 
 
 * :ref:`[prefix].log <files/gtdbtk.log>`
+* :ref:`[prefix].json <files/gtdbtk.json>`
 * :ref:`[prefix].warnings.log <files/gtdbtk.warnings.log>`
-* :ref:`align/[prefix].[domain].msa.fasta.gz <files/msa.fasta>`
-* :ref:`align/[prefix].[domain].user_msa.fasta.gz <files/user_msa.fasta>`
-* :ref:`align/[prefix].[domain].filtered.tsv <files/filtered.tsv>`
-* :ref:`align/intermediate_results/[prefix].[domain].marker_info.tsv <files/marker_info.tsv>`
-
+* align
+    * :ref:`[prefix].[domain].msa.fasta.gz <files/msa.fasta>`
+    * :ref:`[prefix].[domain].user_msa.fasta.gz <files/user_msa.fasta>`
+    * :ref:`[prefix].[domain].filtered.tsv <files/filtered.tsv>`
+    * intermediate_results
+        * :ref:`[prefix].[domain].marker_info.tsv <files/marker_info.tsv>`
 
 Example
 -------
 
@@ -20,17 +20,33 @@ Files output
 ------------
 
 * classify
+    * :ref:`[prefix].[domain].summary.tsv <files/summary.tsv>`
+    * :ref:`[prefix].backbone.[domain].classify.tree <files/classify.tree>`
+    * :ref:`[prefix].[domain].tree.mapping.tsv <files/tree.mapping.tsv>`
+    * :ref:`[prefix].[domain].classify.tree.[index].tree <files/classify.tree>`
     * intermediate_results
-        * :ref:`[prefix].[domain].classification_pplacer.tsv <files/classification_pplacer.tsv>`
-        * :ref:`[prefix].[domain].classify.tree <files/classify.tree>`
+        * :ref:`[prefix].[domain].backbone.classification_pplacer.tsv <files/classification_pplacer.tsv>`
+        * :ref:`[prefix].[domain].class_level.classification_pplacer_tree_[index].tsv <files/classification_pplacer.tsv>`
+        * :ref:`[prefix].[domain].prescreened.msa.fasta <files/msa.fasta>`
+        * :ref:`[prefix].[domain].red_dictionary.tsv <files/red_dictionary.tsv>`
         * pplacer
-            * :ref:`pplacer.[domain].json <files/pplacer.domain.json>`
-            * :ref:`pplacer.[domain].out <files/pplacer.domain.out>`
-            * :ref:`[prefix].[domain].red_dictionary.tsv <files/red_dictionary.tsv>`
+            * :ref:`pplacer.backbone.[domain].json <files/pplacer.domain.json>`
+            * :ref:`pplacer.backbone.[domain].out <files/pplacer.domain.out>`
+            * tree_[index]
+                * :ref:`[prefix].[domain].user_msa.fasta <files/user_msa.fasta>`
+                * :ref:`pplacer.class_level.[domain].out <files/pplacer.domain.out>`
+                * :ref:`pplacer.class_level.[domain].json <files/pplacer.domain.json>`
+* ani_screen
+    * intermediate_results
+        * mash
+            * :ref:`[prefix].mash_distances.tsv <files/mash_distances.msh>`
+            * :ref:`[prefix].user_query_sketch.msh <files/user_query_sketch.msh>`
 * :ref:`[prefix].[domain].summary.tsv <files/summary.tsv>`
 * :ref:`[prefix].log <files/gtdbtk.log>`
+* :ref:`[prefix].json <files/gtdbtk.json>`
 * :ref:`[prefix].warnings.log <files/gtdbtk.warnings.log>`
 
+
 Example
 -------
 
@@ -51,16 +67,26 @@ Output
 
 .. code-block:: text
 
-    [2022-04-11 12:02:06] INFO: GTDB-Tk v2.0.0
-    [2022-04-11 12:02:06] INFO: gtdbtk classify --genome_dir /tmp/gtdbtk/genomes --align_dir /tmp/gtdbtk/align --out_dir /tmp/gtdbtk/classify -x gz --cpus 2
-    [2022-04-11 12:02:06] INFO: Using GTDB-Tk reference data version r207: /srv/db/gtdbtk/official/release207
-    [2022-04-11 12:02:07] TASK: Placing 2 archaeal genomes into reference tree with pplacer using 2 CPUs (be patient).
-    [2022-04-11 12:02:07] INFO: pplacer version: v1.1.alpha19-0-g807f6f3
-    [2022-04-11 12:07:06] INFO: Calculating RED values based on reference tree.
-    [2022-04-11 12:07:06] TASK: Traversing tree to determine classification method.
-    [2022-04-11 12:07:06] INFO: Completed 2 genomes in 0.00 seconds (18,558.87 genomes/second).
-    [2022-04-11 12:07:06] TASK: Calculating average nucleotide identity using FastANI (v1.32).
-    [2022-04-11 12:07:08] INFO: Completed 4 comparisons in 1.61 seconds (2.49 comparisons/second).
-    [2022-04-11 12:07:08] INFO: 2 genome(s) have been classified using FastANI and pplacer.
-    [2022-04-11 12:07:08] INFO: Note that Tk classification mode is insufficient for publication of new taxonomic designations. New designations should be based on one or more de novo trees, an example of which can be produced by Tk in de novo mode.
-    [2022-04-11 12:07:08] INFO: Done.
+    [2023-02-08 12:53:42] INFO: GTDB-Tk v2.2.0
+    [2023-02-08 12:53:42] INFO: gtdbtk classify --align_dir align_3lines/ --batchfile 3lines_batchfile.tsv --out_dir 3classify_ani --mash_db mash_db_dir/ --cpus 20
+    [2023-02-08 12:53:42] INFO: Using GTDB-Tk reference data version r207: /path/to/gtdbtk/database/release207_v2/
+    [2023-02-08 12:53:43] INFO: Loading reference genomes.
+    [2023-02-08 12:53:43] INFO: Using Mash version 2.2.2
+    [2023-02-08 12:53:43] INFO: Loading data from existing Mash sketch file: 3classify_ani/classify/ani_screen/intermediate_results/mash/gtdbtk.user_query_sketch.msh
+    [2023-02-08 12:53:43] INFO: Loading data from existing Mash sketch file: mash_db_dir/gtdb_ref_sketch.msh
+    [2023-02-08 12:53:46] INFO: Calculating Mash distances.
+    [2023-02-08 12:53:49] INFO: Calculating ANI with FastANI v1.3.
+    [2023-02-08 12:53:49] INFO: Completed 12 comparisons in 0.44 seconds (27.54 comparisons/second).
+    [2023-02-08 12:53:49] INFO: 2 genome(s) have been classified using the ANI pre-screening step.
+    [2023-02-08 12:53:49] TASK: Placing 1 bacterial genomes into backbone reference tree with pplacer using 20 CPUs (be patient).
+    [2023-02-08 12:53:49] INFO: pplacer version: v1.1.alpha19-0-g807f6f3
+    [2023-02-08 12:55:02] INFO: Calculating RED values based on reference tree.
+    [2023-02-08 12:55:03] INFO: 1 out of 1 have an class assignments. Those genomes will be reclassified.
+    [2023-02-08 12:55:03] TASK: Placing 1 bacterial genomes into class-level reference tree 5 (1/1) with pplacer using 20 CPUs (be patient).
+    [2023-02-08 12:57:38] INFO: Calculating RED values based on reference tree.
+    [2023-02-08 12:57:40] TASK: Traversing tree to determine classification method.
+    [2023-02-08 12:57:40] INFO: Completed 1 genome in 0.04 seconds (23.86 genomes/second).
+    [2023-02-08 12:57:40] INFO: 0 genome(s) have been classified using FastANI and pplacer.
+    [2023-02-08 12:57:40] WARNING: 1 of 3 genome has a warning (see summary file).
+    [2023-02-08 12:57:40] INFO: Note that Tk classification mode is insufficient for publication of new taxonomic designations. New designations should be based on one or more de novo trees, an example of which can be produced by Tk in de novo mode.
+    [2023-02-08 12:57:40] INFO: Done.
@@ -12,7 +12,11 @@ For arguments and output files, see each of the individual steps:
 * :ref:`commands/align`
 * :ref:`commands/classify`
 
-The classify workflow consists of three steps: ``identify``, ``align``, and ``classify``.
+The classify workflow consists of four steps: ``ani_screen``, ``identify``, ``align``, and ``classify``.
+
+The ``ani_screen`` step compares user genomes against a `Mash <https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0997-x>`_ database composed of all GTDB representative genomes,
+then verify the best mash hits using `FastANI <https://www.nature.com/articles/s41467-018-07641-9>`_. User genomes classified with FastANI are not run through the rest of the pipeline (``identify``, ``align``, ``classify``)
+and are reported in the summary file.
 
 The ``identify`` step calls genes using `Prodigal <http://compbio.ornl.gov/prodigal/>`_,
 and uses HMM models and the `HMMER <http://hmmer.org/>`_ package to identify the
 
@@ -19,11 +19,13 @@ Arguments
 ## Files output
 
 * :ref:`[prefix].log <files/gtdbtk.log>`
+* :ref:`[prefix].json <files/gtdbtk.json>`
 * :ref:`[prefix].warnings.log <files/gtdbtk.warnings.log>`
-* identify/
+* identify
     * :ref:`[prefix].[domain].markers_summary.tsv <files/markers_summary.tsv>`
     * :ref:`[prefix].translation_table_summary.tsv <files/translation_table_summary.tsv>`
-* identify/intermediate_results/marker_genes/[genome_id]/
+    * :ref:`[prefix].failed_genomes.tsv <files/failed_genomes.tsv>`
+    * intermediate_results/marker_genes/[genome_id]/
     * :ref:`[genome_id]_pfam_tophit.tsv <files/pfam_tophit.tsv>`
     * :ref:`[genome_id]_pfam.tsv <files/pfam.tsv>`
     * :ref:`[genome_id]_protein.faa <files/protein.faa>`
 
@@ -0,0 +1,22 @@
+.. _files/failed_genomes.tsv:
+
+failed.genomes.tsv
+===================
+
+File reporting failed genomes which have been excluded from analysis due to Prodigal failing to call any genes.
+
+Produced by
+-----------
+ * :ref:`commands/identify`
+ * :ref:`commands/classify_wf`
+
+Example
+-------
+
+.. code-block:: text
+
+    GCA_000002165.1,No genes were called by Prodigal
+    GCA_000002175.1,No genes were called by Prodigal
+    GCA_000002185.1,No genes were called by Prodigal
+    GCA_000002195.1,No genes were called by Prodigal
+    GCA_000002205.1,No genes were called by Prodigal
@@ -0,0 +1,46 @@
+.. _files/gtdbtk.json:
+
+gtdbtk.json
+===========
+
+The console output of GTDB-Tk saved to disk in a JSON format.
+
+Produced by
+-----------
+
+* :ref:`commands/align`
+* :ref:`commands/align`
+* :ref:`commands/classify`
+* :ref:`commands/classify_wf`
+* :ref:`commands/de_novo_wf`
+* :ref:`commands/identify`
+* :ref:`commands/infer`
+
+Example
+-------
+
+.. code-block:: text
+
+    {
+        "version": "2.1.1",
+        "command_line": "gtdbtk classify_wf --batchfile /srv/projects/gtdbtk/test_new_features/gems_benchmark/3lines_batchfile.tsv --out_dir /srv/projects/gtdbtk/test_new_features/gems_benchmark/classify_wf_outdir_prescreen_3lines/ --keep_intermediates --cpus 20 --mash_db /srv/projects/gtdbtk/test_new_features/gems_benchmark/mash_sketch/cli/",
+        "database_version": "r207",
+        "database_path": "/srv/projects/gtdbtk/test_new_features/release207_v2/",
+        "steps": [
+            {
+                "name": "ANI screen",
+                "input": "/srv/projects/gtdbtk/test_new_features/gems_benchmark/3lines_batchfile.tsv",
+                "output_dir": "/srv/projects/gtdbtk/test_new_features/gems_benchmark/classify_wf_outdir_prescreen_3lines/",
+                "output_files": {
+                    "bac120": "/srv/projects/gtdbtk/test_new_features/gems_benchmark/classify_wf_outdir_prescreen_3lines/classify/ani_screen/gtdbtk.bac120.ani_summary.tsv"
+                },
+                "starts_at": "2023-02-01T08:02:17.814231",
+                "ends_at": "2023-02-01T08:02:27.782442",
+                "duration": "0:00:09",
+                "status": "completed",
+                "mash_k": 16,
+                "mash_s": 5000,
+                "mash_v": 1.0,
+                "mash_max_dist": 0.1,
+                "mash_db": "/srv/projects/gtdbtk/test_new_features/gems_benchmark/mash_sketch/cli/"
+            },