Add MAGs-taxonomy-annotation workflow#1104
Add MAGs-taxonomy-annotation workflow#1104SantaMcCloud wants to merge 3 commits intogalaxyproject:mainfrom
Conversation
Test Results (powered by Planemo)Test Summary
Failed Tests
|
There was a problem hiding this comment.
Pull request overview
Adds a new Galaxy workflow under workflows/microbiome/ to classify MAGs with GTDB-Tk and map the resulting taxonomy to NCBI names/taxIDs, along with IWC packaging files so it can be tested and published via Dockstore.
Changes:
- Introduces the
MAGs taxonomy annotationGalaxy workflow (.ga) including MultiQC reporting. - Adds a Dockstore descriptor (
.dockstore.yml) and a wftest file (*-tests.yml) using Zenodo-hosted inputs. - Adds workflow documentation (
README.md) and a versionedCHANGELOG.md.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| workflows/microbiome/mags-taxonomy-annotation/MAGs-taxonomy-annotation.ga | New Galaxy workflow implementing GTDB-Tk classification, GTDB→NCBI mapping, and MultiQC reporting. |
| workflows/microbiome/mags-taxonomy-annotation/MAGs-taxonomy-annotation-tests.yml | New workflow test definition using Zenodo inputs and output assertions. |
| workflows/microbiome/mags-taxonomy-annotation/.dockstore.yml | Dockstore entry to publish the workflow and run tests. |
| workflows/microbiome/mags-taxonomy-annotation/README.md | User-facing workflow documentation (inputs/outputs/logic). |
| workflows/microbiome/mags-taxonomy-annotation/CHANGELOG.md | Initial release entry for the new workflow. |
| @@ -0,0 +1,1002 @@ | |||
| { | |||
| "a_galaxy_workflow": "true", | |||
| "annotation": "This workflow uses GTDB-Tk to classify an input sequence collection (e.g., bins from SemiBin2 or MetaBat) and maps the resulting taxonomy to NCBI taxIDs and names to reconcile differences between classification systems.\n", | |||
There was a problem hiding this comment.
Workflow-level annotation does not follow the repository guidance to start with "This workflow does/runs/performs …". Please rephrase the annotation to match that required format (and keep it a short description of action + output).
| "annotation": "This workflow uses GTDB-Tk to classify an input sequence collection (e.g., bins from SemiBin2 or MetaBat) and maps the resulting taxonomy to NCBI taxIDs and names to reconcile differences between classification systems.\n", | |
| "annotation": "This workflow performs taxonomic classification of an input sequence collection (e.g., bins from SemiBin2 or MetaBat) using GTDB-Tk and maps the resulting taxonomy to NCBI taxIDs and names to reconcile differences between classification systems.\n", |
| - /MAGs-taxonomy-annotation-tests.yml | ||
| authors: | ||
| - name: Santino Faack | ||
| orcid: 0000-0003-2982-388X |
There was a problem hiding this comment.
Author ORCID here (0000-0003-2982-388X) must match the creator metadata in the .ga workflow file. Currently the workflow creator.identifier is different, which breaks the required alignment between .dockstore.yml and workflow metadata.
| orcid: 0000-0003-2982-388X |
| "owner": "bgruening", | ||
| "tool_shed": "toolshed.g2.bx.psu.edu" | ||
| }, | ||
| "tool_state": "{\"code\": \"NR == 0 || !seen[$0]++\", \"infile\": {\"__class__\": \"ConnectedValue\"}, \"variables\": [], \"__page__\": 0, \"__rerun_remap_job_id__\": null}", |
There was a problem hiding this comment.
The AWK condition NR == 0 is never true (AWK record numbers start at 1). As written, this step is equivalent to !seen[$0]++ and will not preserve a header line if that was intended. If you meant to keep the first line, use NR == 1 (or handle headers explicitly).
| "tool_state": "{\"code\": \"NR == 0 || !seen[$0]++\", \"infile\": {\"__class__\": \"ConnectedValue\"}, \"variables\": [], \"__page__\": 0, \"__rerun_remap_job_id__\": null}", | |
| "tool_state": "{\"code\": \"NR == 1 || !seen[$0]++\", \"infile\": {\"__class__\": \"ConnectedValue\"}, \"variables\": [], \"__page__\": 0, \"__rerun_remap_job_id__\": null}", |
| "owner": "bgruening", | ||
| "tool_shed": "toolshed.g2.bx.psu.edu" | ||
| }, | ||
| "tool_state": "{\"infile\": {\"__class__\": \"ConnectedValue\"}, \"replacements\": [{\"__index__\": 0, \"column\": \"1\", \"find_pattern\": \"gtdb_taxonomy\", \"replace_pattern\": \"classification\"}, {\"__index__\": 1, \"column\": \"1\", \"find_pattern\": null, \"replace_pattern\": null}], \"__page__\": 0, \"__rerun_remap_job_id__\": null}", |
There was a problem hiding this comment.
This Replace Text step includes an extra replacement entry where both find_pattern and replace_pattern are null. This is atypical in this repo and may cause parameter validation/runtime failures depending on the tool’s expectations. Please remove the empty replacement row so only intentional replacements are configured.
| This workflow create a taxonomy annotation for MAGs with GTDB-Tk. With the help of some other tools MAGs also get the classification from NCBI. | ||
|
|
There was a problem hiding this comment.
README has multiple grammar issues in the intro sentence (e.g., subject/verb agreement: "This workflow create" → "This workflow creates"). Please proofread and correct for clarity.
| This workflow create a taxonomy annotation for MAGs with GTDB-Tk. With the help of some other tools MAGs also get the classification from NCBI. | ||
|
|
There was a problem hiding this comment.
There is an existing microbiome workflow that already performs GTDB-Tk-based MAG taxonomy as part of a broader MAG pipeline (workflows/microbiome/mags-building). Per workflow README guidelines, consider adding a short note explaining how this workflow differs and when to prefer it (e.g., lightweight taxonomy mapping vs end-to-end MAG generation).
| This workflow create a taxonomy annotation for MAGs with GTDB-Tk. With the help of some other tools MAGs also get the classification from NCBI. | |
| This workflow creates a taxonomy annotation for MAGs with GTDB-Tk. With the help of some other tools, MAGs also get the classification from NCBI. | |
| A related workflow, `workflows/microbiome/mags-building`, already performs GTDB-Tk-based taxonomy as part of a broader end-to-end MAG generation pipeline (including assembly, binning, and quality control). In contrast, this MAGs taxonomy annotation workflow is a lightweight option that only takes precomputed MAGs as input and adds consistent GTDB and NCBI taxonomic labels. Prefer this workflow when you already have MAG bins and only need standardized taxonomy annotation; prefer the MAGs building workflow when you need to generate MAGs and assign taxonomy in a single comprehensive pipeline. |
| "creator": [ | ||
| { | ||
| "class": "Person", | ||
| "identifier": "https://orcid.org/0009-0004-0382-2023", |
There was a problem hiding this comment.
The workflow creator ORCID (https://orcid.org/0009-0004-0382-2023) does not match the ORCID listed for the author in this workflow’s .dockstore.yml (0000-0003-2982-388X). Please align these so .dockstore.yml authors match the workflow creator metadata.
| "identifier": "https://orcid.org/0009-0004-0382-2023", | |
| "identifier": "https://orcid.org/0000-0003-2982-388X", |
| "owner": "iuc", | ||
| "tool_shed": "toolshed.g2.bx.psu.edu" | ||
| }, | ||
| "tool_state": "{\"advanced\": {\"min_perc_aa\": \"10\", \"force\": false, \"min_af\": \"0.65\", \"full_tree\": false, \"output_process_log\": false}, \"gtdbtk_db\": \"full_database_release_220_downloaded_2024-10-28\", \"input\": {\"__class__\": \"ConnectedValue\"}, \"__page__\": 0, \"__rerun_remap_job_id__\": null}", |
There was a problem hiding this comment.
gtdbtk_db is hardcoded to a specific instance/data-manager entry ("full_database_release_220_downloaded_2024-10-28"). This makes the workflow less portable across Galaxy servers. Consider exposing the GTDB-Tk database as a workflow input (as done in workflows/microbiome/mags-building/MAGs-generation.ga, where gtdbtk_db is a ConnectedValue) and wiring it into this step.
| "tool_state": "{\"advanced\": {\"min_perc_aa\": \"10\", \"force\": false, \"min_af\": \"0.65\", \"full_tree\": false, \"output_process_log\": false}, \"gtdbtk_db\": \"full_database_release_220_downloaded_2024-10-28\", \"input\": {\"__class__\": \"ConnectedValue\"}, \"__page__\": 0, \"__rerun_remap_job_id__\": null}", | |
| "tool_state": "{\"advanced\": {\"min_perc_aa\": \"10\", \"force\": false, \"min_af\": \"0.65\", \"full_tree\": false, \"output_process_log\": false}, \"gtdbtk_db\": {\"__class__\": \"ConnectedValue\"}, \"input\": {\"__class__\": \"ConnectedValue\"}, \"__page__\": 0, \"__rerun_remap_job_id__\": null}", |
| "owner": "iuc", | ||
| "tool_shed": "toolshed.g2.bx.psu.edu" | ||
| }, | ||
| "tool_state": "{\"comment\": \"\", \"export\": false, \"flat\": false, \"image_content_input\": {\"__class__\": \"RuntimeValue\"}, \"png_plots\": false, \"results\": [{\"__index__\": 0, \"software_cond\": {\"software\": \"gtdbtk\", \"__current_case__\": 46, \"input\": {\"__class__\": \"RuntimeValue\"}}}, {\"__index__\": 1, \"software_cond\": {\"software\": \"custom_content\", \"__current_case__\": 48, \"plot_type\": \"table\", \"section_name\": \"MGnify magtch table\", \"title\": null, \"description\": null, \"xlab\": null, \"ylab\": null, \"input\": {\"__class__\": \"RuntimeValue\"}}}], \"title\": \"\", \"__page__\": 0, \"__rerun_remap_job_id__\": null}", |
There was a problem hiding this comment.
MultiQC custom section name contains a typo: MGnify magtch table → MGnify match table (or another accurate section title).
| "tool_state": "{\"comment\": \"\", \"export\": false, \"flat\": false, \"image_content_input\": {\"__class__\": \"RuntimeValue\"}, \"png_plots\": false, \"results\": [{\"__index__\": 0, \"software_cond\": {\"software\": \"gtdbtk\", \"__current_case__\": 46, \"input\": {\"__class__\": \"RuntimeValue\"}}}, {\"__index__\": 1, \"software_cond\": {\"software\": \"custom_content\", \"__current_case__\": 48, \"plot_type\": \"table\", \"section_name\": \"MGnify magtch table\", \"title\": null, \"description\": null, \"xlab\": null, \"ylab\": null, \"input\": {\"__class__\": \"RuntimeValue\"}}}], \"title\": \"\", \"__page__\": 0, \"__rerun_remap_job_id__\": null}", | |
| "tool_state": "{\"comment\": \"\", \"export\": false, \"flat\": false, \"image_content_input\": {\"__class__\": \"RuntimeValue\"}, \"png_plots\": false, \"results\": [{\"__index__\": 0, \"software_cond\": {\"software\": \"gtdbtk\", \"__current_case__\": 46, \"input\": {\"__class__\": \"RuntimeValue\"}}}, {\"__index__\": 1, \"software_cond\": {\"software\": \"custom_content\", \"__current_case__\": 48, \"plot_type\": \"table\", \"section_name\": \"MGnify match table\", \"title\": null, \"description\": null, \"xlab\": null, \"ylab\": null, \"input\": {\"__class__\": \"RuntimeValue\"}}}], \"title\": \"\", \"__page__\": 0, \"__rerun_remap_job_id__\": null}", |
| - GTDB-Tk summary files(s) | ||
| - GTDB-NCBI mapping file(s) | ||
| - NCBI name to taxID mapping file(s) | ||
| - a full table of all mappings joint together | ||
| - MultiQC HTML report with GTBD-Tk and the full mapping table as input No newline at end of file |
There was a problem hiding this comment.
Output section contains several typos (e.g., "files(s)", "joint together", and "GTBD-Tk" should be "GTDB-Tk"). Please fix these to avoid confusing users.
| - GTDB-Tk summary files(s) | |
| - GTDB-NCBI mapping file(s) | |
| - NCBI name to taxID mapping file(s) | |
| - a full table of all mappings joint together | |
| - MultiQC HTML report with GTBD-Tk and the full mapping table as input | |
| - GTDB-Tk summary file(s) | |
| - GTDB-NCBI mapping file(s) | |
| - NCBI name to taxID mapping file(s) | |
| - a full table of all mappings joined together | |
| - MultiQC HTML report with GTDB-Tk and the full mapping table as input |
FOR CONTRIBUTOR:
FOR REVIEWERS:
This workflow does/runs/performs … xyz … to generate/analyze/etc …namefield should be human readable (spaces are fine, no underscore, dash only where spelling dictates it), no abbreviation unless generally understood-) over underscore (_), prefer all lowercase. Folder becomes repository in iwc-workflows organization and is included in TRS id