Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions CONTRIBUTORS.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3083,6 +3083,14 @@ VerenaMoo:
name: Verena Moosmann
joined: 2024-12

vinisalazar:
name: Vini Salazar
joined: 2025-10
orcid: 0000-0002-8362-3195
affiliations:
- unimelb
- melbournebioinformatics

vivekbhr:
name: Vivek Bhardwaj
joined: 2017-09
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
## COMEbin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## COMEbin
## Bin contigs using COMEbin


COMEbin is a relatively new binner that has shown remarkably strong performance in recent benchmarking studies.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
COMEbin is a relatively new binner that has shown remarkably strong performance in recent benchmarking studies.
**COMEbin** ({% cite Wang2024COMEBin %}) is a relatively new binner that has shown remarkably strong performance in recent benchmarking studies.

However, it also has several drawbacks.
* Due to its implementation, it cannot operate reliably on small test datasets, and therefore we cannot include it in this tutorial.
* It requires substantial computational resources and long runtimes.
* The tool also suffers from other technical issues that can cause runs to fail.
Comment on lines +4 to +7
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
However, it also has several drawbacks.
* Due to its implementation, it cannot operate reliably on small test datasets, and therefore we cannot include it in this tutorial.
* It requires substantial computational resources and long runtimes.
* The tool also suffers from other technical issues that can cause runs to fail.
However, the tool has several notable limitations:
- **Dataset Size Constraints**: Its implementation is not optimized for small test datasets, making it unsuitable for inclusion in this tutorial.
- **Resource Intensity**: It demands significant computational resources and extended runtimes, which can be prohibitive.
- **Technical Instability**: The tool is prone to technical issues that may result in failed runs.


These problems cannot be resolved on the Galaxy side, and the tool is currently only lightly maintained upstream.

Nevertheless, because COMEbin can produce some of the best-performing bins when it runs successfully, we still mention it here. It may yield excellent results on real biological datasets and is available in Galaxy.

> <warning-title>Do not run COMEBin</warning-title>
>
> As said: Due to its implementation, it cannot operate reliably on small test datasets, and therefore we cannot include it in this tutorial. Do not run it on the tutorial dataset — it will fail.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> As said: Due to its implementation, it cannot operate reliably on small test datasets, and therefore we cannot include it in this tutorial. Do not run it on the tutorial dataset — it will fail.
> As explained above, due to its implementation, it cannot operate reliably on small test datasets, and therefore, we cannot include it in this tutorial. Do not run it on the tutorial dataset — it will fail.

>
{: .warning}


### Bin contigs using COMEbin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Bin contigs using COMEbin


> <hands-on-title> Individual binning of short reads with COMEbin </hands-on-title>
>
> 1. {% tool [COMEBin](toolshed.g2.bx.psu.edu/repos/iuc/comebin/comebin/1.0.4+galaxy1) %} with the following parameters:
> - {% icon param-collection %} *"Metagenomic assembly file"*: `Contigs` (Input dataset collection)
> - {% icon param-file %} *"Input bam file(s)"*: `Reads` (output of **Samtools sort** {% icon tool %})
>
> > <comment-title> Parameters </comment-title>
> >
> > The Batch size should be less then the number of contigs. But if this is the case for the batch size of 1014 your input data is likely too small to run with this tool !
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> > The Batch size should be less then the number of contigs. But if this is the case for the batch size of 1014 your input data is likely too small to run with this tool !
> > The batch size should be less than the number of contigs. But if this is the case for the batch size of 1014, your input data is likely too small to run with this tool!

> {: .comment}
>
{: .hands_on}


126 changes: 126 additions & 0 deletions topics/microbiome/tutorials/metagenomics-binning/concoct_version.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
## CONCOCT
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## CONCOCT
## Bin contigs using CONCOCT


In this tutorial version we will learn how to use **CONCOCT** {%cite Alneberg2014%} through Galaxy. **CONCOCT** is an *unsupervised metagenomic binner* that groups contigs using both **sequence characteristics** and **differential coverage across multiple samples**. In contrast to SemiBin, it does **not** rely on pretrained models or marker-gene constraints; instead, it clusters contig fragments purely based on statistical similarities.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In this tutorial version we will learn how to use **CONCOCT** {%cite Alneberg2014%} through Galaxy. **CONCOCT** is an *unsupervised metagenomic binner* that groups contigs using both **sequence characteristics** and **differential coverage across multiple samples**. In contrast to SemiBin, it does **not** rely on pretrained models or marker-gene constraints; instead, it clusters contig fragments purely based on statistical similarities.
**CONCOCT** ({% cite Alneberg2014 %}) is an *unsupervised metagenomic binner* that groups contigs using both **sequence characteristics** and **differential coverage across multiple samples**. In contrast to SemiBin, it does **not** rely on pretrained models or marker-gene constraints; instead, it clusters contig fragments purely based on statistical similarities.


> CONCOCT jointly models contig abundance profiles from multiple samples using a Gaussian mixture model. By taking advantage of differences in coverage across samples, it can separate genomes with similar sequence composition but distinct abundance patterns. CONCOCT also introduced the now-standard technique of splitting contigs into fixed-length fragments, allowing more consistent and accurate clustering.
> {: .quote author="Alneberg et al., 2014" }

CONCOCT is widely used in metagenomic binning due to:

* **Unsupervised probabilistic clustering**
No marker genes, labels, or pretrained models are required.
Comment on lines +10 to +11
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* **Unsupervised probabilistic clustering**
No marker genes, labels, or pretrained models are required.
* **Unsupervised probabilistic clustering**: No marker genes, labels, or pretrained models are required.

* **Strong performance with multiple samples**
Differential coverage helps disentangle closely related genomes.
100
Comment on lines +12 to +14
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* **Strong performance with multiple samples**
Differential coverage helps disentangle closely related genomes.
100
* **Strong performance with multiple samples**: Differential coverage helps disentangle closely related genomes

* **Reproducible, transparent workflow**
Its stepwise pipeline—fragmentation, coverage estimation, clustering—yields interpretable results.
Comment on lines +15 to +16
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* **Reproducible, transparent workflow**
Its stepwise pipeline—fragmentation, coverage estimation, clustering—yields interpretable results.
* **Reproducible, transparent workflow**: Its stepwise pipeline—fragmentation, coverage estimation, clustering—yields interpretable results.

* **Complementarity to other binners**
Frequently used alongside SemiBin, MetaBAT2, or MaxBin2 in ensemble pipelines (e.g., MetaWRAP, nf-core/mag).
Comment on lines +17 to +18
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* **Complementarity to other binners**
Frequently used alongside SemiBin, MetaBAT2, or MaxBin2 in ensemble pipelines (e.g., MetaWRAP, nf-core/mag).
* **Complementarity to other binners**: Frequently used alongside SemiBin, MetaBAT2, or MaxBin2 in ensemble pipelines (e.g., MetaWRAP, nf-core/mag).


### Why preprocessing steps (such as cutting contigs) are required
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Why preprocessing steps (such as cutting contigs) are required


CONCOCT relies heavily on preprocessing because its Gaussian mixture model treats **each contig fragment** as an individual data point. One key preprocessing step is **cutting contigs into equal-sized fragments**, typically around 10 kb. Fragmenting contigs helps balance the influence of long versus short contigs, generates uniform data points for statistical modeling, detects local variation or potential misassemblies within long contigs, and improves the resolution of abundance differences across genomes. This fragmentation is therefore mandatory for CONCOCT to function correctly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
CONCOCT relies heavily on preprocessing because its Gaussian mixture model treats **each contig fragment** as an individual data point. One key preprocessing step is **cutting contigs into equal-sized fragments**, typically around 10 kb. Fragmenting contigs helps balance the influence of long versus short contigs, generates uniform data points for statistical modeling, detects local variation or potential misassemblies within long contigs, and improves the resolution of abundance differences across genomes. This fragmentation is therefore mandatory for CONCOCT to function correctly.
Before initiating the binning process with **CONCOCT**, the input data must be preprocessed to ensure compatibility with its Gaussian mixture model. This model treats each contig fragment as an individual data point, necessitating a critical preprocessing step: dividing contigs into **equal-sized fragments**, usually around 10 kb in length.
Fragmentation serves several essential purposes:
- **Balancing Influence**: It mitigates bias between long and short contigs, ensuring each contributes equally to the analysis.
- **Uniform Data Points**: It creates consistent data points, which are crucial for accurate statistical modeling.
- **Detecting Local Variations**: It helps identify potential misassemblies or variations within long contigs.
- **Enhanced Resolution**: It improves the detection of abundance differences across genomes, leading to more precise binning results.
This fragmentation step is mandatory for CONCOCT to operate effectively and deliver reliable results.


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> <hands-on-title> Fragment contigs </hands-on-title>
>
> 1. {% tool [CONCOCT: Cut up contigs](toolshed.g2.bx.psu.edu/repos/iuc/concoct_cut_up_fasta/concoct_cut_up_fasta/1.1.0+galaxy2) %} with the following parameters:
> * {% icon param-collection %} *"Fasta contigs file"*: `Contigs` (Input dataset collection)
> * *"Concatenate final part to last contig?"*: `Yes`
> * *"Output bed file with exact regions of the original contigs corresponding to the newly created contigs?"*: `Yes`
>
{: .hands_on}

After fragmentation, the **coverage of each fragment** is computed across all samples, providing a measure of abundance that CONCOCT uses alongside sequence statistics. These coverage profiles, together with basic sequence features, are then used as input to the **Gaussian mixture model clustering**, which groups fragments into bins. Once fragments are clustered, the results are mapped back to the original contigs to assign each contig to a specific bin.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
After fragmentation, the **coverage of each fragment** is computed across all samples, providing a measure of abundance that CONCOCT uses alongside sequence statistics. These coverage profiles, together with basic sequence features, are then used as input to the **Gaussian mixture model clustering**, which groups fragments into bins. Once fragments are clustered, the results are mapped back to the original contigs to assign each contig to a specific bin.
After fragmentation, CONCOCT calculates the **coverage of each fragment** across all samples. This coverage data serves as a measure of abundance, which, combined with sequence composition statistics, forms the input for the tool's analysis.
> <hands-on-title> Generate coverage table </hands-on-title>
>
> 1. {% tool [CONCOCT: Generate the input coverage table](toolshed.g2.bx.psu.edu/repos/iuc/concoct_coverage_table/concoct_coverage_table/1.1.0+galaxy2) %} with the following parameters:
> * {% icon param-file %} *"Contigs BEDFile"*: `output_bed` (output of **CONCOCT: Cut up contigs** {% icon tool %})
> * *"Type of assembly used to generate the contigs"*: `Individual assembly: 1 run per BAM file`
> * {% icon param-file %} *"Sorted BAM file"*: `output1` (output of **Samtools sort** {% icon tool %})
{: .hands_on}
These **coverage profiles**, along with fundamental sequence features, are fed into CONCOCT's **Gaussian mixture model clustering algorithm**. This model groups fragments into distinct bins based on their statistical similarities.
> <hands-on-title> Run CONCOCT </hands-on-title>
>
> 1. {% tool [CONCOCT](toolshed.g2.bx.psu.edu/repos/iuc/concoct/concoct/1.1.0+galaxy2) %} with the following parameters:
> * {% icon param-file %} *"Coverage file"*: `output` (output of **CONCOCT: Generate the input coverage table** {% icon tool %})
> * {% icon param-file %} *"Composition file with sequences"*: `output_fasta` (output of **CONCOCT: Cut up contigs** {% icon tool %})
> * In *"Advanced options"*:
> * *"Read length for coverage"*: `{'id': 1, 'output_name': 'output'}`
{: .hands_on}
Finally, the clustering results are mapped back to the **original contigs, allowing each contig to be assigned to its corresponding bin. This process ensures accurate and meaningful binning of metagenomic data.
> <hands-on-title> Merge fragment clusters </hands-on-title>
>
> 1. {% tool [CONCOCT: Merge cut clusters](toolshed.g2.bx.psu.edu/repos/iuc/concoct_merge_cut_up_clustering/concoct_merge_cut_up_clustering/1.1.0+galaxy2) %} with the following parameters:
> * {% icon param-file %} *"Clusters generated by CONCOCT"*: `output_clustering` (output of **CONCOCT** {% icon tool %})
{: .hands_on}


Although CONCOCT produces a table assigning contigs to bins, it does not generate FASTA files for each bin by default. To obtain these sequences for downstream analyses, the tool `CONCOCT: Extract a FASTA file` is used. This tool takes the original contig FASTA and CONCOCT’s cluster assignments, extracts all contigs belonging to a chosen bin, and outputs a **FASTA file representing a single MAG**. This extraction step is essential to work with reconstructed genomes in subsequent analyses.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Although CONCOCT produces a table assigning contigs to bins, it does not generate FASTA files for each bin by default. To obtain these sequences for downstream analyses, the tool `CONCOCT: Extract a FASTA file` is used. This tool takes the original contig FASTA and CONCOCT’s cluster assignments, extracts all contigs belonging to a chosen bin, and outputs a **FASTA file representing a single MAG**. This extraction step is essential to work with reconstructed genomes in subsequent analyses.
While **CONCOCT** generates a table mapping contigs to their respective bins, it does not automatically produce **FASTA files** for each bin. To obtain these sequences for further analysis, users must employ the **`CONCOCT: Extract a FASTA file`** utility. This tool combines the original contig FASTA file with CONCOCT’s clustering results, extracts contigs assigned to a specific bin, and outputs a **FASTA file representing a single metagenome-assembled genome (MAG)**. This step is crucial for enabling downstream genomic analyses.


### Bin contigs using CONCOCT
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Bin contigs using CONCOCT


> <hands-on-title> Cut up contigs </hands-on-title>
>
> In this step we fragment the assembled contigs into fixed-length pieces, which CONCOCT requires for stable and consistent clustering.
>
> 1. {% tool [CONCOCT: Cut up contigs](toolshed.g2.bx.psu.edu/repos/iuc/concoct_cut_up_fasta/concoct_cut_up_fasta/1.1.0+galaxy2) %} with the following parameters:
>
> * {% icon param-collection %} *"Fasta contigs file"*: `Contigs` (Input dataset collection)
>
> * *"Concatenate final part to last contig?"*: `Yes`
>
> * *"Output bed file with exact regions of the original contigs corresponding to the newly created contigs?"*: `Yes`
>
> > <comment-title> Why this step? </comment-title>
> >
> > CONCOCT requires contigs to be split into equal-sized fragments. This prevents long contigs from dominating the clustering and increases resolution by allowing variation inside long contigs to be captured.
> > {: .comment}
{: .hands_on}
Comment on lines +30 to +46
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> <hands-on-title> Cut up contigs </hands-on-title>
>
> In this step we fragment the assembled contigs into fixed-length pieces, which CONCOCT requires for stable and consistent clustering.
>
> 1. {% tool [CONCOCT: Cut up contigs](toolshed.g2.bx.psu.edu/repos/iuc/concoct_cut_up_fasta/concoct_cut_up_fasta/1.1.0+galaxy2) %} with the following parameters:
>
> * {% icon param-collection %} *"Fasta contigs file"*: `Contigs` (Input dataset collection)
>
> * *"Concatenate final part to last contig?"*: `Yes`
>
> * *"Output bed file with exact regions of the original contigs corresponding to the newly created contigs?"*: `Yes`
>
> > <comment-title> Why this step? </comment-title>
> >
> > CONCOCT requires contigs to be split into equal-sized fragments. This prevents long contigs from dominating the clustering and increases resolution by allowing variation inside long contigs to be captured.
> > {: .comment}
{: .hands_on}



> <hands-on-title> Generate coverage table </hands-on-title>
>
> This step computes coverage values for each contig fragment across all samples. CONCOCT uses these differential coverage profiles as one of the main signals for clustering.
>
> 1. {% tool [CONCOCT: Generate the input coverage table](toolshed.g2.bx.psu.edu/repos/iuc/concoct_coverage_table/concoct_coverage_table/1.1.0+galaxy2) %} with the following parameters:
>
> * {% icon param-file %} *"Contigs BEDFile"*: `output_bed` (output of **CONCOCT: Cut up contigs** {% icon tool %})
> * *"Type of assembly used to generate the contigs"*: `Individual assembly: 1 run per BAM file`
>
> * {% icon param-file %} *"Sorted BAM file"*: `output1` (output of **Samtools sort** {% icon tool %})
>
> > <comment-title> Why this step? </comment-title>
> >
> > CONCOCT relies on variation in abundance across samples. The coverage table generated here provides this information and is essential for identifying contigs that co-vary in abundance.
> > {: .comment}
{: .hands_on}

> <hands-on-title> Run CONCOCT </hands-on-title>
>
> Here we perform the actual CONCOCT clustering. Using both coverage and sequence information, CONCOCT assigns contig fragments to genome bins.
>
> 1. {% tool [CONCOCT](toolshed.g2.bx.psu.edu/repos/iuc/concoct/concoct/1.1.0+galaxy2) %} with the following parameters:
>
> * {% icon param-file %} *"Coverage file"*: `output` (output of **CONCOCT: Generate the input coverage table** {% icon tool %})
> * {% icon param-file %} *"Composition file with sequences"*: `output_fasta` (output of **CONCOCT: Cut up contigs** {% icon tool %})
> * In *"Advanced options"*:
>
> * *"Read length for coverage"*: `{'id': 1, 'output_name': 'output'}`
>
> > <comment-title> Why this step? </comment-title>
> >
> > This is the core of the CONCOCT workflow. The Gaussian mixture model groups contig fragments into clusters representing draft genomes (bins).
> > {: .comment}
{: .hands_on}
Comment on lines +49 to +82
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> <hands-on-title> Generate coverage table </hands-on-title>
>
> This step computes coverage values for each contig fragment across all samples. CONCOCT uses these differential coverage profiles as one of the main signals for clustering.
>
> 1. {% tool [CONCOCT: Generate the input coverage table](toolshed.g2.bx.psu.edu/repos/iuc/concoct_coverage_table/concoct_coverage_table/1.1.0+galaxy2) %} with the following parameters:
>
> * {% icon param-file %} *"Contigs BEDFile"*: `output_bed` (output of **CONCOCT: Cut up contigs** {% icon tool %})
> * *"Type of assembly used to generate the contigs"*: `Individual assembly: 1 run per BAM file`
>
> * {% icon param-file %} *"Sorted BAM file"*: `output1` (output of **Samtools sort** {% icon tool %})
>
> > <comment-title> Why this step? </comment-title>
> >
> > CONCOCT relies on variation in abundance across samples. The coverage table generated here provides this information and is essential for identifying contigs that co-vary in abundance.
> > {: .comment}
{: .hands_on}
> <hands-on-title> Run CONCOCT </hands-on-title>
>
> Here we perform the actual CONCOCT clustering. Using both coverage and sequence information, CONCOCT assigns contig fragments to genome bins.
>
> 1. {% tool [CONCOCT](toolshed.g2.bx.psu.edu/repos/iuc/concoct/concoct/1.1.0+galaxy2) %} with the following parameters:
>
> * {% icon param-file %} *"Coverage file"*: `output` (output of **CONCOCT: Generate the input coverage table** {% icon tool %})
> * {% icon param-file %} *"Composition file with sequences"*: `output_fasta` (output of **CONCOCT: Cut up contigs** {% icon tool %})
> * In *"Advanced options"*:
>
> * *"Read length for coverage"*: `{'id': 1, 'output_name': 'output'}`
>
> > <comment-title> Why this step? </comment-title>
> >
> > This is the core of the CONCOCT workflow. The Gaussian mixture model groups contig fragments into clusters representing draft genomes (bins).
> > {: .comment}
{: .hands_on}


> <hands-on-title> Merge fragment clusters </hands-on-title>
>
> Since CONCOCT clusters the **fragments**, we must merge them back to produce cluster assignments for the original contigs.
>
> 1. {% tool [CONCOCT: Merge cut clusters](toolshed.g2.bx.psu.edu/repos/iuc/concoct_merge_cut_up_clustering/concoct_merge_cut_up_clustering/1.1.0+galaxy2) %} with the following parameters:
>
> * {% icon param-file %} *"Clusters generated by CONCOCT"*: `output_clustering` (output of **CONCOCT** {% icon tool %})
>
> > <comment-title> Why this step? </comment-title>
> >
> > This step translates fragment-level cluster assignments into contig-level bin assignments—necessary for producing actual MAGs.
> > {: .comment}
{: .hands_on}
Comment on lines +84 to +96
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> <hands-on-title> Merge fragment clusters </hands-on-title>
>
> Since CONCOCT clusters the **fragments**, we must merge them back to produce cluster assignments for the original contigs.
>
> 1. {% tool [CONCOCT: Merge cut clusters](toolshed.g2.bx.psu.edu/repos/iuc/concoct_merge_cut_up_clustering/concoct_merge_cut_up_clustering/1.1.0+galaxy2) %} with the following parameters:
>
> * {% icon param-file %} *"Clusters generated by CONCOCT"*: `output_clustering` (output of **CONCOCT** {% icon tool %})
>
> > <comment-title> Why this step? </comment-title>
> >
> > This step translates fragment-level cluster assignments into contig-level bin assignments—necessary for producing actual MAGs.
> > {: .comment}
{: .hands_on}



> <hands-on-title> Extract MAG FASTA files </hands-on-title>
>
> In this final step we extract the contigs belonging to each bin and create FASTA files representing the reconstructed genomes (MAGs).
>
> 1. {% tool [CONCOCT: Extract a fasta file](toolshed.g2.bx.psu.edu/repos/iuc/concoct_extract_fasta_bins/concoct_extract_fasta_bins/1.1.0+galaxy2) %} with the following parameters:
>
> * {% icon param-collection %} *"Original contig file"*: `output` (Input dataset collection)
>
> * {% icon param-file %} *"CONCOCT clusters"*: `output` (output of **CONCOCT: Merge cut clusters** {% icon tool %})
>
> > <comment-title> Why this step? </comment-title>
> >
> > This tool extracts the contigs belonging to each CONCOCT cluster and outputs them as FASTA files. These represent your preliminary MAGs and can now be evaluated and refined.
> > {: .comment}
Comment on lines +100 to +112
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
>
> In this final step we extract the contigs belonging to each bin and create FASTA files representing the reconstructed genomes (MAGs).
>
> 1. {% tool [CONCOCT: Extract a fasta file](toolshed.g2.bx.psu.edu/repos/iuc/concoct_extract_fasta_bins/concoct_extract_fasta_bins/1.1.0+galaxy2) %} with the following parameters:
>
> * {% icon param-collection %} *"Original contig file"*: `output` (Input dataset collection)
>
> * {% icon param-file %} *"CONCOCT clusters"*: `output` (output of **CONCOCT: Merge cut clusters** {% icon tool %})
>
> > <comment-title> Why this step? </comment-title>
> >
> > This tool extracts the contigs belonging to each CONCOCT cluster and outputs them as FASTA files. These represent your preliminary MAGs and can now be evaluated and refined.
> > {: .comment}
>
> 1. {% tool [CONCOCT: Extract a fasta file](toolshed.g2.bx.psu.edu/repos/iuc/concoct_extract_fasta_bins/concoct_extract_fasta_bins/1.1.0+galaxy2) %} with the following parameters:
> * {% icon param-collection %} *"Original contig file"*: `output` (Input dataset collection)
> * {% icon param-file %} *"CONCOCT clusters"*: `output` (output of **CONCOCT: Merge cut clusters** {% icon tool %})

{: .hands_on}

> <question-title>Binning metrics</question-title>
>
> 1. How many bins where produced by MaxBin2 for our sample?
> 2. How many contigs are in the bin with most contigs?
> > <solution-title></solution-title>
> >
> > 1. There are 10 bins for this sample.
> > 2. 50 - while all other bins only contain one contig each !
> >
> {: .solution}
>
{: .question}
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
## MaxBin2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## MaxBin2
## Bin contigs using MaxBin2


In this tutorial version we will learn how to use MaxBin2 {%cite maxbin2015%} through Galaxy. MaxBin2 is an automated metagenomic binning tool that uses an Expectation-Maximization algorithm to group contigs into genome bins based on abundance, tetranucleotide frequency, and single-copy marker genes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In this tutorial version we will learn how to use MaxBin2 {%cite maxbin2015%} through Galaxy. MaxBin2 is an automated metagenomic binning tool that uses an Expectation-Maximization algorithm to group contigs into genome bins based on abundance, tetranucleotide frequency, and single-copy marker genes.
**MaxBin2** ({% cite maxbin2015 %}) is an automated metagenomic binning tool that uses an Expectation-Maximization algorithm to group contigs into genome bins based on abundance, tetranucleotide frequency, and single-copy marker genes.


## Bin contigs using MaxBin2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Bin contigs using MaxBin2


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The first step when using tools like MetaBAT or MaxBin2 is to compute contig depths from the raw alignment data. Both tools require per-contig depth tables as input, as their binning algorithms rely on summarized coverage statistics at the contig level. However, standard BAM files store read-level alignment information, which must first be processed to generate the necessary contig-level coverage data. This preprocessing step ensures compatibility with the input requirements of these binning tools.

> <hands-on-title> Calculate contig depths </hands-on-title>
>
> 1. {% tool [Calculate contig depths](toolshed.g2.bx.psu.edu/repos/iuc/metabat2_jgi_summarize_bam_contig_depths/metabat2_jgi_summarize_bam_contig_depths/2.17+galaxy0) %} with the following parameters:
> - *"Mode to process BAM files"*: `One by one`
> - {% icon param-file %} *"Sorted bam files"*: output of **Samtools sort** {% icon tool %}
> - *"Select a reference genome?"*: `No`
>
> > <comment-title> Why not use bam directly </comment-title>
> >
> > MetaBAT and MaxBin2 only accept per-contig depth tables because that is the specific input format their binning algorithm requires.
> > BAM files contain read-level alignment data.
> > These binners need summarized, contig-level coverage statistics.
> {: .comment}
Comment on lines +14 to +19
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> > <comment-title> Why not use bam directly </comment-title>
> >
> > MetaBAT and MaxBin2 only accept per-contig depth tables because that is the specific input format their binning algorithm requires.
> > BAM files contain read-level alignment data.
> > These binners need summarized, contig-level coverage statistics.
> {: .comment}

>
{: .hands_on}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
We can now launch the proper binning with MaxBin2

> <hands-on-title> Individual binning of short-reads with MaxBin2 </hands-on-title>
>
> 1. {% tool [MaxBin2](toolshed.g2.bx.psu.edu/repos/mbernt/maxbin2/maxbin2/2.2.7+galaxy6) %} with the following parameters:
> - {% icon param-collection %} *"Contig file"*: `Contigs` (Input dataset collection)
> - *"Assembly type used to generate contig(s)"*: `Assembly of sample(s) one by one (individual assembly)`
> - *"Input type"*: `Abundances`
> - {% icon param-file %} *"Abundance file"*: `outputDepth` (output of **Calculate contig depths** {% icon tool %})
> - In *"Outputs"*:
> - *"Generate visualization of the marker gene presence numbers"*: `Yes`
> - *"Output marker gene presence for bins table"*: `Yes`
> - *"Output marker genes for each bin as fasta"*: `Yes`
> - *"Output log"*: `Yes`
>
>
{: .hands_on}

> <question-title>Binning metrics</question-title>
>
> 1. How many bins where produced by MaxBin2 for our sample?
> 2. How many contigs are in the bin with most contigs?
> > <solution-title></solution-title>
> >
> > 1. There are two bin for this sample.
> > 2. 35 and 24 in the other bin.
> >
> {: .solution}
>
{: .question}
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
## MetaBAT 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## MetaBAT 2
## Bin contigs using MetaBAT 2


In this tutorial version we will learn how to use **MetaBAT 2** {%cite Kang2019%} tool through Galaxy. **MetaBAT** stands for "Metagenome Binning based on Abundance and Tetranucleotide frequency". It is:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In this tutorial version we will learn how to use **MetaBAT 2** {%cite Kang2019%} tool through Galaxy. **MetaBAT** stands for "Metagenome Binning based on Abundance and Tetranucleotide frequency". It is:
**MetaBAT** stands for "Metagenome Binning based on Abundance and Tetranucleotide frequency". It is:


> Grouping large fragments assembled from shotgun metagenomic sequences to deconvolute complex microbial communities, or metagenome binning, enables the study of individual organisms and their interactions. Here we developed automated metagenome binning software, called MetaBAT, which integrates empirical probabilistic distances of genome abundance and tetranucleotide frequency. On synthetic datasets MetaBAT on average achieves 98percent precision and 90% recall at the strain level with 281 near complete unique genomes. Applying MetaBAT to a human gut microbiome data set we recovered 176 genome bins with 92% precision and 80% recall. Further analyses suggest MetaBAT is able to recover genome fragments missed in reference genomes up to 19%, while 53 genome bins are novel. In summary, we believe MetaBAT is a powerful tool to facilitate comprehensive understanding of complex microbial communities.
{: .quote author="Kang et al, 2019" }

MetaBAT is a popular software tool for metagenomics binning, and there are several reasons why it is often used:
- *High accuracy*: MetaBAT uses a combination of tetranucleotide frequency, coverage depth, and read linkage information to bin contigs, which has been shown to be highly accurate and efficient.
- *Easy to use*: MetaBAT has a user-friendly interface and can be run on a standard desktop computer, making it accessible to a wide range of researchers with varying levels of computational expertise.
- *Flexibility*: MetaBAT can be used with a variety of sequencing technologies, including Illumina, PacBio, and Nanopore, and can be applied to both microbial and viral metagenomes.
- *Scalability*: MetaBAT can handle large-scale datasets, and its performance has been shown to improve with increasing sequencing depth.
- *Compatibility*: MetaBAT outputs MAGs in standard formats that can be easily integrated into downstream analyses and tools, such as taxonomic annotation and functional prediction.

### Bin contigs using MetaBAT 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Bin contigs using MetaBAT 2


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The first step when using tools like MetaBAT or MaxBin2 is to compute contig depths from the raw alignment data. Both tools require per-contig depth tables as input, as their binning algorithms rely on summarized coverage statistics at the contig level. However, standard BAM files store read-level alignment information, which must first be processed to generate the necessary contig-level coverage data. This preprocessing step ensures compatibility with the input requirements of these binning tools.

> <hands-on-title> Calculate contig depths </hands-on-title>
>
> 1. {% tool [Calculate contig depths](toolshed.g2.bx.psu.edu/repos/iuc/metabat2_jgi_summarize_bam_contig_depths/metabat2_jgi_summarize_bam_contig_depths/2.17+galaxy0) %} with the following parameters:
> - *"Mode to process BAM files"*: `One by one`
> - {% icon param-file %} *"Sorted bam files"*: output of **Samtools sort** {% icon tool %}
> - *"Select a reference genome?"*: `No`
>
> > <comment-title> Why not use bam directly </comment-title>
> >
> > MetaBAT only accepts per-contig depth tables because that is the specific input format its binning algorithm requires.
> > BAM files contain read-level alignment data.
> > MetaBAT needs summarized, contig-level coverage statistics. This is also the case for MaxBin2.
> {: .comment}
Comment on lines +24 to +29
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> > <comment-title> Why not use bam directly </comment-title>
> >
> > MetaBAT only accepts per-contig depth tables because that is the specific input format its binning algorithm requires.
> > BAM files contain read-level alignment data.
> > MetaBAT needs summarized, contig-level coverage statistics. This is also the case for MaxBin2.
> {: .comment}

>
{: .hands_on}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
We can now launch the proper binning with MetaBAT 2

> <hands-on-title>Individual binning of short-reads with MetaBAT 2</hands-on-title>
> 1. {% tool [MetaBAT 2](toolshed.g2.bx.psu.edu/repos/iuc/metabat2/metabat2/2.17+galaxy0) %} with parameters:
> - *"Fasta file containing contigs"*: `Contigs`
> - In **Advanced options**, keep all as **default**.
> - In **Output options:**
> - *"Save cluster memberships as a matrix format?"*: `"Yes"`
>
{: .hands_on}

The output files generated by MetaBAT 2 include (some of the files below are optional and not produced unless it is required by the user):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The output files generated by MetaBAT 2 include (some of the files below are optional and not produced unless it is required by the user):
MetaBAT 2 generates several output files during its execution, some of which are optional and only produced when explicitly requested by the user. These files include:


1. The final set of genome bins in FASTA format (`.fa`)
2. A summary file with information on each genome bin, including its length, completeness, contamination, and taxonomy classification (`.txt`)
3. A file with the mapping results showing how each contig was assigned to a genome bin (`.bam`)
4. A file containing the abundance estimation of each genome bin (`.txt`)
5. A file with the coverage profile of each genome bin (`.txt`)
6. A file containing the nucleotide composition of each genome bin (`.txt`)
7. A file with the predicted gene sequences of each genome bin (`.faa`)

These output files can be further analyzed and used for downstream applications such as functional annotation, comparative genomics, and phylogenetic analysis.

> <question-title>Binning metrics</question-title>
>
> 1. How many bins where produced by MetaBAT 2 for our sample?
> 2. How many contigs are in the bin with most contigs?
> > <solution-title></solution-title>
> >
> > 1. There is only one bin for this sample.
> > 2. 52 (these numbers may differ slightly depending on the version of MetaBAT2). So not all contigs where binned into this bin !
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> > 2. 52 (these numbers may differ slightly depending on the version of MetaBAT2). So not all contigs where binned into this bin !
> > 2. 52 (these numbers may differ slightly depending on the version of MetaBAT2). So not all contigs were binned into this bin!

> >
> {: .solution}
>
{: .question}
Loading