You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: topics/microbiome/tutorials/metagenomics-assembly/tutorial.md
+45-36Lines changed: 45 additions & 36 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,21 +11,18 @@ questions:
11
11
- "How tools based on De Bruijn graph work?"
12
12
- "How to assess the quality of metagenomic data assembly?"
13
13
objectives:
14
-
- "Describe what an assembly is"
15
-
- "Describe what de-replication is"
16
-
- "Explain the difference between co-assembly and individual assembly"
17
-
- "Explain the difference between reads, contigs and scaffolds"
18
-
- "Explain how tools based on De Bruijn graph work"
19
-
- "Apply appropriate tools for analyzing the quality of metagenomic data"
20
-
- "Construct and apply simple assembly pipelines on short read data"
21
-
- "Apply appropriate tools for analyzing the quality of metagenomic assembly"
22
-
- "Evaluate the Quality of the Assembly with Quast, Bowtie2, and CoverM-Contig"
14
+
- "Describe what an assembly is."
15
+
- "Explain the difference between co-assembly and individual assembly."
16
+
- "Explain the difference between reads, contigs and scaffolds."
17
+
- "Explain how tools based on de Bruijn graph work."
18
+
- "Evaluate the Quality of the Assembly with QUAST, Bowtie2, and CoverM-Contig."
19
+
- "Construct and apply simple assembly pipelines on short read data.""
23
20
time_estimation: "2H"
24
21
key_points:
25
-
- "Assembly groups reads into contigs and scafolds."
26
-
- "De Brujin Graphs use k-mers to assembly reads"
27
-
- "MetaSPAdes and MEGAHIT are assemblers"
28
-
- "Quast is the tool to assess the assembly quality"
22
+
- "Assembly groups reads into contigs and scaffolds."
23
+
- "de Brujin Graphs use k-mers to assembly reads."
24
+
- "MetaSPAdes and MEGAHIT are short-read assemblers."
25
+
- "MetaQUAST is a tool to assess metagenomic assembly quality."
29
26
edam_ontology:
30
27
- topic_3174 # Metagenomics
31
28
- topic_0196 # Sequence assembly
@@ -159,11 +156,11 @@ In case of a not very large dataset it's more convenient to upload data directly
159
156
160
157
As explained before, there are many challenges to metagenomics assembly, including:
161
158
162
-
1. differences in coverage between samples, resulting from differences in abundance,
163
-
2. the fact that different species often share conserved regions ({%cite kececioglu2001%}), and
164
-
3. the presence of multiple strains of a single species ({%cite miller2010%}).
159
+
1. Differences in coverage between samples, resulting from differences in abundance;
160
+
2. The fact that different species often share conserved regions ({%cite kececioglu2001%}), and
161
+
3. The presence of multiple strains of a single species ({%cite miller2010%}).
165
162
166
-
To reduce the differences in coverage between samples, we can use a **co-assembly** approach, where reads from all samples are aligned together.:
163
+
To reduce the differences in coverage between samples, we can use a **co-assembly** approach, where reads from all samples are aligned together:
167
164
168
165
{:width="60%"}
169
166
@@ -185,27 +182,37 @@ In these cases, co-assembly is reasonable if:
185
182
- Longitudinal sampling of the same site
186
183
- Related samples
187
184
188
-
If it is not the case, **individual assembly** should be prefered. In this case, an extra step of **de-replication** should be used:
185
+
Examples where co-assembly would be reasonable:
186
+
- Repeated sampling of the **same patient** along a particular amount of time.
187
+
- Multiple samples taken from the **same site** and **similar environmental conditions**, eg. a patch of soil during the same sampling season.
188
+
189
+
Examples where co-assembly would NOT be recommended:
190
+
- Samples from different patients.
191
+
- Samples from the same site, but over different seasons or under different environmental conditions, eg. a patch of soil before and after a bushfire event, a marine site under upwelling vs. under normal conditions.
192
+
193
+
If samples differ like described, **individual assembly** is preferred. In the case of individual assembly, if **contigs are binned** after, an extra step of **de-replication** should be used:
189
194
190
195
{:width="80%"}
191
196
192
-
Co-assembly is more commonly used than individual assembly and then de-replication after binning. But in this tutorial, to show all steps, we will run an **individual assembly**.
197
+
For more information on dereplication, check out the [metagenomic binningtutorial](../metagenomics-binning/tutorial.md).
193
198
194
-
> <comment-title></comment-title>
195
-
> Sometimes it is important to run assembly tools both on individual samples and on all pooled samples, and use both outputs to get the better outputs for the certain dataset.
199
+
In this tutorial, to show all steps, we will run an **individual assembly**.
200
+
201
+
> <comment-title>Why not both?</comment-title>
202
+
> Sometimes it is important to run both individual assembly and co-assembly, and use both outputs to get better results for that dataset.
196
203
{: .comment}
197
204
198
205
As mentioned in the introduction, several tools are available for metagenomic assembly. But 2 are the most used ones:
199
206
200
-
- **MetaSPAdes** ({%cite nurk2017%}): an short-read assembler designed specifically for large and complex metagenomics datasets
207
+
- **MetaSPAdes** ({%cite nurk2017%}): an short-read assembler designed specifically for large and complex metagenomics datasets.
201
208
202
209
MetaSPAdes is part of the SPAdes toolkit, which has several assembly pipelines. Since SPAdes handles non-uniform coverage, it is useful for assembling simple communities, but metaSPAdes also handles other problems, allowing it to assemble complex communities' metagenomes.
203
210
204
211
As input for metaSPAdes it can accept short reads. However, there is an option to use additionally long reads besides short reads to produce hybrid input.
205
212
206
213
- **MEGAHIT** ({% cite li2015 %}): a single node assembler for large and complex metagenomics NGS reads, such as soil
207
214
208
-
It makes use of succinct de Bruijn graph (SdBG) to achieve low memory assembly.
215
+
It makes use of the Succinct de Bruijn Graph (SdBG) approach to achieve low memory assembly.
209
216
210
217
Both tools are available in Galaxy. But currently, only MEGAHIT can be used in individual mode for several samples.
211
218
@@ -227,9 +234,11 @@ Both tools are available in Galaxy. But currently, only MEGAHIT can be used in i
227
234
>
228
235
{: .hands_on}
229
236
230
-
**MEGAHIT** produced a collection of output assemblies - one per sample - that can be proceeded further in binning step and then de-replication. The output contains **contigs**, contiguous lengths of genomic sequences in which bases are known to a high degree of certainty.
237
+
**MEGAHIT** produced a collection of output assemblies - one per sample - that can be used for the subsequent step of **metagenomic binning**. The output contains **contigs**, contiguous lengths of genomic sequences in which bases are known to a high degree of certainty.
231
238
232
-
Contrary to **MetaSPAdes**, **MEGAHIT** does not output **scaffolds**, i.e. segments of genome sequence reconstructed fron contigs and gaps. The gaps occur when reads from the two sequenced ends of at least one fragment overlap with other reads from two different contigs (as long as the arrangement is otherwise consistent with the contigs being adjacent). It is possible to estimate the number of bases between contigs based on fragment lengths.
239
+
<comment-title>Scaffolds</comment-title>
240
+
Contrary to **MetaSPAdes**, **MEGAHIT** does not output **scaffolds**. **Scaffolds** are segments of genome sequence reconstructed fron contigs and gaps. The gaps occur when reads from the two sequenced ends of at least one fragment overlap with other reads from two different contigs (as long as the arrangement is otherwise consistent with the contigs being adjacent). It is possible to estimate the number of bases between contigs based on fragment lengths.
241
+
{:. comment}
233
242
234
243
> <comment-title></comment-title>
235
244
>
@@ -249,7 +258,7 @@ Contrary to **MetaSPAdes**, **MEGAHIT** does not output **scaffolds**, i.e. segm
249
258
> > ```
250
259
> >
251
260
> >
252
-
> > 2. Create a collection named `MEGAHIT Contig`, rename your pairs with the sample name
261
+
> > 2. Create a collection named `MEGAHIT Contigs`, rename your pairs with the sample name
253
262
> >
254
263
> {: .hands_on}
255
264
{: .comment}
@@ -290,7 +299,7 @@ Assemblies can be evaluated with **metaQUAST** ({%cite mikheenko2016%}), the met
290
299
291
300
> <hands-on-title>Evaluation assembly quality with metaQUAST</hands-on-title>
292
301
>
293
-
> 1. {% tool [Quast](toolshed.g2.bx.psu.edu/repos/iuc/quast/quast/5.2.0+galaxy1) %} with parameters:
302
+
> 1. {% tool [QUAST](toolshed.g2.bx.psu.edu/repos/iuc/quast/quast/5.2.0+galaxy1) %} with parameters:
@@ -327,7 +336,7 @@ Assemblies can be evaluated with **metaQUAST** ({%cite mikheenko2016%}), the met
327
336
> {: .hands_on}
328
337
{: .comment}
329
338
330
-
Quast main output are HTML reports which aggregate different metrics.
339
+
QUAST main output are HTML reports which aggregate different metrics.
331
340
332
341
## Assembly statistics
333
342
@@ -339,7 +348,7 @@ On the top of each report is a table with in rows statistics for contigs larger
339
348
340
349
A base in the reference genome is counted as aligned if at least one contig has at least one alignment to this base.
341
350
342
-
We did not provide any reference there, but metaQuast try to identify genome content of the metagenome by aligning contigs to [SILVA](https://www.arb-silva.de/) 16S rRNA database. For each assembly, 50 reference genomes with top scores are chosen. The full reference genomes of the identified organisms are afterwards downloaded from NCBI to map the assemblies on them and compute the genome fractions.
351
+
We did not provide any reference there, but metaQUAST try to identify genome content of the metagenome by aligning contigs to [SILVA](https://www.arb-silva.de/) 16S rRNA database. For each assembly, 50 reference genomes with top scores are chosen. The full reference genomes of the identified organisms are afterwards downloaded from NCBI to map the assemblies on them and compute the genome fractions.
343
352
344
353
For each identified genomes, the genome fraction is given when clicking on **Genome fraction (%)**
345
354
@@ -456,7 +465,7 @@ On the top of each report is a table with in rows statistics for contigs larger
456
465
457
466
3. **Misassemblies**: joining sequences that should not be adjacent.
458
467
459
-
Quast identifies missassemblies by mapping the contigs to the reference genomes of the identified organisms. 3 types of misassemblies can be identified:
468
+
QUAST identifies missassemblies by mapping the contigs to the reference genomes of the identified organisms. 3 types of misassemblies can be identified:
460
469
461
470
{:width="60%"}
462
471
@@ -781,8 +790,8 @@ Metagenomic data can be assembled to, ideally, obtain the genomes of the species
781
790
- **different tools** like MetaSPAdes and MEGAHIT
782
791
783
792
Once the choices made, metagenomic assembly can start:
784
-
1. Input data are assembled to obtain contigs and sometimes scaffolds
785
-
2. Assembly quality is evaluated with various metrics
793
+
1. Input data are assembled to obtain contigs and sometimes scaffolds.
794
+
2. Assembly quality is evaluated with various metrics.
786
795
3. The assembly graph can be visualized.
787
796
788
-
Once all these steps done, we can move to the next phase to build Metagenomics Assembled Genomes (MAGs): binning
797
+
Once all these steps done, we can move to the next phase to build Metagenomics Assembled Genomes (MAGs): [metagenomic binning](../metagenomics-binning/tutorial.md).
0 commit comments