Skip to content

Commit c8f18cf

Browse files
committed
Merge branch 'swarm3'
2 parents 02ad79a + 9aa56c5 commit c8f18cf

39 files changed

+5168
-4006
lines changed

.travis.yml

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
language: c++
2+
3+
os: linux
4+
5+
dist: bionic
6+
7+
compiler: gcc
8+
9+
before_install:
10+
- sudo apt-get install -y valgrind
11+
12+
script:
13+
- make
14+
- export PATH=$PWD/bin:$PATH
15+
- git clone https://github.com/frederic-mahe/swarm-tests.git && cd swarm-tests && bash ./run_all_tests.sh | tee tests.log && ! grep -q FAIL tests.log

README.md

Lines changed: 31 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
[![Build Status](https://travis-ci.org/torognes/swarm.svg?branch=swarm3)](https://travis-ci.org/torognes/swarm)
2+
13
# swarm
24

35
A robust and fast clustering method for amplicon-based studies.
@@ -16,21 +18,32 @@ To help users, we describe
1618
starting from raw fastq files, clustering with **swarm** and producing
1719
a filtered OTU table.
1820

19-
swarm 2.0 introduces several novelties and improvements over swarm
21+
swarm 3.0 introduces:
22+
* a much faster default algorithm,
23+
* a reduced memory footprint,
24+
* binaries for Windows x86-64, GNU/Linux ARM 64, and GNU/Linux POWER8,
25+
* an updated, hardened, and thoroughly tested code.
26+
27+
Please note that:
28+
* strict dereplication of input sequences is now mandatory,
29+
* \-\-seeds option (\-w) now outputs results sorted by decreasing
30+
abundance, and then by alphabetical order of sequence labels.
31+
32+
swarm 2.0 introduced several novelties and improvements over swarm
2033
1.0:
2134
* built-in breaking phase now performed automatically,
2235
* possibility to output OTU representatives in fasta format (option
2336
`-w`),
2437
* fast algorithm now used by default for *d* = 1 (linear time
2538
complexity),
2639
* a new option called *fastidious* that refines *d* = 1 results and
27-
reduces the number of small OTUs,
40+
reduces the number of small OTUs.
2841

2942
## Common misconceptions
3043

3144
**swarm** is a single-linkage clustering method, with some superficial
32-
similarities with other clustering methods (e.g.,
33-
[Huse et al, 2010](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2909393/)). **swarm**'s
45+
similarities with other clustering methods (e.g., [Huse et al,
46+
2010](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2909393/)). **swarm**'s
3447
novelty is its iterative growth process and the use of sequence
3548
abundance values to delineate OTUs. **swarm** properly delineates
3649
large OTUs (high recall), and can distinguish OTUs with as little as
@@ -76,8 +89,8 @@ cgtcgtcgtcgtcgt
7689

7790
where sequence identifiers are unique and end with a value indicating
7891
the number of occurrences of the sequence (e.g., `_1000`). Alternative
79-
format is possible with the option `-z`, please see the
80-
[user manual](https://github.com/torognes/swarm/blob/master/man/swarm_manual.pdf). Swarm
92+
format is possible with the option `-z`, please see the [user
93+
manual](https://github.com/torognes/swarm/blob/master/man/swarm_manual.pdf). Swarm
8194
**requires** each fasta entry to present a number of occurrences to
8295
work properly. That crucial information can be produced during the
8396
[dereplication](#dereplication-mandatory) step.
@@ -87,7 +100,7 @@ Use `swarm -h` to get a short help, or see the
87100
for a complete description of input/output formats and command line
88101
options.
89102

90-
The memory footprint of **swarm** is roughly 1.6 times the size of the
103+
The memory footprint of **swarm** is roughly 0.6 times the size of the
91104
input fasta file. When using the fastidious option, memory footprint
92105
can increase significantly. See options `-c` and `-y` to control and
93106
cap swarm's memory consumption.
@@ -210,15 +223,10 @@ from two different sets have the same hash code, it means that the
210223
sequences they represent are identical.
211224

212225
If for some reason your fasta entries don't have abundance values, and
213-
you still want to run swarm, you can easily add fake abundance values:
214-
215-
```sh
216-
sed '/^>/ s/$/_1/' amplicons.fasta > amplicons_with_abundances.fasta
217-
```
218-
219-
Alternatively, you may specify a default abundance value with
220-
**swarm**'s `--append-abundance` (`-a`) option to be used when
221-
abundance information is missing from a sequence.
226+
you still want to run swarm (not recommended), you can specify a
227+
default abundance value with **swarm**'s `--append-abundance` (`-a`)
228+
option to be used when abundance information is missing from a
229+
sequence.
222230

223231

224232
### Launch swarm ###
@@ -305,15 +313,6 @@ rm "${AMPLICONS}"
305313
```
306314

307315

308-
## Troubleshooting ##
309-
310-
If **swarm** exits with an error message saying `This program
311-
requires a processor with SSE2`, your computer is too old to run
312-
**swarm** (or based on a non x86-64 architecture). **swarm** only runs
313-
on CPUs with the SSE2 instructions, i.e. most Intel and AMD CPUs
314-
released since 2004.
315-
316-
317316
## Citation ##
318317

319318
To cite **swarm**, please refer to:
@@ -333,7 +332,7 @@ You are welcome to:
333332

334333
* submit suggestions and bug-reports at: https://github.com/torognes/swarm/issues
335334
* send a pull request on: https://github.com/torognes/swarm/
336-
* compose a friendly e-mail to: Frédéric Mahé <mahe@rhrk.uni-kl.de> and Torbjørn Rognes <torognes@ifi.uio.no>
335+
* compose a friendly e-mail to: Frédéric Mahé <frederic.mahe@cirad.fr> and Torbjørn Rognes <torognes@ifi.uio.no>
337336

338337

339338
## Third-party pipelines ##
@@ -356,7 +355,7 @@ You are welcome to:
356355
If you want to try alternative free and open-source clustering
357356
methods, here are some links:
358357

359-
* [VSEARCH](https://github.com/torognes/vsearch)
358+
* [vsearch](https://github.com/torognes/vsearch)
360359
* [Oligotyping](http://merenlab.org/projects/oligotyping/)
361360
* [DNAclust](http://dnaclust.sourceforge.net/)
362361
* [Sumaclust](http://metabarcoding.org/sumatra)
@@ -365,6 +364,11 @@ methods, here are some links:
365364

366365
## Version history ##
367366

367+
### version 3.0 ###
368+
369+
**swarm** 3.0 is much faster when _d_ = 1, and consumes less memory.
370+
Strict dereplication is now mandatory.
371+
368372
### version 2.2.2 ###
369373

370374
**swarm** 2.2.2 fixes a bug causing Swarm to wait forever in very rare

man/swarm.1

Lines changed: 41 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
.\" ============================================================================
2-
.TH swarm 1 "December 12, 2017" "version 2.2.2" "USER COMMANDS"
2+
.TH swarm 1 "October 24, 2019" "version 3.0.0" "USER COMMANDS"
33
.\" ============================================================================
44
.SH NAME
55
swarm \(em find clusters of nearly-identical nucleotide amplicons
@@ -110,8 +110,9 @@ results obtained during the clustering process allows \fBswarm\fR to
110110
avoid most of the amplicon comparisons needed in a naïve approach. To
111111
speed up the remaining amplicon comparisons, \fBswarm\fR implements an
112112
extremely fast Needleman-Wunsch algorithm making use of the Streaming
113-
SIMD Extensions (SSE2) of modern x86-64 CPUs. If SSE2 instructions are
114-
not available, \fBswarm\fR exits with an error message.
113+
SIMD Extensions (SSE2) of modern x86-64 CPUs, or NEON instructions of
114+
ARM-64 CPUs. If SSE2 instructions are not available, \fBswarm\fR exits
115+
with an error message.
115116
.PP
116117
\fBswarm\fR can read nucleotide amplicons in fasta format from a
117118
normal file or from the standard input (using a pipe or a
@@ -138,7 +139,19 @@ defined as a string of [ACGT] or [ACGU] symbols (case insensitive, 'U'
138139
is replaced with 'T' internally), starting after the end of the header
139140
line and ending before the next header line or the file end;
140141
\fBswarm\fR silently removes newline symbols ('\\n' or '\\r') and
141-
exits with an error message if any other symbol is present.
142+
exits with an error message if any other symbol is present. Lastly, if
143+
sequences are not all unique, i.e. were not properly dereplicated,
144+
swarm will exit with an error message.
145+
.PP
146+
Clusters are written to output files (specified with \-i, \-o, \-s and
147+
\-u) by decreasing abundance of their seed sequences, and then by
148+
alphabetical order of seed sequence labels. An exception to that is
149+
the \-w (\-\-seeds) output, which is sorted by decreasing \fIcluster
150+
abundance\fR (sum of abundances of all sequences in the cluster), and
151+
then by alphabetical order of seed sequence labels. This is
152+
particularly useful for post-clustering steps, such as \fIde novo\fR
153+
chimera detection, that require clusters to be sorted by decreasing
154+
abundances.
142155
.\" ----------------------------------------------------------------------------
143156
.SS General options
144157
.TP 9
@@ -286,7 +299,7 @@ in situations where writing to \fIstandard error\fR is problematic
286299
output clustering results to \fIfilename\fR. Results consist of a list
287300
of OTUs, one OTU per line. An OTU is a list of amplicon headers
288301
separated by spaces. That output format can be modified by the option
289-
\-\-mothur (\-r). Default is to write to standard output.
302+
\-\-mothur (\-r). Default is to write to \fIstandard output\fR.
290303
.TP
291304
.B \-r\fP,\fB\ \-\-mothur
292305
output clustering results in a format compatible with Mothur. That
@@ -305,7 +318,7 @@ total abundance of amplicons in the OTU,
305318
.IP \n+[step].
306319
label of the initial seed (header without abundance annotations),
307320
.IP \n+[step].
308-
initial seed abundance,
321+
abundance of the initial seed,
309322
.IP \n+[step].
310323
number of amplicons with an abundance of 1 in the OTU,
311324
.IP \n+[step].
@@ -363,13 +376,15 @@ output OTU representative sequences to \fIfilename\fR in fasta
363376
format. The abundance value of each OTU representative is the sum of
364377
the abundances of all the amplicons in the OTU. Fasta headers are
365378
formated as follows: '>label_\fIinteger\fR',
366-
or '>label;size=\fIinteger\fR;' if the \-z option is used.
379+
or '>label;size=\fIinteger\fR;' if the \-z option is used, and
380+
sequences are uppercased. Sequences are sorted by decreasing
381+
abundance, and then by alphabetical order of sequence labels.
367382
.TP
368383
.B \-z\fP,\fB\ \-\-usearch\-abundance
369384
accept amplicon abundance values in usearch/vsearch's style
370385
(>label;size=\fIinteger\fR[;]). That option influences the abundance
371-
annotation style used in swarm's standard output (\-o), as well as the
372-
ouput of options \-r, \-u and \-w.
386+
annotation style used in swarm's \fIstandard output\fR (\-o), as well
387+
as the output of options \-r, \-u and \-w.
373388
.LP
374389
.\" ----------------------------------------------------------------------------
375390
.SS Pairwise alignment advanced options
@@ -410,7 +425,7 @@ zcat myfile.fasta.gz | \\
410425
\-t 4 \\
411426
\-f \\
412427
\-w myfile.representatives.fasta \\
413-
\-o myfile.swarms
428+
\-o /dev/null
414429
.RE
415430
.EE
416431
.\" ============================================================================
@@ -475,7 +490,7 @@ License along with this program. If not, see
475490
.\" ============================================================================
476491
.SH SEE ALSO
477492
\fBswipe\fR, an extremely fast Smith-Waterman database search tool by
478-
Torbjørn Rognes (available from
493+
Torbjørn Rognes (available at
479494
.UR https://github.com/torognes/swipe
480495
.UE ).
481496
.PP
@@ -492,8 +507,17 @@ New features and important modifications of \fBswarm\fR (short lived
492507
or minor bug releases are not mentioned):
493508
.RS
494509
.TP
510+
.BR v3.0.0\~ "released October 24, 2019"
511+
Version 3.0.0 introduces a faster algorithm for \fId\fR = 1, and a
512+
reduced memory footprint. Swarm has been ported to Windows x86-64,
513+
GNU/Linux ARM 64, and GNU/Linux POWER8. Internal code has been
514+
modernized, hardened, and thoroughly tested. Strict dereplication of
515+
input sequences is now mandatory. The \-\-seeds option (\-w) now
516+
outputs results sorted by decreasing abundance, and then by
517+
alphabetical order of sequence labels.
518+
.TP
495519
.BR v2.2.2\~ "released December 12, 2017"
496-
Version 2.2.2 fixes a bug that would cause Swarm to wait forever in
520+
Version 2.2.2 fixes a bug that would cause swarm to wait forever in
497521
very rare cases when multiple threads were used.
498522
.TP
499523
.BR v2.2.1\~ "released October 27, 2017"
@@ -527,7 +551,7 @@ bug only applies when \fId\fR > 1.
527551
.BR v2.1.10\~ "released December 22, 2016"
528552
Version 2.1.10 fixes two bugs related to gap penalties of alignments.
529553
The first bug may lead to wrong aligments and similarity percentages
530-
reported in UCLUST (.uc) files. The second bug makes Swarm use a
554+
reported in UCLUST (.uc) files. The second bug makes swarm use a
531555
slightly higher gap extension penalty than specified. The default gap
532556
extension penalty used have actually been 4.5 instead of 4.
533557
.TP
@@ -679,10 +703,10 @@ not. Only basic SSE2 instructions are now required to run \fBswarm\fR.
679703
.TP
680704
.BR v1.2.4\~ "released January 30, 2014"
681705
Version 1.2.4 introduces an option \-\-break\-swarms to output all
682-
pairs of amplicons with \fId\fR differences to standard error. That
683-
option is used by the companion script `swarm_breaker.py` to refine
684-
\fBswarm\fR results. The syntax of the inline assembly code is changed
685-
for compatibility with more compilers.
706+
pairs of amplicons with \fId\fR differences to \fIstandard
707+
error\fR. That option is used by the companion script
708+
`swarm_breaker.py` to refine \fBswarm\fR results. The syntax of the
709+
inline assembly code is changed for compatibility with more compilers.
686710
.TP
687711
.BR v1.2\~ "released May 16, 2013"
688712
Version 1.2 greatly improves speed by using alignment-free comparisons

man/swarm_manual.pdf

2.41 KB
Binary file not shown.

scripts/amplicon_contingency_table.py

Lines changed: 7 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,13 @@
1-
#!/usr/bin/env python
1+
#!/usr/bin/env python3
22
# -*- coding: utf-8 -*-
33
"""
44
Read all fasta files and build a sorted amplicon contingency
5-
table. Usage: python amplicon_contingency_table.py samples_*.fas
5+
table. Usage: python3 amplicon_contingency_table.py samples_*.fas
66
"""
77

8-
from __future__ import print_function
9-
10-
__author__ = "Frédéric Mahé <mahe@rhrk.uni-kl.fr>"
11-
__date__ = "2016/03/12"
12-
__version__ = "$Revision: 2.1"
8+
__author__ = "Frédéric Mahé <frederic.mahe@cirad.fr>"
9+
__date__ = "2019/09/24"
10+
__version__ = "$Revision: 3.0"
1311

1412
import os
1513
import sys
@@ -35,7 +33,7 @@ def fasta_parse():
3533
sample = os.path.basename(fasta_file)
3634
sample = os.path.splitext(sample)[0]
3735
samples[sample] = samples.get(sample, 0) + 1
38-
with open(fasta_file, "rU") as fasta_file:
36+
with open(fasta_file, "r") as fasta_file:
3937
for line in fasta_file:
4038
if line.startswith(">"):
4139
amplicon, abundance = line.strip(">;\n").split(separator)
@@ -65,7 +63,7 @@ def main():
6563
all_amplicons, amplicons2samples, samples = fasta_parse()
6664

6765
# Sort amplicons by decreasing abundance (and by amplicon name)
68-
sorted_all_amplicons = sorted(all_amplicons.iteritems(),
66+
sorted_all_amplicons = sorted(iter(all_amplicons.items()),
6967
key=operator.itemgetter(1, 0))
7068
sorted_all_amplicons.reverse()
7169

scripts/graph_plot.py

Lines changed: 12 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,29 +1,24 @@
1-
#!/usr/bin/env python
1+
#!/usr/bin/env python3
22
# -*- coding: utf-8 -*-
33
"""
44
Visualize the internal structure of a swarm (color vertices by
5-
abundance). Requires the module igraph and python 2.7+.
6-
7-
Limitations: amplicons grafted with the fastidious option will be
8-
discarded and will not be visualized.
5+
abundance). Requires the module igraph and python 3.
96
"""
107

11-
from __future__ import print_function
12-
13-
__author__ = "Frédéric Mahé <mahe@rhrk.uni-kl.fr>"
14-
__date__ = "2016/11/09"
15-
__version__ = "$Revision: 3.1"
8+
__author__ = "Frédéric Mahé <frederic.mahe@cirad.fr>"
9+
__date__ = "2019/09/24"
10+
__version__ = "$Revision: 4.0"
1611

1712
import sys
1813
import os.path
1914
from igraph import Graph, plot
2015
from optparse import OptionParser
2116

22-
#*****************************************************************************#
17+
# *************************************************************************** #
2318
# #
2419
# Functions #
2520
# #
26-
#*****************************************************************************#
21+
# *************************************************************************** #
2722

2823

2924
def option_parse():
@@ -76,7 +71,7 @@ def parse_files(swarms, internal_structure, OTU, drop):
7671
"""
7772
# List amplicon ids and abundances
7873
amplicons = list()
79-
with open(swarms, "rU") as swarms:
74+
with open(swarms, "r") as swarms:
8075
for i, swarm in enumerate(swarms):
8176
if i == OTU - 1:
8277
# Deal with ";size=" in a rather clumsy way... but it works
@@ -100,7 +95,7 @@ def parse_files(swarms, internal_structure, OTU, drop):
10095

10196
# List pairwise relations
10297
relations = list()
103-
with open(internal_structure, "rU") as internal_structure:
98+
with open(internal_structure, "r") as internal_structure:
10499
print("Parsing amplicon relationships", file=sys.stdout)
105100
for line in internal_structure:
106101
# Get the first four elements of the line
@@ -138,7 +133,7 @@ def build_graph(amplicons, relations):
138133

139134
amplicon_ids = [amplicon[0] for amplicon in amplicons]
140135
abundances = [int(amplicon[1]) for amplicon in amplicons]
141-
minimum, maximum = min(abundances), max(abundances)
136+
maximum = max(abundances)
142137

143138
# Determine canvas size
144139
if len(abundances) < 500:
@@ -214,11 +209,11 @@ def main():
214209
return
215210

216211

217-
#*****************************************************************************#
212+
# *************************************************************************** #
218213
# #
219214
# Body #
220215
# #
221-
#*****************************************************************************#
216+
# *************************************************************************** #
222217

223218
if __name__ == '__main__':
224219

0 commit comments

Comments
 (0)