Releases: MikkelSchubert/adapterremoval
AdapterRemoval v3.0.0-alpha3
This is the third alpha release of AdapterRemoval v3. As with the previous alpha
releases, changes that affect how AdapterRemoval is used (e.g. by removing
options) or that result in different output compared to previous versions are
marked with the label "[BREAKING]".
AdapterRemoval now uses meson for its build process, and meson is therefore
a build-time requirement. A Makefile is still provided to simplify setting up
and running the build. See the installation instructions in the documentation
for more information.
Major changes include support for hardware accelerated alignments using NEON on
modern Apple hardware, support for samples being identified by multiple
barcodes, support for handling barcodes in that may ligate in different
orientations, improved support for SAM/BAM output, and (optional) duplication
plot in HTML report.
For more information, see
https://github.com/MikkelSchubert/adapterremoval/blob/v3.0.0-alpha3/README-v3.md
Added
- Multiple barcodes/barcode pairs may now be used to identify the same sample,
via the--multiple-barcodesflag. The number of hits per barcode/barcode
pair is reported in the HTML/JSON reports. - Added support for handling barcodes that may ligate in different orientations
(via--barcode-orientation) and for normalizing the orientation of merged
reads (via--normalize-orientation). - The
--use-colorsparameter may now be used to controls color output.
Options are auto (default; enabled when run interactively), always, or never. - The title of the HTML report can now be set via
--report-title. - Input files are now checked for duplicate filenames, in order to help
prevents accidental data duplication. - Alignments are now accelerated on Apple hardware using NEON instructions, for
a roughly 3-fold increase in throughput. - A duplication plot is now included in the HTML report if this is enabled,
instead of only being reported in the JSON file.
Changed
- [BREAKING] Changed
COtags for read-groups in SAM/BAM files toDS
(description) tags, in order to match the specification. - [BREAKING] A number of changes have been made to the JSON report layout,
including the moving, removal, and addition of sections. The layout is
described inschema.json. - [BREAKING] The minimum allowed/default value for
--min-adapter-overlap
was set to 1. In practice this has no effect, since length 0 alignments were
never considered, but may break scripts running AdapterRemoval. - [BREAKING] Drop support for raw error-rates to
--trim-mott-rate, which
was renamed to--trim-mott-qualityto match other trimming options. - [BREAKING] SAM/BAM output is now combined into a single file by default,
including discarded reads. This can be overridden by setting the individual
--out-*options. - [BREAKING] Dropped
PGtag from read-groups/records in SAM/BAM output. - [BREAKING] Dropped (minimal) read-groups for SAM/BAM output. If desired,
read-group information can be added with--read-group. - [BREAKING] The
--report-duplicationoption now supports k/m/g suffixes,
and defaults to100kif used without an explicit value. - [BREAKING] The
--read-groupoption no longer attempts to unescape
special characters. Instead, tags must be separated using embedded tabs
(--read-group $'ID:A\tSM:B') or provided as individual arguments
(--read-group 'ID:A' 'SM:B'). - Improved checks for conflicting command-line options.
- Barcodes are now recorded in FASTQ headers demultiplexing without trimming.
- The
$schemaURL is now included in the JSON report - Makefile features are now enabled/disable with
true/falseinstead of
yes/no. - Vega-lite is now loaded in the background, when opening the HTML reports,
making the report readable before Vega-lite has loaded. - Optimized alignments involving multiple possible adapter sequences, by only
once performing the alignments that involve no adapter sequences. - Optimized alignments involving multiple possible adapter sequences, by
sorting the list of adapter sequences by hits. This increasing the odds that
a good alignment is found early so that worse alignments can be skipped. - The old Makefile was replaced with the Meson build system, but a wrapper
Makefile is still provided/used as a convenience for setting the recommended
build options. - A number of small improvements were made to the
--helptext. - Improved error messages when mismatching (paired) read names are detected.
- Singleton reads are now included in the overall summary statistics in
JSON/HTML reports. - Hardening flags are now enabled by default during compilation. This comes with
a small performance cost, but most distros are also expected to enable similar
flags by default.
Fixed
- NA values were being written with '%' or 'bp' suffixes in HTML report.
- Some plots were omitted for merged reads in HTML report.
- Mate 2 adapters were reverse-complemented in JSON report when demultiplexing.
- SAM/BAM headers were not being written in demultiplexing mode.
- Mate 1/2 statistics were sampled independently, and thus potentially not
derived from the same read pairs. - The JSON/HTML reports would give different time-stamps for the run, since one
gave the start time and the other the end time. Now start time is always used. - Fixed failure when reading paired FASTQ files where read lengths differed
between the two files - Fixed report files and unidentified reads getting additional suffixes when
filenames were specified manually during demultiplexing. - Fixed
/dev/nullbeing listed as the path for some files when demultiplexing,
and these outputs were disabled. - Reverted the removal of support for '.' as equivalent to 'N' in FASTQ reads.
This is found in some older data-sets (#112). - Fixed misleading IO error messages, that would include descriptions of
unrelated errors in some cases.
AdapterRemoval v2.3.4
This release adds a new couple of command-line options for handling non-ACGTN
bases in FASTQ data and back-ports a few minor fixes from the development
branch.
Added
- Added support for converting Uracils (U) in input data to Thymine (T) via the
--convert-uracilsflag. - Added support for replacing IUPAC-encoded degenerate bases with Ns via the
--mask-degenerate-basesflag. - Added DESTDIR support to
make install.
Fixed
- Improved progress timer accuracy, so updates occur closer to every 1M reads.
Changed
- Minor improvements to
--helptext and documentation.
AdapterRemoval v3.0.0-alpha2
This is the second alpha release of AdapterRemoval v3. It is the intention that
a third alpha release, or the final 3.0 release, will follow within the next
couple of months.
As with alpha 1, changes that affect how AdapterRemoval is used (e.g. by
removing options) or that result in different output compared to AdapterRemoval
v2 are marked with the label "[BREAKING]".
In addition to changes listed below, this release includes increased throughput
thanks to improved parallelization of various steps in internal pipeline,
support for AVX512 and general improvements to the SIMD alignment algorithms,
loop unrolling of non-SIMD alignments to significantly increase throughput when
SIMD is not available, and a significant decrease in the number of allocations
to decrease overhead.
This release requires a compiler with support for c++17 and libdeflate is now a
mandatory dependency.
Draft documentation is available here and a pre-compiled binary for x86-64
Linux systems is attached below.
Added
- Added support for converting (U)racils in input data to T(hymine) via the
--convert-uracilsflag. - Added support for replacing IUPAC-encoded degenerate bases with Ns via the
--mask-degenerate-basesflag. - Added support for writing output in SAM/BAM formats, with optional
user-supplied read-group information. - Added support for alignments using AVX512 instructions. AVX512 support only
available when AdapterRemoval is compiled with GCC v11+ or Clang v8+. - Added support selecting output file formats via the file extension and via
the--out-formatoption. A corresponding option,--stdout-formatwas
added to select the format for data written to STDOUT. - Added support for reading from STDIN or writing to STDOUT when '-' is used as
the filename, as an alternative to using/dev/stdinor/dev/stdout. - Added dedicated threads solely for writing output data. This allows compute
threads to work at full capacity, as long as the destination can consume
written data fast enough. This may result in CPU utilization exceeding
--threadsby a couple of percent. - Added support for setting DESTDIR when running
make install. - Added
--licensesflag for displaying licenses of 3rd party code used by /
incorporated into AdapterRemoval. - Added
--simdoption allowing the user to select the specific SIMD
instruction set they wish to use. - Added
Containerfilefor building static binaries using alpine/musl.
Changed
- [BREAKING] Changed the default
--mm/--mismatch-ratefrom 1/3 to 1/6,
in order to decrease the false positive rate, in particular for read merging. - [BREAKING] Default to writing gzip-compressed FASTQ files; output written
to STDOUT is uncompressed by default. - [BREAKING] Discarded reads are no longer saved by default.
- [BREAKING] Output files for discarded reads and singleton (orphan)
paired-end reads are only created if filtering is enabled. - [BREAKING] The
--basename/--out-prefixno longer defaults to
your_output. Instead the user is required to set at least one--out-*
option. - [BREAKING] Merged
--identify-adaptersand--report-onlycommands. The
adapter sequence is presently only reported in the HTML report, but will be
added to the JSON report following some planned changes. - [BREAKING] Reverted
--min-complexitybeing enabled by default. - Increased the default
--threadsvalue to 2. - A number of command-line options were renamed for consistency; use of the old
names is still supported, but will trigger a warning message. - Re-organized compression: level 1 is streamed using isa-l, while levels 2-13
correspond to libdeflate levels 1 to 12. - Changed the default compression level to 5 on the new scale (libdeflate level
4); this results in a ~40% increase in throughput at the cost of roughly ~3%
larger output files. - Setting an
--out-*option in demultiplexing mode overrides the basename /
prefix for that specific output type. - Add smoothing to GC values calculated for the GC content curve, to account
for the fact that possible GC% values are unevenly distributed depending on
the read length.
Removed
The following changes are all [BREAKING] as described above:
- Removed support for original merging algorithm has been removed. The
--merge-strategy additivemethod produces very similar, but slightly more
conservative scores. - Removed the ability to randomly sample a base if no best base could be
selected in case of mismatches. Such bases are now changed toN, while both
methods assign a Phred score of 0 (!).
AdapterRemoval v3.0.0-alpha1
This is the first alpha release of AdapterRemoval v3. This is a major revision
of AdapterRemoval, with the goals of simplify usage by picking a sensible set of
default settings, adding new features to handle a wider range of data, providing
human/machine readable reports, and improving overall throughput.
This release features a number of breaking changes compared to AdapterRemoval v2
and it is therefore recommended that you carefully read the list of changes
below. Changes that affect how AdapterRemoval is used (e.g. by removing options)
or that result in different output compared to AdapterRemoval v2 are marked with
the label "[BREAKING]".
This is an alpha release; not all planned features are complete (more QC reports
are planned among other things), additional optimizations will be attempted, and
documentation is still needs to be expanded further before the final release.
Feedback is very welcome in the mean time.
Draft documentation is available here and a pre-compiled binary for x86-64 Linux systems is attached below.
Added
- Reports are now available in JSON format for easy parsing and in HTML format
for human consumption. These replace the old--settingsfile. - AVX2 enabled alignment algorithm for a significant performance boost (YMMV).
- Added support for detecting supported CPU extensions (SSE/AVX) at runtime.
- Support for combining output by simply by specifying the same filename for for
multiple outputs types, e.g.--output1 file.fq --output2 file.fqwill for
example produce interleaved output. - Added handling for
/dev/nullas a "magic" output filename. Read-types
writing to this exact path will be discarded early in the pipeline, saving
time previously spent processing, compressing, and writing FASTQ reads. - Added read complexity filter inspired by [fastp].
- Added the ability to only processes the first
Nreads/read pairs via the
newly added--head Ncommand-line option. - Added estimation of duplication rates based on the [FastQC] algorithm.
- Automatic detection of mate separators based on the first chunk of reads
processed. The--mate-separatoris therefore only required in cases where
the results are ambiguous. - Automatic gzip compression of output files with a
.gzextension. This makes
it possible to compress only a subset of files and removes the need for the
--gzipoption when manually specifying output files. - Added options
--prefix-read1,--prefix-read2, and--prefix-mergedfor
adding custom prefixes to the names of FASTQ reads.
Changed
- [BREAKING] Default adapters have been changed to the [recommended Illumina
sequences], equivalent to the first 33 bp of the adapter sequences used by
AdapterRemoval v2. This makes the default settings more generally applicable. - [BREAKING] The trimming options
--trimwindows,--trimns,
--trimqualities, and--minqualityhave been deprecated in favor of a new
the modified Mott's algorithm, which is enabled by default. The trimming
algorithm used may be changed using new--trim-strategyoption. - [BREAKING] Merging now defaults to using the conservative algorithm,
meaning that matching quality scores are assignedQ_match = max(Q_a, Q_b)
instead ofQ_match ~= Q_a + Q_b, and that same-quality mismatches are
assigned 'N' instead of one being picked at random. Motivated in part by
doi:10.1186/s12859-018-2579-2. This can be changed using--merge-strategy. - The
--mergeoption no longer has any effect when processing SE data;
previously this option would treat reads with at--minalignmentlength
adapter as pseudo-merged reads. - [BREAKING] Merged reads are no longer given a
M_name prefix and merged
reads that have been trimmed after merging are no longer given anMT_name
prefix. Instead, see the new option--prefix-merged. - [BREAKING] Default filenames have all been revised and now include proper
extensions to indicate the format. - [BREAKING] The executable is now named
adapterremoval3. This was done to
allow v3 to coexist with AdapterRemoval v2 and to prevent accidental use of
the wrong version. - [BREAKING] Changed the default --maxns value from 1000 to "infinite"
--gzipnow defaults to compressing independent blocks of 64kb data using
libdeflate. This significantly improves throughput in both single- and
(especially) multi-threaded mode, but may be incompatible with a few programs.
Compression levels of 3 and below use isa-l for compression and provides a
more universally compatible output.- The term "merging" is now used consistently instead of "collapsing", including
for default output filenames. Options have been renamed, but old option names
continue to work (except for--outputcollapsedtruncated). - Improvements to alignment algorithm in order to terminate early if possible.
- Logging is now done more consistently and exposes options to increase or
decrease the amount of messages printed (debug, info, warning, errors).
Removed
The following changes are all [BREAKING] as described above:
- The
--outputcollapsedtruncatedhas been removed and all merged reads
(whether quality trimmed or not) are simply written to--outputmerged. - The
--qualitybase-outputhas been removed. Output is now always Phred+33. - The
--combined-outputoption has been removed in favor of allowing arbitrary
merging of output files (see above). - The
--settingsoption has been replaced by--out-jsonand--out-htmlfor
machine and human readable reports, respectively. - Removed support for guessing the intended command-line argument based on
prefixes. I.e.--thwill no longer be accepted for--threads. Due to the
number of options added, removed, and renamed, this is no longer reliable. - The deprecated
--pcr1and--pcr2options have been removed. - Dropped undocumented support for '.' as equivalent to 'N' in FASTQ reads.
- Support for reading and writing of bzip2 files has been removed.
AdapterRemoval v2.3.3
- Updated Catch2 to fix compilation with glibc 2.34, courtesy of loganrosen.
AdapterRemoval v2.3.2
- Improved error messages when AdapterRemoval failed to open or write FASTQ
files (issue #42). - Fixed build on some architectures. Patch courtesy of Andreas Tille/the Debian
build team. - Fixed display of max Phred scores in FASTQ validation error messages.
- Removed benchmarking scripts which were included in the repo for the sake of
making Schubert et al. 2016 reproducible. This is no longer relevant. - Use 'install' in the Makefile; patch courtesy of Eric DEVEAUD.
- Added --collapse-deterministic to .settings file.
- Fixed --minadapteroverlap being misapplied in PE mode.
- Added --collapse-conservatively merge algorithm based on FASTQ-join. See
the man-page for more information
AdapterRemoval v2.3.1
- Added --preserve5p option. This option prevents AdapterRemoval from trimming
the 5p of reads when the --trimqualities, --trimns, and --trimwindows options
are used. Neither end of collapsed reads are trimmed when this option is used. - Fixed Ns being miscounted as As when constructing consensus adapter sequences
using --identify-adapters.
AdapterRemoval v2.3.0
- Fixed --collapse producing slightly different result on 32 bit and 64 bit
architectures. Courtesy of Andreas Tille. - Added support for output files without a basename; to create such output
files, use an empty basename (--basename "") or a basename ending with a
slash (--basename path/). - Added support for managing file handles to allow AdapterRemoval to run
when the the number of output files exceeds the number of file handles, e.g.
when demultiplexing large numbers of samples. - Reworked demultiplexing to improve performance for many paired barcodes.
AdapterRemoval v2.2.4
- Fixed bug in --trim5p N which would AdapterRemoval to abort if N was greater
than the pre-trimmed read length. - Fixed --identify-adapters not respecting the --mate-separator option.
AdapterRemoval v2.2.3
- Added support for trimming reads by a fixed amount: --trim5p N --trim3p N.
Different values may be given for each mate: --trim5p N1 N2. Trimming is
carried out after adapters have been removed and reads have been collapsed,
if enabled, but before quality trimming (Ns and low qualities). - Added option for determistic read merging (--collapse-deterministic). In
this mode AdapterRemoval will set a merged base to 'N' with quality 0 if
the corresponding bases on the two mates differ, and if both have the same
quality score. The default behavior is to select one of the two bases at
random. - Fixed reporting of line numbers in error messages.
- Added conda installation instructions, courtesy of Maxime Borry (maxibor).
- Fixed reading mate 2 adapters specified via --adapter-list. Adapters would
be used in the reverse orientation compared to --adapter2. Courtesy of
Karolis (KarolisM). - Fixed various typos and improved help/error messages.