BURST is an optimal, high-speed pairwise sequence aligner specialized in aligning many NGS short reads against large reference databases. It is designed to provide mathematically optimal alignments while maintaining exceptional speed, making it particularly useful for metagenomic studies and large-scale sequence analysis projects.
This documentation covers BURST version 1.0+.
- Optimal end-to-end alignment of variable-length short reads (up to a few thousand bases) against arbitrary reference sequences
- Gapped alignment support
- Multiple alignment modes:
- BEST: Report first best match by hybrid BLAST id
- ALLPATHS: Report all ties with the same error profile
- CAPITALIST: Minimize set of references AND interpolate taxonomy (default)
- FORAGE: Report all matches above specified threshold
- ANY: Report any valid hit above specified threshold
- Optional optimal LCA taxonomy assignment with customizable confidence cutoff
- Full IUPAC ambiguous base support in queries and references
- Accelerator mode for faster alignment using k-mer hashing
- Fingerprinting for additional filtering of potential matches
- Support for reverse complement alignment
- Database creation and management tools
- Multithreading support for improved performance
burst {options}
--references (-r) <name>: FASTA/edx DB of reference sequences [required]--accelerator (-a) <name>: Creates/uses a helper DB (acc/acx) [optional]--queries (-q) <name>: FASTA file of queries to search [required if aligning]--output (-o) <name>: Blast6/edb file for output alignments/database [required]
--forwardreverse (-fr): Also search the reverse complement of queries--whitespace (-w): Write full query names in output (include whitespace)--xalphabet (-x): Allow any alphabet and disable ambiguity matching--nwildcard (-y): Allow N,X to match anything (in query and reference)--taxonomy (-b) <name>: Taxonomy map (to interpolate, use -m CAPITALIST)--mode (-m) <name>: Pick an alignment reporting mode (BEST, ALLPATHS, CAPITALIST, FORAGE, ANY)
--dbpartition (-dp) <int>: Split DB making into chunks (lossy)--taxacut (-bc) <num>: Allow 1/ rank discord OR % conf--taxa_ncbi (-bn): Assume NCBI header format '>xxx|accsn...' for taxonomy--skipambig (-sa): Do not consider highly ambiguous queries (5+ ambigs)--taxasuppress (-bs) [STRICT]: Suppress taxonomic specificity by %ID--id (-i) <decimal>: Target minimum similarity (range 0-1)--threads (-t) <int>: How many logical processors to use--shear (-s) [len]: Shear references longer than [len] bases--fingerprint (-f): Use sketch fingerprints to precheck matches (or cluster db)--prepass (-p) [speed]: Use ultra-heuristic pre-matching--heuristic (-hr): Allow relaxed comparison of low-id matches--noprogress: Suppress progress indicator--qbunch (-qb) <int>: Pack QBUNCH with queries divergent--qbunch_max (-qm) <int>: Max size of QBUNCH--quickforage (-qf): Output FORAGE'd results inline--cache (-c) <int>: Performance tweaking parameter--latency (-l) <int>: Performance tweaking parameter
- BEST: Reports the first best match based on hybrid BLAST id.
- ALLPATHS: Reports all ties with the same error profile.
- CAPITALIST: Minimizes the set of references and interpolates taxonomy (default mode).
- FORAGE: Reports all matches above the specified threshold.
- ANY: Reports any valid hit above the specified threshold.
BURST can create custom databases for faster alignment:
burst -r input_references.fasta -d [DNA|RNA|QUICK] [max_query_length] -o output_database.edx
Options:
- DNA/RNA: Creates a full database
- QUICK: Creates a faster, but potentially less sensitive database
- max_query_length: Optional parameter to specify the maximum expected query length
Note: BURST's accelerator formats are hard-coded for either prefixes of size 12 or 15. The version you're using is displayed in the BURST help string. The smaller size-12 prefix uses less memory but is slower (and is hence suitable for marker gene analysis).
To create an accelerator file for even faster alignments:
burst -r input_references.fasta -d [options] -a output_accelerator.acx -o output_database.edx
- FASTA format
- Can be provided as raw FASTA or as a pre-built BURST database (.edx)
- FASTA or FASTQ format
- Gzipped input supported
- Maximum sequence length: 100MB (configurable)
- Tab-delimited text file
- Columns: sequence name, taxonomy string
- Taxonomy strings are semicolon-delimited
BURST outputs alignments in a modified BLAST-6 column format:
- Query sequence name
- Reference sequence name
- Percent identity
- Alignment length
- Number of mismatches
- Number of gap openings
- Query start position
- Query end position
- Subject start position
- Subject end position
- E-value (set to -1 in BURST)
- Bit score (used for other purposes in BURST)
- Taxonomy (LCA-based, if provided and using CAPITALIST mode)
- Use the accelerator (-a) option for faster alignments on large databases
- Increase the number of threads (-t) to utilize multiple CPU cores
- Adjust the cache (-c) and latency (-l) parameters for fine-tuning performance
- Use the fingerprint (-f) option for additional filtering of potential matches
- Consider using the prepass (-p) option for ultra-fast, heuristic pre-matching
- Local alignment is not supported (only end-to-end alignment)
- Custom scoring matrices are not implemented
- Paired-end unstitched alignments are not directly supported
BURST includes basic error checking for input file formats and command-line arguments. It will print error messages and exit if it encounters issues like malformed input files or invalid options.
-
Basic alignment:
burst -r references.fasta -q queries.fasta -o alignments.b6 -i 0.97 -
Create a database and accelerator:
burst -r references.fasta -d DNA 320 -a references.acx -o references.edx -
Align using a pre-built database with taxonomy:
burst -r references.edx -a references.acx -q queries.fasta -b taxonomy.txt -o alignments.b6 -m CAPITALIST
BURST provides a powerful and flexible tool for optimal sequence alignment, particularly suited for metagenomic studies and large-scale sequence analysis projects. Its various output options and optimization features make it suitable for a wide range of applications, from simple best-hit reporting to complex taxonomic assignment tasks.