MMseqs2 Release 11-e1a1c
·
1105 commits
to master
since this release
At a glance: The MMseqs2 command line interface is cleaner and validates user input. Many MMseqs2 modules use less memory and run faster. The new databases module helps to download and setup database. We now have a chat support at chat.mmseqs.com.
Known Issues
rbhcrashes due to invalid sorting mode (#290)- Homebrew's macOS version does not use multiple cores (#289)
prefilterresults can be unstable between different runs for extremely redundant databases (#277)linclust/clustercan crash for very small input sets (#274)
Breaking Changes
kmermatcher--skip-n-repeat-kmerparameter was replaced with--ignore-multi-kmer
Does not discard whole sequences anymore if a k-mer occured to often, instead it skips the specific k-mers.
Either mode is only used in Plass and not in Linclust--lca-ranksfrom(easy-)taxonomyandlcahas to be delimited with semicolons (;) instead of colons (:)--dont-shuffleflag was renamed to--shuffle true/false
Features
- new
databasesworkflow to list and download common databases.
Supported databases:
Name Type Taxonomy Url
- UniRef100 Aminoacid yes https://www.uniprot.org/help/uniref
- UniRef90 Aminoacid yes https://www.uniprot.org/help/uniref
- UniRef50 Aminoacid yes https://www.uniprot.org/help/uniref
- UniProtKB Aminoacid yes https://www.uniprot.org/help/uniprotkb
- UniProtKB/TrEMBL Aminoacid yes https://www.uniprot.org/help/uniprotkb
- UniProtKB/Swiss-Prot Aminoacid yes https://uniprot.org
- NR Aminoacid - https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA
- NT Nucleotide - https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA
- PDB Aminoacid - https://www.rcsb.org
- PDB70 Profile - https://github.com/soedinglab/hh-suite
- Pfam-A.full Profile - https://pfam.xfam.org
- Pfam-A.seed Profile - https://pfam.xfam.org
- eggNOG Profile - http://eggnog5.embl.de
- Resfinder Nucleotide - https://cge.cbs.dtu.dk/services/ResFinder
- Kalamari Nucleotide yes https://github.com/lskatz/Kalamari
(easy-)search --slice-searchis now usable. Slice search finds all hits that fulfill the alignment criteria while using only as much disk space as defined by--disk-space-limitcreatedband the variouseasy-workflows learned to read query input fromSTDINtaxonomyreportlearned to display the summarized taxonomy result with Krona- new
filtertaxseqdbmodule for filtering sequence DBs with taxonomy information according to provided taxa --taxon-listparameter understands expressions. E.g. get all bacterial and human sequences--taxon-list "2||9606"easy-searchandconvertaliscan now output taxonomic information using--format-output
taxid Taxonomic identifier
taxname Taxon Name
taxlineage Taxonomic lineage
- speed up in
(easy-)cluster/linclustby improving k-mer extraction - MMseqs2 consistently creates .source and .lookup files to match from which input file a sequence came from
E.g.:mmseqs createdb input1.fa input2.fa seqDBeach sequence in seqDB can tell if it came frominput1.faorinput2.fa createdblearned to index an existing (single-line-seq per entry) FASTA file without copying the FASTA content to a new databasealignandrescorediagonallearned to align circular sequencesalignexposes the z-drop parameter of its Banded Nucleotide alignment algorithmreverseseqlearned to reverse profilesfilterdbcan filter rows with value within given percentage of first row- new
aggragatetaxmodule to assign a taxonomic label to contigs according to the fragments matched on the contig - Adjusting
--max-seq-lenis not required anymore, MMseqs2 automatically increases the length now. - MMseqs2 on Cygwin/Windows uses
nedmallocas its memory allocator now and does not massively slow down due to lock contention - new
tar2dbmodule to efficiently transform content oftararchives to MMseqs2 databases
Bug fixes
createindexwould create corrupted indices for profile target databasesrbhworkflow would create its result DB at an unexpected (wrong) location(easy)-taxonomy --lca-mode 3(Approx. LCA) was aligning invalid sequences in the second iteration and producing bad resultslca(and(easy)-taxonomy) add empty columns for unclassifed sequences to be valid TSVskmermatcheruses xxhash for hashing now (faster)kmermatcheravoid crash machine has not enough memory to process data at once (affects linclust/cluster)kmermatchercorrectly deals with sequences longer than MAX_SHRT nowkmermatcherfixed various edge cases (e.g. alignment of 1-char sequences)kmermatcherhash-shift would be ignoredoffsetalignmentcould produce wrong results in the minus-strandclustnow correctly and consistently handles alignment DB inputclusthashbetter deals with nucleotide input now and several multi-threaded inefficiencies were resolved(easy-)cluster--single-step-clusteringcould cluster unrelated sequences due to hash collisionsprefilter --diag-score 0respects--min-ungapped-scorecreateseqfiledbcould print empty sequence linestaxonomyreportcould crash if no sequence was unclassifiedresult2flatcould crash with long sequence inputresult2msa, result2profile, msa2profilebackport filtering fix from HHblitsaligncould produce bad alignments if all sequence lenghts in query DB where a lot shorter than in target DBsplitsequencefix issues with splitsequence if combined with compressedresult2profilefix Filter2 bug of HH-suite in MMseqs2applywould crash due to reading wrong entry lengthsfilterdb --filter-expressionwas not thread safe and could corrupt resultsfilterdb--extract-linesand--trim-to-one-columnare compatible with each other
Developers
- Internal representation of sequences changed from 4-byte per character to 1-byte per character
- Compilation under AppleClang + libomp works now (see
util/build_osx.sh) - Tools inheriting from MMseqs2 can now add their own citations
- MMseqs2 on macOS compiles with the macOS 10.9 SDK (removed
symlinkatcall; relevant for bioconda)