Implements two VLMC (variable length Markov chain) construction algorithms, based on the ideas by Schulz et al.. In short, the VLMC of a sequence S is constructed by building a probabilistic suffix tree (PST) in two phases: support pruning and similarity pruning. This PST has a direct correspondence to the VLMC.
In support pruning, we select only those branches which are at most d deep (k-mer is at most d long), and occur at least c times in the sequence. This is implemented in practice using a Lazy Suffix Tree (using the WOTD algorithm described by Giegerich et al.). The suffix tree is extended with implicit nodes and suffix links. See search/lazy_suffix_tree.hpp and search/lazy_suffix_tree/ for the implementation details.
The similarity pruning proceeds to compute the forward probabilities (probability of a character a occurring after a context c), and then prunes the tree bottom-up. The pruning uses the Kullback-Leibler estimator for calculation of which contexts should be kept in the tree.
The output is as a .tree file which for, each node, contains the forward and reverse counts of each child as well as
the index of the (PST) children of that node.
The algorithms are highly parallel, and at time of implementation, much faster than other current VLMC construction algorithms, depending on parameter selection.
Published in BMC bioinformatics.
Pre-complied binaries for linux are available on GitHub under the releases.
The CLI can be used as follows. To train a VLMC on the file CP007136.1.fa, and compute the length-normalised negative log-likelihood:
./pst-classifier CP007136.1.fa --min-count 100 --max-depth 15 --threshold 3.9075 --multi-core --parallel-depth 2 > CP007136.1.tree
./pst-score-sequences -p CP007136.1.tree -s CP007136.1.faWhere the first argument is a path to a fasta file. For details on the arguments, run e.g. ./pst-classifier --help.
Equivalently, if CP007136.1.fa was a multi-fasta file,
./pst-batch-training CP007136.1.fa --min-count 100 --max-depth 15 --threshold 3.9075 --multi-core --parallel-depth 2 -o trees.h5
./pst-score-sequences -p trees.h5 -s CP007136.1.faTo use the VLMC in a c++ project, include the corresponding header files (in include/), make sure SeqAn3 is installed. Example usage:
#include "pst/probabilistic_suffix_tree.hpp"
#include "pst/distances/negative_log_likelihood.hpp"
#include <seqan3/alphabet/nucleotide/dna5.hpp>
int main() {
using seqan3::operator""_dna5;
seqan3::dna5_vector sequence = "GATTACA"_dna5;
// Parameters: name, sequence, min_count, max_depth, Kullback-Leibler threshold, parallel, parallel-depth
pst::KullbackLieblerTreeMap<seqan3::dna5> vlmc{"Test", sequence, 2, 3, 3.9075, true, 2};
std::cout << vlmc.to_tree() << std::endl;
auto score = pst::distances::negative_log_likelihood(vlmc, sequence);
std::cout << score << std::endl;
return 0;
}This will train a PST on sequence with a depth of 3, and include k-mers that occur at least 2 times, and then compute the length-normalised negative log-likelihood.
Since we rely heavily on the SeqAn3 library, the main building dependencies are similar to theirs. Specifically, a modern c++ compiler (full c++20 support, or GCC >= 7) is necessary. See the SeqAn3 quick-setup for the SeqAn3 specific dependencies and build instructions.
We also provide singularity and docker files, for easy
reproducibility and usage. The build_artifacts.sh script uses docker to build the binaries on debian 18.04 and outputs them in the artifacts folder.
Dependencies (debian packages in parentheses):
- GCC >= 7 (g++) (I've recently only been able to compile with GCC12)
- CMake (
cmake) - HDF5 (
libhdf5-dev) - Eigen (
libeigen3-dev) - tbb (
libtbb-dev)
Fetch the source code via git, or download the latest release.
Update/download the git submodules containing the SeqAn3, eigen3, robin-hood-hashing, HighFive libraries
(you can skip --init if you're only updating):
git submodule update --init --recursiveCreate and go to a build directory:
mkdir build
cd buildConfigure the cmake-project and build.
cmake -DCMAKE_BUILD_TYPE=Release ..
make pst-classifier pst-batch-training pst-score-sequencesThis should yield three files in the build/src directory:
pst-classifierwhich provides a command line interface to the PST/VLMC training.pst-batch-trainingwhich trains a VLMC for every sequence in a multi-fasta file, and saves the output to a hdf5 file.pst-score-sequenceswhich takes the signatures from the output ofpst-batch-trainingorpst-classifierand computes the negative log-likelihood of as set of sequences, and saves the result to a hdf5 file.
Or for tests, build all everything:
cmake -DCMAKE_BUILD_TYPE=Release ..
make
./tests/probabilistic_suffix_tree_map_testsDependencies (homebrew packages in parentheses):
- GCC >= 7 (gcc) (I've recently only been able to compile with GCC12)
- CMake (cmake)
- HDF5 (hdf5)
- Eigen (eigen)
- tbb (tbb)
Other than the different packages, the instructions for Linux should work, but you may need to provide the paths to the installed gcc and g++ when you configure the cmake project, e.g.:
CC=/opt/homebrew/bin/gcc-12 CXX=/opt/homebrew/bin/g++-12 cmake -DCMAKE_BUILD_TYPE=Release ..Windows is not supported by SeqAn3, but through the Windows Subsystem for Linux, the instructions for Linux should work.
We include a wrapper for the VLMC training and scoring in a package called libvlmc. It can be built from the
python-package directory, and requires numpy, cython, and scikit-build.
After ensuring the c++ package builds, the python package can be installed through:
python -m pip install -r dev-requirements.txt
python setup.py bdist_wheel
pip install --user dist/libvlmc-0.2-{Python_Version}-{Python_Version}-linux_x86_64.whlRunning the shell script python-cross-compile.sh will use docker to build a version of the python package that runs on debian.