Unicorn computes alignment-based statistics from BAM/SAM files for metagenomic analysis.
Unicorn depends on:
Make sure htslib is installed and available:
git clone --recursive https://github.com/GeoGenetics/unicorn.git
cd unicorn
make
If htslib is in a non-standard location, set HTSSRC:
export HTSSRC=/path/to/htslib/
make
/path/to/htslib
Must contain lib
and include
and be searchable by the linker at runtime.
For conda environments:
Install with conda
conda install -c conda-forge -c bioconda enhjoerning
Or if you installed htslib from conda and want to compile unicorn yourself.
export HTSSRC=$CONDA_PREFIX
make
Unicorn provides four commands for different types of alignment statistics or filtering:
$ ./unicorn
unicorn 2.2.0 31750bf
Aug 22 2025 12:04:35
./unicorn command [options] -b <in.bam>|<in.sam>
Commands:
refstats Compute per reference statistics.
bamstats Compute per bam statistics.
tidstats Compute per taxid statistics.
reassign Filter alignments via EM algorithm.
Compute statistics for each reference sequence.
$ ./unicorn refstats
unicorn 2.2.0 31750bf
Aug 22 2025 12:04:35
./unicorn refstats [options] -b <in.bam>|<in.sam>
Options:
-b <str> Input bam|sam [Required]
-t <int>, --threads <int> Number of threads [4]
--outbam <str> Output BAM file with filtered alignments.
--outstat <str> Output statistics file
--[FILTER] <PARAM> Apply filter "FILTER" with parameter "PARAM"
For example "--minreads 100" to filter out references with
less than 100 reads.
Available filters:
- minrefl <int> Minimum reference length to consider [0]
- minreads <int> Minimum number of reads to consider [1]
--withtid Report taxid of reference sequence. Requires --acc2tax, --names and --nodes options.
--names <str> Taxonomy nodeid to name mapping file.
--nodes <str> Taxonomy nodeid to parent nodeid mapping file.
--acc2tax <str> Accession to taxid mapping file or .khash file.
--verbose Print libunicorn's messages.
-h print this help message
Basic usage:
./unicorn refstats -b input.bam > refstats.txt
Example with filtering:
./unicorn refstats -b input.bam --minreads 10 --minrefl 1000 --outstat refstats.txt
Example with filtering and filtered bam output:
./unicorn refstats -b input.bam --minreads 10 --minrefl 1000 --outbam filtered.bam > refstats.txt
Output format (28 columns):
- Id - Reference name
- Length - Reference length
- n_alns - Number of alignments to the reference
- n_reads - Number of reads to the reference (always ≤ n_alns)
- m_readl - Median read length
- std_readl - Standard deviation of read length
- md_readl - Median of read length
- mo_readl - Mode of read length
- readl_min - Smallest read length
- readl_max - Largest read length
- m_alnnm - Mean alignment edit distance
- m_alnani - Mean alignment ANI (Average Nucleotide Identity)
- std_alnani - Standard deviation of alignment ANI
- md_alnani - Median of alignment ANI
- n_covbases - Number of covered bases
- m_cov - Mean coverage depth
- breath_cov - Breadth of coverage
- exp_breath - Expected breadth
- breath_ratio - Breadth ratio
- m_covcovered - Mean coverage of covered positions
- std_covcovered - Standard deviation of coverage of covered positions
- evenness_cov - Evenness of coverage
- site_density - Site density
- entropy - Coverage entropy
- gini - Coverage gini coefficient
- n_entropy - Normalized entropy
- n_gini - Normalized Gini coefficient
- tad80 - Truncated Average Depth at 80% of covergae mass
Compute overall statistics for BAM/SAM files.
$ ./unicorn bamstats
unicorn 2.2.0 31750bf
Aug 22 2025 12:04:35
./unicorn bamstats [options] -b <in.bam>|<in.sam>
Options:
-b <str> Input bam|sam
--outstat <str> Output statistics file
--filelist <str> File containing input file paths. One per line.
--printdists Print distributions of read lengths, alignment lengths, etc.
This will create a files <inputname>.dists.txt
Example:
./unicorn bamstats -b input.bam > bam_summary.txt
Compute statistics grouped by taxonomic ID.
$ ./unicorn tidstats [options] -b <in.bam>|<in.sam>
unicorn 2.2.0 31750bf
Aug 22 2025 12:04:35
./unicorn tidstats [options] -b <in.bam>|<in.sam>
Options:
-b <str> Input bam|sam
-o <str> | --outstat <str> Output statistics file [/dev/stdout]
-a <str> | --acc2tax <str> Accession to taxid mapping file or .khash file.
Providing a .khash file is much faster.
-n <str> | --names <str> Taxonomy names file.
-d <str> | --nodes <str> Taxonomy nodes file
--[FILTER] <PARAM> Apply filter "FILTER" with parameter "PARAM"
For example "--minreads 100" to filter out taxids with
less than 100 reads.
Available filters:
- minrefl <int> Minimum reference length. [0]
- minreads <int> Minimum number of reads per taxid. [1]
- minmani <float> Minimum mean ANI per taxid. [0]
--filelist <str> File containing input file paths. One per line.
--rank <str> Taxonomic rank to summarize by. [species]
--verbose Prints libunicorn's messages.
-h Print this help message
Example:
./unicorn tidstats -b input.bam -a acc2tax.txt -n names.dmp -d nodes.dmp --rank genus > genus_stats.txt
Filter alignments using an Expectation-Maximization algorithm to reassign reads with multiple alignments.
unicorn reassign
Is a reimplementation of bamfilter.
bam files are required to be query grouped. That is, collated by query name or sorted by query name.
$ ./unicorn reassign [options] -b <in.bam>|<in.sam>
unicorn 2.2.0 31750bf
Aug 22 2025 12:04:35
./unicorn reassign [options] -b <in.bam>|<in.sam>
Options:
-b <str> Input bam|sam
-o <str> | --outbam <str> Output BAM file [stdout]
-t <int> | --threads <int> Number of threads to use [4]
--alpha <float> Score retention scaling factor (0.0, 1.0] [0.80]
--niter <int> Max number of EM algorithm iterations [5]
--scale-type <str> Scaling type subject weights [LENGTH]
Available types:
NONE - No subject weight scaling
LENGTH - Scale by subject length
SQRTLEN - Scale by square root of subject length
--verbose Prints libunicorn's messages.
-h Print this help message
Example:
./unicorn reassign -b input.bam --alpha 0.9 --niter 10 -o reassigned.bam
- Compute reference statistics:
./unicorn refstats -b aligned.bam --minreads 5 > reference_stats.txt
- Get taxonomic summary:
./unicorn tidstats -b aligned.bam -a acc2tax.khash -n names.dmp -d nodes.dmp --rank species > species_summary.txt
- Filter ambiguous alignments:
./unicorn reassign -b aligned.bam --alpha 0.8 -o filtered.bam
- Generate BAM-level summary:
./unicorn bamstats -b filtered.bam --printdists --outstat final_summary.txt
- acc2tax: Tab-separated file mapping accession IDs to taxonomy IDs
- names.dmp: NCBI taxonomy names file
- nodes.dmp: NCBI taxonomy nodes file
- .khash files: Binary format for faster acc2tax lookups.
- BAM/SAM files must be query-grouped (sorted/collated by read name)
- Use
samtools sort -n input.bam -o query_grouped.bam
if needed
Run the test suite:
make test
For development information, see src/README.md
MIT License - see LICENSE file for details.