Unicorn

Unicorn computes alignment-based statistics from BAM/SAM files for metagenomic analysis.

Dependencies

Unicorn depends on:

htslib for BAM/SAM file handling
klib (included as submodule)

Installation

Standard installation

Make sure htslib is installed and available:

git clone --recursive https://github.com/GeoGenetics/unicorn.git
cd unicorn
make

Alternative build configurations

If htslib is in a non-standard location, set HTSSRC:

export HTSSRC=/path/to/htslib/
make

/path/to/htslib Must contain lib and include and be searchable by the linker at runtime.

For conda environments:

Install with conda

conda install -c conda-forge -c bioconda enhjoerning

Or if you installed htslib from conda and want to compile unicorn yourself.

export HTSSRC=$CONDA_PREFIX
make

Usage

Unicorn provides four commands for different types of alignment statistics or filtering:

$ ./unicorn
unicorn 2.2.0 31750bf
        Aug 22 2025 12:04:35
./unicorn command [options] -b <in.bam>|<in.sam>
Commands:
  refstats    Compute per reference statistics.
  bamstats    Compute per bam statistics.
  tidstats    Compute per taxid statistics.
  reassign    Filter alignments via EM algorithm.

Commands in Detail

1. refstats - Per-reference statistics

Compute statistics for each reference sequence.

$ ./unicorn refstats
unicorn 2.2.0 31750bf
        Aug 22 2025 12:04:35
./unicorn refstats [options] -b <in.bam>|<in.sam>
Options:
  -b <str>   Input bam|sam [Required]
  -t <int>, --threads <int> Number of threads [4]
  --outbam  <str> Output BAM file with filtered alignments.
  --outstat <str> Output statistics file
  --[FILTER] <PARAM>  Apply filter "FILTER" with parameter "PARAM"
      For example "--minreads 100" to filter out references with
      less than 100 reads.
      Available filters:
       - minrefl  <int>  Minimum reference length to consider [0]
       - minreads <int>  Minimum number of reads to consider  [1]
  --withtid  Report taxid of reference sequence. Requires --acc2tax, --names and --nodes options.
  --names   <str> Taxonomy nodeid to name mapping file.
  --nodes   <str> Taxonomy nodeid to parent nodeid mapping file.
  --acc2tax <str> Accession to taxid mapping file or .khash file.
  --verbose     Print libunicorn's messages.
  -h         print this help message

Basic usage:

./unicorn refstats -b input.bam > refstats.txt

Example with filtering:

./unicorn refstats -b input.bam --minreads 10 --minrefl 1000 --outstat refstats.txt

Example with filtering and filtered bam output:

./unicorn refstats -b input.bam --minreads 10 --minrefl 1000 --outbam filtered.bam > refstats.txt

Output format (28 columns):

Id - Reference name
Length - Reference length
n_alns - Number of alignments to the reference
n_reads - Number of reads to the reference (always ≤ n_alns)
m_readl - Median read length
std_readl - Standard deviation of read length
md_readl - Median of read length
mo_readl - Mode of read length
readl_min - Smallest read length
readl_max - Largest read length
m_alnnm - Mean alignment edit distance
m_alnani - Mean alignment ANI (Average Nucleotide Identity)
std_alnani - Standard deviation of alignment ANI
md_alnani - Median of alignment ANI
n_covbases - Number of covered bases
m_cov - Mean coverage depth
breath_cov - Breadth of coverage
exp_breath - Expected breadth
breath_ratio - Breadth ratio
m_covcovered - Mean coverage of covered positions
std_covcovered - Standard deviation of coverage of covered positions
evenness_cov - Evenness of coverage
site_density - Site density
entropy - Coverage entropy
gini - Coverage gini coefficient
n_entropy - Normalized entropy
n_gini - Normalized Gini coefficient
tad80 - Truncated Average Depth at 80% of covergae mass

2. bamstats - Per-BAM statistics

Compute overall statistics for BAM/SAM files.

$ ./unicorn bamstats
unicorn 2.2.0 31750bf
        Aug 22 2025 12:04:35
./unicorn bamstats [options] -b <in.bam>|<in.sam>
Options:
  -b <str>         Input bam|sam
  --outstat <str>  Output statistics file
  --filelist <str> File containing input file paths. One per line.
  --printdists     Print distributions of read lengths, alignment lengths, etc.
                   This will create a files <inputname>.dists.txt

Example:

./unicorn bamstats -b input.bam > bam_summary.txt

3. tidstats - Per-taxid statistics

Compute statistics grouped by taxonomic ID.

$ ./unicorn tidstats [options] -b <in.bam>|<in.sam>
unicorn 2.2.0 31750bf
        Aug 22 2025 12:04:35
./unicorn tidstats [options] -b <in.bam>|<in.sam>
Options:
  -b <str>                     Input bam|sam
  -o <str> | --outstat <str>   Output statistics file [/dev/stdout]
  -a <str> | --acc2tax <str>   Accession to taxid mapping file or .khash file.
                               Providing a .khash file is much faster.
  -n <str> | --names <str>     Taxonomy names file.
  -d <str> | --nodes <str>     Taxonomy nodes file
  --[FILTER] <PARAM>  Apply filter "FILTER" with parameter "PARAM"
      For example "--minreads 100" to filter out taxids with
      less than 100 reads.
      Available filters:
       - minrefl  <int>   Minimum reference length. [0]
       - minreads <int>   Minimum number of reads per taxid. [1]
       - minmani  <float> Minimum mean ANI per taxid. [0]
  --filelist <str>             File containing input file paths. One per line.
  --rank <str>                 Taxonomic rank to summarize by. [species]
  --verbose                    Prints libunicorn's messages.
  -h                           Print this help message

Example:

./unicorn tidstats -b input.bam -a acc2tax.txt -n names.dmp -d nodes.dmp --rank genus > genus_stats.txt

4. reassign - EM algorithm filtering

Filter alignments using an Expectation-Maximization algorithm to reassign reads with multiple alignments. unicorn reassign Is a reimplementation of bamfilter.

bam files are required to be query grouped. That is, collated by query name or sorted by query name.

$ ./unicorn reassign [options] -b <in.bam>|<in.sam>
unicorn 2.2.0 31750bf
        Aug 22 2025 12:04:35
./unicorn reassign [options] -b <in.bam>|<in.sam>
Options:
  -b <str>                     Input bam|sam
  -o <str> | --outbam  <str>   Output BAM file [stdout]
  -t <int> | --threads <int>   Number of threads to use [4]
  --alpha <float>              Score retention scaling factor (0.0, 1.0] [0.80]
  --niter <int>                Max number of EM algorithm iterations [5]
  --scale-type <str>           Scaling type subject weights [LENGTH]
                               Available types:
                                NONE    - No subject weight scaling
                                LENGTH  - Scale by subject length
                                SQRTLEN - Scale by square root of subject length
  --verbose                    Prints libunicorn's messages.
  -h                           Print this help message

Example:

./unicorn reassign -b input.bam --alpha 0.9 --niter 10 -o reassigned.bam

Examples

Basic workflow for metagenomic analysis:

Compute reference statistics:

./unicorn refstats -b aligned.bam --minreads 5 > reference_stats.txt

Get taxonomic summary:

./unicorn tidstats -b aligned.bam -a acc2tax.khash -n names.dmp -d nodes.dmp --rank species > species_summary.txt

Filter ambiguous alignments:

./unicorn reassign -b aligned.bam --alpha 0.8 -o filtered.bam

Generate BAM-level summary:

./unicorn bamstats -b filtered.bam --printdists --outstat final_summary.txt

File Formats

Taxonomy files

acc2tax: Tab-separated file mapping accession IDs to taxonomy IDs
names.dmp: NCBI taxonomy names file
nodes.dmp: NCBI taxonomy nodes file
.khash files: Binary format for faster acc2tax lookups.

Input requirements

BAM/SAM files must be query-grouped (sorted/collated by read name)
Use samtools sort -n input.bam -o query_grouped.bam if needed

Testing

Run the test suite:

make test

Developers

For development information, see src/README.md

License

MIT License - see LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 179 Commits
.github/workflows		.github/workflows
conda		conda
data		data
doc		doc
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Unicorn

Dependencies

Installation

Standard installation

Alternative build configurations

Usage

Commands in Detail

1. refstats - Per-reference statistics

2. bamstats - Per-BAM statistics

3. tidstats - Per-taxid statistics

4. reassign - EM algorithm filtering

Examples

Basic workflow for metagenomic analysis:

File Formats

Taxonomy files

Input requirements

Testing

Developers

License

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

License

GeoGenetics/unicorn

Folders and files

Latest commit

History

Repository files navigation

Unicorn

Dependencies

Installation

Standard installation

Alternative build configurations

Usage

Commands in Detail

1. refstats - Per-reference statistics

2. bamstats - Per-BAM statistics

3. tidstats - Per-taxid statistics

4. reassign - EM algorithm filtering

Examples

Basic workflow for metagenomic analysis:

File Formats

Taxonomy files

Input requirements

Testing

Developers

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages