Skip to content

jzhang-dev/hapcut2-mec-solver

Repository files navigation

HapCUT2 MEC solver

Solve general minimum error correction (MEC) problems using HapCUT2.

Usage

Input and output

The input for HapCUT2 MEC solver is a allele matrix, which is a two-dimensional matrix, with each row representing a fragment, and each column representing a locus.

For an allele matrix with n rows and m columns, the output includes:

  • a haplotype matrix with 2 rows and m columns;
  • a partitioning vector of length n;
  • and the minimum cost.

The input allele matrix can be supplied using either the JSON format, or as a .npz file containing a sparse CSR matrix. In both cases, the output will be a specified using the JSON format.

For JSON input, the allowed values are 0 (Allele 0), 1 (Allele 1) and -1 (unknown allele), following the convention of MEC problems. For sparse matrix input, the allowed values are 1 (Allele 0), 2 (Allele 1) and 0 (unknown allele), following the convention of sparse matrices. The output always uses {-1, 0, 1} values for alleles.

For example, the following is a valid JSON input file with three loci and four fragments:

[
    [ 0,  1,  0], 
    [ 1,  0,  1], 
    [ 1, -1,  1],
    [ 1,  1, -1]
]

The output for the input matrix above will be:

{"haplotypes": [[0, 1, 0], [1, 0, 1]], "partition": [0, 1, 1, 0], "cost": 1}

Docker

Pull image from Docker Hub:

docker pull jzhang0246/hapcut2-mec-solver:latest

Example usage:

# The input allele matrix can be supplied as a JSON file
docker run \
  --rm \
  -v "$(pwd)"/tests:/data \
  jzhang0246/hapcut2-mec-solver:latest \
    python -m hapcut2_mec_solver \
      /data/test.json

# ... or as a npz file containing a sparse matrix 
docker run \
  --rm \
  -v "$(pwd)"/tests:/data \
  jzhang0246/hapcut2-mec-solver:latest \
    python -m hapcut2_mec_solver \
      /data/test.npz


# Alternatively, supply the matrix as a JSON string surrounded by single or double quotes. 
docker run \
  --rm 
  jzhang0246/hapcut2-mec-solver:latest \
    python -m hapcut2_mec_solver \
      '[[1, 0, 1], [0, 1, 0]]'

Run tests:

docker run --rm jzhang0246/hapcut2-mec-solver:latest pytest

Python integration

If HapCUT2 has already been installed and is available under $PATH as hapcut2, then HapCUT2 MEC solver can be used as a Python library.

Install using pip:

pip install git+https://github.com/jzhang-dev/hapcut2-mec-solver

Example usage:

>>> from hapcut2_mec_solver import MECSolver, solve_MEC

>>> fragments = [[0, 0, 1, 1], [1, 1, 0, 0]]
>>> haplotypes = solve_MEC(fragments)
>>> set(haplotypes) == {(0, 0, 1, 1), (1, 1, 0, 0)}
True

>>> fragments = [[0, 1, 0], [1, 0, 1], [1, -1, 1], [1, 1, -1]]
>>> result = MECSolver.from_fragments(fragments).solve()
>>> result.cost == 1
True
>>> result.partition in [(0, 1, 1, 0), (1, 0, 0, 1), (0, 1, 1, 1), (1, 0, 0, 0)]
True

Run tests:

git clone https://github.com/jzhang-dev/hapcut2-mec-solver
cd hapcut2-mec-solver
pip install -v .
pip install pytest
pytest

Notes

  • HapCUT2 assumes all loci are heterozygous by default even if only one allele is seen in the fragments. For example, fragments [[1, 0, 1], [1, 0, 1], [0, 1, 1]] will be solved as haplotypes [[1, 0, 1], [0, 1, 0]]. Although a --call_homozygous option is availble in HapCUT2, the prior probablity of a locus being homozygous is hard-coded to be extremely small, preventing a locus to be called homozygous in most cases. See this issue for further discussions.
  • Fragments that contain only -1 will be assigned -1 in the output partitions. Loci that contain only -1 will show as -1 in the output haplotypes.
  • HapCUT2 harnesses a heuristic algorithm to appoach the $\mathcal{NP}$-hard MEC problem. HapCUT2 is not guaranteed to find the theoretically optimal solution. In contrast, WhatsHap always finds the theoretically optimal solution with the lowest cost, but has $\mathit{O}(\exp(n))$ time complexity, where $n$ is the maximum coverage of the loci. See also WhatsHap wMEC solver.
  • See tests for more examples about the behavior of the HapCUT2 algorithm.

References

Edge, P., Bafna, V., & Bansal, V. (2017). HapCUT2: Robust and accurate haplotype assembly for diverse sequencing technologies. Genome Research, 27(5), 801–812. https://doi.org/10.1101/gr.213462.116

About

Solve general minimum error correction (MEC) problems using HapCUT2.

Topics

Resources

License

Stars

Watchers

Forks