Skip to content

Commit eb7ab75

Browse files
authored
Update README.md
1 parent d38e538 commit eb7ab75

File tree

1 file changed

+17
-9
lines changed

1 file changed

+17
-9
lines changed

README.md

Lines changed: 17 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,29 +1,37 @@
11
Winnowmap
22
========================================================================
33

4-
Winnowmap is a new long-read mapping algorithm, and a result of our exploration into superior minimizer sampling techniques. Minimizer sampling was originally introduced by [Roberts et al.](http://www.cs.toronto.edu/~wayne/research/papers/minimizers.pdf). This technique yields reduced representation of reference genome, enabling fast mapping. A key property of the minimizer technique is that if two sequences share a substring of a specified length, then they can be guaranteed to have a matching minimizer. However, because the k-mer distribution in eukaryotic genomes is highly uneven, minimizer-based long read mappers (e.g., [minimap2](https://github.com/lh3/minimap2/), [mashmap](https://github.com/marbl/MashMap)) opt to discard the most frequently occurring minimizers from the genome in order to avoid excessive false positive seed hits. By doing so, the underlying guarantee is lost and accuracy is reduced in repetitive genomic regions (e.g., long tandem repeats).
4+
Winnowmap is a long-read mapping algorithm, and a result of our exploration into superior minimizer sampling techniques. Minimizer sampling was originally introduced by [Roberts et al.](http://www.cs.toronto.edu/~wayne/research/papers/minimizers.pdf) This technique yields reduced representation of reference genome, enabling fast mapping analyses. A key property of the minimizer technique is that if two sequences share a substring of a specified length, then they can be guaranteed to have a matching minimizer. However, because the k-mer distribution in eukaryotic genomes is highly uneven, minimizer-based long read mappers (e.g., [minimap2](https://github.com/lh3/minimap2/), [mashmap](https://github.com/marbl/MashMap)) opt to discard the most frequently occurring minimizers from the genome in order to avoid excessive false positive seed hits. By doing so, the underlying guarantee is lost and accuracy is reduced in repetitive genomic regions (e.g., long tandem repeats).
55

6-
To address the above problem, Winnowmap implements a novel **weighted minimizer** sampling algorithm. A unique feature of the proposed algorithm is that it performs minimizer sampling while taking into account a weight for each k-mer; i.e, the higher the weight of a k-mer, the more likely it is to be selected. Rather than masking k-mers, Winnowmap opts to down-weight frequently occurring k-mers, thus reducing their chance of getting selected as minimizers. Winnowmap implements the new minimizer sampling and indexing algorithm, and borrows [minimap2’s](https://github.com/lh3/minimap2/) highly efficient anchor chaining and gapped alignment routines. The user-interface of Winnowmap is maintained similar to minimap2.
7-
8-
Comparing Winnowmap to minimap2, we observe a reduction in the mapping error-rate from 0.14% to 0.06% in the recently finished [human X chromosome](https://github.com/nanopore-wgs-consortium/CHM13), and from 3.6% to 0% within the highly repetitive X centromere (3.1 Mbp). Winnowmap improves mapping accuracy within repeats and achieves these results with sparser sampling, leading to better index compression and competitive runtimes.
6+
To address the above problem, Winnowmap implements a novel **weighted minimizer** sampling algorithm. A unique feature of Winnowmap is that it performs minimizer sampling while taking into account a weight for each k-mer; i.e, the higher the weight of a k-mer, the more likely it is to be selected. Rather than masking k-mers, Winnowmap down-weights frequently occurring k-mers, thus reducing their chance of getting selected as minimizers. Winnowmap implements the new minimizer sampling and indexing algorithm, and borrows [minimap2’s](https://github.com/lh3/minimap2/) highly efficient anchor chaining and gapped alignment routines. The user-interface of Winnowmap is maintained similar to minimap2.
97

108
## Compile
119

12-
Winnowmap requires c++11 to build, which is available in GCC >= 4.8. To compile Winnowmap, run the `make` command. Expect two executables `computeHighFreqKmers` and `winnowmap`.
10+
Winnowmap requires c++11 to build, which is available in GCC >= 4.8. To compile Winnowmap, run `make`. Expect two executables `computeHighFreqKmers` and `winnowmap`.
1311

1412
## Usage
1513

1614
* Step 1: compute a set of highly frequent k-mers:
1715
```sh
18-
computeHighFreqKmers 19 1 1024 ref.fa bad_kmers.txt
16+
computeHighFreqKmers 19 1 1024 ref.fa bad_Hk19_mers.txt (OR)
17+
computeHighFreqKmers 15 0 1024 ref.fa bad_k15_mers.txt
1918
```
20-
The above executable `computeHighFreqKmers` expects five arguments in the following order: k-mer length, binary integer 0/1 indicating whether to enable homopolymer compression, minimum k-mer frequency, reference genome and output file.
19+
The above executable `computeHighFreqKmers` expects five arguments in the following order: k-mer length, binary number 0/1 indicating whether to enable homopolymer compression, minimum k-mer frequency, reference genome and output file.
2120

2221
* Step 2: Map long reads to reference:
2322
```sh
24-
winnowmap -W bad_kmers.txt -cx map-pb ref.fa pacbio.fq.gz > output.paf
23+
winnowmap -W bad_Hk19_mers.txt -cx map-pb ref.fa pacbio.fq.gz > output.paf (OR)
24+
winnowmap -W bad_k15_mers.txt -cx map-ont ref.fa ont.fq.gz > output.paf
2525
```
2626
Except the `-W` parameter above needed by Winnowmap, the remaining options are consistent with [minimap2 usage](https://github.com/lh3/minimap2/blob/master/README.md).
2727

28-
Users should keep k-mer length and homopolymer compression parameters consistent in Steps 1 and 2. For example, `map-pb` preset in Step 2 uses k-mer length 19 with homopolymer compression enabled. The other popular preset `map-ont` uses k-mer length 15 without homopolymer compression. Later, the two steps will be merged into one in a subsequent release of Winnowmap for user convenience.
28+
Users should keep k-mer length and homopolymer compression parameters consistent in Steps 1 and 2. For example, `map-pb` preset in Step 2 uses k-mer length 19 with homopolymer compression enabled. The other popular preset `map-ont` uses k-mer length 15 without homopolymer compression. In near future, these two steps will be merged into one for user convenience.
29+
30+
## Benchmarking
31+
32+
Comparing Winnowmap to minimap2, we observed a reduction in the mapping error-rate from 0.14% to 0.06% in the recently finished [human X chromosome](https://github.com/nanopore-wgs-consortium/CHM13), and from 3.6% to 0% within the highly repetitive X centromere (3.1 Mbp). Winnowmap improves mapping accuracy within repeats and achieves these results with sparser sampling, leading to better index compression and competitive runtimes. By avoiding masking, we show that Winnowmap maintains uniform minimizer density.
2933

34+
<p align="center">
35+
<img src="https://1aaaa1f6-a-62cb3a1a-s-sites.googlegroups.com/site/chirgjain/readme-winnowmap-density.jpg?attachauth=ANoY7cost_TsHo3yjf_COK13C-JBDQIio-GCb_hNSAdMQ92aRqISg21pJsg5dMKD5yMalcAugwI5vkqf9Cdu3sVk-xBz-SkRMkuyWAk3vK06_LEF2ay1pNSzCxU6nUNywhTYb5li8moC-YzRMmJZt7r3KFvcI34IbD7rktjXAPn_5Jba86E19uXq2o6zjAEDmsfjrKxqAdbsnPL3bU8L4wHwsH9gyv6170wD7WFJ_8pfFjeWam0v2uY%3D&attredirects=0" width=400px"> <br>
36+
Minimizer sampling density using a human X chromosome as the reference, with the centromere positioned between 58 Mbp and 61 Mbp. ‘Standard’ method refers to the classic minimizer sampling algorithm from <a href="http://www.cs.toronto.edu/~wayne/research/papers/minimizers.pdf">Roberts et al.</a>, without any masking or modification.
37+
</p>

0 commit comments

Comments
 (0)