Skip to content

Commit 9b03503

Browse files
Merge pull request #213 from jeromekelleher/vcfparition-final-docs
Vcfparition final docs
2 parents 9eb9d30 + 6e05bd6 commit 9b03503

File tree

4 files changed

+86
-3
lines changed

4 files changed

+86
-3
lines changed

bio2zarr/cli.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -517,7 +517,7 @@ def plink2zarr():
517517

518518
@click.command
519519
@version
520-
@click.argument("vcf_path", type=click.Path())
520+
@click.argument("vcf_path", type=click.Path(exists=True, dir_okay=False))
521521
@verbose
522522
@click.option(
523523
"-n",

docs/_config.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,3 +29,5 @@ sphinx:
2929
# This is needed to make sure that text is output in single block from
3030
# bash cells.
3131
nb_merge_streams: true
32+
myst_enable_extensions:
33+
- colon_fence

docs/vcfpartition/cli_ref.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22

33

44
```{eval-rst}
5+
.. _cmd-vcfpartition:
56
.. click:: bio2zarr.cli:vcfpartition
67
:prog: vcfpartition
78
:nested: full

docs/vcfpartition/overview.md

Lines changed: 82 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,20 +14,100 @@ kernelspec:
1414
```{code-cell}
1515
:tags: [remove-cell]
1616
cp ../../tests/data/vcf/CEUTrio.20.21.gatk3.4.g.bcf* ./
17+
cp ../../tests/data/vcf/NA12878.prod.chr20snippet.g.vcf.gz* ./
1718
```
1819

1920
## Overview
2021

21-
Partition a given VCF file into (approximately) a give number of regions:
22+
The {ref}`cmd-vcfpartition` utility outputs a set of region strings
23+
that partition an indexed VCF/BCF into either an approximate number of
24+
parts, or into parts of approximately a given size. This is useful
25+
for parallel processing of large VCF files.
2226

27+
:::{admonition} Why is this in bio2zarr?
28+
The ``vcfpartition`` program is packaged with bio2zarr because the underlying
29+
functionality was developed for {ref}`sec-vcf2zarr`, and there is currently
30+
no easy way to split processing of large VCFs up.
31+
:::
2332

33+
### Partitioning into a number of parts
34+
35+
Here, we partition a BCF file into three parts using the ``--num-parts/-n``
36+
argument:
2437
```{code-cell}
2538
vcfpartition CEUTrio.20.21.gatk3.4.g.bcf -n 3
2639
```
2740

41+
The output is a tab-delimited stream of region strings and the file path.
42+
43+
:::{tip}
44+
The file path is included in the output to make it easy to work with
45+
multiple files at once, and also to simplify shell scripting tasks.
46+
:::
47+
48+
We can use this, for example, in a shell loop to count the
49+
number of variants in each partition:
50+
51+
```{code-cell}
52+
vcfpartition CEUTrio.20.21.gatk3.4.g.bcf -n 3 | while read split;
53+
do
54+
bcftools view -Hr $split | wc -l
55+
done
56+
```
57+
58+
:::{note}
59+
Note that the number of variants in each partition is quite uneven, which
60+
is generally true across files of all scales.
61+
:::
62+
63+
64+
Another important point is that there is granularity limit to the
65+
partitions:
66+
```{code-cell}
67+
vcfpartition CEUTrio.20.21.gatk3.4.g.bcf -n 30
68+
```
69+
70+
Here, we asked for 30 partitions, but the underlying indexes provide
71+
a maxmimum of 3.
72+
73+
:::{warning}
74+
Do not assume that the number of partitions you ask for is what you get!
75+
:::
76+
77+
### Partitioning into a fixed size
78+
79+
It is also possible to partition a VCF file into chunks of approximately
80+
a given size.
81+
82+
83+
```{code-cell}
84+
ls -lh NA12878.prod.chr20snippet.g.vcf.gz
85+
```
86+
87+
In this example, we have 3.8M file, and would like
88+
to process this in chunks of approximately 500K at a time:
89+
90+
```{code-cell}
91+
vcfpartition NA12878.prod.chr20snippet.g.vcf.gz -s 500K
92+
```
93+
94+
:::{tip}
95+
Suffixes like M, MiB, G, GB, or raw numbers in bytes are all supported.
96+
:::
97+
98+
We get 8 partitions in this example. Note again that these target sizes
99+
are quite approximate.
100+
101+
### Parallel example
102+
103+
Here we use illustrate using `vcfpartition` to count the variants in each
104+
partition in parallel using xargs. In this case we use 3 partitions with 3
105+
processes, but because the number of variants per partition can be quite
106+
uneven, it is a good idea to partition up work into (say) four times the number
107+
of cores available for processing.
28108

29109
```{code-cell}
30110
vcfpartition CEUTrio.20.21.gatk3.4.g.bcf -n 3 \
31-
| xargs -P 3 -I {} sh -c "bcftools view -Hr {} CEUTrio.20.21.gatk3.4.g.bcf | wc -l"
111+
| xargs -P 3 -I {} sh -c "bcftools view -Hr {} | wc -l"
32112
```
33113

0 commit comments

Comments
 (0)