@@ -14,20 +14,100 @@ kernelspec:
14
14
``` {code-cell}
15
15
:tags: [remove-cell]
16
16
cp ../../tests/data/vcf/CEUTrio.20.21.gatk3.4.g.bcf* ./
17
+ cp ../../tests/data/vcf/NA12878.prod.chr20snippet.g.vcf.gz* ./
17
18
```
18
19
19
20
## Overview
20
21
21
- Partition a given VCF file into (approximately) a give number of regions:
22
+ The {ref}` cmd-vcfpartition ` utility outputs a set of region strings
23
+ that partition an indexed VCF/BCF into either an approximate number of
24
+ parts, or into parts of approximately a given size. This is useful
25
+ for parallel processing of large VCF files.
22
26
27
+ :::{admonition} Why is this in bio2zarr?
28
+ The `` vcfpartition `` program is packaged with bio2zarr because the underlying
29
+ functionality was developed for {ref}` sec-vcf2zarr ` , and there is currently
30
+ no easy way to split processing of large VCFs up.
31
+ :::
23
32
33
+ ### Partitioning into a number of parts
34
+
35
+ Here, we partition a BCF file into three parts using the `` --num-parts/-n ``
36
+ argument:
24
37
``` {code-cell}
25
38
vcfpartition CEUTrio.20.21.gatk3.4.g.bcf -n 3
26
39
```
27
40
41
+ The output is a tab-delimited stream of region strings and the file path.
42
+
43
+ :::{tip}
44
+ The file path is included in the output to make it easy to work with
45
+ multiple files at once, and also to simplify shell scripting tasks.
46
+ :::
47
+
48
+ We can use this, for example, in a shell loop to count the
49
+ number of variants in each partition:
50
+
51
+ ``` {code-cell}
52
+ vcfpartition CEUTrio.20.21.gatk3.4.g.bcf -n 3 | while read split;
53
+ do
54
+ bcftools view -Hr $split | wc -l
55
+ done
56
+ ```
57
+
58
+ :::{note}
59
+ Note that the number of variants in each partition is quite uneven, which
60
+ is generally true across files of all scales.
61
+ :::
62
+
63
+
64
+ Another important point is that there is granularity limit to the
65
+ partitions:
66
+ ``` {code-cell}
67
+ vcfpartition CEUTrio.20.21.gatk3.4.g.bcf -n 30
68
+ ```
69
+
70
+ Here, we asked for 30 partitions, but the underlying indexes provide
71
+ a maxmimum of 3.
72
+
73
+ :::{warning}
74
+ Do not assume that the number of partitions you ask for is what you get!
75
+ :::
76
+
77
+ ### Partitioning into a fixed size
78
+
79
+ It is also possible to partition a VCF file into chunks of approximately
80
+ a given size.
81
+
82
+
83
+ ``` {code-cell}
84
+ ls -lh NA12878.prod.chr20snippet.g.vcf.gz
85
+ ```
86
+
87
+ In this example, we have 3.8M file, and would like
88
+ to process this in chunks of approximately 500K at a time:
89
+
90
+ ``` {code-cell}
91
+ vcfpartition NA12878.prod.chr20snippet.g.vcf.gz -s 500K
92
+ ```
93
+
94
+ :::{tip}
95
+ Suffixes like M, MiB, G, GB, or raw numbers in bytes are all supported.
96
+ :::
97
+
98
+ We get 8 partitions in this example. Note again that these target sizes
99
+ are quite approximate.
100
+
101
+ ### Parallel example
102
+
103
+ Here we use illustrate using ` vcfpartition ` to count the variants in each
104
+ partition in parallel using xargs. In this case we use 3 partitions with 3
105
+ processes, but because the number of variants per partition can be quite
106
+ uneven, it is a good idea to partition up work into (say) four times the number
107
+ of cores available for processing.
28
108
29
109
``` {code-cell}
30
110
vcfpartition CEUTrio.20.21.gatk3.4.g.bcf -n 3 \
31
- | xargs -P 3 -I {} sh -c "bcftools view -Hr {} CEUTrio.20.21.gatk3.4.g.bcf | wc -l"
111
+ | xargs -P 3 -I {} sh -c "bcftools view -Hr {} | wc -l"
32
112
```
33
113
0 commit comments