Skip to content

Commit 64b0ca7

Browse files
Some notes on how to improve encoding performance
Also notes on required validation updates
1 parent 4865be0 commit 64b0ca7

File tree

2 files changed

+17
-0
lines changed

2 files changed

+17
-0
lines changed

bio2zarr/vcf.py

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1172,6 +1172,19 @@ def create_array(self, variable):
11721172
a.attrs["_ARRAY_DIMENSIONS"] = variable.dimensions
11731173

11741174
def encode_column(self, pcvcf, column, encoder_threads=4):
1175+
# TODO we're doing this the wrong way at the moment, overcomplicating
1176+
# things by having the ThreadedZarrEncoder. It would be simpler if
1177+
# we split the columns into vertical chunks, and just pushed a bunch
1178+
# of futures for encoding start:end slices of each column. The
1179+
# complicating factor here is that we need to get these slices
1180+
# out of the pcvcf, which takes a little bit of doing (but fine,
1181+
# because we know the number of records in each partition).
1182+
# An annoying factor then is how to update the progess meter
1183+
# because the "bytes read" approach becomes problematic
1184+
# when we might access the same chunk several times.
1185+
# Would perhaps be better to call sys.getsizeof() on the stored
1186+
# value each time.
1187+
11751188
source_col = pcvcf.columns[column.vcf_field]
11761189
array = self.root[column.name]
11771190
ba = core.BufferedArray(array)

validation.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,10 @@
77

88
from bio2zarr import vcf
99

10+
# TODO add support here for split vcfs. Perhaps simplest to take a
11+
# directory provided as input as indicating this, and then having
12+
# the original unsplit vs split files in there following some
13+
# naming conventions.
1014

1115
@click.command
1216
@click.argument("vcfs", nargs=-1)

0 commit comments

Comments
 (0)