-
Notifications
You must be signed in to change notification settings - Fork 18
Open
Description
I wrote a function that reads the GT field of a vcf file into a numeric matrix where the heavy lifting is done by GeneticVariation, then uses the matrix for analyses. Upon testing, I noticed my import routine is ~4x slower than this other software, which kind of killed the performance gains in the analyses portion.
After some experiments, I came across the following benchmark on this test file
target.chr18.typedOnly.maf0.1.masked.vcf.gz:
using CodecZlib, GeneticVariation
# reads file line by line with function in Base
function read_linebyline(vcffile)
io = GzipDecompressorStream(open(vcffile))
s = 0
for line in eachline(io)
s += 1
end
close(io)
s
end
# uses "for record in reader"
function loop_vcf(vcffile)
reader = VCF.Reader(GzipDecompressorStream(open(vcffile, "r")))
s = 0
for record in reader
s += 1 # here I have routines to process the record into numeric data
end
close(reader)
return s
end
# uses read! and eof
function loop_vcf2(reffile)
reader = VCF.Reader(GzipDecompressorStream(open(reffile, "r")))
record = VCF.Record()
s = 0
while !eof(reader)
read!(reader, record)
s += 1 # here I have routines to process the record into numeric data
end
close(reader)
return s
end
# timings after compilation:
tgtfile = "target.chr18.typedOnly.maf0.1.masked.vcf.gz"
@time read_linebyline(tgtfile) # 0.365078 seconds (524.71 k allocations: 176.156 MiB, 7.50% gc time)
@time loop_vcf(tgtfile) # 4.711456 seconds (54.36 M allocations: 5.589 GiB, 10.05% gc time)
@time loop_vcf2(tgtfile) # 2.028391 seconds (34.67 M allocations: 2.583 GiB, 13.78% gc time)Using the loop_vcf2 approach, the read! function still takes ~80% of the time in my data import code.
My question is:
- Is this expected performance?
- Is there a way to modify
loop_vcforloop_vcf2so that I can read through them faster?
Metadata
Metadata
Assignees
Labels
No labels