Skip to content

Looping over vcf file 10x slower than reading line by line #27

@biona001

Description

@biona001

I wrote a function that reads the GT field of a vcf file into a numeric matrix where the heavy lifting is done by GeneticVariation, then uses the matrix for analyses. Upon testing, I noticed my import routine is ~4x slower than this other software, which kind of killed the performance gains in the analyses portion.

After some experiments, I came across the following benchmark on this test file
target.chr18.typedOnly.maf0.1.masked.vcf.gz:

using CodecZlib, GeneticVariation

# reads file line by line with function in Base
function read_linebyline(vcffile)
    io = GzipDecompressorStream(open(vcffile))
    s = 0
    for line in eachline(io)
        s += 1
    end
    close(io)
    s
end

# uses "for record in reader"
function loop_vcf(vcffile)
    reader = VCF.Reader(GzipDecompressorStream(open(vcffile, "r")))
    s = 0
    for record in reader
        s += 1  # here I have routines to process the record into numeric data
    end
    close(reader)
    return s
end

# uses read! and eof
function loop_vcf2(reffile)
    reader = VCF.Reader(GzipDecompressorStream(open(reffile, "r")))
    record = VCF.Record()
    s = 0
    while !eof(reader)
        read!(reader, record)
        s += 1  # here I have routines to process the record into numeric data
    end
    close(reader)
    return s
end

# timings after compilation:
tgtfile = "target.chr18.typedOnly.maf0.1.masked.vcf.gz"
@time read_linebyline(tgtfile) # 0.365078 seconds (524.71 k allocations: 176.156 MiB, 7.50% gc time)
@time loop_vcf(tgtfile)        # 4.711456 seconds (54.36 M allocations: 5.589 GiB, 10.05% gc time)
@time loop_vcf2(tgtfile)       # 2.028391 seconds (34.67 M allocations: 2.583 GiB, 13.78% gc time)

Using the loop_vcf2 approach, the read! function still takes ~80% of the time in my data import code.

My question is:

  1. Is this expected performance?
  2. Is there a way to modify loop_vcf or loop_vcf2 so that I can read through them faster?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions