Skip to content

Looping over a VCF file seems to incur huge memory #26

@biona001

Description

@biona001

I'm writing a routine to import a VCF file as a numeric matrix, but I get a much larger memory usage than expected.

As a minimum working example, consider the code below that loops over a VCF file:

using GeneticVariation
function loop_vcf()
    reader = VCF.Reader(open("target.vcf", "r"))
    s = 0
    for record in reader, geno in record.genotype
        s += 1
    end
    close(reader)
    return s
end

On a test data (target.vcf.gz, must decompress first) with 3000 records and 100 samples, I get the following benchmark:

using BenchmarkTools
@benchmark loop_vcf()
BenchmarkTools.Trial:
  memory estimate:  98.64 MiB
  allocs estimate:  941005
  --------------
  minimum time:     62.249 ms (5.75% GC)
  median time:      63.186 ms (5.99% GC)
  mean time:        63.835 ms (6.75% GC)
  maximum time:     79.381 ms (5.22% GC)
  --------------
  samples:          79
  evals/sample:     1

Why am I getting such a large memory requirement? My data target.vcf is only 1.3MB on disk, so I feel like this memory usage is highly suspicious..

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions