-
Notifications
You must be signed in to change notification settings - Fork 10
Closed
Labels
bugSomething isn't workingSomething isn't working
Milestone
Description
After running conversion on 1000 Genomes chr2 data, I'm not seeing any reduction in the size of PL fields. Here's the ICF:
$ vcf2zarr inspect 1kg_chr2_lpl.icf/ | grep PL
FORMAT/PL Integer 6906 430.58 GiB 60.6 GiB 28 0 3.2e+05
FORMAT/LPL Integer 6906 430.58 GiB 60.57 GiB 28 -1 3.2e+05
$ vcf2zarr inspect 1kg_chr2_lpl.icf/ | grep 'LAA'
FORMAT/LAA Integer 4534 282.07 GiB 315.18 MiB 6 -2 6
and on the VCZ: (for a small number of variants using --max-variant-chunks)
$ vcf2zarr inspect 1kg_chr2_lpl.vcz/ | grep PL
/call_LPL int32 8.27 MiB 342.01 MiB 41 40 8.55 MiB 211.78 KiB (1000, 3202, 28) (100, 1000, 28) Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0) None
/call_PL int32 8.27 MiB 342.01 MiB 41 40 8.55 MiB 211.78 KiB (1000, 3202, 28) (100, 1000, 28) Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0) None
$ vcf2zarr inspect 1kg_chr2_lpl.vcz/ | grep LAA
/call_LAA int8 102.69 KiB 18.32 MiB 180 40 469.04 KiB 2.57 KiB (1000, 3202, 6) (100, 1000, 6) Blosc(cname='zstd', clevel=7, shuffle=NOSHUFFLE, blocksize=0) None
So, there's no reduction in the maximum dimension and the storage sizes are essentially identical.
Have you any ideas what might be going on here @Will-Tyler?
I think it would be really worthwhile getting some truth data for LPL that we could compare with. It does seem that getting running Hail is the only way to do this, so probably worth the effort.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working