-
Notifications
You must be signed in to change notification settings - Fork 21
Description
It looks like genmod models is silently dropping variants that share CHROM, POS, REF and ALT.
For example when I run models om the following SV VCF:
##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=CN,Number=1,Type=Float,Description="Copy number">
##FORMAT=<ID=CNQ,Number=1,Type=Float,Description="Copy number genotype quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
##INFO=<ID=HOMLEN,Number=A,Type=Integer,Description="Length of base pair identical micro-homology at breakpoints">
##INFO=<ID=HOMSEQ,Number=A,Type=String,Description="Sequence of base pair identical micro-homology at breakpoints">
##INFO=<ID=IMPRECISE,Number=0,Type=Flag,Description="Imprecise structural variation">
##INFO=<ID=SVLEN,Number=A,Type=Integer,Description="Length of structural variant">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=Annotation,Number=.,Type=String,Description="Annotates what feature(s) this variant belongs to.">>
##contig=<ID=chr16,length=90338345>
##ALT=<ID=CNV,Description="Copy number variable region">
##ALT=<ID=DEL,Description="Deletion">
##ALT=<ID=DUP,Description="Duplication">
##ALT=<ID=DUP:TANDEM,Description="Tandem Duplication">
##ALT=<ID=INS,Description="Insertion">
##ALT=<ID=INV,Description="Inversion">
##source=MergeVCF
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT proband father mother
chr16 1 . G <DEL> 487 PASS IMPRECISE;SVTYPE=DEL;END=1001;SVLEN=1000;Annotation=PDXDC1 GT:CN:CNQ 0/1:1:487 ./.:.:. ./.:.:.
chr16 1 . G <DEL> 69 PASS IMPRECISE;SVTYPE=DEL;END=101;SVLEN=100;Annotation=PDXDC1 GT:CN:CNQ ./.:.:. ./.:.:. 0/1:1:69
I'll get a VCF containing only one of the two variants:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT proband father mother
chr16 1 . G <DEL> 69 PASS IMPRECISE;SVTYPE=DEL;END=101;SVLEN=100;Annotation=PDXDC1;GeneticModels=foo:AR_hom|AR_hom_dn GT:CN:CNQ ./.:.:. ./.:.:. 0/1:1:69
I've narrowed it down to the variant id built here:
genmod/genmod/vcf_tools/parse_variant.py
Lines 47 to 67 in 8ecff5b
| def get_variant_id(variant_dict): | |
| """Build a variant id | |
| The variant id is a string made of CHROM_POS_REF_ALT | |
| The alt field for svs needs some massage to work downstream. | |
| Args: | |
| variant_dict (dict): A variant dictionary | |
| Returns: | |
| variant_id (str) | |
| """ | |
| chrom = variant_dict["CHROM"] | |
| pos = variant_dict["POS"] | |
| ref = variant_dict["REF"] | |
| # There are several symbols in structural variant calls that make | |
| # things hard. We will strip those symbols | |
| bad_chars = "<>[]:" | |
| alt = "".join(c for c in variant_dict["ALT"] if c not in bad_chars) | |
| return "_".join([chrom, pos, ref, alt]) |
Later being used as a key when adding parsed variants to a batch dict here:
genmod/genmod/utils/get_batches.py
Line 156 in 8ecff5b
| batch[variant_id] = variant |
Since the two calls share the same variant id, the second variant will overwrite the first in that dict.
Also, the VCF header gets written a second time to the end of the file. Not sure what that is about.
This was detected in genmod 3.10.1