Skip to content

genmod models silently drops SVs that share the same CHROM, POS, REF, ALT #190

@alkc

Description

@alkc

It looks like genmod models is silently dropping variants that share CHROM, POS, REF and ALT.

For example when I run models om the following SV VCF:

##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=CN,Number=1,Type=Float,Description="Copy number">
##FORMAT=<ID=CNQ,Number=1,Type=Float,Description="Copy number genotype quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
##INFO=<ID=HOMLEN,Number=A,Type=Integer,Description="Length of base pair identical micro-homology at breakpoints">
##INFO=<ID=HOMSEQ,Number=A,Type=String,Description="Sequence of base pair identical micro-homology at breakpoints">
##INFO=<ID=IMPRECISE,Number=0,Type=Flag,Description="Imprecise structural variation">
##INFO=<ID=SVLEN,Number=A,Type=Integer,Description="Length of structural variant">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=Annotation,Number=.,Type=String,Description="Annotates what feature(s) this variant belongs to.">>
##contig=<ID=chr16,length=90338345>
##ALT=<ID=CNV,Description="Copy number variable region">
##ALT=<ID=DEL,Description="Deletion">
##ALT=<ID=DUP,Description="Duplication">
##ALT=<ID=DUP:TANDEM,Description="Tandem Duplication">
##ALT=<ID=INS,Description="Insertion">
##ALT=<ID=INV,Description="Inversion">
##source=MergeVCF
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  proband father  mother
chr16   1       .       G       <DEL>   487     PASS    IMPRECISE;SVTYPE=DEL;END=1001;SVLEN=1000;Annotation=PDXDC1      GT:CN:CNQ       0/1:1:487       ./.:.:. ./.:.:.
chr16   1       .       G       <DEL>   69      PASS    IMPRECISE;SVTYPE=DEL;END=101;SVLEN=100;Annotation=PDXDC1        GT:CN:CNQ       ./.:.:. ./.:.:. 0/1:1:69

I'll get a VCF containing only one of the two variants:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  proband father  mother
chr16   1       .       G       <DEL>   69      PASS    IMPRECISE;SVTYPE=DEL;END=101;SVLEN=100;Annotation=PDXDC1;GeneticModels=foo:AR_hom|AR_hom_dn     GT:CN:CNQ       ./.:.:. ./.:.:. 0/1:1:69

I've narrowed it down to the variant id built here:

def get_variant_id(variant_dict):
"""Build a variant id
The variant id is a string made of CHROM_POS_REF_ALT
The alt field for svs needs some massage to work downstream.
Args:
variant_dict (dict): A variant dictionary
Returns:
variant_id (str)
"""
chrom = variant_dict["CHROM"]
pos = variant_dict["POS"]
ref = variant_dict["REF"]
# There are several symbols in structural variant calls that make
# things hard. We will strip those symbols
bad_chars = "<>[]:"
alt = "".join(c for c in variant_dict["ALT"] if c not in bad_chars)
return "_".join([chrom, pos, ref, alt])

Later being used as a key when adding parsed variants to a batch dict here:

batch[variant_id] = variant

Since the two calls share the same variant id, the second variant will overwrite the first in that dict.

Also, the VCF header gets written a second time to the end of the file. Not sure what that is about.

This was detected in genmod 3.10.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions