Skip to content

norm -m + --atomize inconsistent representation of complex variants #2482

@mi3112

Description

@mi3112

I am working with a WES tumor-only experiment.
Variant calling was performed using three different tools:

  • Mpileup
  • Mutect
  • Freebayes

(in the pictures I kept the same order)

Image

All three callers detect the same complex variant, but each represents it differently in the original VCF.

To normalize the variants, I used the following command:

bcftools norm --atomize -f ref.fasta -o output.vcf input.vcf

Before normalized I have this representation:
Freebayes
chr17 7675081 . GGGGCAGC GGA

Mutect2

chr17	7675082	.	GGGC	G	
chr17	7675086	.	AGC	A	

Mpileup

chr17	7675081	.	GGGGCAG	GG	
chr17	7675088	.	C	A	

After normalization the result was
Freebayes:
chr17 7675083 . GGCAGC A

Mutect2:

chr17	7675082	.	GGGC	G	
chr17	7675086	.	AGC	A

Mpileup

chr17	7675081	.	GGGGCA	G	
chr17	7675088	.	C	A

Even after applying bcftools norm --atomize, the same biological variant is still represented differently across callers:

  • Different POS
  • Different decomposition boundaries
  • Different REF/ALT lengths

I was expecting --atomize to produce a canonical, consistent representation across callers (same coordinates and minimal atomic variants), but this did not happen.

Is this behavior expected?
Does --atomize intentionally preserve caller-specific breakpoints or representations?

Is there a recommended way to obtain an identical representation for complex variants across different callers, so that intersections/overlaps between VCFs can be computed reliably?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions