Skip to content

Reduced recall due to low MAPQ after surjecting long-read graph alignments to GRCh38 #998

@matthewglasenapp

Description

@matthewglasenapp

Hello,

I'm working with PacBio target capture data from the human MHC (6p21). DeepVariant performs very well when these reads are mapped to the linear GRCh38 with minimap2 (SNP F1 0.9988).

However, I'd prefer to map to the HPRC graph with vg giraffe and surject the alignments to GRCh38, roughly following your Using Graph Genomes case study. When I tested this approach, both SNP and INDEL recall (measured with GIAB samples) were very low (~60%) due to the very low MAPQ of the vg-mapped reads. I tried running DeepVariant with the following parameters suggested in the tutorial, but it did not help.
min_mapping_quality=0,keep_legacy_allele_counter_behavior=true,normalize_reads=true

The MAPQ differences between minimap2 and vg giraffe are quite striking. For example, nearly all the minimap2-mapped reads in HLA-A exon 2 had a MAPQ of 60, while the average MAPQ of vg giraffe-mapped reads was 22. The IGV of the graph-mapped reads looks very clean, and the coverage depth is very high. DeepVariant detects the ALT allele at all the expected positions, but often produces a REFCALL due to the low MAPQ.

As a quick workaround, I reassigned the MAPQ of the vg-mapped reads to their MAPQ values when mapped to the linear GRCh38 with minimap2. This restored nearly all of the lost recall, but still underperformed relative to using the minimap2-mapped reads as input to DeepVariant.

I'm hoping to get your recommendations on how to proceed with vg giraffe + DeepVariant for long reads. I'd also love to see a pangenome-aware DeepVariant model trained on PacBio/ONT data. I'm happy to provide fastq/BAM/vcf if that would be helpful. Thanks!

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions