-
Notifications
You must be signed in to change notification settings - Fork 759
Description
Hello,
I'm working with PacBio target capture data from the human MHC (6p21). DeepVariant performs very well when these reads are mapped to the linear GRCh38 with minimap2 (SNP F1 0.9988).
However, I'd prefer to map to the HPRC graph with vg giraffe and surject the alignments to GRCh38, roughly following your Using Graph Genomes case study. When I tested this approach, both SNP and INDEL recall (measured with GIAB samples) were very low (~60%) due to the very low MAPQ of the vg-mapped reads. I tried running DeepVariant with the following parameters suggested in the tutorial, but it did not help.
min_mapping_quality=0,keep_legacy_allele_counter_behavior=true,normalize_reads=true
The MAPQ differences between minimap2 and vg giraffe are quite striking. For example, nearly all the minimap2-mapped reads in HLA-A exon 2 had a MAPQ of 60, while the average MAPQ of vg giraffe-mapped reads was 22. The IGV of the graph-mapped reads looks very clean, and the coverage depth is very high. DeepVariant detects the ALT allele at all the expected positions, but often produces a REFCALL due to the low MAPQ.
As a quick workaround, I reassigned the MAPQ of the vg-mapped reads to their MAPQ values when mapped to the linear GRCh38 with minimap2. This restored nearly all of the lost recall, but still underperformed relative to using the minimap2-mapped reads as input to DeepVariant.
I'm hoping to get your recommendations on how to proceed with vg giraffe + DeepVariant for long reads. I'd also love to see a pangenome-aware DeepVariant model trained on PacBio/ONT data. I'm happy to provide fastq/BAM/vcf if that would be helpful. Thanks!