Scope and Principles for Long-Read QC Standard Metrics - Indels #116
justinjj24
started this conversation in
Long-read QC Standards
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
1. Context and Motivation
Purpose
This discussion focused on defining the appropriate scope for Roadmap V2 quality control (QC) metrics in the context of long-read whole-genome sequencing, specifically addressing how insertions and deletions (indels) should be handled and whether structural variation (SV) should be explicitly included. The key question was whether V2 should:
Expand indel metrics to include larger events enabled by long reads, or
Explicitly introduce structural variation (SV) metrics into the QC framework.
This arose because long-read technologies blur traditional boundaries between short indels and structural variants, especially for insertions and duplications.
2. Differences Between Short-Read and Long-Read Variant Calling
o Historically, QC standards and variant definitions originated from short-read sequencing, where:
Indels are typically short (shorter than read length).
SV insertions/deletions are outputs of dedicated SV callers.
o In long-read sequencing, alignments span much larger regions, causing:
Large indels to appear naturally in CIGAR strings.
Insertions, duplications, and other rearrangements to be represented differently depending on aligner and caller behavior.
Daniel highlighted that:
o Deletions are generally straightforward.
o Insertions are problematic due to long-standing ambiguity in the VCF specification:
The same biological event may be called as an insertion by long-read callers.
The same event may be called as a duplication by short-read or differently configured long-read pipelines.
3. Ambiguity in Definitions (Insertion vs Duplication)
o There is no universally clean or unambiguous definition separating:
Insertions
Tandem duplications
o This ambiguity is intrinsic to:
Alignment strategies
Variant caller design
File formats (VCF vs newer representations)
Nicolas emphasized that:
o Absolute clarity is unlikely to be achieved.
o The goal of QC standards should be to:
Clearly define terms in plain English
Explicitly state what is included and excluded
Reduce ambiguity through scope and definition, rather than attempting to solve ontology debates.
4. Key Scoping Decision for Roadmap V2
A central decision emerged from the discussion:
Roadmap V2 will NOT explicitly include structural variation (SV) QC metrics
Instead:
o V2 will focus on VCF-based indels only, even if they are larger in size due to long reads.
o These indels are treated as:
“Large indels enabled by long-read sequencing”
Not as fully defined structural variants
This avoids:
o Introducing SV ontology, breakpoint definitions, and cross-caller harmonization.
o The need to also address copy number variation (CNV), which is inseparable from SVs at a conceptual level.
Daniel strongly noted:
o Once SVs are introduced, QC must also account for:
CNVs
Breakpoint-based rearrangements
Different caller ecosystems (SV, CNV, STR, VNTR)
o This would significantly expand scope and complexity beyond what is appropriate for V2.
5. File Format and Ecosystem Constraints
o The QC framework is fundamentally downstream of BAM/CRAM and VCF.
o While newer formats (e.g., VRS/JSON-based representations) address many SV shortcomings:
They are impractical for whole-genome QC at scale.
VCF remains the lowest common denominator across all variant callers.
o Therefore:
6. Importance of Context-Specific Metrics
Daniel highlighted that:
o Different variant types demand different QC perspectives:
SNP/indel callers
SV callers
STR/VNTR callers
CNV callers
o For example:
STR QC focuses on net expansion/contraction, not individual indel events.
Counting indels is meaningless for STR pathogenicity assessment.
o Mixing these contexts in V2 would dilute clarity and purpose.
7. Final Consensus and Outcome
The group converged on the following principles for Roadmap V2:
o ✅ Focus on VCF-based indels only, including those larger in long-read data
o ❌ Do not introduce explicit SV QC metrics in V2
o 📌 Treat large long-read indels as:
An extension of indel QC
Not as full structural variants
o 🧭 Defer SV, CNV, STR, and complex rearrangements to:
Future roadmap phases
Separate, more specialized QC frameworks
This approach ensures:
Manageable scope for V2
Compatibility with existing pipelines
Clear expectations for users
A solid foundation for future expansion
For Roadmap V2, long-read sequencing QC will extend indel metrics within the VCF framework, while intentionally deferring structural variation to future phases. This balances practical implementation with scientific rigor and ensures clarity for both developers and users.
8. Forward Outlook
Structural variation QC will require a separate, dedicated framework with explicit definitions and broader ecosystem alignment.
Future roadmap versions may address SVs once consensus on representation, ontology, and metric definitions has matured.
Roadmap V2 establishes a stable foundation for long-read QC without overextending scope.
Beta Was this translation helpful? Give feedback.
All reactions