Scope and Principles for Long-Read QC Standard Metrics - Indels #116

justinjj24 · 2026-01-16T02:13:47Z

justinjj24
Jan 16, 2026
Collaborator

1. Context and Motivation

Purpose
This discussion focused on defining the appropriate scope for Roadmap V2 quality control (QC) metrics in the context of long-read whole-genome sequencing, specifically addressing how insertions and deletions (indels) should be handled and whether structural variation (SV) should be explicitly included. The key question was whether V2 should:

Expand indel metrics to include larger events enabled by long reads, or
Explicitly introduce structural variation (SV) metrics into the QC framework.

This arose because long-read technologies blur traditional boundaries between short indels and structural variants, especially for insertions and duplications.

2. Differences Between Short-Read and Long-Read Variant Calling

o Historically, QC standards and variant definitions originated from short-read sequencing, where:

Indels are typically short (shorter than read length).
SV insertions/deletions are outputs of dedicated SV callers.

o In long-read sequencing, alignments span much larger regions, causing:

Large indels to appear naturally in CIGAR strings.
Insertions, duplications, and other rearrangements to be represented differently depending on aligner and caller behavior.

Daniel highlighted that:

o Deletions are generally straightforward.

o Insertions are problematic due to long-standing ambiguity in the VCF specification:

The same biological event may be called as an insertion by long-read callers.
The same event may be called as a duplication by short-read or differently configured long-read pipelines.

3. Ambiguity in Definitions (Insertion vs Duplication)

o There is no universally clean or unambiguous definition separating:

Insertions
Tandem duplications

o This ambiguity is intrinsic to:

Alignment strategies
Variant caller design
File formats (VCF vs newer representations)

Nicolas emphasized that:

o Absolute clarity is unlikely to be achieved.

o The goal of QC standards should be to:

Clearly define terms in plain English
Explicitly state what is included and excluded
Reduce ambiguity through scope and definition, rather than attempting to solve ontology debates.

4. Key Scoping Decision for Roadmap V2

A central decision emerged from the discussion:

Roadmap V2 will NOT explicitly include structural variation (SV) QC metrics

Instead:

o V2 will focus on VCF-based indels only, even if they are larger in size due to long reads.

o These indels are treated as:

“Large indels enabled by long-read sequencing”
Not as fully defined structural variants

This avoids:

o Introducing SV ontology, breakpoint definitions, and cross-caller harmonization.

o The need to also address copy number variation (CNV), which is inseparable from SVs at a conceptual level.

Daniel strongly noted:

o Once SVs are introduced, QC must also account for:

CNVs
Breakpoint-based rearrangements
Different caller ecosystems (SV, CNV, STR, VNTR)

o This would significantly expand scope and complexity beyond what is appropriate for V2.

5. File Format and Ecosystem Constraints

o The QC framework is fundamentally downstream of BAM/CRAM and VCF.

o While newer formats (e.g., VRS/JSON-based representations) address many SV shortcomings:

They are impractical for whole-genome QC at scale.
VCF remains the lowest common denominator across all variant callers.

o Therefore:

QC metrics must align with VCF outputs, despite known limitations for SV representation.

6. Importance of Context-Specific Metrics

Daniel highlighted that:

o Different variant types demand different QC perspectives:

SNP/indel callers
SV callers
STR/VNTR callers
CNV callers

o For example:

STR QC focuses on net expansion/contraction, not individual indel events.
Counting indels is meaningless for STR pathogenicity assessment.

o Mixing these contexts in V2 would dilute clarity and purpose.

7. Final Consensus and Outcome

The group converged on the following principles for Roadmap V2:

o ✅ Focus on VCF-based indels only, including those larger in long-read data

o ❌ Do not introduce explicit SV QC metrics in V2

o 📌 Treat large long-read indels as:

An extension of indel QC
Not as full structural variants

o 🧭 Defer SV, CNV, STR, and complex rearrangements to:

Future roadmap phases
Separate, more specialized QC frameworks

This approach ensures:

Manageable scope for V2
Compatibility with existing pipelines
Clear expectations for users
A solid foundation for future expansion

For Roadmap V2, long-read sequencing QC will extend indel metrics within the VCF framework, while intentionally deferring structural variation to future phases. This balances practical implementation with scientific rigor and ensures clarity for both developers and users.

8. Forward Outlook

Structural variation QC will require a separate, dedicated framework with explicit definitions and broader ecosystem alignment.
Future roadmap versions may address SVs once consensus on representation, ontology, and metric definitions has matured.
Roadmap V2 establishes a stable foundation for long-read QC without overextending scope.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scope and Principles for Long-Read QC Standard Metrics - Indels #116

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Scope and Principles for Long-Read QC Standard Metrics - Indels #116

Uh oh!

Uh oh!

justinjj24 Jan 16, 2026 Collaborator

1. Context and Motivation

2. Differences Between Short-Read and Long-Read Variant Calling

3. Ambiguity in Definitions (Insertion vs Duplication)

4. Key Scoping Decision for Roadmap V2

5. File Format and Ecosystem Constraints

6. Importance of Context-Specific Metrics

7. Final Consensus and Outcome

8. Forward Outlook

Replies: 0 comments

justinjj24
Jan 16, 2026
Collaborator