Skip to content

Fix/819 Fix incorrect g->c mapping when CIGAR alignment gaps overlap variant interval#823

Open
andreasprlic wants to merge 6 commits intobiocommons:mainfrom
andreasprlic:fix/819-cigar-internal-gap-delins
Open

Fix/819 Fix incorrect g->c mapping when CIGAR alignment gaps overlap variant interval#823
andreasprlic wants to merge 6 commits intobiocommons:mainfrom
andreasprlic:fix/819-cigar-internal-gap-delins

Conversation

@andreasprlic
Copy link
Copy Markdown
Member

Closes #819

Problem

The g_to_c (and g_to_n) mapping in VariantMapper produced incorrect results for two related cases involving CIGAR alignment gaps:

  1. I-segment strictly inside the variant interval - When a genomic delins spans a position where the CIGAR has an I-segment (a genomic base with no transcript equivalent), both variant endpoints fall in normal = segments, so pos_c.uncertain is False. The previous code path would strand-flip the alt allele without accounting for the missing transcript base, producing an incorrect transcript edit with the wrong length.
  2. I-segment immediately adjacent to the variant boundary — When a variant endpoint lands exactly at the edge of an I-segment, the previous code would silently ignore the gap, again producing a wrong edit.

Both cases are manifestations of issue #819 (the "double gap" problem).

Fix

Added helper methods to VariantMapper:

  • _gap_segments_within_pos_g(mapper, pos_g) — returns all I/D CIGAR segments strictly inside a genomic interval (replaces the earlier _i_segment_offsets_in_pos_g).
  • _variant_has_internal_gap(mapper, var_g) — returns True when a gap segment lies strictly inside the variant interval (both endpoints in = regions).
  • _expand_pos_g_for_adjacent_gap(mapper, var_g) — extends the genomic position to include any I-segment immediately adjacent to the variant boundary.

The mapping logic in g_to_c and g_to_n now checks for these cases before the strand-flip path and routes them through _get_altered_tx_sequence(), which correctly filters out genomic-only bases when reconstructing the transcript edit.

Tests

Three new test cases in test_hgvs_assemblymapper.py:

  • NC_000011.10:g.119027721_119027726delinsTCACA → NM_001164277.1:c.532G>A — I-segment adjacent to (but outside) the variant interval
  • NC_000011.10:g.119027726_119027728delinsTT → NM_001164277.1:c.526_527delinsAA — I-segment strictly inside the variant interval
  • NC_000002.12:g.73385901_73385903del → NM_015120.4:c.34_36del — variant spanning a D-segment boundary

andreasprlic and others added 6 commits March 14, 2026 12:41
When a genomic delins spans a position where the CIGAR alignment has an I-segment
(genomic bases absent from the transcript), both variant endpoints can land in normal
= segments so pos_c.uncertain remains False. The previous code path would strand-flip
the edit without adjusting for the missing transcript base, producing an incorrect
transcript edit.

Fix: add _variant_has_internal_gap() and _gap_segments_within_pos_g() helpers that
detect I/D segments strictly inside the variant interval, and route these cases
through _get_altered_tx_sequence() which correctly filters genomic-only bases before
building the transcript edit. Also refactors _i_segment_offsets_in_pos_g into the
new unified _gap_segments_within_pos_g.

Tests: update expected value for the adjacent-I-segment case (c.527_532delinsTGTGA,
which is the correct result since the I-segment at g.119027727 is outside the variant
g.119027721-119027726), and add a new test for the true internal-gap case
(g.119027726_119027728delinsTT -> c.526_527delinsAA).

Closes: hgvs-cl9, hgvs-hqa, hgvs-83z, hgvs-wma, hgvs-o26

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@andreasprlic andreasprlic requested a review from a team as a code owner March 26, 2026 15:38
@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

delins adjacent to tx-ref disagreement incorrectly produces a double-gap instead of collapsing to a SNV

1 participant