Skip to content

Validate liftover results against ClinVar mappings #50

@jsstevenson

Description

@jsstevenson

Feature description

It'd be nice to see how our own liftover tool agrees or disagrees with what's submitted as HGVS expressions to ClinVar. @theferrit32 posted a big parquet of ClinVar HGVS data to slack. With some parsing, we could probably get a big set of Clinvar-asserted liftover mappings and check to see whether agct provided liftover results are equivalent.

Use case

This library is at the mercy of UCSC-produced chainfile mappings and the corresponding format, but this is far from the only way to do a "liftover". It would be nice to get a sense of concordance against other libraries and methods.

if I understand correctly, ClinVar is not necessarily always a gold standard source of truth in this respect (are the mappings provided by the submitter or performed internally?), but it still could be a good experiment to run.

Acceptance Criteria

A script to ingest the ClinVar HGVS parquet, extract all cases of multiple HGVS NC expressions, extract the chromosomes/positions from them, feed those through agct, aggregate counts of agreement/disagreement, and log cases of concordance.

I don't think this necessarily should be a standard test case (although I bet the stripped-down expressions file might be small enough to check in?) but it'd be nice to have under analysis/ as something we could run occasionally, maybe with a notebook write-up.

Proposed solution

Use a query like the following to get all instances of multi-NC prefix expressions

import duckdb
duckdb.read_parquet("variation_hgvs.2025_06_14.parquet")
duckdb.sql("""SELECT
variation_id,
STRING_AGG(hgvs_source, ', ') AS hgvs_source_csv
FROM read_parquet('variation_hgvs.2025_06_14.parquet')
WHERE hgvs_source LIKE 'NC%'
GROUP BY variation_id
HAVING
COUNT(hgvs_source) > 1
AND hgvs_source_csv NOT LIKE '%?%'
AND hgvs_source_csv NOT LIKE '%m.%'
AND hgvs_source_csv NOT LIKE '%(%';
""")

Alternatives considered

No response

Implementation details

No response

Potential Impact

No response

Additional context

No response

Contribution

Yes, I can create a PR for this feature.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions