-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Feature description
It'd be nice to see how our own liftover tool agrees or disagrees with what's submitted as HGVS expressions to ClinVar. @theferrit32 posted a big parquet of ClinVar HGVS data to slack. With some parsing, we could probably get a big set of Clinvar-asserted liftover mappings and check to see whether agct provided liftover results are equivalent.
Use case
This library is at the mercy of UCSC-produced chainfile mappings and the corresponding format, but this is far from the only way to do a "liftover". It would be nice to get a sense of concordance against other libraries and methods.
if I understand correctly, ClinVar is not necessarily always a gold standard source of truth in this respect (are the mappings provided by the submitter or performed internally?), but it still could be a good experiment to run.
Acceptance Criteria
A script to ingest the ClinVar HGVS parquet, extract all cases of multiple HGVS NC expressions, extract the chromosomes/positions from them, feed those through agct, aggregate counts of agreement/disagreement, and log cases of concordance.
I don't think this necessarily should be a standard test case (although I bet the stripped-down expressions file might be small enough to check in?) but it'd be nice to have under analysis/ as something we could run occasionally, maybe with a notebook write-up.
Proposed solution
Use a query like the following to get all instances of multi-NC prefix expressions
import duckdb
duckdb.read_parquet("variation_hgvs.2025_06_14.parquet")
duckdb.sql("""SELECT
variation_id,
STRING_AGG(hgvs_source, ', ') AS hgvs_source_csv
FROM read_parquet('variation_hgvs.2025_06_14.parquet')
WHERE hgvs_source LIKE 'NC%'
GROUP BY variation_id
HAVING
COUNT(hgvs_source) > 1
AND hgvs_source_csv NOT LIKE '%?%'
AND hgvs_source_csv NOT LIKE '%m.%'
AND hgvs_source_csv NOT LIKE '%(%';
""")Alternatives considered
No response
Implementation details
No response
Potential Impact
No response
Additional context
No response
Contribution
Yes, I can create a PR for this feature.