-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Feature description
We wrote this to replace pyliftover, but we are at the whims of an external library (the St Jude chainfile crate). It'd be good to be sure that we're producing accurate results. One way to produce some kind of sanity check would be to run agct and pyliftover over the whole genome to see if they agree with each other.
In particular, it seems like the result from pyliftover on negative strand input differs from agct by 1 position. I don't remember if this is intentional or not but might be something that we need to look at again.
Use case
We want accurate mappings.
Acceptance Criteria
A script in the analysis/ folder that runs a genome-wide pos-by-pos liftover and compares results between the two libraries + some way to log results and produce summary statistics. I don't think this should run in CI, per se, but it'd be good to have it on hand and run occasionally if there are major code changes.
Proposed solution
I scraped this together very quickly but it would need some polish
"""Compare results from pyliftover and agct"""
from pyliftover import LiftOver
from agct import Converter, Genome
pylo_conv = LiftOver("hg38", "hg19")
agct_conv = Converter(Genome.HG38, Genome.HG19)
chrom_names = [
"chr1",
"chr2",
"chr3",
"chr4",
"chr5",
"chr6",
"chr7",
"chr8",
"chr9",
"chr10",
"chr11",
"chr12",
"chr13",
"chr14",
"chr15",
"chr16",
"chr17",
"chr18",
"chr19",
"chr20",
"chr21",
"chr22",
"chrX",
"chrY"
]
def results_are_equal(py_result: list[list], agct_result: list[tuple]) -> bool:
if len(py_result) != len(agct_result):
return False
for py, agct in zip(py_result, agct_result):
if py[0] != agct[0]:
return False
if py[1] != agct[1]:
return False
if py[2] != agct[2].value:
return False
# ignore chainfile score value in pyliftover result
return True
# arbitrarily long value that should exceed the length of all chromosomes on all reference sequences
# it'd be nice to write in some expected lengths per chromosome
MAX_POS = 300000000
for chrom in chrom_names:
for pos in range(0, MAX_POS):
py_result = pylo_conv.convert_coordinate(chrom, pos)
agct_result = agct_conv.convert_coordinate(chrom, pos, "+")
if py_result and py_result[0][2] == "-":
continue # these all seem to be off by 1
if not results_are_equal(py_result, agct_result):
print((chrom, pos, py_result, agct_result))Alternatives considered
No response
Implementation details
No response
Potential Impact
No response
Additional context
No response
Contribution
None