Skip to content

Validate liftover results against pyliftover #51

@jsstevenson

Description

@jsstevenson

Feature description

We wrote this to replace pyliftover, but we are at the whims of an external library (the St Jude chainfile crate). It'd be good to be sure that we're producing accurate results. One way to produce some kind of sanity check would be to run agct and pyliftover over the whole genome to see if they agree with each other.

In particular, it seems like the result from pyliftover on negative strand input differs from agct by 1 position. I don't remember if this is intentional or not but might be something that we need to look at again.

Use case

We want accurate mappings.

Acceptance Criteria

A script in the analysis/ folder that runs a genome-wide pos-by-pos liftover and compares results between the two libraries + some way to log results and produce summary statistics. I don't think this should run in CI, per se, but it'd be good to have it on hand and run occasionally if there are major code changes.

Proposed solution

I scraped this together very quickly but it would need some polish

"""Compare results from pyliftover and agct"""
from pyliftover import LiftOver

from agct import Converter, Genome

pylo_conv = LiftOver("hg38", "hg19")
agct_conv = Converter(Genome.HG38, Genome.HG19)

chrom_names = [
    "chr1",
    "chr2",
    "chr3",
    "chr4",
    "chr5",
    "chr6",
    "chr7",
    "chr8",
    "chr9",
    "chr10",
    "chr11",
    "chr12",
    "chr13",
    "chr14",
    "chr15",
    "chr16",
    "chr17",
    "chr18",
    "chr19",
    "chr20",
    "chr21",
    "chr22",
    "chrX",
    "chrY"
]

def results_are_equal(py_result: list[list], agct_result: list[tuple]) -> bool:
    if len(py_result) != len(agct_result):
        return False
    for py, agct in zip(py_result, agct_result):
        if py[0] != agct[0]:
            return False
        if py[1] != agct[1]:
            return False
        if py[2] != agct[2].value:
            return False
        # ignore chainfile score value in pyliftover result
    return True

# arbitrarily long value that should exceed the length of all chromosomes on all reference sequences 
# it'd be nice to write in some expected lengths per chromosome
MAX_POS = 300000000

for chrom in chrom_names:
    for pos in range(0, MAX_POS): 
        py_result = pylo_conv.convert_coordinate(chrom, pos)
        agct_result = agct_conv.convert_coordinate(chrom, pos, "+")
        if py_result and py_result[0][2] == "-":
            continue  # these all seem to be off by 1
        if not results_are_equal(py_result, agct_result):
            print((chrom, pos, py_result, agct_result))

Alternatives considered

No response

Implementation details

No response

Potential Impact

No response

Additional context

No response

Contribution

None

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions