Genomic Coordinate Liftover with ML Confidence Prediction

Honest Implementation: This tool provides accurate coordinate liftover with ML-based confidence prediction. It does NOT claim to be a general-purpose AI genomics platform.

What This Tool Actually Does

Core Features

Accurate Coordinate Liftover
- Uses UCSC LiftOver chain files (pyliftover)
- Supports hg19 (GRCh37) ↔ hg38 (GRCh38)
- Per-variant tracking and validation against authoritative coordinates
ML Confidence Prediction
- Gradient boosting classifier with calibrated probability outputs
- Interpretable feature contributions (SHAP-compatible export)
- Expanded feature set including chain quality, repeats, SVs, GC content, and historical region success
VCF File Conversion & Streaming
- Complete VCF 4.x parsing and validation
- Preserves sample/genotype information
- Streaming-friendly batch processing for large VCFs
- Per-variant conversion status and quality metrics
Scalable Batch Processing (Only Partially Functional - Repeated Separate Liftover Recommended)
- Parallelized batch liftover worker (configurable concurrency)
- CLI support for batched jobs and pipeline integration
- Optional chunked processing to limit memory footprint
Deployment & Reproducibility
- Docker image for demo and reproducible environments
- GitHub Actions CI for tests and linting
- Pre-built demo on Render (where available)
Expanded Validation
- Broader RefSeq-derived training/validation set (expanded towards ~20K genes)
- Exportable validation reports with per-chromosome and calibration analysis
Optional Integrations (Pluggable)
- Ensembl API connector (experimental) for cross-references
- RepeatMasker, DGV/gnomAD-SV integration to flag problematic regions

What This Tool Does NOT Do

Be honest:

This is NOT clinical-grade variant interpretation (not FDA/CLIA).
It is not a general biomedical language model — semantic reconciliation is lightweight.
Multi-species liftover is experimental/partial; human assemblies are the primary supported target.
Real-time production-scale RL or deep learning components are not part of the stable core.

Quick Start

Install (recommended: virtualenv) or use Docker

Using Python virtual environment:

git clone https://github.com/Arnav261/genomic-annotation-version-controllers.git
cd genomic-annotation-version-controllers

python3 -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
pip install -r requirements.txt

Download UCSC chain files (one-time):

mkdir -p app/data/chains
cd app/data/chains
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/liftOver/hg19ToHg38.over.chain.gz
cd ../../..

Run the API server:

uvicorn app.main:app --reload

Using Docker (recommended for demo or reproducible runs):

# Build
docker build -t gavo-liftover:latest .

# Run (exposes API on 8000)
docker run --rm -p 8000:8000 \
  -v "$(pwd)/app/data/chains:/app/data/chains:ro" \
  gavo-liftover:latest

# Or run the published image (if available)
docker run --rm -p 8000:8000 arnav261/gavo-liftover:latest

CLI example (batch liftover)

# liftover-cli is available in the repo to run local batch jobs
liftover-cli --input variants.vcf --output variants.lifted.vcf \
  --from hg19 --to hg38 --workers 4 --include-ml

Basic Python usage (single coordinate)

import requests

response = requests.post(
    "http://localhost:8000/liftover/single",
    params={
        "chrom": "chr17",
        "pos": 41196312,
        "from_build": "hg19",
        "to_build": "hg38",
        "include_ml": True
    }
)

result = response.json()
print(f"Lifted position: {result.get('lifted_pos')}")
print(f"ML confidence: {result.get('ml_analysis',{}).get('confidence_score')}")
print(f"Recommendation: {result.get('ml_analysis',{}).get('interpretation',{}).get('recommendation')}")

Technical Details

ML Confidence Model

Algorithm: Gradient boosting classifier (scikit-learn) with probability calibration. Optional LightGBM backend supported for faster training/inference.
Features (examples):
- Chain alignment score and agreement
- Chain gap and local chain context
- Local GC content (±1kb)
- RepeatMasker overlap density and low-complexity flag
- Structural variant overlap (DGV/gnomAD-SV)
- Segmental duplication overlap
- Distance to assembly gaps
- Historical region liftover success
Explainability: SHAP export enables per-variant explanation vectors that can be included in reports.

Training and validation:

Training dataset expanded using RefSeq-derived coordinates; ongoing efforts to grow the validated set to ~20K genes for robust calibration.
Internal cross-validation shows improved calibration and stability compared to the initial prototype. See exported validation reports for detailed metrics.

Validation & Reports

The /validation-report endpoint and exported artifacts include:

Per-chromosome accuracy
Confidence score calibration plots and metrics
Error distribution statistics and edge-case summaries
Comparison methodology and reproducible scripts used for benchmarking

API Endpoints (high level)

POST /liftover/single — single coordinate liftover (with optional ML output)
POST /liftover/batch — JSON array of coordinates; supports streaming/async mode
POST /liftover/stream — upload VCF for streaming conversion (recommended for large files)
GET /validation-report — returns the latest validation artifacts
GET /health — service and ML model health with honest status report

Example cURL (single):

curl -X POST "http://localhost:8000/liftover/single?chrom=chr17&pos=41196312&from_build=hg19&to_build=hg38&include_ml=true"

Batch (file upload / streaming is recommended for large sets):

curl -X POST "http://localhost:8000/liftover/batch" \
  -H "Content-Type: application/json" \
  -d '[{"chrom": "chr17", "pos": 41196312}, {"chrom": "chr7", "pos": 55086725}]'

Development Roadmap

Phase 1: Previous State

Core liftover using UCSC chain files
ML confidence prediction (calibrated gradient boosting)
VCF conversion and streaming support
FastAPI with OpenAPI docs and health checks
Docker image and basic CI

Phase 2: Enhanced ML (In Progress - Near Completion)

Expanded RefSeq-derived training and validation dataset
SHAP-compatible explainability export
Parallel batch processing and CLI
Further calibration with additional validated variants (Partially achieved with semantic reconciliation)

Phase 3: Advanced Features (Planned)

Biomedical NLP (PubMedBERT) for semantic reconciliation (experimental)
Multi-species liftover support (beta)
RepeatMasker and DGV/gnomAD-SV full integration
Real-time Ensembl API backlinking and cross-references
Benchmarking against CrossMap and UCSC liftOver at scale

Phase 4: Publication (Goal)

Comprehensive benchmark and methods paper
Performance profiling and memory/speed optimizations
Community-driven validations and user-facing tutorials

Data Sources

Required

UCSC Chain Files: coordinate mappings
- Source: http://hgdownload.soe.ucsc.edu/goldenPath/
- License: open access for research
NCBI RefSeq: gene coordinates for training/validation
- Source: ftp://ftp.ncbi.nlm.nih.gov/refseq/
- License: public domain

Optional (improves accuracy)

RepeatMasker annotations
DGV / gnomAD-SV for structural variant context
Ensembl cross-references (experimental)
ClinVar variant coordinates for clinical validation

Citation

If you use this tool in your research, please cite:

@software{asher2025genomic,
  author = {Asher, Arnav},
  title = {Genomic Coordinate Liftover with ML Confidence Prediction},
  year = {2025},
  doi = {10.5281/zenodo.16966073},
  url = {https://github.com/Arnav261/genomic-annotation-version-controllers}
}

Contributing

This is a learning & research project — contributions and feedback are welcome.

Helpful contributions

Testing on additional gene sets and real VCFs
Integration and benchmarking with other liftover tools (CrossMap, UCSC liftOver)
Performance optimization and memory profiling
Improvements to ML model, calibration, and feature engineering
Better docs and example pipelines

How to contribute:

Fork the repository
Create a feature branch (git checkout -b feature/improvement)
Add tests where applicable
Submit a pull request with a clear description

Known Issues & Limitations

Current Limitations

Assembly support: Primary support remains hg19 ↔ hg38 (human). Multi-species is experimental.
ML model: While training data has been expanded, more high-quality validated variants are needed for production-grade calibration.
Semantic features: Still lightweight; PubMedBERT integration is planned but not production-ready.
Resource usage: Large VCF streaming requires adequate memory and I/O; tune worker count accordingly.

Edge Cases

Centromeres, telomeres, and PAR regions remain problematic.
Large structural variants, complex rearrangements, or regions adjacent to assembly gaps may fail liftover or be assigned low confidence.
X/Y pseudoautosomal edge cases require cautious interpretation.

Support

Issues: GitHub Issues
Email: arnavasher007@gmail.com
Demo: genomic-annotation-version-controller.onrender.com (subject to availability)

Acknowledgments

UCSC Genome Browser: chain files and liftover resources
NCBI: RefSeq coordinates
pyliftover, scikit-learn, FastAPI
Community contributors and early testers

Honest Self-Assessment

What works well:

Accurate and validated liftover for many genic regions
ML confidence gives interpretable uncertainty that is useful for filtering and triage
Docker and CLI make reproducible demos and batch processing straightforward

What needs improvement:

Larger, high-quality training/validation datasets for production-grade calibration
Full integration with large annotation sources (RepeatMasker, gnomAD-SV)
Multi-species and advanced NLP features remain experimental

Rating: 7/10 for research use; with additional validation, documentation, and benchmarks this can rise for production research pipelines.

Last Updated: 2026-01-30
Version: 5.0.0
Status: Active Development

Name		Name	Last commit message	Last commit date
Latest commit History 153 Commits
.github/workflows		.github/workflows
.vscode		.vscode
app		app
data		data
docker		docker
notebooks		notebooks
pytorch		pytorch
requirements		requirements
scripts		scripts
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
pytest.ini		pytest.ini
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Genomic Coordinate Liftover with ML Confidence Prediction

What This Tool Actually Does

Core Features

What This Tool Does NOT Do

Quick Start

Install (recommended: virtualenv) or use Docker

CLI example (batch liftover)

Basic Python usage (single coordinate)

Technical Details

ML Confidence Model

Validation & Reports

API Endpoints (high level)

Development Roadmap

Phase 1: Previous State

Phase 2: Enhanced ML (In Progress - Near Completion)

Phase 3: Advanced Features (Planned)

Phase 4: Publication (Goal)

Data Sources

Required

Optional (improves accuracy)

Citation

Contributing

Helpful contributions

Known Issues & Limitations

Current Limitations

Edge Cases

Support

Acknowledgments

Honest Self-Assessment

About

Uh oh!

Releases 4

Packages

Languages

Arnav261/genomic-annotation-version-controllers

Folders and files

Latest commit

History

Repository files navigation

Genomic Coordinate Liftover with ML Confidence Prediction

What This Tool Actually Does

Core Features

What This Tool Does NOT Do

Quick Start

Install (recommended: virtualenv) or use Docker

CLI example (batch liftover)

Basic Python usage (single coordinate)

Technical Details

ML Confidence Model

Validation & Reports

API Endpoints (high level)

Development Roadmap

Phase 1: Previous State

Phase 2: Enhanced ML (In Progress - Near Completion)

Phase 3: Advanced Features (Planned)

Phase 4: Publication (Goal)

Data Sources

Required

Optional (improves accuracy)

Citation

Contributing

Helpful contributions

Known Issues & Limitations

Current Limitations

Edge Cases

Support

Acknowledgments

Honest Self-Assessment

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Languages

Packages