Genomic Coordinate Liftover with ML Confidence Prediction
Version: 5.0.1
Release date: 2026-01-30
Status: Active development
Demo available at https://genomic-annotation-version-controller.onrender.com
Highlights
- Parallelized batch liftover and streaming VCF support for improved throughput on large files.
- Official CLI (
liftover-cli) for reproducible local batch jobs and pipeline integration. - Docker image and recommended run patterns for reproducible demos and deployments.
- Expanded ML training/validation pipeline and SHAP-compatible explainability export for per-variant interpretation.
- Optional LightGBM backend supported for faster training/inference.
- Basic CI and smoke tests added to improve reliability of core functionality.
New features
- Parallel batch processing
- Configurable worker count for batch liftover jobs.
- Chunked processing mode to control memory footprint for large VCFs.
- Streaming VCF liftover endpoint
POST /liftover/streamaccepts VCF upload and streams transformed variants back, reducing temporary disk use.
- ML / explainability
- SHAP-compatible per-variant explainability export (JSON) added to inference path.
- Option to use LightGBM backend (faster training/inference; drop-in flag in training/inference configs).
- Expanded validation dataset
- Scripts and manifest for assembling a larger RefSeq-derived training/validation set included (see
docs/validation/). - NOTE: public dataset downloads are scripted; some sources require direct download and are NOT bundled in the repo.
- Scripts and manifest for assembling a larger RefSeq-derived training/validation set included (see
Improvements / Enhancements
- Improved API
- Health endpoint expanded to include ML model readiness, recent validation artifact timestamp, and worker pool status.
- Batch endpoint supports asynchronous job submission with job status polling.
- Performance and robustness
- I/O and memory improvements for streaming and chunked VCF processing.
- Worker pool is resilient to individual variant failures; failures are recorded per-variant and do not abort the entire job.
- Documentation
- Reworked README sections for Quick Start (virtualenv and Docker), CLI examples, and upgrade notes.
- New reproducible training runbook (small dataset) and example notebooks added under
docs/examples/.
Bug fixes
- Fix: liftover chain parsing bug that produced incorrect chain agreement counts in rare chain overlap cases.
- Fix: off-by-one coordinate handling in VCF streaming that affected indel normalization in some edge cases.
- Fix: API error responses standardized (consistent JSON schema with
error,code,details). - Fix: model calibration step now respects seed control for reproducible calibration artifacts.
Breaking changes / Migration notes
- Config change: the default chain-files path is now
app/data/chains/(same as before) but the containerized recommended mount point is explicit in the Docker examples. If you used a different layout, update scripts or pointLIFTOVER_CHAIN_DIRin environment variables. - SHAP output: explainability export format is JSON with a new schema (
explainability.version: "1.0"). If you parse previous explainability tables, update your parsers to accept the new JSON schema.
Known issues
- ML training reproducibility: while training scripts are included, full-size RefSeq downloads are large and not fully included; exact AUC and calibration numbers depend on the final curated dataset and random seeds.
- Multi-species liftover: experimental and partial. Human hg19↔hg38 is the primary supported path.
- Large structural variants and complex rearrangements may still fail liftover or receive low-confidence scores — these are flagged but not automatically resolved.
- Security / privacy: the tool is not hardened for protected human data. Users must ensure compliance with local privacy policies before processing real human genomic data.
Documentation & examples
- Quick Start, CLI examples, and Docker usage: see
README.md. - Reproducible small training run and example notebooks:
docs/examples/. - Validation scripts and manifests:
docs/validation/. - API reference: available at runtime via FastAPI OpenAPI docs (default:
http://localhost:8000/docs)