Privacy-preserving population classification using recombination-defined genomic hashes and federated learning.
This project provides an end-to-end pipeline for:
- Haploblock Clustering - Generate recombination-defined genomic hashes from phased VCF data
- SNP Deconvolution - GPU-accelerated machine learning for population classification
- Federated Learning - NVFlare integration for multi-site privacy-preserving training
graph TD
subgraph Data_Pipeline [Data & Feature Layer]
RAW[Biobank VCF / HB Hashes] --> DP[Data Preprocessing]
DP --> FS[Feature Set: Sparse Matrix/Tensors]
end
subgraph Model_Abstraction [Unified Model Abstraction Layer]
FS --> M_INT{{"BaseSNPModel (Interface)"}}
subgraph Implementations [Internal Implementations]
M_INT --> XGB[XGBoost Strategy]
M_INT --> ADL[Attention DL Strategy]
end
XGB & ADL --> M_OUT["Unified Output: Logits / Feature Importance"]
end
subgraph Federated_Layer [NVFlare Federated Infrastructure]
M_INT -.-> |Strategy Injection| EXEC[SNPDeconvExecutor]
EXEC --> |Comm| SERVER[NVFlare Aggregator]
SERVER --> |Global Update| EXEC
end
subgraph Evaluation [Evaluation Framework]
M_OUT --> Metrics[AUC / PRC / SNP Ranking]
Metrics --> Val[Biological Validation via ClinVar]
end
style M_INT fill:#f96,stroke:#333,stroke-width:4px
style Federated_Layer fill:#e1f5fe,stroke:#01579b
style Implementations fill:#fff3e0,stroke:#ff6f00,stroke-dasharray: 5 5
Haploblock_Clusters_ElixirBH25/
│
├── haploblock_pipeline/ # Phase 1: Haploblock Clustering
│ ├── step1_haploblocks.py # Define haploblock boundaries
│ ├── step2_phased_sequences.py # Extract phased sequences
│ ├── step3_merge_fasta.py # Merge sequences
│ ├── step4_clusters.py # MMSeqs2 clustering
│ └── step5_variant_hashes.py # Generate genomic hashes
│
├── snp_deconvolution/ # Phase 2: ML/DL Classification
│ ├── data_integration/ # Data loading
│ │ ├── cluster_feature_loader.py # Cluster IDs → Embedding
│ │ └── sparse_genotype_matrix.py # VCF → Sparse matrix
│ ├── xgboost/ # XGBoost GPU
│ │ ├── xgb_trainer.py # GPU histogram training
│ │ └── feature_selector.py # Iterative SNP selection
│ ├── attention_dl/ # Deep Learning
│ │ ├── lightning_trainer.py # PyTorch Lightning
│ │ └── nvflare_lightning.py # NVFlare integration
│ └── nvflare_base/ # Federated Learning
│ ├── base_executor.py # Abstract executor
│ └── *_nvflare_wrapper.py # Model wrappers
│
├── dl_models/ # Model Architectures
│ ├── haploblock_embedding_model.py # Embedding + Transformer
│ └── snp_interpretable_models.py # CNN/Transformer
│
└── data/ # Input Data
├── *.vcf.gz # 1000 Genomes VCF
└── igsr-*.tsv # Population labels
The pipeline generates unique genomic identifiers that encode:
individual_hash = strand(4) + chromosome(10) + haploblock(20) + cluster(20) + [variants]
| Component | Bits | Description |
|---|---|---|
| Strand | 4 | Haplotype strand (0 or 1) |
| Chromosome | 10 | Chromosome number |
| Haploblock | 20 | Haploblock position index |
| Cluster | 20 | MMSeqs2 cluster membership |
| Variants | N | Optional SNP-specific encoding |
Key Insight: The Cluster ID is the meaningful categorical feature for ML/DL!
Both models now support the privacy-preserving Cluster ID mode (recommended):
graph TD
subgraph Input ["Unified Input (Privacy-Preserving)"]
PIPE["Pipeline Output<br/>(clusters/*.tsv)"]
end
subgraph Feature ["Feature Extraction"]
PIPE --> CID["Cluster ID Matrix<br/>(samples × haploblocks)"]
end
subgraph Models ["Models"]
CID --> XGB["XGBoost<br/>(Categorical Features)"]
CID --> EMB["Embedding Layer"]
EMB --> DL["Deep Learning<br/>(Transformer)"]
end
XGB --> OUT["Population<br/>Classification"]
DL --> OUT
style CID fill:#ff9800,stroke:#ef6c00,stroke-width:2px
style XGB fill:#4caf50,stroke:#2e7d32,stroke-width:2px,color:#fff
style DL fill:#2196f3,stroke:#1565c0,stroke-width:2px,color:#fff
| Mode | Privacy | Input | Use Case |
|---|---|---|---|
| Cluster (default) | High | Cluster ID matrix | Privacy-preserving federated learning |
| SNP (baseline) | Low | Sparse SNP matrix | Baseline comparison |
| Aspect | XGBoost GPU | Deep Learning |
|---|---|---|
| Input | Cluster ID (categorical) | Cluster ID → Embedding |
| Features | XGBoost auto-splits | Learned representations |
| Model | Gradient boosted trees | CNN + Transformer |
| Interpretability | Haploblock importance | Attention weights |
| Speed | Fast | Slower |
| Long-range patterns | Limited | Captures via Transformer |
| Best for | Quick baseline | Complex interactions |
graph TD
subgraph Raw ["Raw Data"]
VCF["1000 Genomes VCF<br/>(chr6, 2548 samples)"]
REC["Recombination Map<br/>(Halldorsson 2019)"]
POP["Population Files<br/>(CHB, GBR, PUR)"]
end
subgraph Pipeline ["Haploblock Pipeline"]
REC --> BOUNDS["2,288 Haploblocks<br/>(chromosome 6)"]
VCF --> PHASE["Phased Sequences"]
BOUNDS --> PHASE
PHASE --> MMSEQ["MMSeqs2 Clustering"]
MMSEQ --> CLUSTERS["Cluster Assignments<br/>(per haploblock)"]
end
subgraph Features ["Feature Extraction"]
CLUSTERS --> |"For DL"| CID["Cluster ID Matrix<br/>(2548 × 2288)"]
VCF --> |"For XGBoost"| SPARSE["Sparse Genotype Matrix"]
POP --> LABELS["Population Labels<br/>(0/1/2)"]
end
subgraph Training ["Model Training"]
CID --> DL["Embedding → Transformer"]
SPARSE --> XGB["XGBoost GPU"]
LABELS --> DL
LABELS --> XGB
end
style CLUSTERS fill:#ff9800,stroke:#ef6c00,stroke-width:2px
style CID fill:#ff9800,stroke:#ef6c00,stroke-width:2px
# snp_deconvolution/config/deconv_config.yaml
data:
pipeline_output_dir: "out_dir/TNFa"
population_files:
- "data/igsr-chb.tsv.tsv" # CHB (label: 0)
- "data/igsr-gbr.tsv.tsv" # GBR (label: 1)
- "data/igsr-pur.tsv.tsv" # PUR (label: 2)
xgboost:
n_estimators: 2000
max_depth: 6
tree_method: "gpu_hist"
deep_learning:
architecture: "cnn_transformer"
lightning:
precision: "bf16-mixed"
model:
embedding_dim: 32
transformer_dim: 128
num_heads: 8
nvflare:
aggregation_strategy: "fedavg"
num_rounds: 50Pipeline tested on:
- Chromosome 6: 2,288 haploblocks
- Populations: CHB (Han Chinese), GBR (British), PUR (Puerto Rican)
- Samples: 2,548 individuals from 1000 Genomes Phase 3
-
Halldorsson et al. (2019). Characterizing mutagenic effects of recombination through a sequence-level genetic map. Science, 363(6425).
-
Palsson et al. (2025). Complete human recombination maps. Nature, 639, 700-707.
-
NVFlare Documentation: https://nvflare.readthedocs.io/
This work was supported by ELIXIR, the research infrastructure for life science data, and conducted at the ELIXIR BioHackathon Europe 2025.
MIT License