End-to-end, standards-first pipeline for EEG/iEEG × multi-omics biomarker discovery. Automates ingestion, robust EEG cleaning, rich derivatives, omics QC/batch correction, feature fusion, and model training with MLOps (configs, tracking, versioning). HPC-ready, reproducible, optionally privacy-preserving; supports local dev and clusters.
Standards & metadata: BIDS/BIDS-Derivatives (EEG/iEEG), ISA-JSON (omics), sidecar provenance.
EEG/iEEG features: PSD, specparam/FOOOF, microstates, connectivity, complexity, PAC.
Omics: genomics, transcriptomics, proteomics, metabolomics; normalization, imputation, ComBat/RUV batch correction, ID mapping (HGNC/Ensembl/UniProt/HMDB).
Integration & modeling: DIABLO (mixOmics), MOFA+, SNF, baselines (XGBoost/RF/SVM), optional early-fusion NN, stacking/late fusion with strict CV.
MLOps: Hydra configs, DVC pipelines, MLflow tracking, Snakemake/Nextflow orchestration, containers (Docker/Singularity).
Environments: local (subset) and HPC (SLURM arrays), per-subject/session parallelism, optional GPU for deep models, deterministic seeds, idempotent stages with caching.
neurofusion-eeg-omics/
├── README.md
├── LICENSE
├── pyproject.toml
├── requirements/
│ ├── base.txt
│ ├── eeg.txt
│ ├── omics.txt
│ └── dev.txt
├── environment/
│ ├── conda-lock.yml
│ └── renv.lock
├── containers/
│ ├── docker/
│ │ ├── base.Dockerfile
│ │ ├── eeg-omics.Dockerfile
│ │ └── r-mixomics.Dockerfile
│ └── singularity/
│ ├── eeg-omics.def
│ └── r-mixomics.def
├── conf/
│ ├── config.yaml
│ ├── paths.yaml
│ ├── data/
│ │ ├── bids.yaml
│ │ ├── omics.yaml
│ │ ├── isa.yaml
│ │ └── privacy.yaml
│ ├── eeg/
│ │ ├── preprocessing.yaml
│ │ ├── derivatives.yaml
│ │ ├── qc.yaml
│ │ └── reference_spaces.yaml
│ ├── omics/
│ │ ├── genomics.yaml
│ │ ├── transcriptomics.yaml
│ │ ├── proteomics.yaml
│ │ ├── metabolomics.yaml
│ │ ├── batch.yaml
│ │ └── id-map.yaml
│ ├── features/
│ │ ├── merge.yaml
│ │ ├── age-norm.yaml
│ │ ├── block-balance.yaml
│ │ └── export.yaml
│ ├── models/
│ │ ├── diablo.yaml
│ │ ├── mofa.yaml
│ │ ├── snf.yaml
│ │ ├── xgb.yaml
│ │ ├── rf.yaml
│ │ ├── svm.yaml
│ │ └── nn.yaml
│ ├── interpretation/
│ │ ├── shap.yaml
│ │ ├── haufe.yaml
│ │ └── pathway.yaml
│ ├── cv.yaml
│ ├── mlflow.yaml
│ ├── observability.yaml
│ ├── data-protection.yaml
│ ├── hydra/
│ │ ├── local-dev.yaml
│ │ ├── hpc-full.yaml
│ │ ├── cloud.yaml
│ │ └── smoke-test.yaml
│ └── hpc/
│ ├── slurm.yaml
│ ├── lsf.yaml
│ └── resources.yaml
├── src/
│ ├── __init__.py
│ ├── common/
│ │ ├── __init__.py
│ │ ├── io_utils.py
│ │ ├── schema_validation.py
│ │ ├── logging_utils.py
│ │ ├── mlflow_utils.py
│ │ ├── dvc_utils.py
│ │ ├── cache_utils.py
│ │ └── privacy_utils.py
│ ├── ingest/
│ │ ├── ingest_metadata.py
│ │ ├── validate_bids.py
│ │ ├── validate_isa.py
│ │ └── build_sample_map.py
│ ├── eeg/
│ │ ├── preprocess_eeg.py
│ │ ├── compute_derivatives.py
│ │ ├── eeg_qc_report.py
│ │ └── eeg_privacy_check.py
│ ├── omics/
│ │ ├── qc_genomics.py
│ │ ├── qc_transcriptomics.R
│ │ ├── qc_proteomics.py
│ │ ├── qc_metabolomics.py
│ │ ├── omics_batch_correct.R
│ │ ├── omics_id_map.py
│ │ └── omics_qc_report.py
│ ├── features/
│ │ ├── merge_features.py
│ │ ├── age_norm.py
│ │ ├── block_balance.py
│ │ ├── features_dictionary.py
│ │ └── export_feature_sets.py
│ ├── models/
│ │ ├── cv_splits.py
│ │ ├── train_models.py
│ │ ├── run_diablo.R
│ │ ├── run_mofa.py
│ │ ├── run_snf.py
│ │ ├── train_baselines.py
│ │ ├── stack_predictions.py
│ │ └── evaluate_metrics.py
│ └── interpretation/
│ ├── interpret_model.py
│ ├── shap_explain.py
│ ├── haufe_maps.py
│ └── pathway_summarize.py
├── workflows/
│ ├── Snakefile
│ ├── profiles/
│ │ ├── local
│ │ ├── slurm
│ │ └── hpc-gpu
│ ├── main.nf
│ ├── nextflow.config
│ └── slurm/
│ ├── template.sbatch
│ ├── array-preproc.sbatch
│ ├── array-omics.sbatch
│ └── gpu-models.sbatch
├── notebooks/
│ ├── exploratory/
│ ├── qa/
│ └── reports/
├── tests/
│ ├── unit/
│ ├── integration/
│ ├── regression/
│ ├── data/
│ │ ├── synthetic_eeg/
│ │ ├── golden_omics/
│ │ └── configs/
│ └── schemas/
├── reports/
│ ├── qc/
│ ├── modeling/
│ ├── interpretation/
│ └── governance/
├── docs/
│ ├── isa/
│ ├── bids/
│ ├── api/
│ └── operations/
├── scripts/
│ ├── bootstrap-data.sh
│ └── sync-mlflow.sh
├── data/ (gitignored)
└── mlruns/ (gitignored or remote)
The following tables document every script with purpose, IO, CLI, configs, algorithms, QC/observability, compute, caching, and validation.
Each script lists: purpose, inputs→outputs (relative paths), CLI (local + SLURM), key Hydra config, algorithms/libs, QC & logs, compute, idempotence & validation.
Goal: Validate BIDS, validate ISA-JSON, and harmonize subject/sample IDs into a master sample map.
- Purpose: Merge raw registries with BIDS
participants.tsvand ISA to produce harmonized participant tables and an ISA skeleton. - Inputs → Outputs:
inputs/raw_metadata/*.csv,bids/participants.tsv,docs/isa/templates/*.json,conf/paths.yaml→
interim/metadata/harmonized_participants.tsv,docs/isa/investigation.json,docs/isa/study.json,docs/isa/assay.json,reports/qc/metadata_diff.html - CLI:
Local →python -m src.ingest.ingest_metadata --config-name=config metadata_csvs='[inputs/raw_metadata/subjects.csv,inputs/raw_metadata/sessions.csv]'
SLURM →sbatch workflows/slurm/template.sbatch --wrap="python -m src.ingest.ingest_metadata hydra/launcher=submitit_slurm" - Key Hydra:
data.metadata.raw_paths,ingest.participant_id_regex,ingest.merge.strategy,ingest.iso8601_columns,paths.docs.isa_root - Algorithms/libs: pandas, pandera, jsonschema, ISA-API for ISA-Tab/ISA-JSON read/write. :contentReference[oaicite:0]{index=0}
- QC & logs: MLflow params (row counts, completeness, dupes), diff HTML artifact
- Compute: CPU ≤2 cores, RAM ≤4 GB
- Idempotence & validation: DVC tracks inputs/config; schema checks; deterministic merges. :contentReference[oaicite:1]{index=1}
- Purpose: Run BIDS Validator + PyBIDS sanity checks on the dataset.
- Inputs → Outputs:
bids/→reports/qc/bids_validator.json,reports/qc/bids_summary.html - CLI: Local →
python -m src.ingest.validate_bids bids_root=bids/ strict=true· SLURM via submitit multirun - Key Hydra:
data.bids.root,ingest.bids.strict,ingest.bids.max_issues - Algorithms/libs: BIDS Validator, PyBIDS. :contentReference[oaicite:2]{index=2}
- QC & logs: error/warning counts, validator version
- Compute: CPU 2, RAM 4 GB; parallel by subject
- Idempotence & validation: Output hashed by validator JSON; hard-fail unless
allow_errors=true. :contentReference[oaicite:3]{index=3}
- Purpose: Validate ISA-JSON/ISA-Tab and ontology references.
- Inputs → Outputs:
docs/isa/*.json→reports/qc/isa_validation.json,reports/qc/isa_report.html - CLI:
python -m src.ingest.validate_isa isa_root=docs/isa strict=true - Key Hydra:
data.isa.root,ingest.isa.required_ontologies,ingest.isa.schema - Algorithms/libs: ISA-API validators. :contentReference[oaicite:4]{index=4}
- QC/Compute/Idempotence: Lightweight; cached by JSON hash
- Purpose: Construct subject–session–modality map aligning EEG/iEEG (BIDS) with all omics IDs; track consent flags.
- Inputs → Outputs:
interim/metadata/harmonized_participants.tsv,inputs/omics/*manifest.tsv,bids/→interim/mappings/sample_map.tsv,interim/mappings/subject_key.json,reports/qc/sample_map.html - CLI:
python -m src.ingest.build_sample_map map_strategy=inner include_controls=false(arrayable per modality) - Key Hydra:
ingest.mapping.strategy,ingest.mapping.key_columns,privacy.exclude_opt_out - Algorithms/libs: pandas, PyBIDS for BIDS lookups. :contentReference[oaicite:5]{index=5}
Goal: Clean EEG/iEEG with MNE, then compute derivatives: PSD, spectral parameterization (specparam/FOOOF), microstates, connectivity, complexity, and PAC.
- Purpose: Band-filter, resample, re-reference, ICA/ASR, interpolation, segmentation → BIDS-Derivatives FIF + JSON provenance.
- Inputs → Outputs: BIDS EEG,
interim/mappings/sample_map.tsv, montages,conf/eeg/preprocessing.yaml→derivatives/eeg-preproc/*_desc-preproc_eeg.fif+_preproc.json,reports/qc/eeg/*_preproc_report.html - CLI: Local (per subject):
python -m src.eeg.preprocess_eeg subject=sub-001· SLURM array viaworkflows/slurm/array-preproc.sbatch - Key Hydra:
eeg.preprocessing.filter.{l_freq,h_freq,notch},resample,reference,ica.method,artifact.removal - Algorithms/libs: MNE-Python, MNE-BIDS, optional PyPREP/autoreject. :contentReference[oaicite:6]{index=6}
- QC & logs: bad-channel %, raw vs cleaned PSD, ICA comps; MLflow run with figures
- Compute: 4–8 cores; 8–16 GB RAM
- Idempotence & validation: Channel montage checks; DVC per subject. :contentReference[oaicite:7]{index=7}
- Purpose: Unified CLI to compute PSD, specparam/FOOOF features, microstates, connectivity, complexity, and PAC.
- Inputs → Outputs: Preproc FIF + configs → BIDS-compliant TSV/HDF5 + JSON sidecars;
reports/qc/eeg/*_derivatives.html - CLI:
python -m src.eeg.compute_derivatives subject=sub-001 eeg.derivatives='[psd,fooof,microstates,connectivity,complexity,pac]' fooof.max_n_peaks=6 microstates.n_states=4 connectivity.metrics='[coh,pli,pli2_unbiased]' pac.freq_pairs='[[6,10],[80,150]]'
SLURM arrays across subjects/sessions - Key Hydra:
fooof.*,microstates.n_states,connectivity.metrics,pac.method,complexity.metrics - Algorithms/libs: MNE, specparam/FOOOF, Pycrostates, mne-connectivity, Tensorpac. :contentReference[oaicite:8]{index=8}
- QC & logs: FOOOF fit R², microstate GEV, connectivity sparsity, PAC comodulograms
- Compute: 8–16 cores; 16 GB RAM; optional GPU for clustering
- Idempotence & validation:
--only-missing, NaN thresholds; DVC per derivative; connectivity via dedicated package. :contentReference[oaicite:9]{index=9}
- Purpose: Aggregate subject/cohort EEG QC → HTML/PDF dashboards with thresholds and flags.
- Inputs → Outputs: derivative metrics + MLflow logs →
reports/qc/eeg/index.html,reports/qc/eeg/flags.csv - Algorithms/libs: pandas, plotly,
mne-report - Compute/Idempotence: Lightweight; incremental merges
Goal: Per-modality normalization/QC, imputation, and batch correction (ComBat/RUV), with standardized identifiers.
- Purpose: SNP QC, ancestry inference, kinship → filtered PLINK + dosage matrix.
- Inputs → Outputs:
inputs/omics/genomics/*.vcf{.gz}+ sample map →derivatives/omics/genomics/qc/subset_plink.*,dosage.parquet,genomics_qc.html - CLI: Call rate, HWE p, MAF, LD-prune; array by chromosome
- Algorithms/libs: plink2, scikit-allel (optional Hail)
- QC/Compute/Idempotence: MLflow call rate/het/PCs; DVC per chromosome
- Purpose: RNA-seq normalization & QC with edgeR/DESeq2/limma, batch preview.
- Inputs → Outputs: raw counts + metadata → normalized counts (
.tsv/.rds), QC plots - CLI: choose normalization (TMM/DESeq2/voom), CPM filters, batch variables
- Algorithms/libs: edgeR, DESeq2, limma; sva for batch methods; BiocParallel. :contentReference[oaicite:10]{index=10}
- QC/Compute/Idempotence: % kept, MV trends; parallel; DVC caches; biomaRt ID checks
- Purpose: Normalize proteomics (label-free/TMT), adaptive imputation, replicate concordance.
- Inputs → Outputs: intensities + metadata → normalized matrix, QC metrics
- CLI: transforms (log2/vsn), imputation (knn/bpca/min-det), batch variables
- Algorithms/libs: pandas, pyComBat, sklearn KNNImputer
- QC/Compute/Idempotence: CV%, replicate R; DVC; UniProt mapping coverage
- Purpose: Normalize targeted/untargeted metabolomics; pooled-QC & drift correction.
- Inputs → Outputs: vendor CSVs + metadata → normalized matrix, drift curves, QC metrics
- CLI: imputation (knn/rf/bpca), batch (ComBat/RUV), scale (pareto/auto/range), pooled-CV threshold
- Algorithms/libs: missForest/pyComBat; HMDB mapping
- QC/Compute/Idempotence: pooled CVs; DVC; mapping coverage
- Purpose: Apply ComBat (empirical Bayes) or RUV; emit diagnostics.
- Inputs → Outputs: normalized matrices + covariates →
batch_corrected.tsv, variance components + plots - CLI: method, covariates, reference batch
- Algorithms/libs: sva::ComBat, RUVSeq, limma. :contentReference[oaicite:11]{index=11}
- QC/Compute/Idempotence: MLflow variance ratios pre/post; DVC; covariate presence checks
- Purpose: Standardize feature IDs (HGNC/Ensembl/UniProt/HMDB); crosswalk dictionaries.
- Outputs: standardized matrices,
feature_id_map.tsv,ambiguous_ids.tsv - QC/Idempotence: coverage %, ambiguous counts; uniqueness enforced
- Purpose: Compile per-modality QC dashboards → HTML/PDF + governance summary
- Outputs:
reports/qc/omics/omics_qc.html/.pdf,reports/governance/omics_summary.xlsx
Goal: Join EEG derivatives with each omics block by participant_id to create a multi-block master matrix; add age-adjusted features and block balancing.
- Purpose: Align EEG + omics into
interim/features/master_matrix.parquetwithblock_metadata.json - Config: align keys, block weights, missingness policy; Dask for scale
- Algorithms/libs: pandas, polars, Dask
- QC: feature counts, missingness, cross-block heatmaps
- Purpose: Compute age-adjusted z-scores or residuals per block (linear/GAM/LOESS)
- Algorithms/libs: statsmodels, pyGAM
- QC: distribution diagnostics per block
- Purpose: Scale/weight blocks (variance / quantile / energy) to avoid dominance by high-dimensional omics
- Algorithms/libs: scikit-learn, mixOmics utilities. :contentReference[oaicite:12]{index=12}
- Purpose: Feature dictionary with provenance and ontology tags
- Outputs:
docs/features/*.tsv/.json/.yaml
- Purpose: Export curated subsets (EEG-only, omics-only, task-specific) with privacy filters
- Outputs:
exports/features/{subset}.parquet/.json
Goal: Compare supervised (DIABLO) and unsupervised (MOFA+) integration, graph fusion (SNFpy), plus baselines (XGBoost/RF/SVM) and optional early-fusion NN; strict CV; MLflow logging.
- Purpose: Leakage-safe (stratified/grouped) CV; nested repeats; seeds + fold manifests
- Outputs:
interim/cv/splits.pkl,leakage_checks.json
- Purpose: Hydra dispatcher that runs DIABLO (R), MOFA+ (Python), SNF-based clustering, and baselines; per-fold MLflow logging + DVC checkpoints; SLURM fold-parallel
- Inputs → Outputs: master matrix + config →
models/{model}/run-*/, OOF predictions,reports/modeling/{model}_fold_metrics.csv - Algorithms/libs: mixOmics/DIABLO, mofapy2 (MOFA+), snfpy, scikit-learn, XGBoost, PyTorch Lightning (optional). :contentReference[oaicite:13]{index=13}
- CLI:
Local →python -m src.models.train_models model=diablo cv.repeats=5
SLURM →snakemake --profile workflows/profiles/slurm train_models model=xgb - QC & logs: per-fold PR-AUC/AUC/F1, confusion matrices; MLflow params/artifacts
- Compute: CPU/GPU as needed (GPU for NN); fold-parallel on cluster
- Idempotence & validation: split manifests frozen; preprocessing fit inside CV; DVC stage. :contentReference[oaicite:14]{index=14}
- Purpose: Run DIABLO (multiblock sPLS-DA) per fold; export loadings, design matrix, performance. :contentReference[oaicite:15]{index=15}
- Outputs:
models/diablo/fold-*/loadings.tsv,plots/*.html
- Purpose: Train MOFA+ factors; save factors/loadings + ELBO/variance explained. :contentReference[oaicite:16]{index=16}
- Outputs:
models/mofa/factors.parquet,models/mofa/loadings.parquet,reports/modeling/mofa_variance.html
- Purpose: Build per-block affinities and run Similarity Network Fusion; cluster and export labels. :contentReference[oaicite:17]{index=17}
- Outputs:
models/snf/affinities/*.npy,models/snf/fused.npy,models/snf/labels.csv
- Purpose: XGBoost/RF/SVM baselines with unified API; SHAP-ready artifacts.
- Outputs: models/baselines/*, OOF preds, feature importance
- Purpose: Late-fusion stacking on OOF predictions (logistic/elastic-net meta-learner); calibration and uncertainty
- Purpose: Aggregate metrics across folds, compute CIs (e.g., bootstrap or DeLong), export cohort report
Goal: Explain models and render EEG maps; summarize biology.
- Purpose: Unified entry: DIABLO loadings, MOFA factors, SHAP summaries, Haufe maps
- Outputs: reports/interpretation/* JSON + HTML
- Purpose: Global & local explanations using SHAP (tree/NN); interaction plots; background caching
- Algorithms/libs: SHAP, xgboost/torch. SHAP
- Purpose: Compute Haufe transform for linear EEG subspaces to obtain sensor-space, physiologically interpretable maps (per fold). ScienceDirect
- Outputs: FIF derivatives + HTML topographies
- Purpose: Map feature importances to pathways (Reactome/KEGG/MSigDB); FDR-controlled enrichment; export ranked tables