neurofusion-eeg-omics

End-to-end, standards-first pipeline for EEG/iEEG × multi-omics biomarker discovery. Automates ingestion, robust EEG cleaning, rich derivatives, omics QC/batch correction, feature fusion, and model training with MLOps (configs, tracking, versioning). HPC-ready, reproducible, optionally privacy-preserving; supports local dev and clusters.

Highlights

Standards & metadata: BIDS/BIDS-Derivatives (EEG/iEEG), ISA-JSON (omics), sidecar provenance.

EEG/iEEG features: PSD, specparam/FOOOF, microstates, connectivity, complexity, PAC.

Omics: genomics, transcriptomics, proteomics, metabolomics; normalization, imputation, ComBat/RUV batch correction, ID mapping (HGNC/Ensembl/UniProt/HMDB).

Integration & modeling: DIABLO (mixOmics), MOFA+, SNF, baselines (XGBoost/RF/SVM), optional early-fusion NN, stacking/late fusion with strict CV.

MLOps: Hydra configs, DVC pipelines, MLflow tracking, Snakemake/Nextflow orchestration, containers (Docker/Singularity).

Environments: local (subset) and HPC (SLURM arrays), per-subject/session parallelism, optional GPU for deep models, deterministic seeds, idempotent stages with caching.

Repository Layout

neurofusion-eeg-omics/
├── README.md
├── LICENSE
├── pyproject.toml
├── requirements/
│   ├── base.txt
│   ├── eeg.txt
│   ├── omics.txt
│   └── dev.txt
├── environment/
│   ├── conda-lock.yml
│   └── renv.lock
├── containers/
│   ├── docker/
│   │   ├── base.Dockerfile
│   │   ├── eeg-omics.Dockerfile
│   │   └── r-mixomics.Dockerfile
│   └── singularity/
│       ├── eeg-omics.def
│       └── r-mixomics.def
├── conf/
│   ├── config.yaml
│   ├── paths.yaml
│   ├── data/
│   │   ├── bids.yaml
│   │   ├── omics.yaml
│   │   ├── isa.yaml
│   │   └── privacy.yaml
│   ├── eeg/
│   │   ├── preprocessing.yaml
│   │   ├── derivatives.yaml
│   │   ├── qc.yaml
│   │   └── reference_spaces.yaml
│   ├── omics/
│   │   ├── genomics.yaml
│   │   ├── transcriptomics.yaml
│   │   ├── proteomics.yaml
│   │   ├── metabolomics.yaml
│   │   ├── batch.yaml
│   │   └── id-map.yaml
│   ├── features/
│   │   ├── merge.yaml
│   │   ├── age-norm.yaml
│   │   ├── block-balance.yaml
│   │   └── export.yaml
│   ├── models/
│   │   ├── diablo.yaml
│   │   ├── mofa.yaml
│   │   ├── snf.yaml
│   │   ├── xgb.yaml
│   │   ├── rf.yaml
│   │   ├── svm.yaml
│   │   └── nn.yaml
│   ├── interpretation/
│   │   ├── shap.yaml
│   │   ├── haufe.yaml
│   │   └── pathway.yaml
│   ├── cv.yaml
│   ├── mlflow.yaml
│   ├── observability.yaml
│   ├── data-protection.yaml
│   ├── hydra/
│   │   ├── local-dev.yaml
│   │   ├── hpc-full.yaml
│   │   ├── cloud.yaml
│   │   └── smoke-test.yaml
│   └── hpc/
│       ├── slurm.yaml
│       ├── lsf.yaml
│       └── resources.yaml
├── src/
│   ├── __init__.py
│   ├── common/
│   │   ├── __init__.py
│   │   ├── io_utils.py
│   │   ├── schema_validation.py
│   │   ├── logging_utils.py
│   │   ├── mlflow_utils.py
│   │   ├── dvc_utils.py
│   │   ├── cache_utils.py
│   │   └── privacy_utils.py
│   ├── ingest/
│   │   ├── ingest_metadata.py
│   │   ├── validate_bids.py
│   │   ├── validate_isa.py
│   │   └── build_sample_map.py
│   ├── eeg/
│   │   ├── preprocess_eeg.py
│   │   ├── compute_derivatives.py
│   │   ├── eeg_qc_report.py
│   │   └── eeg_privacy_check.py
│   ├── omics/
│   │   ├── qc_genomics.py
│   │   ├── qc_transcriptomics.R
│   │   ├── qc_proteomics.py
│   │   ├── qc_metabolomics.py
│   │   ├── omics_batch_correct.R
│   │   ├── omics_id_map.py
│   │   └── omics_qc_report.py
│   ├── features/
│   │   ├── merge_features.py
│   │   ├── age_norm.py
│   │   ├── block_balance.py
│   │   ├── features_dictionary.py
│   │   └── export_feature_sets.py
│   ├── models/
│   │   ├── cv_splits.py
│   │   ├── train_models.py
│   │   ├── run_diablo.R
│   │   ├── run_mofa.py
│   │   ├── run_snf.py
│   │   ├── train_baselines.py
│   │   ├── stack_predictions.py
│   │   └── evaluate_metrics.py
│   └── interpretation/
│       ├── interpret_model.py
│       ├── shap_explain.py
│       ├── haufe_maps.py
│       └── pathway_summarize.py
├── workflows/
│   ├── Snakefile
│   ├── profiles/
│   │   ├── local
│   │   ├── slurm
│   │   └── hpc-gpu
│   ├── main.nf
│   ├── nextflow.config
│   └── slurm/
│       ├── template.sbatch
│       ├── array-preproc.sbatch
│       ├── array-omics.sbatch
│       └── gpu-models.sbatch
├── notebooks/
│   ├── exploratory/
│   ├── qa/
│   └── reports/
├── tests/
│   ├── unit/
│   ├── integration/
│   ├── regression/
│   ├── data/
│   │   ├── synthetic_eeg/
│   │   ├── golden_omics/
│   │   └── configs/
│   └── schemas/
├── reports/
│   ├── qc/
│   ├── modeling/
│   ├── interpretation/
│   └── governance/
├── docs/
│   ├── isa/
│   ├── bids/
│   ├── api/
│   └── operations/
├── scripts/
│   ├── bootstrap-data.sh
│   └── sync-mlflow.sh
├── data/ (gitignored)
└── mlruns/ (gitignored or remote)

Stage-wise Script Catalogue

The following tables document every script with purpose, IO, CLI, configs, algorithms, QC/observability, compute, caching, and validation.

Stage-wise script catalogue (local + HPC, standards-first)

Each script lists: purpose, inputs→outputs (relative paths), CLI (local + SLURM), key Hydra config, algorithms/libs, QC & logs, compute, idempotence & validation.

Stage 1 — Dataset ingestion & metadata alignment

Goal: Validate BIDS, validate ISA-JSON, and harmonize subject/sample IDs into a master sample map.

`src/ingest/ingest_metadata.py`

Purpose: Merge raw registries with BIDS participants.tsv and ISA to produce harmonized participant tables and an ISA skeleton.
Inputs → Outputs:
inputs/raw_metadata/*.csv, bids/participants.tsv, docs/isa/templates/*.json, conf/paths.yaml →
interim/metadata/harmonized_participants.tsv, docs/isa/investigation.json, docs/isa/study.json, docs/isa/assay.json, reports/qc/metadata_diff.html
CLI:
Local → python -m src.ingest.ingest_metadata --config-name=config metadata_csvs='[inputs/raw_metadata/subjects.csv,inputs/raw_metadata/sessions.csv]'
SLURM → sbatch workflows/slurm/template.sbatch --wrap="python -m src.ingest.ingest_metadata hydra/launcher=submitit_slurm"
Key Hydra: data.metadata.raw_paths, ingest.participant_id_regex, ingest.merge.strategy, ingest.iso8601_columns, paths.docs.isa_root
Algorithms/libs: pandas, pandera, jsonschema, ISA-API for ISA-Tab/ISA-JSON read/write. :contentReference[oaicite:0]{index=0}
QC & logs: MLflow params (row counts, completeness, dupes), diff HTML artifact
Compute: CPU ≤2 cores, RAM ≤4 GB
Idempotence & validation: DVC tracks inputs/config; schema checks; deterministic merges. :contentReference[oaicite:1]{index=1}

`src/ingest/validate_bids.py`

Purpose: Run BIDS Validator + PyBIDS sanity checks on the dataset.
Inputs → Outputs: bids/ → reports/qc/bids_validator.json, reports/qc/bids_summary.html
CLI: Local → python -m src.ingest.validate_bids bids_root=bids/ strict=true · SLURM via submitit multirun
Key Hydra: data.bids.root, ingest.bids.strict, ingest.bids.max_issues
Algorithms/libs: BIDS Validator, PyBIDS. :contentReference[oaicite:2]{index=2}
QC & logs: error/warning counts, validator version
Compute: CPU 2, RAM 4 GB; parallel by subject
Idempotence & validation: Output hashed by validator JSON; hard-fail unless allow_errors=true. :contentReference[oaicite:3]{index=3}

`src/ingest/validate_isa.py`

Purpose: Validate ISA-JSON/ISA-Tab and ontology references.
Inputs → Outputs: docs/isa/*.json → reports/qc/isa_validation.json, reports/qc/isa_report.html
CLI: python -m src.ingest.validate_isa isa_root=docs/isa strict=true
Key Hydra: data.isa.root, ingest.isa.required_ontologies, ingest.isa.schema
Algorithms/libs: ISA-API validators. :contentReference[oaicite:4]{index=4}
QC/Compute/Idempotence: Lightweight; cached by JSON hash

`src/ingest/build_sample_map.py`

Purpose: Construct subject–session–modality map aligning EEG/iEEG (BIDS) with all omics IDs; track consent flags.
Inputs → Outputs: interim/metadata/harmonized_participants.tsv, inputs/omics/*manifest.tsv, bids/ → interim/mappings/sample_map.tsv, interim/mappings/subject_key.json, reports/qc/sample_map.html
CLI: python -m src.ingest.build_sample_map map_strategy=inner include_controls=false (arrayable per modality)
Key Hydra: ingest.mapping.strategy, ingest.mapping.key_columns, privacy.exclude_opt_out
Algorithms/libs: pandas, PyBIDS for BIDS lookups. :contentReference[oaicite:5]{index=5}

Stage 2 — EEG/iEEG preprocessing & derivatives

Goal: Clean EEG/iEEG with MNE, then compute derivatives: PSD, spectral parameterization (specparam/FOOOF), microstates, connectivity, complexity, and PAC.

`src/eeg/preprocess_eeg.py`

Purpose: Band-filter, resample, re-reference, ICA/ASR, interpolation, segmentation → BIDS-Derivatives FIF + JSON provenance.
Inputs → Outputs: BIDS EEG, interim/mappings/sample_map.tsv, montages, conf/eeg/preprocessing.yaml → derivatives/eeg-preproc/*_desc-preproc_eeg.fif + _preproc.json, reports/qc/eeg/*_preproc_report.html
CLI: Local (per subject): python -m src.eeg.preprocess_eeg subject=sub-001 · SLURM array via workflows/slurm/array-preproc.sbatch
Key Hydra: eeg.preprocessing.filter.{l_freq,h_freq,notch}, resample, reference, ica.method, artifact.removal
Algorithms/libs: MNE-Python, MNE-BIDS, optional PyPREP/autoreject. :contentReference[oaicite:6]{index=6}
QC & logs: bad-channel %, raw vs cleaned PSD, ICA comps; MLflow run with figures
Compute: 4–8 cores; 8–16 GB RAM
Idempotence & validation: Channel montage checks; DVC per subject. :contentReference[oaicite:7]{index=7}

`src/eeg/compute_derivatives.py`

Purpose: Unified CLI to compute PSD, specparam/FOOOF features, microstates, connectivity, complexity, and PAC.
Inputs → Outputs: Preproc FIF + configs → BIDS-compliant TSV/HDF5 + JSON sidecars; reports/qc/eeg/*_derivatives.html
CLI:
python -m src.eeg.compute_derivatives subject=sub-001 eeg.derivatives='[psd,fooof,microstates,connectivity,complexity,pac]' fooof.max_n_peaks=6 microstates.n_states=4 connectivity.metrics='[coh,pli,pli2_unbiased]' pac.freq_pairs='[[6,10],[80,150]]'
SLURM arrays across subjects/sessions
Key Hydra: fooof.*, microstates.n_states, connectivity.metrics, pac.method, complexity.metrics
Algorithms/libs: MNE, specparam/FOOOF, Pycrostates, mne-connectivity, Tensorpac. :contentReference[oaicite:8]{index=8}
QC & logs: FOOOF fit R², microstate GEV, connectivity sparsity, PAC comodulograms
Compute: 8–16 cores; 16 GB RAM; optional GPU for clustering
Idempotence & validation: --only-missing, NaN thresholds; DVC per derivative; connectivity via dedicated package. :contentReference[oaicite:9]{index=9}

`src/eeg/eeg_qc_report.py`

Purpose: Aggregate subject/cohort EEG QC → HTML/PDF dashboards with thresholds and flags.
Inputs → Outputs: derivative metrics + MLflow logs → reports/qc/eeg/index.html, reports/qc/eeg/flags.csv
Algorithms/libs: pandas, plotly, mne-report
Compute/Idempotence: Lightweight; incremental merges

Stage 3 — Multi-omics QC & harmonization

Goal: Per-modality normalization/QC, imputation, and batch correction (ComBat/RUV), with standardized identifiers.

`src/omics/qc_genomics.py`

Purpose: SNP QC, ancestry inference, kinship → filtered PLINK + dosage matrix.
Inputs → Outputs: inputs/omics/genomics/*.vcf{.gz} + sample map → derivatives/omics/genomics/qc/subset_plink.*, dosage.parquet, genomics_qc.html
CLI: Call rate, HWE p, MAF, LD-prune; array by chromosome
Algorithms/libs: plink2, scikit-allel (optional Hail)
QC/Compute/Idempotence: MLflow call rate/het/PCs; DVC per chromosome

`src/omics/qc_transcriptomics.R`

Purpose: RNA-seq normalization & QC with edgeR/DESeq2/limma, batch preview.
Inputs → Outputs: raw counts + metadata → normalized counts (.tsv/.rds), QC plots
CLI: choose normalization (TMM/DESeq2/voom), CPM filters, batch variables
Algorithms/libs: edgeR, DESeq2, limma; sva for batch methods; BiocParallel. :contentReference[oaicite:10]{index=10}
QC/Compute/Idempotence: % kept, MV trends; parallel; DVC caches; biomaRt ID checks

`src/omics/qc_proteomics.py`

Purpose: Normalize proteomics (label-free/TMT), adaptive imputation, replicate concordance.
Inputs → Outputs: intensities + metadata → normalized matrix, QC metrics
CLI: transforms (log2/vsn), imputation (knn/bpca/min-det), batch variables
Algorithms/libs: pandas, pyComBat, sklearn KNNImputer
QC/Compute/Idempotence: CV%, replicate R; DVC; UniProt mapping coverage

`src/omics/qc_metabolomics.py`

Purpose: Normalize targeted/untargeted metabolomics; pooled-QC & drift correction.
Inputs → Outputs: vendor CSVs + metadata → normalized matrix, drift curves, QC metrics
CLI: imputation (knn/rf/bpca), batch (ComBat/RUV), scale (pareto/auto/range), pooled-CV threshold
Algorithms/libs: missForest/pyComBat; HMDB mapping
QC/Compute/Idempotence: pooled CVs; DVC; mapping coverage

`src/omics/omics_batch_correct.R`

Purpose: Apply ComBat (empirical Bayes) or RUV; emit diagnostics.
Inputs → Outputs: normalized matrices + covariates → batch_corrected.tsv, variance components + plots
CLI: method, covariates, reference batch
Algorithms/libs: sva::ComBat, RUVSeq, limma. :contentReference[oaicite:11]{index=11}
QC/Compute/Idempotence: MLflow variance ratios pre/post; DVC; covariate presence checks

`src/omics/omics_id_map.py`

Purpose: Standardize feature IDs (HGNC/Ensembl/UniProt/HMDB); crosswalk dictionaries.
Outputs: standardized matrices, feature_id_map.tsv, ambiguous_ids.tsv
QC/Idempotence: coverage %, ambiguous counts; uniqueness enforced

`src/omics/omics_qc_report.py`

Purpose: Compile per-modality QC dashboards → HTML/PDF + governance summary
Outputs: reports/qc/omics/omics_qc.html/.pdf, reports/governance/omics_summary.xlsx

Stage 4 — Feature fusion & alignment

Goal: Join EEG derivatives with each omics block by participant_id to create a multi-block master matrix; add age-adjusted features and block balancing.

`src/features/merge_features.py`

Purpose: Align EEG + omics into interim/features/master_matrix.parquet with block_metadata.json
Config: align keys, block weights, missingness policy; Dask for scale
Algorithms/libs: pandas, polars, Dask
QC: feature counts, missingness, cross-block heatmaps

`src/features/age_norm.py`

Purpose: Compute age-adjusted z-scores or residuals per block (linear/GAM/LOESS)
Algorithms/libs: statsmodels, pyGAM
QC: distribution diagnostics per block

`src/features/block_balance.py`

Purpose: Scale/weight blocks (variance / quantile / energy) to avoid dominance by high-dimensional omics
Algorithms/libs: scikit-learn, mixOmics utilities. :contentReference[oaicite:12]{index=12}

`src/features/features_dictionary.py`

Purpose: Feature dictionary with provenance and ontology tags
Outputs: docs/features/*.tsv/.json/.yaml

`src/features/export_feature_sets.py`

Purpose: Export curated subsets (EEG-only, omics-only, task-specific) with privacy filters
Outputs: exports/features/{subset}.parquet/.json

Stage 5 — Integration & modeling

Goal: Compare supervised (DIABLO) and unsupervised (MOFA+) integration, graph fusion (SNFpy), plus baselines (XGBoost/RF/SVM) and optional early-fusion NN; strict CV; MLflow logging.

`src/models/cv_splits.py`

Purpose: Leakage-safe (stratified/grouped) CV; nested repeats; seeds + fold manifests
Outputs: interim/cv/splits.pkl, leakage_checks.json

`src/models/train_models.py`

Purpose: Hydra dispatcher that runs DIABLO (R), MOFA+ (Python), SNF-based clustering, and baselines; per-fold MLflow logging + DVC checkpoints; SLURM fold-parallel
Inputs → Outputs: master matrix + config → models/{model}/run-*/, OOF predictions, reports/modeling/{model}_fold_metrics.csv
Algorithms/libs: mixOmics/DIABLO, mofapy2 (MOFA+), snfpy, scikit-learn, XGBoost, PyTorch Lightning (optional). :contentReference[oaicite:13]{index=13}
CLI:
Local → python -m src.models.train_models model=diablo cv.repeats=5
SLURM → snakemake --profile workflows/profiles/slurm train_models model=xgb
QC & logs: per-fold PR-AUC/AUC/F1, confusion matrices; MLflow params/artifacts
Compute: CPU/GPU as needed (GPU for NN); fold-parallel on cluster
Idempotence & validation: split manifests frozen; preprocessing fit inside CV; DVC stage. :contentReference[oaicite:14]{index=14}

`src/models/run_diablo.R`

Purpose: Run DIABLO (multiblock sPLS-DA) per fold; export loadings, design matrix, performance. :contentReference[oaicite:15]{index=15}
Outputs: models/diablo/fold-*/loadings.tsv, plots/*.html

`src/models/run_mofa.py`

Purpose: Train MOFA+ factors; save factors/loadings + ELBO/variance explained. :contentReference[oaicite:16]{index=16}
Outputs: models/mofa/factors.parquet, models/mofa/loadings.parquet, reports/modeling/mofa_variance.html

`src/models/run_snf.py`

Purpose: Build per-block affinities and run Similarity Network Fusion; cluster and export labels. :contentReference[oaicite:17]{index=17}
Outputs: models/snf/affinities/*.npy, models/snf/fused.npy, models/snf/labels.csv

`src/models/train_baselines.py`

Purpose: XGBoost/RF/SVM baselines with unified API; SHAP-ready artifacts.
Outputs: models/baselines/*, OOF preds, feature importance

`src/models/stack_predictions.py`

Purpose: Late-fusion stacking on OOF predictions (logistic/elastic-net meta-learner); calibration and uncertainty

`src/models/evaluate_metrics.py`

Purpose: Aggregate metrics across folds, compute CIs (e.g., bootstrap or DeLong), export cohort report

Stage 6 — Interpretation & reporting

Goal: Explain models and render EEG maps; summarize biology.

`src/interpretation/interpret_model.py`

Purpose: Unified entry: DIABLO loadings, MOFA factors, SHAP summaries, Haufe maps
Outputs: reports/interpretation/* JSON + HTML

`src/interpretation/shap_explain.py`

Purpose: Global & local explanations using SHAP (tree/NN); interaction plots; background caching
Algorithms/libs: SHAP, xgboost/torch. SHAP

`src/interpretation/haufe_maps.py`

Purpose: Compute Haufe transform for linear EEG subspaces to obtain sensor-space, physiologically interpretable maps (per fold). ScienceDirect
Outputs: FIF derivatives + HTML topographies

`src/interpretation/pathway_summarize.py`

Purpose: Map feature importances to pathways (Reactome/KEGG/MSigDB); FDR-controlled enrichment; export ranked tables

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

neurofusion-eeg-omics

Highlights

Repository Layout

Stage-wise Script Catalogue

Stage-wise script catalogue (local + HPC, standards-first)

Stage 1 — Dataset ingestion & metadata alignment

src/ingest/ingest_metadata.py

src/ingest/validate_bids.py

src/ingest/validate_isa.py

src/ingest/build_sample_map.py

Stage 2 — EEG/iEEG preprocessing & derivatives

src/eeg/preprocess_eeg.py

src/eeg/compute_derivatives.py

src/eeg/eeg_qc_report.py

Stage 3 — Multi-omics QC & harmonization

src/omics/qc_genomics.py

src/omics/qc_transcriptomics.R

src/omics/qc_proteomics.py

src/omics/qc_metabolomics.py

src/omics/omics_batch_correct.R

src/omics/omics_id_map.py

src/omics/omics_qc_report.py

Stage 4 — Feature fusion & alignment

src/features/merge_features.py

src/features/age_norm.py

src/features/block_balance.py

src/features/features_dictionary.py

src/features/export_feature_sets.py

Stage 5 — Integration & modeling

src/models/cv_splits.py

src/models/train_models.py

src/models/run_diablo.R

src/models/run_mofa.py

src/models/run_snf.py

src/models/train_baselines.py

src/models/stack_predictions.py

src/models/evaluate_metrics.py

Stage 6 — Interpretation & reporting

src/interpretation/interpret_model.py

src/interpretation/shap_explain.py

src/interpretation/haufe_maps.py

src/interpretation/pathway_summarize.py

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

`src/ingest/ingest_metadata.py`

`src/ingest/validate_bids.py`

`src/ingest/validate_isa.py`

`src/ingest/build_sample_map.py`

`src/eeg/preprocess_eeg.py`

`src/eeg/compute_derivatives.py`

`src/eeg/eeg_qc_report.py`

`src/omics/qc_genomics.py`

`src/omics/qc_transcriptomics.R`

`src/omics/qc_proteomics.py`

`src/omics/qc_metabolomics.py`

`src/omics/omics_batch_correct.R`

`src/omics/omics_id_map.py`

`src/omics/omics_qc_report.py`

`src/features/merge_features.py`

`src/features/age_norm.py`

`src/features/block_balance.py`

`src/features/features_dictionary.py`

`src/features/export_feature_sets.py`

`src/models/cv_splits.py`

`src/models/train_models.py`

`src/models/run_diablo.R`

`src/models/run_mofa.py`

`src/models/run_snf.py`

`src/models/train_baselines.py`

`src/models/stack_predictions.py`

`src/models/evaluate_metrics.py`

`src/interpretation/interpret_model.py`

`src/interpretation/shap_explain.py`

`src/interpretation/haufe_maps.py`

`src/interpretation/pathway_summarize.py`

Packages