Skip to content

SidSin0809/neurofusion-eeg-omics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

neurofusion-eeg-omics

End-to-end, standards-first pipeline for EEG/iEEG × multi-omics biomarker discovery. Automates ingestion, robust EEG cleaning, rich derivatives, omics QC/batch correction, feature fusion, and model training with MLOps (configs, tracking, versioning). HPC-ready, reproducible, optionally privacy-preserving; supports local dev and clusters.

Highlights

Standards & metadata: BIDS/BIDS-Derivatives (EEG/iEEG), ISA-JSON (omics), sidecar provenance.

EEG/iEEG features: PSD, specparam/FOOOF, microstates, connectivity, complexity, PAC.

Omics: genomics, transcriptomics, proteomics, metabolomics; normalization, imputation, ComBat/RUV batch correction, ID mapping (HGNC/Ensembl/UniProt/HMDB).

Integration & modeling: DIABLO (mixOmics), MOFA+, SNF, baselines (XGBoost/RF/SVM), optional early-fusion NN, stacking/late fusion with strict CV.

MLOps: Hydra configs, DVC pipelines, MLflow tracking, Snakemake/Nextflow orchestration, containers (Docker/Singularity).

Environments: local (subset) and HPC (SLURM arrays), per-subject/session parallelism, optional GPU for deep models, deterministic seeds, idempotent stages with caching.

Repository Layout

neurofusion-eeg-omics/
├── README.md
├── LICENSE
├── pyproject.toml
├── requirements/
│   ├── base.txt
│   ├── eeg.txt
│   ├── omics.txt
│   └── dev.txt
├── environment/
│   ├── conda-lock.yml
│   └── renv.lock
├── containers/
│   ├── docker/
│   │   ├── base.Dockerfile
│   │   ├── eeg-omics.Dockerfile
│   │   └── r-mixomics.Dockerfile
│   └── singularity/
│       ├── eeg-omics.def
│       └── r-mixomics.def
├── conf/
│   ├── config.yaml
│   ├── paths.yaml
│   ├── data/
│   │   ├── bids.yaml
│   │   ├── omics.yaml
│   │   ├── isa.yaml
│   │   └── privacy.yaml
│   ├── eeg/
│   │   ├── preprocessing.yaml
│   │   ├── derivatives.yaml
│   │   ├── qc.yaml
│   │   └── reference_spaces.yaml
│   ├── omics/
│   │   ├── genomics.yaml
│   │   ├── transcriptomics.yaml
│   │   ├── proteomics.yaml
│   │   ├── metabolomics.yaml
│   │   ├── batch.yaml
│   │   └── id-map.yaml
│   ├── features/
│   │   ├── merge.yaml
│   │   ├── age-norm.yaml
│   │   ├── block-balance.yaml
│   │   └── export.yaml
│   ├── models/
│   │   ├── diablo.yaml
│   │   ├── mofa.yaml
│   │   ├── snf.yaml
│   │   ├── xgb.yaml
│   │   ├── rf.yaml
│   │   ├── svm.yaml
│   │   └── nn.yaml
│   ├── interpretation/
│   │   ├── shap.yaml
│   │   ├── haufe.yaml
│   │   └── pathway.yaml
│   ├── cv.yaml
│   ├── mlflow.yaml
│   ├── observability.yaml
│   ├── data-protection.yaml
│   ├── hydra/
│   │   ├── local-dev.yaml
│   │   ├── hpc-full.yaml
│   │   ├── cloud.yaml
│   │   └── smoke-test.yaml
│   └── hpc/
│       ├── slurm.yaml
│       ├── lsf.yaml
│       └── resources.yaml
├── src/
│   ├── __init__.py
│   ├── common/
│   │   ├── __init__.py
│   │   ├── io_utils.py
│   │   ├── schema_validation.py
│   │   ├── logging_utils.py
│   │   ├── mlflow_utils.py
│   │   ├── dvc_utils.py
│   │   ├── cache_utils.py
│   │   └── privacy_utils.py
│   ├── ingest/
│   │   ├── ingest_metadata.py
│   │   ├── validate_bids.py
│   │   ├── validate_isa.py
│   │   └── build_sample_map.py
│   ├── eeg/
│   │   ├── preprocess_eeg.py
│   │   ├── compute_derivatives.py
│   │   ├── eeg_qc_report.py
│   │   └── eeg_privacy_check.py
│   ├── omics/
│   │   ├── qc_genomics.py
│   │   ├── qc_transcriptomics.R
│   │   ├── qc_proteomics.py
│   │   ├── qc_metabolomics.py
│   │   ├── omics_batch_correct.R
│   │   ├── omics_id_map.py
│   │   └── omics_qc_report.py
│   ├── features/
│   │   ├── merge_features.py
│   │   ├── age_norm.py
│   │   ├── block_balance.py
│   │   ├── features_dictionary.py
│   │   └── export_feature_sets.py
│   ├── models/
│   │   ├── cv_splits.py
│   │   ├── train_models.py
│   │   ├── run_diablo.R
│   │   ├── run_mofa.py
│   │   ├── run_snf.py
│   │   ├── train_baselines.py
│   │   ├── stack_predictions.py
│   │   └── evaluate_metrics.py
│   └── interpretation/
│       ├── interpret_model.py
│       ├── shap_explain.py
│       ├── haufe_maps.py
│       └── pathway_summarize.py
├── workflows/
│   ├── Snakefile
│   ├── profiles/
│   │   ├── local
│   │   ├── slurm
│   │   └── hpc-gpu
│   ├── main.nf
│   ├── nextflow.config
│   └── slurm/
│       ├── template.sbatch
│       ├── array-preproc.sbatch
│       ├── array-omics.sbatch
│       └── gpu-models.sbatch
├── notebooks/
│   ├── exploratory/
│   ├── qa/
│   └── reports/
├── tests/
│   ├── unit/
│   ├── integration/
│   ├── regression/
│   ├── data/
│   │   ├── synthetic_eeg/
│   │   ├── golden_omics/
│   │   └── configs/
│   └── schemas/
├── reports/
│   ├── qc/
│   ├── modeling/
│   ├── interpretation/
│   └── governance/
├── docs/
│   ├── isa/
│   ├── bids/
│   ├── api/
│   └── operations/
├── scripts/
│   ├── bootstrap-data.sh
│   └── sync-mlflow.sh
├── data/ (gitignored)
└── mlruns/ (gitignored or remote)

Stage-wise Script Catalogue

The following tables document every script with purpose, IO, CLI, configs, algorithms, QC/observability, compute, caching, and validation.

Stage-wise script catalogue (local + HPC, standards-first)

Each script lists: purpose, inputs→outputs (relative paths), CLI (local + SLURM), key Hydra config, algorithms/libs, QC & logs, compute, idempotence & validation.


Stage 1 — Dataset ingestion & metadata alignment

Goal: Validate BIDS, validate ISA-JSON, and harmonize subject/sample IDs into a master sample map.

src/ingest/ingest_metadata.py

  • Purpose: Merge raw registries with BIDS participants.tsv and ISA to produce harmonized participant tables and an ISA skeleton.
  • Inputs → Outputs:
    inputs/raw_metadata/*.csv, bids/participants.tsv, docs/isa/templates/*.json, conf/paths.yaml
    interim/metadata/harmonized_participants.tsv, docs/isa/investigation.json, docs/isa/study.json, docs/isa/assay.json, reports/qc/metadata_diff.html
  • CLI:
    Local → python -m src.ingest.ingest_metadata --config-name=config metadata_csvs='[inputs/raw_metadata/subjects.csv,inputs/raw_metadata/sessions.csv]'
    SLURM → sbatch workflows/slurm/template.sbatch --wrap="python -m src.ingest.ingest_metadata hydra/launcher=submitit_slurm"
  • Key Hydra: data.metadata.raw_paths, ingest.participant_id_regex, ingest.merge.strategy, ingest.iso8601_columns, paths.docs.isa_root
  • Algorithms/libs: pandas, pandera, jsonschema, ISA-API for ISA-Tab/ISA-JSON read/write. :contentReference[oaicite:0]{index=0}
  • QC & logs: MLflow params (row counts, completeness, dupes), diff HTML artifact
  • Compute: CPU ≤2 cores, RAM ≤4 GB
  • Idempotence & validation: DVC tracks inputs/config; schema checks; deterministic merges. :contentReference[oaicite:1]{index=1}

src/ingest/validate_bids.py

  • Purpose: Run BIDS Validator + PyBIDS sanity checks on the dataset.
  • Inputs → Outputs: bids/reports/qc/bids_validator.json, reports/qc/bids_summary.html
  • CLI: Local → python -m src.ingest.validate_bids bids_root=bids/ strict=true · SLURM via submitit multirun
  • Key Hydra: data.bids.root, ingest.bids.strict, ingest.bids.max_issues
  • Algorithms/libs: BIDS Validator, PyBIDS. :contentReference[oaicite:2]{index=2}
  • QC & logs: error/warning counts, validator version
  • Compute: CPU 2, RAM 4 GB; parallel by subject
  • Idempotence & validation: Output hashed by validator JSON; hard-fail unless allow_errors=true. :contentReference[oaicite:3]{index=3}

src/ingest/validate_isa.py

  • Purpose: Validate ISA-JSON/ISA-Tab and ontology references.
  • Inputs → Outputs: docs/isa/*.jsonreports/qc/isa_validation.json, reports/qc/isa_report.html
  • CLI: python -m src.ingest.validate_isa isa_root=docs/isa strict=true
  • Key Hydra: data.isa.root, ingest.isa.required_ontologies, ingest.isa.schema
  • Algorithms/libs: ISA-API validators. :contentReference[oaicite:4]{index=4}
  • QC/Compute/Idempotence: Lightweight; cached by JSON hash

src/ingest/build_sample_map.py

  • Purpose: Construct subject–session–modality map aligning EEG/iEEG (BIDS) with all omics IDs; track consent flags.
  • Inputs → Outputs: interim/metadata/harmonized_participants.tsv, inputs/omics/*manifest.tsv, bids/interim/mappings/sample_map.tsv, interim/mappings/subject_key.json, reports/qc/sample_map.html
  • CLI: python -m src.ingest.build_sample_map map_strategy=inner include_controls=false (arrayable per modality)
  • Key Hydra: ingest.mapping.strategy, ingest.mapping.key_columns, privacy.exclude_opt_out
  • Algorithms/libs: pandas, PyBIDS for BIDS lookups. :contentReference[oaicite:5]{index=5}

Stage 2 — EEG/iEEG preprocessing & derivatives

Goal: Clean EEG/iEEG with MNE, then compute derivatives: PSD, spectral parameterization (specparam/FOOOF), microstates, connectivity, complexity, and PAC.

src/eeg/preprocess_eeg.py

  • Purpose: Band-filter, resample, re-reference, ICA/ASR, interpolation, segmentation → BIDS-Derivatives FIF + JSON provenance.
  • Inputs → Outputs: BIDS EEG, interim/mappings/sample_map.tsv, montages, conf/eeg/preprocessing.yamlderivatives/eeg-preproc/*_desc-preproc_eeg.fif + _preproc.json, reports/qc/eeg/*_preproc_report.html
  • CLI: Local (per subject): python -m src.eeg.preprocess_eeg subject=sub-001 · SLURM array via workflows/slurm/array-preproc.sbatch
  • Key Hydra: eeg.preprocessing.filter.{l_freq,h_freq,notch}, resample, reference, ica.method, artifact.removal
  • Algorithms/libs: MNE-Python, MNE-BIDS, optional PyPREP/autoreject. :contentReference[oaicite:6]{index=6}
  • QC & logs: bad-channel %, raw vs cleaned PSD, ICA comps; MLflow run with figures
  • Compute: 4–8 cores; 8–16 GB RAM
  • Idempotence & validation: Channel montage checks; DVC per subject. :contentReference[oaicite:7]{index=7}

src/eeg/compute_derivatives.py

  • Purpose: Unified CLI to compute PSD, specparam/FOOOF features, microstates, connectivity, complexity, and PAC.
  • Inputs → Outputs: Preproc FIF + configs → BIDS-compliant TSV/HDF5 + JSON sidecars; reports/qc/eeg/*_derivatives.html
  • CLI:
    python -m src.eeg.compute_derivatives subject=sub-001 eeg.derivatives='[psd,fooof,microstates,connectivity,complexity,pac]' fooof.max_n_peaks=6 microstates.n_states=4 connectivity.metrics='[coh,pli,pli2_unbiased]' pac.freq_pairs='[[6,10],[80,150]]'
    SLURM arrays across subjects/sessions
  • Key Hydra: fooof.*, microstates.n_states, connectivity.metrics, pac.method, complexity.metrics
  • Algorithms/libs: MNE, specparam/FOOOF, Pycrostates, mne-connectivity, Tensorpac. :contentReference[oaicite:8]{index=8}
  • QC & logs: FOOOF fit R², microstate GEV, connectivity sparsity, PAC comodulograms
  • Compute: 8–16 cores; 16 GB RAM; optional GPU for clustering
  • Idempotence & validation: --only-missing, NaN thresholds; DVC per derivative; connectivity via dedicated package. :contentReference[oaicite:9]{index=9}

src/eeg/eeg_qc_report.py

  • Purpose: Aggregate subject/cohort EEG QC → HTML/PDF dashboards with thresholds and flags.
  • Inputs → Outputs: derivative metrics + MLflow logs → reports/qc/eeg/index.html, reports/qc/eeg/flags.csv
  • Algorithms/libs: pandas, plotly, mne-report
  • Compute/Idempotence: Lightweight; incremental merges

Stage 3 — Multi-omics QC & harmonization

Goal: Per-modality normalization/QC, imputation, and batch correction (ComBat/RUV), with standardized identifiers.

src/omics/qc_genomics.py

  • Purpose: SNP QC, ancestry inference, kinship → filtered PLINK + dosage matrix.
  • Inputs → Outputs: inputs/omics/genomics/*.vcf{.gz} + sample map → derivatives/omics/genomics/qc/subset_plink.*, dosage.parquet, genomics_qc.html
  • CLI: Call rate, HWE p, MAF, LD-prune; array by chromosome
  • Algorithms/libs: plink2, scikit-allel (optional Hail)
  • QC/Compute/Idempotence: MLflow call rate/het/PCs; DVC per chromosome

src/omics/qc_transcriptomics.R

  • Purpose: RNA-seq normalization & QC with edgeR/DESeq2/limma, batch preview.
  • Inputs → Outputs: raw counts + metadata → normalized counts (.tsv/.rds), QC plots
  • CLI: choose normalization (TMM/DESeq2/voom), CPM filters, batch variables
  • Algorithms/libs: edgeR, DESeq2, limma; sva for batch methods; BiocParallel. :contentReference[oaicite:10]{index=10}
  • QC/Compute/Idempotence: % kept, MV trends; parallel; DVC caches; biomaRt ID checks

src/omics/qc_proteomics.py

  • Purpose: Normalize proteomics (label-free/TMT), adaptive imputation, replicate concordance.
  • Inputs → Outputs: intensities + metadata → normalized matrix, QC metrics
  • CLI: transforms (log2/vsn), imputation (knn/bpca/min-det), batch variables
  • Algorithms/libs: pandas, pyComBat, sklearn KNNImputer
  • QC/Compute/Idempotence: CV%, replicate R; DVC; UniProt mapping coverage

src/omics/qc_metabolomics.py

  • Purpose: Normalize targeted/untargeted metabolomics; pooled-QC & drift correction.
  • Inputs → Outputs: vendor CSVs + metadata → normalized matrix, drift curves, QC metrics
  • CLI: imputation (knn/rf/bpca), batch (ComBat/RUV), scale (pareto/auto/range), pooled-CV threshold
  • Algorithms/libs: missForest/pyComBat; HMDB mapping
  • QC/Compute/Idempotence: pooled CVs; DVC; mapping coverage

src/omics/omics_batch_correct.R

  • Purpose: Apply ComBat (empirical Bayes) or RUV; emit diagnostics.
  • Inputs → Outputs: normalized matrices + covariates → batch_corrected.tsv, variance components + plots
  • CLI: method, covariates, reference batch
  • Algorithms/libs: sva::ComBat, RUVSeq, limma. :contentReference[oaicite:11]{index=11}
  • QC/Compute/Idempotence: MLflow variance ratios pre/post; DVC; covariate presence checks

src/omics/omics_id_map.py

  • Purpose: Standardize feature IDs (HGNC/Ensembl/UniProt/HMDB); crosswalk dictionaries.
  • Outputs: standardized matrices, feature_id_map.tsv, ambiguous_ids.tsv
  • QC/Idempotence: coverage %, ambiguous counts; uniqueness enforced

src/omics/omics_qc_report.py

  • Purpose: Compile per-modality QC dashboards → HTML/PDF + governance summary
  • Outputs: reports/qc/omics/omics_qc.html/.pdf, reports/governance/omics_summary.xlsx

Stage 4 — Feature fusion & alignment

Goal: Join EEG derivatives with each omics block by participant_id to create a multi-block master matrix; add age-adjusted features and block balancing.

src/features/merge_features.py

  • Purpose: Align EEG + omics into interim/features/master_matrix.parquet with block_metadata.json
  • Config: align keys, block weights, missingness policy; Dask for scale
  • Algorithms/libs: pandas, polars, Dask
  • QC: feature counts, missingness, cross-block heatmaps

src/features/age_norm.py

  • Purpose: Compute age-adjusted z-scores or residuals per block (linear/GAM/LOESS)
  • Algorithms/libs: statsmodels, pyGAM
  • QC: distribution diagnostics per block

src/features/block_balance.py

  • Purpose: Scale/weight blocks (variance / quantile / energy) to avoid dominance by high-dimensional omics
  • Algorithms/libs: scikit-learn, mixOmics utilities. :contentReference[oaicite:12]{index=12}

src/features/features_dictionary.py

  • Purpose: Feature dictionary with provenance and ontology tags
  • Outputs: docs/features/*.tsv/.json/.yaml

src/features/export_feature_sets.py

  • Purpose: Export curated subsets (EEG-only, omics-only, task-specific) with privacy filters
  • Outputs: exports/features/{subset}.parquet/.json

Stage 5 — Integration & modeling

Goal: Compare supervised (DIABLO) and unsupervised (MOFA+) integration, graph fusion (SNFpy), plus baselines (XGBoost/RF/SVM) and optional early-fusion NN; strict CV; MLflow logging.

src/models/cv_splits.py

  • Purpose: Leakage-safe (stratified/grouped) CV; nested repeats; seeds + fold manifests
  • Outputs: interim/cv/splits.pkl, leakage_checks.json

src/models/train_models.py

  • Purpose: Hydra dispatcher that runs DIABLO (R), MOFA+ (Python), SNF-based clustering, and baselines; per-fold MLflow logging + DVC checkpoints; SLURM fold-parallel
  • Inputs → Outputs: master matrix + config → models/{model}/run-*/, OOF predictions, reports/modeling/{model}_fold_metrics.csv
  • Algorithms/libs: mixOmics/DIABLO, mofapy2 (MOFA+), snfpy, scikit-learn, XGBoost, PyTorch Lightning (optional). :contentReference[oaicite:13]{index=13}
  • CLI:
    Local → python -m src.models.train_models model=diablo cv.repeats=5
    SLURM → snakemake --profile workflows/profiles/slurm train_models model=xgb
  • QC & logs: per-fold PR-AUC/AUC/F1, confusion matrices; MLflow params/artifacts
  • Compute: CPU/GPU as needed (GPU for NN); fold-parallel on cluster
  • Idempotence & validation: split manifests frozen; preprocessing fit inside CV; DVC stage. :contentReference[oaicite:14]{index=14}

src/models/run_diablo.R

  • Purpose: Run DIABLO (multiblock sPLS-DA) per fold; export loadings, design matrix, performance. :contentReference[oaicite:15]{index=15}
  • Outputs: models/diablo/fold-*/loadings.tsv, plots/*.html

src/models/run_mofa.py

  • Purpose: Train MOFA+ factors; save factors/loadings + ELBO/variance explained. :contentReference[oaicite:16]{index=16}
  • Outputs: models/mofa/factors.parquet, models/mofa/loadings.parquet, reports/modeling/mofa_variance.html

src/models/run_snf.py

  • Purpose: Build per-block affinities and run Similarity Network Fusion; cluster and export labels. :contentReference[oaicite:17]{index=17}
  • Outputs: models/snf/affinities/*.npy, models/snf/fused.npy, models/snf/labels.csv

src/models/train_baselines.py

  • Purpose: XGBoost/RF/SVM baselines with unified API; SHAP-ready artifacts.
  • Outputs: models/baselines/*, OOF preds, feature importance

src/models/stack_predictions.py

  • Purpose: Late-fusion stacking on OOF predictions (logistic/elastic-net meta-learner); calibration and uncertainty

src/models/evaluate_metrics.py

  • Purpose: Aggregate metrics across folds, compute CIs (e.g., bootstrap or DeLong), export cohort report

Stage 6 — Interpretation & reporting

Goal: Explain models and render EEG maps; summarize biology.

src/interpretation/interpret_model.py

  • Purpose: Unified entry: DIABLO loadings, MOFA factors, SHAP summaries, Haufe maps
  • Outputs: reports/interpretation/* JSON + HTML

src/interpretation/shap_explain.py

  • Purpose: Global & local explanations using SHAP (tree/NN); interaction plots; background caching
  • Algorithms/libs: SHAP, xgboost/torch. SHAP

src/interpretation/haufe_maps.py

  • Purpose: Compute Haufe transform for linear EEG subspaces to obtain sensor-space, physiologically interpretable maps (per fold). ScienceDirect
  • Outputs: FIF derivatives + HTML topographies

src/interpretation/pathway_summarize.py

  • Purpose: Map feature importances to pathways (Reactome/KEGG/MSigDB); FDR-controlled enrichment; export ranked tables

About

End-to-end, standards-first pipeline for EEG/iEEG × multi-omics biomarker discovery. Automates ingestion, robust EEG cleaning, rich derivatives, omics QC/batch correction, feature fusion, and model training with MLOps (configs, tracking, versioning). HPC-ready, reproducible, optionally privacy-preserving; supports local dev and clusters.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors