Skip to content

MOKA Snakemake pipeline for GWAS incorporating user provided multi-omics data sources such as gene expression, neural network weights, transcription factor binding, and evolutionary conservation scores.

License

Notifications You must be signed in to change notification settings

davidenoma/moka

Repository files navigation

Snakemake

image

🌉 Multi-omics bridged Kernel Association test (MOKA) Pipeline

MOKA implements a Snakemake pipeline to automate data bridge kernel-based association tests.
This pipeline offers flexibility of GWAS analysis & visualizations with different multi-omics variant-specific weights.

moka-figure
🚀 Usage

1. Prepare data inputs

  • GWAS genotype files in PLINK format (.bed, .bim, .fam) [!required]
  • Variant-specific weights in CSV format (SNP_ID,Chromosome,Position,Weight) [!required for moka]
  • Gene regions file (provided in GRCh38 or hg38) , generated from Ensembl GFF3 annotations (e.g., Homo_sapiens.GRCh38.115.gff3.gz, Ensembl Release 115).
  • DisGeNET gene disease database reference file ( If disease external validation needed)

2. Install Snakemake

Follow the Snakemake Installation Guide:

conda create -c conda-forge -c bioconda -c nodefaults -n snakemake snakemake
conda activate snakemake
snakemake --help

3. Download and install moka

git clone https://github.com/davidenoma/moka
cd moka

4. Configure the pipeline parameters in the config.yaml file

This configuration controls paths to inputs and analysis settings:

  • genotype_prefix: Prefix for the genotype data files (without file extensions).
  • weights_type: Text descriptor indicating the source or type of biologically informed functional weights used in the analysis.
  • genotype_file_path: Directory path to the genotype data files.
  • weight_file: File path to the functional weight file applied in the association tests.
  • disgenet_reference_file: Path to the DisGeNET reference file containing disease-specific gene–disease associations https://disgenet.org [For gene disease associations only!].
  • spectral decomposition: Boolean flag to enable spectral (eigenvalue) decomposition and transformation of genotype and phenotype matrices. Default: TRUE.
  • is_binary: Boolean flag specifying the phenotype type (TRUE for binary traits; FALSE for quantitative traits). Default: TRUE.
  • Plink: Path to the PLINK executable, e.g."~/software/plink".

5. Running MOKA

We provide a demo example with configuration located at ./config/config.yaml. The pipeline executes the following steps:

Step 1. Kernel-based association testing

  • Integrates GWAS genotype data with the provided weights.
  • For the weight file, supports diverse data sources derived SNP-level weights.
  • Performs SNP-set kernel-based association tests to model the joint effect of multiple variants.
  • Execute one chromosome at a time.
  • Optionally applies decorrelation to account for population structure or relatedness.

Note: The initial run would take some time as the software installs the core dependencies requires from workflow/envs/moka.yaml.

Input: Preprocessed genotype data (./genotype_data/test_geno.fam test_geno.bim test_geno.bed ) and weight files (./weights/test_geno_weights.csv).

Output: Results of association tests under ./result_folder/

snakemake --cores <num_cores> --use-conda
<num_cores> are the number of cores to use e.g. 8 

Step 2. Merge results from all chromosomes to a single file
Input: Individual association test results.

Output: Merged association test results under ./result_folder/

snakemake --cores 1 merge_moka_results --use-conda

Step 3. Functional enrichment analysis

  • Significant genes from association testing are assessed for:
    • KEGG pathway enrichment
    • Gene Ontology (GO) enrichment
  • Helps identify biological processes and pathways underlying GWAS signals.

Input: Merged association test results.

Output: GO analysis results under ./output_plots/

snakemake --cores 1 go_analysis --use-conda

Output: KEGG pathway analysis results under ./output_plots/

snakemake --cores 1 kegg_pathway_analysis --use-conda

Step 4. Visualization

  • Generates publication-ready visual summaries: Manhattan plots

Input: Merged association test results.

Output: Manhattan plots with visual representations of association test results under ./output_plots/

snakemake --cores 1 manhattan_plots --use-conda

Step 5. External validation

  • Cross-references significant genes against the DisGeNET database.
  • Reports a validation ratio of overlapping associated genes, strengthening interpretation of GWAS findings.

Input: Merged association test results.

Output: Annotated association test results with DisGeNet database under ./output_plots/

snakemake --cores 1 disgenet_annotation_005 --use-conda

Additional function: generate_gene_regions

  • Generates gene region files with specified flanking size from GFF3 annotation for gene-based association testing. Uses config.flank_size from the config file.

How to execute:

snakemake --cores 1 generate_gene_regions --use-conda

Dependencies

All required Python and R packages, as well as other software dependencies, are specified in the provided envs.yaml files in the workflow/envs/ directory. These environments are automatically created and managed by Snakemake when you use the --use-conda flag. You do not need to install packages manually.

For more details, see the environment YAML files in workflow/envs/.

Liftover protocol

You much lift over to GRCh38 format check here: Liftover GWAS: [https://github.com/davidenoma/LiftOver]

📖 Additional Information

For more information on the MOKA pipeline and its usage, refer to the documentation provided in the repository or contact the project maintainers. david.enoma@ucalgary.ca

Publication reference

MOKA: A pipeline for multi-omics bridged SNP-set kernel association test https://academic.oup.com/g3journal/advance-article/doi/10.1093/g3journal/jkaf296/8384368

🐳 Docker

You can run the MOKA pipeline using the official Docker image:

docker pull davidenoma/moka-gwas

For usage instructions and examples, see https://hub.docker.com/r/davidenoma/moka-gwas.

Running on HPC Clusters with Restricted Conda

See docs/README_Snakemake_ARC.md for detailed setup instructions and ready-to-use scripts (run_smk.sh, run_smk.sbatch) to run Snakemake safely on clusters without admin privileges.

About

MOKA Snakemake pipeline for GWAS incorporating user provided multi-omics data sources such as gene expression, neural network weights, transcription factor binding, and evolutionary conservation scores.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors