Skip to content

Snakemake workflows for analyzing DNA methylation. The first is for demultiplexing, trimming and aligning sequencing data. The second is for merging, deduplication and extracting methylation information.

Notifications You must be signed in to change notification settings

ohsu-cedar-comp-hub/AMLClock

Repository files navigation

Pipeline for Analyzing DNA Methylation

Introduction

Pipeline Overview
This pipeline consists of 2 Snakemake workflows.

1. Preprocessing Workflow:

Input: PE sequencing data 
1. Demultiplex + Adding Index (fgbio DemuxFastqs)
2. Trim (trim galore)
3. Align (bismarck)

2. Methylation Calling Workflow:

Input: aligned bam 
1. Merge (by sample across sequencing runs if desired) 
2. Deduplicate (deduplicate_bismarck)
3. Extract methylation info (bismarck_methylation_extractor )



This workflow was developed by Emily Chao, using Ruslan Stroganov's scripts as the template.

If you want to know in detail how this workflow was created, click here.

If you do not have access to this document, please email Emily Chao ([email protected]).

I HIGHLY recommend using this document to guide you if you're a first time user. The instructions there are way more in-depth with thorough explanations.

Getting Started
1. Pull all files from repository:
srun --time=2:00:00 --mem=20G --partition=interactive --pty /usr/bin/bash -A [account]

cd /home/exacloud/gscratch/CEDAR/[user]
git clone https://github.com/ohsu-cedar-comp-hub/AMLClock.git
cd AMLClock

Run tree to ensure you have all the neccessary files. Your output should look like this:

.
├── 01_preprocessing.smk
├── 02_methylation_calling.smk
├── EXAMPLE_INPUT
│   ├── SampleSheets
│   │   └── S1_SampleSheet.csv
│   ├── data_S1_R1.fastq.gz
│   └── data_S1_R2.fastq.gz
├── README.md
├── aml_clock.yaml
├── assets
│   ├── 01_config_breakdown.png
│   ├── 02_config_breakdown.png
│   └── cluster_breakdown.png
├── config
│   ├── 01_config.yaml
│   ├── 02_config.yaml
│   └── cluster
│       └── config.v8+.yaml
└── scripts
    ├── align.sh
    ├── change_ext.sh
    ├── cluster-status.py
    ├── deduplicate.sh
    ├── demux_w_id.sh
    ├── extract_methyl.sh
    ├── merge.sh
    ├── run_pipeline.sh
    └── trim.sh

  1. Set up the conda environment.
conda env create -f aml_clock.yaml
conda activate aml_clock
cd AMLClock

NOTE: If your conda is an older version compared to mine (conda 24.11.3), you may run into an issue. Try the below:

conda config --set channel_priority flexible

If it is still erroring, please email me ([email protected])

Running Preprocessing Workflow

From the Start
  1. Organize input data accordingly.

    Input data must be paired-end sequencing data ending with *_R1.fastq.gz and *_R2.fastq.gz. The filename of the input data must also include an _S#_ somewhere aka the sample id. This is essential to match up with the expected sample sheet. All input data must be placed in a folder.

    You must also have sample sheets to indicate which sample barcodes are expected per sample ID after demultiplexing. Sample sheets should follow the naming scheme of S#_SampleSheet.csv and placed in a SampleSheets directory inside your input data folder. The content of the sample sheet must follow fgbio's guidelines: https://fulcrumgenomics.github.io/fgbio/tools/latest/DemuxFastqs.html.

    NOTE: There is one key addition to the sample sheet: SAMPLE_INDEX. This was added to help with identifying files and each index is unique per sample name.

    A full example is in the EXAMPLE_INPUT directory.

    NOTE: If you want to analyze your blank samples after demultiplexing, make sure you have a unique sample index for them as well. If your blank has no sample index provided, snakemake will skip over it for the next steps.



    NOTE: If you have files ending in *_R1_{something}.fastq.gz, you can use change_ext.sh located in the scripts/ to change it to the desired format. *_R1.fastq.gz. This script gets rid of the {something} so if you still wanted to keep it, put it manually in file name/use a different script.

    ./scripts/change_ext.sh --d=[absolute path to input data directory]
    
    

  1. Navigate to config/01_config.yaml. You will need to change the variables to match what you need.

    These are the variables you must change and their required values:

    1. raw_data: [absolute path to input data directory]
    2. manual: false

    You will also likely need to change the other variables as it is dependent on the parameters you would like to use for the demultiplexing, trimming and aligning.

    Refer to Config File Breakdown for more details.


  1. Navigate to config/cluster/config.v8+.yaml.

    You can change the variables as needed. Most often, you may want to change the slurm account, where the output logs are going to and how the slurm jobs are named.

    If desired, you can also change the default resources. You are also able to change the resource requests per rule in the smk file. More details in [another section].

    Refer to Cluster Config File Breakdown Cluster Config File Breakdown for more details.


  1. Perform a Snakemake dry run to confirm that your data will be ran correctly.

    cd AMLClock
    configfile=[absolute path to 01_config.yaml]
    snakefile=[absolute path to 01_preprocessing.smk]
    
    snakemake -n --profile config/cluster/ --configfile=$configfile -s $snakefile 
    

    Pay close attention to the output of this dry run and check that the files Snakemake is expected to generate are correct.


  1. Now run this workflow using the launch script scripts/run_pipeline.sh
    cd AMLClock
    sbatch scripts/run_pipeline.sh -c $configfile -s $snakefile
    
    

NOTE: If you want to run the workflow from the start, but end earlier, you would be looking to customize like this:

cd AMLClock

snakemake -n --profile config/cluster/ --configfile=$configfile -s $snakefile --until [last rule to run]

sbatch scripts/run_pipeline.sh -c $configfile -s $snakefile -u [last rule to run]

  1. Check and monitor the workflow's progress.

You can see how the overall workflow is proceeding by going to slurm-[sbatchjobid].out

You can see individual job progress by going to the output directory you had set in config/cluster/config.v8+.yaml.

From the Middle

This is for when you want to run the preprocessing workflow mid-way and you do not have the files produced from the first step (demultiplexing) because they are elsewhere/not available to you.

For this workflow, there are 2 possible scenarios:

  1. I have files demultiplexed (w/ or w/o index) that needs trimming and alignment.
  2. I have files trimmed that need alignment.

I will address both below:

  1. Prepare your input.

    Your input must match the format of the rule that you are trying to start from.

    Start from trim:

    • Input files must be in [input dir]/Demuxed.

    • Input files must end in: *_R1.fastq.gz , *_R2.fastq.gz

    • NOTE: If your demuxed file name has no index and you don't have a sample sheet in hand, you will need to manually add in an index where it should be *_{INDEX}_R1.fastq.gz and *_{INDEX}_R2.fastq.gz. Your index should be unique per sample name. So, files with the same sample name should have the same sample index.

    Start from align:

    • Input files must be in [input dir]/Trimmed.

    • Input files must end in: *_R1_val_1.fq.gz , *_R2_val_2.fq.gz


  1. Adjust your config file in config/01_config.yaml as needed.

    These are the variables you must change and their required values:

    1. raw_data: [absolute path to input dir]
    2. manual: true

    Feel free to adjust the other parameters as desired for your trimming and/or aligning.

    Refer to Config File Breakdown for more details.


  1. Adjust your cluster config file in config/cluster/config.v8+.yaml as needed. If desired, you can also change the default resources. You are also able to change the resource requests per rule in the smk file. More details in Changing Resource Requests.

    Refer to Cluster Config File Breakdown Cluster Config File Breakdown for more details.

  2. Run dry run now.

    cd AMLClock
    snakefile=[absolute path to 01_preprocessing.smk]
    configfile=[absolute path to 01_config.yaml]
    snakemake -n –-profile config/cluster/ --configfile=$configfile -s $snakefile 
    

    Pay close attention to the output snakemake generates. You should see that the rules snakemake would run are the remaining rules you need.


  1. Run the workflow now.

    cd AMLClock
    snakefile=[absolute path to 01_preprocessing.smk]
    configfile=[absolute path to 01_config.yaml]
    sbatch scripts/run_pipeline.sh -c $configfile -s $snakefile
    
    
  2. Check and monitor the workflow's progress. You can see how the overall workflow is proceeding by going to slurm-[sbatchjobid].out You can see individual job progress by going to the output directory you had set in config/cluster/config.v8+.yaml.

Running the Methylation Calling Workflow

From the Start
This workflow is very similar to the Preprocessing Workflow.
  1. Organize input data accordingly.

    Input data must be aligned bam data ending with *_R1_val_1_bismark_bt2_pe.bam.


  1. Navigate to config/02_config.yaml. You will need to change the variables to match what you need.

    These are the variables you must change and their required values:

    1. main_dir: [absolute path to input data directory]
    2. manual: false
    3. merge: true OR false

    NOTE: You will specify whether you want merging to occur here. Merging occur before deduplication and extraction methylation informtion. It will merge by sample name and sample index and will return {sample_name}_{sample_index}_R1_val_1_bismark_bt2_pe.bam.

    You will also likely need to change the other variables as it is dependent on the parameters you would like to use for the deduplicating and extracting methylation information. Refer to Config File Breakdown for more details.

    NOTE: If you are running this workflow right after running the preprocessing workflow, your config[main_dir] = [absolute path to Aligned_paired directory]


  1. Navigate to config/cluster/config.v8+.yaml.

    You can change the variables as needed. Most often, you may want to change the slurm account, where the output logs are going to and how the slurm jobs are named.

    If desired, you can also change the default resources. You are also able to change the resource requests per rule in the smk file. More details in Changing Resource Requests.

    Refer to Cluster Config File Breakdown Cluster Config File Breakdown for more details.


  1. Perform a Snakemake dry run to confirm that your data will be ran correctly.

    cd AMLClock
    configfile=[absolute path to 02_config.yaml]
    snakefile=[absolute path to 02_methylation_calling.smk]
    
    snakemake -n --profile config/cluster/ --configfile=$configfile -s $snakefile 
    

    Pay close attention to the output of this dry run and check that the files Snakemake is expected to generate are correct.


  1. Now run this workflow using the launch script scripts/run_pipeline.sh
    cd AMLClock
    sbatch scripts/run_pipeline.sh -c $configfile -s $snakefile
    
    

NOTE: If you want to run the workflow from the start, but end earlier, you would be looking to customize like this:

cd AMLClock

snakemake -n --profile config/cluster/ --configfile=$configfile -s $snakefile --until [last rule to run]

sbatch scripts/run_pipeline.sh -c $configfile -s $snakefile -u [last rule to run]

  1. Check and monitor the workflow's progress. You can see how the overall workflow is proceeding by going to slurm-[sbatchjobid].out You can see individual job progress by going to the output directory you had set in config/cluster/config.v8+.yaml.
From the Middle

This is for when you want to start running the methylation calling workflow mid-way.

The only way this would occur for this workflow is for when you have deduplicated files but do not have your aligned files.

Example: I have output files that have been deduplicated. I only need to extract methylation information from them.

  1. Prepare your input.

    Start from deduplicate:

    • Input files must be in [input dir]/Deduplicated.

    • Input files must end in: *_R1_val_1_bismark_bt2_pe.deduplicated.bam.


  1. Adjust your config file in config/02_config.yaml as needed.

    These are the variables you must change and their required values:

    1. main_dir: [absolute path to input dir]
    2. manual: true
    3. merge: false

    Feel free to adjust the other parameters as desired for methylation calling.

    Refer to Config File Breakdown for more details.


  1. Adjust your cluster config file in config/cluster/config.v8+.yaml as needed.

    If desired, you can also change the default resources. You are also able to change the resource requests per rule in the smk file. More details in Changing Resource Requests.

    Refer to Cluster Config File Breakdown Cluster Config File Breakdown for more details.

  2. Run dry run now.

    cd AMLClock
    snakefile=[absolute path to 02_preprocessing.smk]
    configfile=[absolute path to 02_config.yaml]
    snakemake -n --profile config/cluster/ --configfile=$configfile -s $snakefile 
    

    Pay close attention to the output snakemake generates. You should see that the rules snakemake would run are the remaining rules you need. In this case, it should only run all and get_methyl_info rules.


  1. Run the workflow now.

    cd AMLClock
    snakefile=[absolute path to 02_preprocessing.smk]
    configfile=[absolute path to 02_config.yaml]
    sbatch scripts/run_pipeline.sh -c $configfile -s $snakefile
    
    
  2. Check and monitor the workflow's progress. You can see how the overall workflow is proceeding by going to slurm-[sbatchjobid].out You can see individual job progress by going to the output directory you had set in config/cluster/config.v8+.yaml.

Config File Breakdown

Use this breakdown to help determine which variables you would want to customize:

image.png

image.png

Cluster Config File Breakdown

Use this breakdown to help determine which variables you would want to customize: image.png

Changing Resource Requests

The default resource requests can be seen by going to the cluster config file.

The breakdown is here: Cluster Config File Breakdown.

I've chosen the default and some rule-specific manual adjustments through some experimenting with small UKCTOCs samples. For the full explanation go here.

I've also manually adjusted the resource requests for some big rules in the smk files. To take a look, go to the smk file.

For preprocessing workflow, it will be 01_preprocessing.smk.

For methylation calling workflow, it will be 02_methylation_calling.smk.

NOTE: If you want to do any resource adjusting, I suggest you adjust it manually first, before trying to change the defaults.

To do so:

  1. Navigate to the relevant smk file.
  2. Look for rule [target_rule] in the code. You can see if there are any parameters for resources by looking for:
  • threads: (specifies # of CPUs)
  • resources: (specifies time, mem_mb, gres etc. )
  1. Feel free to adjust your resource requests by modifying the number in threads and the values in resources.

    If there are no 'threads:' or 'resources:' present under your target rule, you can add them.

    Example:

    rule target: 
        input: ..
        output: ..
        params: ..
        threads: 1
        resources: 
            time="04:00:00",
            gres="disk:1024", 
            mem_mb=4000  
    

Want to know if your resource requests were appropriate?

Refer to the SlurmJobAssessment tool here

About

Snakemake workflows for analyzing DNA methylation. The first is for demultiplexing, trimming and aligning sequencing data. The second is for merging, deduplication and extracting methylation information.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published