Pipeline for Analyzing DNA Methylation

Introduction

Pipeline Overview

This pipeline consists of 2 Snakemake workflows.

1. Preprocessing Workflow:

Input: PE sequencing data 
1. Demultiplex + Adding Index (fgbio DemuxFastqs)
2. Trim (trim galore)
3. Align (bismarck)

2. Methylation Calling Workflow:

Input: aligned bam 
1. Merge (by sample across sequencing runs if desired) 
2. Deduplicate (deduplicate_bismarck)
3. Extract methylation info (bismarck_methylation_extractor )

This workflow was developed by Emily Chao, using Ruslan Stroganov's scripts as the template.

If you want to know in detail how this workflow was created, click here.

If you do not have access to this document, please email Emily Chao ([email protected]).

I HIGHLY recommend using this document to guide you if you're a first time user. The instructions there are way more in-depth with thorough explanations.

Getting Started

1. Pull all files from repository:

srun --time=2:00:00 --mem=20G --partition=interactive --pty /usr/bin/bash -A [account]

cd /home/exacloud/gscratch/CEDAR/[user]
git clone https://github.com/ohsu-cedar-comp-hub/AMLClock.git
cd AMLClock

Run tree to ensure you have all the neccessary files. Your output should look like this:

.
├── 01_preprocessing.smk
├── 02_methylation_calling.smk
├── EXAMPLE_INPUT
│   ├── SampleSheets
│   │   └── S1_SampleSheet.csv
│   ├── data_S1_R1.fastq.gz
│   └── data_S1_R2.fastq.gz
├── README.md
├── aml_clock.yaml
├── assets
│   ├── 01_config_breakdown.png
│   ├── 02_config_breakdown.png
│   └── cluster_breakdown.png
├── config
│   ├── 01_config.yaml
│   ├── 02_config.yaml
│   └── cluster
│       └── config.v8+.yaml
└── scripts
    ├── align.sh
    ├── change_ext.sh
    ├── cluster-status.py
    ├── deduplicate.sh
    ├── demux_w_id.sh
    ├── extract_methyl.sh
    ├── merge.sh
    ├── run_pipeline.sh
    └── trim.sh

Set up the conda environment.

conda env create -f aml_clock.yaml
conda activate aml_clock
cd AMLClock

NOTE: If your conda is an older version compared to mine (conda 24.11.3), you may run into an issue. Try the below:

conda config --set channel_priority flexible

If it is still erroring, please email me ([email protected])

Running Preprocessing Workflow

From the Start

Organize input data accordingly.

Input data must be paired-end sequencing data ending with *_R1.fastq.gz and *_R2.fastq.gz. The filename of the input data must also include an _S#_ somewhere aka the sample id. This is essential to match up with the expected sample sheet. All input data must be placed in a folder.

You must also have sample sheets to indicate which sample barcodes are expected per sample ID after demultiplexing. Sample sheets should follow the naming scheme of S#_SampleSheet.csv and placed in a SampleSheets directory inside your input data folder. The content of the sample sheet must follow fgbio's guidelines: https://fulcrumgenomics.github.io/fgbio/tools/latest/DemuxFastqs.html.

NOTE: There is one key addition to the sample sheet: SAMPLE_INDEX. This was added to help with identifying files and each index is unique per sample name.

A full example is in the EXAMPLE_INPUT directory.

NOTE: If you want to analyze your blank samples after demultiplexing, make sure you have a unique sample index for them as well. If your blank has no sample index provided, snakemake will skip over it for the next steps.

NOTE: If you have files ending in *_R1_{something}.fastq.gz, you can use change_ext.sh located in the scripts/ to change it to the desired format. *_R1.fastq.gz. This script gets rid of the {something} so if you still wanted to keep it, put it manually in file name/use a different script.
```
./scripts/change_ext.sh --d=[absolute path to input data directory]
```

Navigate to config/01_config.yaml. You will need to change the variables to match what you need.

These are the variables you must change and their required values:
1. raw_data: [absolute path to input data directory]
2. manual: false
You will also likely need to change the other variables as it is dependent on the parameters you would like to use for the demultiplexing, trimming and aligning.

Refer to Config File Breakdown for more details.

Navigate to config/cluster/config.v8+.yaml.

You can change the variables as needed. Most often, you may want to change the slurm account, where the output logs are going to and how the slurm jobs are named.

If desired, you can also change the default resources. You are also able to change the resource requests per rule in the smk file. More details in [another section].

Refer to Cluster Config File Breakdown Cluster Config File Breakdown for more details.

Perform a Snakemake dry run to confirm that your data will be ran correctly.
```
cd AMLClock
configfile=[absolute path to 01_config.yaml]
snakefile=[absolute path to 01_preprocessing.smk]

snakemake -n --profile config/cluster/ --configfile=$configfile -s $snakefile 
```
Pay close attention to the output of this dry run and check that the files Snakemake is expected to generate are correct.

Now run this workflow using the launch script scripts/run_pipeline.sh

cd AMLClock
sbatch scripts/run_pipeline.sh -c $configfile -s $snakefile

NOTE: If you want to run the workflow from the start, but end earlier, you would be looking to customize like this:

cd AMLClock

snakemake -n --profile config/cluster/ --configfile=$configfile -s $snakefile --until [last rule to run]

sbatch scripts/run_pipeline.sh -c $configfile -s $snakefile -u [last rule to run]

Check and monitor the workflow's progress.

You can see how the overall workflow is proceeding by going to slurm-[sbatchjobid].out

You can see individual job progress by going to the output directory you had set in config/cluster/config.v8+.yaml.

From the Middle

This is for when you want to run the preprocessing workflow mid-way and you do not have the files produced from the first step (demultiplexing) because they are elsewhere/not available to you.

For this workflow, there are 2 possible scenarios:

I have files demultiplexed (w/ or w/o index) that needs trimming and alignment.
I have files trimmed that need alignment.

I will address both below:

Prepare your input.

Your input must match the format of the rule that you are trying to start from.

Start from trim:
- Input files must be in [input dir]/Demuxed.
- Input files must end in: *_R1.fastq.gz , *_R2.fastq.gz
- NOTE: If your demuxed file name has no index and you don't have a sample sheet in hand, you will need to manually add in an index where it should be *_{INDEX}_R1.fastq.gz and *_{INDEX}_R2.fastq.gz. Your index should be unique per sample name. So, files with the same sample name should have the same sample index.
Start from align:
- Input files must be in [input dir]/Trimmed.
- Input files must end in: *_R1_val_1.fq.gz , *_R2_val_2.fq.gz

Adjust your config file in config/01_config.yaml as needed.

These are the variables you must change and their required values:
1. raw_data: [absolute path to input dir]
2. manual: true
Feel free to adjust the other parameters as desired for your trimming and/or aligning.

Refer to Config File Breakdown for more details.

Adjust your cluster config file in config/cluster/config.v8+.yaml as needed. If desired, you can also change the default resources. You are also able to change the resource requests per rule in the smk file. More details in Changing Resource Requests.

Refer to Cluster Config File Breakdown Cluster Config File Breakdown for more details.

Run dry run now.

cd AMLClock
snakefile=[absolute path to 01_preprocessing.smk]
configfile=[absolute path to 01_config.yaml]
snakemake -n –-profile config/cluster/ --configfile=$configfile -s $snakefile

Pay close attention to the output snakemake generates. You should see that the rules snakemake would run are the remaining rules you need.

Run the workflow now.

cd AMLClock
snakefile=[absolute path to 01_preprocessing.smk]
configfile=[absolute path to 01_config.yaml]
sbatch scripts/run_pipeline.sh -c $configfile -s $snakefile

Check and monitor the workflow's progress. You can see how the overall workflow is proceeding by going to slurm-[sbatchjobid].out You can see individual job progress by going to the output directory you had set in config/cluster/config.v8+.yaml.

Running the Methylation Calling Workflow

From the Start

This workflow is very similar to the Preprocessing Workflow.

Organize input data accordingly.

Input data must be aligned bam data ending with *_R1_val_1_bismark_bt2_pe.bam.

Navigate to config/02_config.yaml. You will need to change the variables to match what you need.

These are the variables you must change and their required values:
1. main_dir: [absolute path to input data directory]
2. manual: false
3. merge: true OR false
NOTE: You will specify whether you want merging to occur here. Merging occur before deduplication and extraction methylation informtion. It will merge by sample name and sample index and will return {sample_name}_{sample_index}_R1_val_1_bismark_bt2_pe.bam.

You will also likely need to change the other variables as it is dependent on the parameters you would like to use for the deduplicating and extracting methylation information. Refer to Config File Breakdown for more details.

NOTE: If you are running this workflow right after running the preprocessing workflow, your config[main_dir] = [absolute path to Aligned_paired directory]

Navigate to config/cluster/config.v8+.yaml.

You can change the variables as needed. Most often, you may want to change the slurm account, where the output logs are going to and how the slurm jobs are named.

If desired, you can also change the default resources. You are also able to change the resource requests per rule in the smk file. More details in Changing Resource Requests.

Refer to Cluster Config File Breakdown Cluster Config File Breakdown for more details.

Perform a Snakemake dry run to confirm that your data will be ran correctly.
```
cd AMLClock
configfile=[absolute path to 02_config.yaml]
snakefile=[absolute path to 02_methylation_calling.smk]

snakemake -n --profile config/cluster/ --configfile=$configfile -s $snakefile 
```
Pay close attention to the output of this dry run and check that the files Snakemake is expected to generate are correct.

Now run this workflow using the launch script scripts/run_pipeline.sh

cd AMLClock
sbatch scripts/run_pipeline.sh -c $configfile -s $snakefile

NOTE: If you want to run the workflow from the start, but end earlier, you would be looking to customize like this:

cd AMLClock

snakemake -n --profile config/cluster/ --configfile=$configfile -s $snakefile --until [last rule to run]

sbatch scripts/run_pipeline.sh -c $configfile -s $snakefile -u [last rule to run]

Check and monitor the workflow's progress. You can see how the overall workflow is proceeding by going to slurm-[sbatchjobid].out You can see individual job progress by going to the output directory you had set in config/cluster/config.v8+.yaml.

From the Middle

This is for when you want to start running the methylation calling workflow mid-way.

The only way this would occur for this workflow is for when you have deduplicated files but do not have your aligned files.

Example: I have output files that have been deduplicated. I only need to extract methylation information from them.

Prepare your input.

Start from deduplicate:
- Input files must be in [input dir]/Deduplicated.
- Input files must end in: *_R1_val_1_bismark_bt2_pe.deduplicated.bam.

Adjust your config file in config/02_config.yaml as needed.

These are the variables you must change and their required values:
1. main_dir: [absolute path to input dir]
2. manual: true
3. merge: false
Feel free to adjust the other parameters as desired for methylation calling.

Refer to Config File Breakdown for more details.

Adjust your cluster config file in config/cluster/config.v8+.yaml as needed.

If desired, you can also change the default resources. You are also able to change the resource requests per rule in the smk file. More details in Changing Resource Requests.

Refer to Cluster Config File Breakdown Cluster Config File Breakdown for more details.
Run dry run now.
```
cd AMLClock
snakefile=[absolute path to 02_preprocessing.smk]
configfile=[absolute path to 02_config.yaml]
snakemake -n --profile config/cluster/ --configfile=$configfile -s $snakefile 
```
Pay close attention to the output snakemake generates. You should see that the rules snakemake would run are the remaining rules you need. In this case, it should only run all and get_methyl_info rules.

Run the workflow now.

cd AMLClock
snakefile=[absolute path to 02_preprocessing.smk]
configfile=[absolute path to 02_config.yaml]
sbatch scripts/run_pipeline.sh -c $configfile -s $snakefile

Check and monitor the workflow's progress. You can see how the overall workflow is proceeding by going to slurm-[sbatchjobid].out You can see individual job progress by going to the output directory you had set in config/cluster/config.v8+.yaml.

Config File Breakdown

Use this breakdown to help determine which variables you would want to customize:

Cluster Config File Breakdown

Use this breakdown to help determine which variables you would want to customize:

Changing Resource Requests

The default resource requests can be seen by going to the cluster config file.

The breakdown is here: Cluster Config File Breakdown.

I've chosen the default and some rule-specific manual adjustments through some experimenting with small UKCTOCs samples. For the full explanation go here.

I've also manually adjusted the resource requests for some big rules in the smk files. To take a look, go to the smk file.

For preprocessing workflow, it will be 01_preprocessing.smk.

For methylation calling workflow, it will be 02_methylation_calling.smk.

NOTE: If you want to do any resource adjusting, I suggest you adjust it manually first, before trying to change the defaults.

To do so:

Navigate to the relevant smk file.
Look for rule [target_rule] in the code. You can see if there are any parameters for resources by looking for:

threads: (specifies # of CPUs)
resources: (specifies time, mem_mb, gres etc. )

Feel free to adjust your resource requests by modifying the number in threads and the values in resources.

If there are no 'threads:' or 'resources:' present under your target rule, you can add them.

Example:
```
rule target: 
    input: ..
    output: ..
    params: ..
    threads: 1
    resources: 
        time="04:00:00",
        gres="disk:1024", 
        mem_mb=4000  
```

Want to know if your resource requests were appropriate?

Refer to the SlurmJobAssessment tool here

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pipeline for Analyzing DNA Methylation

Introduction

Running Preprocessing Workflow

Running the Methylation Calling Workflow

Config File Breakdown

Cluster Config File Breakdown

Changing Resource Requests

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
EXAMPLE_INPUT		EXAMPLE_INPUT
assets		assets
config		config
scripts		scripts
01_preprocessing.smk		01_preprocessing.smk
02_methylation_calling.smk		02_methylation_calling.smk
README.md		README.md
aml_clock.yaml		aml_clock.yaml

ohsu-cedar-comp-hub/AMLClock

Folders and files

Latest commit

History

Repository files navigation

Pipeline for Analyzing DNA Methylation

Introduction

Running Preprocessing Workflow

Running the Methylation Calling Workflow

Config File Breakdown

Cluster Config File Breakdown

Changing Resource Requests

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages