Nextflow Variant Calling Pipeline using GATK4: Question about error exit status 2 #4399

kchilkert · 2023-10-11T19:17:32Z

kchilkert
Oct 11, 2023

Hello,

I am running a variant calling pipeline using GATK4 used as a containerized nextflow script for analyzing BGI sequencing data, which I am submitting as a Slurm job to HPC. I set my input to the reads I want analyzed, the workflow begins and completes step 1 but fails step 2 saying that the _aligned_reads.sam file does not exist, which is the output of step 1. The process/error is below:

executor > local (2)
[7b/56c3d5] process > align (1) [100%] 1 of 1 ✔
[83/d89167] process > markDuplicatesSpark (1) [ 0%] 0 of 1
[- ] process > getMetrics -
[- ] process > haplotypeCaller -
[- ] process > selectVariants -
[- ] process > filterSnps -
[- ] process > filterIndels -
[- ] process > bqsr -
[- ] process > analyzeCovariates -
[- ] process > snpEff -
[- ] process > qc -
Error executing process > 'markDuplicatesSpark (1)'

Caused by:
Process markDuplicatesSpark (1) terminated with an error exit status (2)

Command executed:

mkdir -p /scratch/projects/oleksyk-lab/gatk4/gatk_temp/furious_hamilton/
gatk --java-options "-Djava.io.tmpdir=/scratch/projects/oleksyk-lab/gatk4/gatk_temp/furious_hamilton/" MarkDuplicatesSpark -I _aligned_reads.sam -M _dedup_metrics.txt -O _sorted_dedup.bam
rm -r /scratch/projects/oleksyk-lab/gatk4/gatk_temp/furious_hamilton/

Command exit status:
2

Command output:
(empty)

Command error:
18:17:56.068 INFO ContextHandler - Started o.s.j.s.ServletContextHandler@51e0f2eb{/api,null,AVAILABLE,@Spark}
18:17:56.069 INFO ContextHandler - Started o.s.j.s.ServletContextHandler@aa794a3{/jobs/job/kill,null,AVAILABLE,@Spark}
18:17:56.069 INFO ContextHandler - Started o.s.j.s.ServletContextHandler@22cb8e5f{/stages/stage/kill,null,AVAILABLE,@Spark}
18:17:56.072 INFO ContextHandler - Started o.s.j.s.ServletContextHandler@5ca8c904{/metrics/json,null,AVAILABLE,@Spark}
18:17:56.076 INFO MarkDuplicatesSpark - Spark verbosity set to INFO (see --spark-verbosity argument)
18:17:56.118 INFO GoogleHadoopFileSystemBase - GHFS version: 1.9.4-hadoop3
WARNING 2023-10-09 18:17:56 SamReaderFactory Unable to detect file format from input URL or stream, assuming SAM format.
WARNING 2023-10-09 18:17:56 SamReaderFactory Unable to detect file format from input URL or stream, assuming SAM format.
18:17:56.286 INFO MemoryStore - Block broadcast_0 stored as values in memory (estimated size 1540.3 KiB, free 17.8 GiB)
18:17:56.593 INFO MemoryStore - Block broadcast_0_piece0 stored as bytes in memory (estimated size 68.4 KiB, free 17.8 GiB)
18:17:56.596 INFO BlockManagerInfo - Added broadcast_0_piece0 in memory on hpc-compute-p36.cm.cluster:44093 (size: 68.4 KiB, free: 17.8 GiB)
18:17:56.599 INFO SparkContext - Created broadcast 0 from broadcast at SamSource.java:78
18:17:56.719 INFO MemoryStore - Block broadcast_1 stored as values in memory (estimated size 188.3 KiB, free 17.8 GiB)
18:17:56.741 INFO MemoryStore - Block broadcast_1_piece0 stored as bytes in memory (estimated size 41.8 KiB, free 17.8 GiB)
18:17:56.742 INFO BlockManagerInfo - Added broadcast_1_piece0 in memory on hpc-compute-p36.cm.cluster:44093 (size: 41.8 KiB, free: 17.8 GiB)
18:17:56.742 INFO SparkContext - Created broadcast 1 from newAPIHadoopFile at SamSource.java:108
18:17:56.833 INFO BlockManagerInfo - Removed broadcast_1_piece0 on hpc-compute-p36.cm.cluster:44093 in memory (size: 41.8 KiB, free: 17.8 GiB)
18:17:56.837 INFO BlockManagerInfo - Removed broadcast_0_piece0 on hpc-compute-p36.cm.cluster:44093 in memory (size: 68.4 KiB, free: 17.8 GiB)
WARNING 2023-10-09 18:17:56 SamReaderFactory Unable to detect file format from input URL or stream, assuming SAM format.
WARNING 2023-10-09 18:17:56 SamReaderFactory Unable to detect file format from input URL or stream, assuming SAM format.
18:17:56.903 INFO MemoryStore - Block broadcast_2 stored as values in memory (estimated size 1540.3 KiB, free 17.8 GiB)
18:17:56.912 INFO MemoryStore - Block broadcast_2_piece0 stored as bytes in memory (estimated size 68.4 KiB, free 17.8 GiB)
18:17:56.913 INFO BlockManagerInfo - Added broadcast_2_piece0 in memory on hpc-compute-p36.cm.cluster:44093 (size: 68.4 KiB, free: 17.8 GiB)
18:17:56.914 INFO SparkContext - Created broadcast 2 from broadcast at SamSource.java:78
18:17:56.917 INFO MemoryStore - Block broadcast_3 stored as values in memory (estimated size 188.3 KiB, free 17.8 GiB)
18:17:56.927 INFO MemoryStore - Block broadcast_3_piece0 stored as bytes in memory (estimated size 41.8 KiB, free 17.8 GiB)
18:17:56.928 INFO BlockManagerInfo - Added broadcast_3_piece0 in memory on hpc-compute-p36.cm.cluster:44093 (size: 41.8 KiB, free: 17.8 GiB)
18:17:56.928 INFO SparkContext - Created broadcast 3 from newAPIHadoopFile at SamSource.java:108
18:17:56.974 INFO BlockManagerInfo - Removed broadcast_2_piece0 on hpc-compute-p36.cm.cluster:44093 in memory (size: 68.4 KiB, free: 17.8 GiB)
18:17:56.977 INFO BlockManagerInfo - Removed broadcast_3_piece0 on hpc-compute-p36.cm.cluster:44093 in memory (size: 41.8 KiB, free: 17.8 GiB)
18:17:56.978 INFO AbstractConnector - Stopped Spark@5cb6966{HTTP/1.1, (http/1.1)}{0.0.0.0:4040}
18:17:56.981 INFO SparkUI - Stopped Spark web UI at http://hpc-compute-p36.cm.cluster:4040
18:17:56.989 INFO MapOutputTrackerMasterEndpoint - MapOutputTrackerMasterEndpoint stopped!
18:17:57.004 INFO MemoryStore - MemoryStore cleared
18:17:57.004 INFO BlockManager - BlockManager stopped
18:17:57.006 INFO BlockManagerMaster - BlockManagerMaster stopped
18:17:57.008 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint - OutputCommitCoordinator stopped!
18:17:57.016 INFO SparkContext - Successfully stopped SparkContext
18:17:57.016 INFO MarkDuplicatesSpark - Shutting down engine
[October 9, 2023 at 6:17:57 PM EDT] org.broadinstitute.hellbender.tools.spark.transforms.markduplicates.MarkDuplicatesSpark done. Elapsed time: 0.06 minutes.
Runtime.totalMemory()=285212672

A USER ERROR has occurred: Failed to load reads from _aligned_reads.sam
Caused by:Input path does not exist: file:_aligned_reads.sam

More Info on what I'm running is below:

Config File:

// Required Parameters
params.reads = "/projects/oleksyk-lab/Kenneth/Golden_Standard/BGI/{E150016531_L01_75_1.fq.gz,E150016531_L01_75_2.fq.gz}"
params.ref = "/projects/oleksyk-lab/Kenneth/Golden_Standard/References/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta"
params.outdir = "/scratch/projects/oleksyk-lab/gatk4"
params.snpeff_db = "GRCh38.105"
params.pl = "bgi"
params.pm = "dnbseq"

// Set the Nextflow Working Directory
// By default this gets set to params.outdir + '/nextflow_work_dir'
workDir = params.outdir + '/nextflow_work_dir'

Slurm Script (dsl1):

module load bwa
module load GATK
export NXF_VER=22.10.7
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
source activate nf-env
nextflow run main.nf -c goldstandardnextflow.config

I cannot find anyone with this error and I'm very confused as to why I am receiving it. Any help is greatly appreciated!!

mribeirodantas · 2023-10-12T04:00:05Z

mribeirodantas
Oct 12, 2023
Collaborator

It may be that your process is expecting a specific filename, and the output of the previous process doesn't match the filename. This is not uncommon on nf-core pipelines, and is usually solved with something like the snippet below in a configuration file:

process {
  withName: PICARD_MARKDUPLICATES {
    ext.prefix = { "output_${meta.id}" }
  }
}

A first step should be to check the task dir of the failed task and see what input files are linked there, if any. If you can share a minimal reproducible example or a publicly available pipeline for me to check, I can try to be reproduce the issue on my side and work on a solution 😄

1 reply

kchilkert Oct 12, 2023
Author

Thanks for your response, I really appreciate it. I will try to add that into my config file and see what happens.

In the meantime, I am using the pipeline found here:
https://gencore.bio.nyu.edu/variant-calling-pipeline-gatk4/

In summary, I modify the 6 input parameters in the config file for my specific reads that I want to analyze, as shown below from the write-up:

**params.reads = "/path/to/reads/n0{1,2}.fastq.gz"
params.ref = "/path/to/ref.fa"
params.outdir = "/scratch/user/gatk4/"
params.snpeff_db = "athalianaTair10"
params.pl = "illumina"
params.pm = "nextseq"

After activating your conda environment with Nextflow installed, run:

nextflow run main.nf

By default, Nextflow will look for a config file called nextflow.config inside the directory from which Nextflow was launched. You can specify a different config file on the command line using -c:

nextflow run main.nf -c your.config**

If you check my config file, I've done exactly that:

// Required Parameters
params.reads = "/projects/oleksyk-lab/Kenneth/Golden_Standard/BGI/{E150016531_L01_75_1.fq.gz,E150016531_L01_75_2.fq.gz}"
params.ref = "/projects/oleksyklab/Kenneth/Golden_Standard/References/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta"
params.outdir = "/scratch/projects/oleksyk-lab/gatk4"
params.snpeff_db = "GRCh38.105"
params.pl = "bgi"
params.pm = "dnbseq"

Below is the main.nf for the first 2 processes in the pipeline(I've only modified under step 1 bwa mem -t, designating 40 cpu's to the job):

process align {

publishDir "${params.out}/aligned_reads", mode:'copy'



input:

set pair_id, file(reads) from read_pairs_ch



output:

set val(pair_id), file("${pair_id}_aligned_reads.sam") \

    into aligned_reads_ch



script:

readGroup = \

    "@RG\\tID:${pair_id}\\tLB:${pair_id}\\tPL:${[params.pl](http://params.pl/)}\\tPM:${[params.pm](http://params.pm/)}\\tSM:${pair_id}"

"""

bwa mem \

-K 100000000 \

    -v 3 \

-t 40 \

    -Y \

-R \"${readGroup}\" \

    $ref \

${reads[0]} \

    ${reads[1]} \

    > ${pair_id}_aligned_reads.sam

"""

}

process markDuplicatesSpark {

publishDir "${params.out}/dedup_sorted", mode:'copy'



input:

set val(pair_id), file(aligned_reads) from aligned_reads_ch



// If we're doing this step, it's the first round

// so we set val(1) (round = 1).

output:

set val(pair_id), \

    val(1), \

    file("${pair_id}_sorted_dedup.bam") \

    into bam_for_variant_calling, \

    sorted_dedup_ch_for_metrics, \

    bam_for_bqsr

set val(pair_id), \

    file ("${pair_id}_dedup_metrics.txt") \

    into dedup_qc_ch



script:

"""

mkdir -p ${params.tmpdir}/${workflow.runName}/${pair_id}

gatk --java-options "-Djava.io.tmpdir=${params.tmpdir}/${workflow.runName}/${pair_id}" \

     MarkDuplicatesSpark \

    -I $aligned_reads \

    -M ${pair_id}_dedup_metrics.txt \

    -O ${pair_id}_sorted_dedup.bam

rm -r ${params.tmpdir}/${workflow.runName}/${pair_id}

The write-up I'm using is extremely helpful as I'm quite new to this, as far as I know I've followed it exactly but still getting this User Error.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Nextflow Variant Calling Pipeline using GATK4: Question about error exit status 2 #4399

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Nextflow Variant Calling Pipeline using GATK4: Question about error exit status 2 #4399

Uh oh!

kchilkert Oct 11, 2023

Replies: 1 comment · 1 reply

Uh oh!

mribeirodantas Oct 12, 2023 Collaborator

Uh oh!

kchilkert Oct 12, 2023 Author

kchilkert
Oct 11, 2023

Replies: 1 comment 1 reply

mribeirodantas
Oct 12, 2023
Collaborator

kchilkert Oct 12, 2023
Author