nextflow pipeline with bedtools #4055

jml96 · 2023-06-27T15:54:48Z

jml96
Jun 27, 2023

Hi,
I created a pipeline that runs for each chromosome and uses bedtools.
When I call bedtools from a singularity image (process.container='/path/to/singularity/image/singularity.img'
singularity.enabled=true) the calls for most of chromosome are incomplete. I get the expected results when I change the nexflow.config file to enable conda (singularity.enabled=true). Bedtools in the singularity image was installed from continuumio/miniconda3.
A R package, that is not available in conda, is needed in one of the processes and the solution that I found was to use a singularity image.
Does anybody know why the results are incomplete when I am using a singularity image?
Thank you.

jml96 · 2023-07-04T08:28:58Z

jml96
Jul 4, 2023
Author

Hi,
I realised that the incomplete output files (data from some chromosomes missing) only occur when I run the pipeline with multiple samples. At the moment, for the large majority of cases, I get the expected results (complete files) for a single sample. You can find below the generic script of my pipeline:

process process_1 {
  input:
  tuple val(val1), val(val2), path('input_file')
  output:
  tuple val(val1), val(val2), path(file_1.txt')

  shell:
  '''
  COMMAND TO PROCESS input_file > file_1.txt'
  '''
}

process process_2{
  input:
  tuple val(str1),val(str2), path('file_1.txt'), val(chr), path('reference_chr.txt')

  output:
  tuple val(tum), val(norm), path(' file_2.txt' )

  shell:
  '''
  chr=!{chr}
  COMMAND TO PROCESS file_1.txt & reference_chr.txt > file_2.txt'
  '''
}

process process_3 {
  input:
  tuple val(tum), val(norm), path(' file_2')

  output:
  tuple val(tum), val(norm), path(' file_3.txt')

  shell:
  '''
   COMMAND TO PROCESS file_2* > file_3.txt
  '''
}

workflow {
  data=Channel.fromPath('list_samples.txt').splitCsv(sep:'\t') # file format: sampleID1\tsampleID2\tpath_to_vcf_file
  chr_ref=Channel.fromPath('chr_reference.txt').splitCsv(sep:'\t') #file format: chr\tpath_to_reference_file
  out_1=process_1(data)
  out_1_comb=out_1.combine(chr_ref)

  out_2=process_2(out_1_comb)
  out_2_comb=out_2.groupTuple(by:0..1)
  out_3=process_3(out_2_comb)
}

I will appreciate if someone can spot the reason why my pipeline does not work for multiple files. I suspect that it could be linked to simultaneous access of input files by the channels. I tried to use flock -x to it but it does not resolve the issue.
Thank you.

João

0 replies

colindaven · 2023-08-08T12:11:19Z

colindaven
Aug 8, 2023

It would be better to use channels to specify your file inputs, and not exactly matching file parameters in your processes.

I use something like this commonly in the workflow

where params.input_genomes_path
is something like
input_genomes_path = "/data/*.fasta"

    input_genomes = Channel.fromPath(params.input_genomes_path, checkIfExists: true)

    // run processes

    // align proteins vs all genomes in list
    miniprot1(input_proteins_faa, input_genomes)

Then nextflow will run the process miniprot1 for each combination of input_proteins and genomes. It doesn't matter if there are 1 or 100 genomes.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

nextflow pipeline with bedtools #4055

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

nextflow pipeline with bedtools #4055

Uh oh!

jml96 Jun 27, 2023

Replies: 2 comments

Uh oh!

Uh oh!

jml96 Jul 4, 2023 Author

Uh oh!

Uh oh!

colindaven Aug 8, 2023

jml96
Jun 27, 2023

jml96
Jul 4, 2023
Author

colindaven
Aug 8, 2023