Output from different fastq pairs mixing in final process #2336

dan-sprague · 2021-09-21T17:55:09Z

dan-sprague
Sep 21, 2021

Hi everyone,

I have a question for an issue I have not been able to resolve on my own. I wrote a pipeline to analyze RNAseq data. For each pair of FASTQ files, they are assembled and then passed into processes that do different analyses.

The final step in the pipeline is taking 4 different outputs and passing them to a predictive model. What keeps happening to me however is that different pair_ids are being mixed together. I'm not really sure how this is possible.

I have included a screen shot below. The general pattern is ${pair_id}_processname.tsv, however you can see where I underlined that an incorrect $pair_id is being passed into this process. If I run it again, the pair_ids seem essentially randomly shuffled, i.e. the names won't be the same every time and sometimes by luck it will run correctly.

Anyone have any idea why this is happening? I'm sure I am doing something wrong...

#!/usr/bin/env nextflow

params.reads = "$baseDir/data/fastq/*_{1,2}.fastq"
params.viroidStructs = "$baseDir/data/seqStructs/xaaa*"
params.codeDir = ''
params.dataDir = ''



Channel
    .fromFilePairs( params.reads )
    .ifEmpty { error "Cannot find any reads matching: ${params.reads}" }
    .set { read_pairs_ch }



process assemble {
    
    cpus 48 


    input:
    tuple val(pair_id),path(reads) from read_pairs_ch

    output:
    tuple pair_id, "./${pair_id}_assembly/transcripts.fasta" into assembly_ch


    """
    rnaspades.py -1 ${reads[0]} -2 ${reads[1]} -o ./${pair_id}_assembly -t ${task.cpus}
    """

}

process reformat {

    input:
    tuple val(pair_id),path(assembly) from assembly_ch

    output:
    tuple pair_id, "./${pair_id}_shorties.fasta" into assembly_ch1,assembly_ch2,assembly_ch3

    """
    reformat.sh minlength=50 maxlength=700 in=${assembly} out=./${pair_id}_shorties.fasta

    """
}

process logodds {

    input:
    tuple val(pair_id),path(assembly) from assembly_ch1

    output:
    tuple pair_id, "./${pair_id}_scores.txt" into odds_ch

    """
    python ${params.codeDir}/src/hmmLogOdds.py -f $assembly -v ${params.dataDir}/models/pospiHMM.p -n ${params.dataDir}/models/neg_hmm.p -o ./${pair_id}_scores.txt

    """

}

process rnafold {

    cpus 24

    input:
    tuple val(pair_id),path(assembly) from assembly_ch2

    output:
    tuple pair_id, "./${pair_id}_structs_reformat.txt" into structures_ch
    tuple pair_id, "./${pair_id}_mfe.txt" into mfe_ch

    """
    RNAfold --jobs=${task.cpus} --noPS -i ${assembly} > ./${pair_id}_structs.txt
    awk '/^>/ {getline;getline;print \$1}' ./${pair_id}_structs.txt > ./${pair_id}_structs_reformat.txt
    python ${params.codeDir}/src/mfe.py -f ${pair_id}_structs.txt -o ./${pair_id}_mfe.txt
    """

}

process rnadistance {

    cpus 48

    input:
    tuple val(pair_id),path(structures) from structures_ch

    output:
    tuple pair_id,"./${pair_id}_distances.tsv" into distance_ch

    """

    for f in ${params.dataDir}/viroidStructs/* 
    do 
    name="\$(basename -- \${f})"
    cat \${f} ${structures} > \${name}_distance.txt
    done


    parallel -j${task.cpus} 'RNAdistance -Xf < {} > {}_distData.txt' ::: *_distance.txt
    for f in *_distData.txt; do awk '{print \$2}' \${f} > \${f}_fix.txt;done
    paste *_fix.txt > ${pair_id}_distances.tsv

    """



}


process lengths {

    input:
    tuple val(pair_id),path(assembly) from assembly_ch3

    output:
    tuple pair_id,"./${pair_id}_lengths.txt" into lengths_ch

    """

    python ${params.codeDir}/src/lengths.py -f ${assembly} -o ./${pair_id}_lengths.txt

    """


}

process predict {

    publishDir '/home/ubuntu/working/09_2021/09_13_2021/nemo/results'

    input:
    tuple val(pair_id),path(odds) from odds_ch
    tuple val(pair_id2),path(dist) from distance_ch
    tuple val(pair_id3),path(mfe) from mfe_ch
    tuple val(pair_id4),path(lengths) from lengths_ch

    output:
    path "./${pair_id}_predictions.tsv" into prediction

    """
    echo ${odds}
    echo ${lengths}
    echo ${mfe}
    echo ${dist}
    julia ${params.codeDir}/src/predict.jl --odds ${odds} --lengths ${lengths} --mfe ${mfe} --struct ${dist} --model ${params.codeDir}/ref/model_chain.jls --out ./${pair_id}_predictions.tsv
    """




}

Answered by dan-sprague

Sep 21, 2021

Thanks for your help, mix was the answer but additional work was required. Unfortunately the mix function does not emit in order and for this to work you need to know the order of the files. I don't know groovy at all so it took me many hours to figure out how to get the tuple (pair_id, [file 1, file 2, ... file N]) in order. The fix ended up being quite elegant, but the toSorted function is presumably found in the groovy documentation, and is no where in the nextflow api. The other key realization was that the file paths which appear as plain strings when you .view() turn out to have a .name attribute, which allows you to sort by the base name of the file. Otherwise it would be impossibl…

View full answer

manuelesimi · 2021-09-21T18:12:10Z

manuelesimi
Sep 21, 2021
Collaborator

A couple of comments here:

you should use the proper qualifier for the elements of the output tuples in all your processes. For instance, this
tuple pair_id,"./${pair_id}_lengths.txt" into lengths_ch
should be:
tuple val (pair_id), path("${pair_id}_lengths.txt") into lengths_ch
more importantly, with your current implementation there is no guarantee that the the first element of any channel belongs to the first pair. It depends when tasks are completed. If the second pair of reads is smaller than the first pair, it's very likely that the first element in the distance_ch channel belongs to the second pair instead of the first one. In other words, it's not the submission order that determines which pair completes first in any step.

1 reply

dan-sprague Sep 21, 2021
Author

Thanks, I've updated my code to reflect point 1 you made.

For point 2: This would make a lot of sense. I'm testing the code on the first 100 lines and first 100000 lines of the fastq files, so this is very much the case.

Is there a design pattern that would allow simultaneous execution of these different post-assembly processes, but still guarantee that they are collected by pair, rather than time to completion? Perhaps the collect operator would help, although I think it would just collect out-of-sync outputs... so maybe I need to redesign this?

manuelesimi · 2021-09-21T18:28:19Z

manuelesimi
Sep 21, 2021
Collaborator

So, you have 4 channels, each of them with tuples of 2 elements, the pair ID is the first element for all of them.

The first idea that pops in my mind is to mix the 4 channels into a single channel and then use groupTuple to obtain a tuple (pair id, [file from P1,...., file from P4]). There might be more elegant solutions, but this should work.

0 replies

dan-sprague · 2021-09-21T23:32:17Z

dan-sprague
Sep 21, 2021
Author

Thanks for your help, mix was the answer but additional work was required. Unfortunately the mix function does not emit in order and for this to work you need to know the order of the files. I don't know groovy at all so it took me many hours to figure out how to get the tuple (pair_id, [file 1, file 2, ... file N]) in order. The fix ended up being quite elegant, but the toSorted function is presumably found in the groovy documentation, and is no where in the nextflow api. The other key realization was that the file paths which appear as plain strings when you .view() turn out to have a .name attribute, which allows you to sort by the base name of the file. Otherwise it would be impossible because the files are in the randomly named nextflow work directories.

Anyway, here is the fix for anyone with similar issues:

odds_ch
     .mix(distance_ch,mfe_ch,lengths_ch)
     .groupTuple(by:[0])
     .map { a -> [a[0],a[1].toSorted { b -> b.name}]}
     .set { predict_ch }

1 reply

manuelesimi Sep 21, 2021
Collaborator

My apologies, I could have sent you a small snippet with the channel manipulation I mentioned.

But it's good you managed yourself anyway. You're on your way to be a groovy master!

Output from different fastq pairs mixing in final process #2336

Uh oh!

Uh oh!

dan-sprague Sep 21, 2021

Replies: 3 comments · 2 replies

Uh oh!

Uh oh!

manuelesimi Sep 21, 2021 Collaborator

Uh oh!

Uh oh!

dan-sprague Sep 21, 2021 Author

Uh oh!

manuelesimi Sep 21, 2021 Collaborator

Uh oh!

Uh oh!

dan-sprague Sep 21, 2021 Author

Uh oh!

Uh oh!

manuelesimi Sep 21, 2021 Collaborator

dan-sprague
Sep 21, 2021

Replies: 3 comments 2 replies

manuelesimi
Sep 21, 2021
Collaborator

dan-sprague Sep 21, 2021
Author

manuelesimi
Sep 21, 2021
Collaborator

dan-sprague
Sep 21, 2021
Author

manuelesimi Sep 21, 2021
Collaborator