nextflow-io
diff --git a/‎docs/nf4_science/genomics/03_modules.md‎
Lines changed: 208 additions & 0 deletions b/‎docs/nf4_science/genomics/03_modules.md‎
Lines changed: 208 additions & 0 deletions
diff --git a/‎mkdocs.yml‎
Lines changed: 4 additions & 0 deletions b/‎mkdocs.yml‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎nf4-science/genomics/genomics-3.nf‎
Lines changed: 148 additions & 0 deletions b/‎nf4-science/genomics/genomics-3.nf‎
Lines changed: 148 additions & 0 deletions
@@ -0,0 +1,208 @@
+# Part 3: moving code into modules
+
+In the first part of this course, you built a variant calling pipeline that was completely linear and processed each sample's data independently of the others.
+
+In the second part, we showed you how to use channels and channel operators to implement joint variant calling with GATK, building on the pipeline from Part 1.
+
+In this part, we'll show you how to convert the code in that workflow into modules. To follow this part of the training, you should have completed Part 1 and Part 2, as well as [Hello Modules](../../../hello_nextflow/hello_modules.md), which covers the basics of modules.
+
+---
+
+## 0. Warmup
+
+When we started developing our workflow, we put everything in one single code file.
+Now it's time to tackle **modularizing** our code, _i.e._ extracting the process definitions into modules.
+
+We're going to start with the same workflow as in Part 2, which we've provided for you in the file `genomics-3.nf`.
+
+!!! note
+
+     Make sure you're in the correct working directory:
+     `cd /workspaces/training/nf4-science/genomics`
+
+Let's try running that now.
+
+```bash
+nextflow run genomics-3.nf -resume
+```
+
+And it works!
+
+```console title="Output"
+ N E X T F L O W   ~  version 24.10.0
+
+Launching `genomics-3.nf` [gloomy_poincare] DSL2 - revision: 43203316e0
+
+executor >  local (7)
+[18/89dfa4] SAMTOOLS_INDEX (1)       | 3 of 3 ✔
+[30/b2522b] GATK_HAPLOTYPECALLER (2) | 3 of 3 ✔
+[a8/d2c189] GATK_JOINTGENOTYPING     | 1 of 1 ✔
+```
+
+Like previously, there will now be a `work` directory and a `results_genomics` directory inside your project directory.
+
+### Takeaway
+
+You're ready to start modularizing your workflow.
+
+### What's next?
+
+Move the Genomics workflow's processes into modules.
+
+---
+
+## 1. Move processes into modules
+
+As you learned in [Hello Modules](../../../hello_nextflow/hello_modules.md), you can create a module simply by copying the process definition into its own file, in any directory, and you can name that file anything you want.
+
+For reasons that will become clear later (in particular when we come to testing), in this training we'll follow the convention of naming the file `main.nf`, and placing it in a directory structure named after the tool kit and the command.
+
+### 1.1. Create a module for the `SAMTOOLS_INDEX` process
+
+In the case of the `SAMTOOLS_INDEX` process, 'samtools' is the toolkit and 'index' is the command. So, we'll create a directory structure `modules/samtools/index` and put the `SAMTOOLS_INDEX` process definition in the `main.nf` file inside that directory.
+
+```bash
+mkdir -p modules/samtools/index
+touch modules/samtools/index/main.nf
+```
+
+Open the `main.nf` file and copy the `SAMTOOLS_INDEX` process definition into it, so you end up with something like this:
+
+```groovy title="modules/samtools/index/main.nf" linenums="1"
+#!/usr/bin/env nextflow
+
+/*
+ * Generate BAM index file
+ */
+process SAMTOOLS_INDEX {
+
+    container 'community.wave.seqera.io/library/samtools:1.20--b5dfbd93de237464'
+
+    publishDir params.outdir, mode: 'symlink'
+
+    input:
+        path input_bam
+
+    output:
+        tuple path(input_bam), path("${input_bam}.bai")
+
+    script:
+    """
+    samtools index '$input_bam'
+    """
+}
+```
+
+Then, remove the `SAMTOOLS_INDEX` process definition from `genomics-3.nf`, and add an import declaration for the module before the next process definition, like this:
+
+_Before:_
+
+```groovy title="tests/main.nf.test" linenums="1" hl_lines="1"
+/*
+ * Call variants with GATK HaplotypeCaller
+ */
+process GATK_HAPLOTYPECALLER {
+```
+
+_After:_
+
+```groovy title="genomics-3.nf" linenums="1" hl_lines="1 2"
+// Include modules
+include { SAMTOOLS_INDEX } from './modules/samtools/index/main.nf'
+
+/*
+ * Call variants with GATK HaplotypeCaller
+ */
+process GATK_HAPLOTYPECALLER {
+```
+
+You can now run the workflow again, and it should still work the same way as before. If you supply the `-resume` flag, no new should even need to be done:
+
+```bash
+nextflow run genomics-3.nf -resume
+```
+
+```console title="Re-used Output after moving SAMTOOLS_INDEX to a module"
+ N E X T F L O W   ~  version 24.10.0
+
+Launching `genomics-3.nf` [ridiculous_jones] DSL2 - revision: c5a13e17a1
+
+[cf/289c2d] SAMTOOLS_INDEX (2)       | 3 of 3, cached: 3 ✔
+[30/b2522b] GATK_HAPLOTYPECALLER (1) | 3 of 3, cached: 3 ✔
+[a8/d2c189] GATK_JOINTGENOTYPING     | 1 of 1, cached: 1 ✔
+```
+
+### 1.2. Create a modules for the `GATK_HAPLOTYPECALLER` and `GATK_JOINTGENOTYPING` processes
+
+Repeat the same steps for the remaining processes. You'll need to create a directory for each process, and then create a `main.nf` file inside that directory, removing the process definition from the workflow's `main.nf` file and adding an import declaration for the module. Once you're done, check that your modules directory structure is correct by running:
+
+```bash
+tree modules/
+```
+
+```console title="Directory structure"
+modules/
+├── gatk
+│   ├── haplotypecaller
+│   │   └── main.nf
+│   └── jointgenotyping
+│       └── main.nf
+└── samtools
+    └── index
+        └── main.nf
+
+5 directories, 3 files
+```
+
+You should also have something like this in the main workflow file, after the parameters section:
+
+```
+include { SAMTOOLS_INDEX } from './modules/samtools/index/main.nf'
+include { GATK_HAPLOTYPECALLER } from './modules/gatk/haplotypecaller/main.nf'
+include { GATK_JOINTGENOTYPING } from './modules/gatk/jointgenotyping/main.nf'
+
+workflow {
+```
+
+### Takeaway
+
+You've practiced modularizing a workflow, with the genomics workflow as an example.
+
+### What's next?
+
+Test the modularised workflow.
+
+---
+
+## 2. Test the modularised workflow
+
+Let's try running that now.
+
+```bash
+nextflow run genomics-3.nf -resume
+```
+
+And it works!
+
+```console title="Output"
+ N E X T F L O W   ~  version 24.10.0
+
+Launching `genomics-3.nf` [gloomy_poincare] DSL2 - revision: 43203316e0
+
+executor >  local (7)
+[18/89dfa4] SAMTOOLS_INDEX (1)       | 3 of 3 ✔
+[30/b2522b] GATK_HAPLOTYPECALLER (2) | 3 of 3 ✔
+[a8/d2c189] GATK_JOINTGENOTYPING     | 1 of 1 ✔
+```
+
+Yep, everything still works, including the resumability of the pipeline.
+
+### Takeaway
+
+You've practiced modularizing a workflow, and you've seen that it still works the same way as before.
+
+---
+
+## 3. Summary
+
+So, once again (assuming you followed [Hello Modules](../../../hello_nextflow/hello_modules.md)), you've done all this work and absolutely nothing has changed to how the pipeline works! This is a good thing, because it means that you've modularised your workflow without impacting its function. Importantly, you've laid a foundation for doing things that will make your code more modular and easier to maintain- for example, you can now add tests to your pipeline using the nf-test framework. This is what we'll be looking at in the next part of this course.
@@ -27,6 +27,8 @@ nav:
       - nf4_science/genomics/00_orientation.md
       - nf4_science/genomics/01_per_sample_variant_calling.md
       - nf4_science/genomics/02_joint_calling.md
+      - nf4_science/genomics/03_modules.md
+
   - Nextflow for RNAseq:
       - nf4_science/rnaseq/index.md
       - nf4_science/rnaseq/00_orientation.md
@@ -186,6 +188,8 @@ plugins:
         - advanced/index.md
         - advanced/orientation.md
         - side_quests/nf-test.md
+        - nf4_science/genomics/03_modules.md
+
   - i18n:
       docs_structure: suffix
       fallback_to_default: true
 
@@ -0,0 +1,148 @@
+#!/usr/bin/env nextflow
+
+/*
+ * Pipeline parameters
+ */
+
+// Primary input (file of input files, one per line)
+params.reads_bam = "${projectDir}/data/sample_bams.txt"
+
+// Output directory
+params.outdir = "results_genomics"
+
+// Accessory files
+params.reference        = "${projectDir}/data/ref/ref.fasta"
+params.reference_index  = "${projectDir}/data/ref/ref.fasta.fai"
+params.reference_dict   = "${projectDir}/data/ref/ref.dict"
+params.intervals        = "${projectDir}/data/ref/intervals.bed"
+
+// Base name for final output file
+params.cohort_name = "family_trio"
+
+/*
+ * Generate BAM index file
+ */
+process SAMTOOLS_INDEX {
+
+    container 'community.wave.seqera.io/library/samtools:1.20--b5dfbd93de237464'
+
+    publishDir params.outdir, mode: 'symlink'
+
+    input:
+        path input_bam
+
+    output:
+        tuple path(input_bam), path("${input_bam}.bai")
+
+    script:
+    """
+    samtools index '$input_bam'
+    """
+}
+
+/*
+ * Call variants with GATK HaplotypeCaller
+ */
+process GATK_HAPLOTYPECALLER {
+
+    container "community.wave.seqera.io/library/gatk4:4.5.0.0--730ee8817e436867"
+
+    publishDir params.outdir, mode: 'symlink'
+
+    input:
+        tuple path(input_bam), path(input_bam_index)
+        path ref_fasta
+        path ref_index
+        path ref_dict
+        path interval_list
+
+    output:
+        path "${input_bam}.g.vcf"     , emit: vcf
+        path "${input_bam}.g.vcf.idx" , emit: idx
+
+    script:
+    """
+    gatk HaplotypeCaller \
+        -R ${ref_fasta} \
+        -I ${input_bam} \
+        -O ${input_bam}.g.vcf \
+        -L ${interval_list} \
+        -ERC GVCF
+    """
+}
+
+/*
+ * Combine GVCFs into GenomicsDB datastore and run joint genotyping to produce cohort-level calls
+ */
+process GATK_JOINTGENOTYPING {
+
+    container "community.wave.seqera.io/library/gatk4:4.5.0.0--730ee8817e436867"
+    publishDir params.outdir, mode: 'copy'
+
+    input:
+        path all_gvcfs
+        path all_idxs
+        path interval_list
+        val cohort_name
+        path ref_fasta
+        path ref_index
+        path ref_dict
+
+    output:
+        path "${cohort_name}.joint.vcf"     , emit: vcf
+        path "${cohort_name}.joint.vcf.idx" , emit: idx
+
+    script:
+    def gvcfs_line = all_gvcfs.collect { gvcf -> "-V ${gvcf}" }.join(' ')
+    """
+    gatk GenomicsDBImport \
+        ${gvcfs_line} \
+        -L ${interval_list} \
+        --genomicsdb-workspace-path ${cohort_name}_gdb
+
+    gatk GenotypeGVCFs \
+        -R ${ref_fasta} \
+        -V gendb://${cohort_name}_gdb \
+        -L ${interval_list} \
+        -O ${cohort_name}.joint.vcf
+    """
+}
+
+workflow {
+
+    // Create input channel from a text file listing input file paths
+    reads_ch = Channel.fromPath(params.reads_bam).splitText()
+
+    // Load the file paths for the accessory files (reference and intervals)
+    ref_file        = file(params.reference)
+    ref_index_file  = file(params.reference_index)
+    ref_dict_file   = file(params.reference_dict)
+    intervals_file  = file(params.intervals)
+
+    // Create index file for input BAM file
+    SAMTOOLS_INDEX(reads_ch)
+
+    // Call variants from the indexed BAM file
+    GATK_HAPLOTYPECALLER(
+        SAMTOOLS_INDEX.out,
+        ref_file,
+        ref_index_file,
+        ref_dict_file,
+        intervals_file
+    )
+
+    // Collect variant calling outputs across samples
+    all_gvcfs_ch = GATK_HAPLOTYPECALLER.out.vcf.collect()
+    all_idxs_ch = GATK_HAPLOTYPECALLER.out.idx.collect()
+
+    // Combine GVCFs into a GenomicsDB data store and apply joint genotyping
+    GATK_JOINTGENOTYPING(
+        all_gvcfs_ch,
+        all_idxs_ch,
+        intervals_file,
+        params.cohort_name,
+        ref_file,
+        ref_index_file,
+        ref_dict_file
+    )
+}