|
| 1 | +# Part 3: moving code into modules |
| 2 | + |
| 3 | +In the first part of this course, you built a variant calling pipeline that was completely linear and processed each sample's data independently of the others. |
| 4 | + |
| 5 | +In the second part, we showed you how to use channels and channel operators to implement joint variant calling with GATK, building on the pipeline from Part 1. |
| 6 | + |
| 7 | +In this part, we'll show you how to convert the code in that workflow into modules. To follow this part of the training, you should have completed Part 1 and Part 2, as well as [Hello Modules](../../../hello_nextflow/hello_modules.md), which covers the basics of modules. |
| 8 | + |
| 9 | +--- |
| 10 | + |
| 11 | +## 0. Warmup |
| 12 | + |
| 13 | +When we started developing our workflow, we put everything in one single code file. |
| 14 | +Now it's time to tackle **modularizing** our code, _i.e._ extracting the process definitions into modules. |
| 15 | + |
| 16 | +We're going to start with the same workflow as in Part 2, which we've provided for you in the file `genomics-3.nf`. |
| 17 | + |
| 18 | +!!! note |
| 19 | + |
| 20 | + Make sure you're in the correct working directory: |
| 21 | + `cd /workspaces/training/nf4-science/genomics` |
| 22 | + |
| 23 | +Let's try running that now. |
| 24 | + |
| 25 | +```bash |
| 26 | +nextflow run genomics-3.nf -resume |
| 27 | +``` |
| 28 | + |
| 29 | +And it works! |
| 30 | + |
| 31 | +```console title="Output" |
| 32 | + N E X T F L O W ~ version 24.10.0 |
| 33 | + |
| 34 | +Launching `genomics-3.nf` [gloomy_poincare] DSL2 - revision: 43203316e0 |
| 35 | + |
| 36 | +executor > local (7) |
| 37 | +[18/89dfa4] SAMTOOLS_INDEX (1) | 3 of 3 ✔ |
| 38 | +[30/b2522b] GATK_HAPLOTYPECALLER (2) | 3 of 3 ✔ |
| 39 | +[a8/d2c189] GATK_JOINTGENOTYPING | 1 of 1 ✔ |
| 40 | +``` |
| 41 | + |
| 42 | +Like previously, there will now be a `work` directory and a `results_genomics` directory inside your project directory. |
| 43 | + |
| 44 | +### Takeaway |
| 45 | + |
| 46 | +You're ready to start modularizing your workflow. |
| 47 | + |
| 48 | +### What's next? |
| 49 | + |
| 50 | +Move the Genomics workflow's processes into modules. |
| 51 | + |
| 52 | +--- |
| 53 | + |
| 54 | +## 1. Move processes into modules |
| 55 | + |
| 56 | +As you learned in [Hello Modules](../../../hello_nextflow/hello_modules.md), you can create a module simply by copying the process definition into its own file, in any directory, and you can name that file anything you want. |
| 57 | + |
| 58 | +For reasons that will become clear later (in particular when we come to testing), in this training we'll follow the convention of naming the file `main.nf`, and placing it in a directory structure named after the tool kit and the command. |
| 59 | + |
| 60 | +### 1.1. Create a module for the `SAMTOOLS_INDEX` process |
| 61 | + |
| 62 | +In the case of the `SAMTOOLS_INDEX` process, 'samtools' is the toolkit and 'index' is the command. So, we'll create a directory structure `modules/samtools/index` and put the `SAMTOOLS_INDEX` process definition in the `main.nf` file inside that directory. |
| 63 | + |
| 64 | +```bash |
| 65 | +mkdir -p modules/samtools/index |
| 66 | +touch modules/samtools/index/main.nf |
| 67 | +``` |
| 68 | + |
| 69 | +Open the `main.nf` file and copy the `SAMTOOLS_INDEX` process definition into it, so you end up with something like this: |
| 70 | + |
| 71 | +```groovy title="modules/samtools/index/main.nf" linenums="1" |
| 72 | +#!/usr/bin/env nextflow |
| 73 | +
|
| 74 | +/* |
| 75 | + * Generate BAM index file |
| 76 | + */ |
| 77 | +process SAMTOOLS_INDEX { |
| 78 | +
|
| 79 | + container 'community.wave.seqera.io/library/samtools:1.20--b5dfbd93de237464' |
| 80 | +
|
| 81 | + publishDir params.outdir, mode: 'symlink' |
| 82 | +
|
| 83 | + input: |
| 84 | + path input_bam |
| 85 | +
|
| 86 | + output: |
| 87 | + tuple path(input_bam), path("${input_bam}.bai") |
| 88 | +
|
| 89 | + script: |
| 90 | + """ |
| 91 | + samtools index '$input_bam' |
| 92 | + """ |
| 93 | +} |
| 94 | +``` |
| 95 | + |
| 96 | +Then, remove the `SAMTOOLS_INDEX` process definition from `genomics-3.nf`, and add an import declaration for the module before the next process definition, like this: |
| 97 | + |
| 98 | +_Before:_ |
| 99 | + |
| 100 | +```groovy title="tests/main.nf.test" linenums="1" hl_lines="1" |
| 101 | +/* |
| 102 | + * Call variants with GATK HaplotypeCaller |
| 103 | + */ |
| 104 | +process GATK_HAPLOTYPECALLER { |
| 105 | +``` |
| 106 | + |
| 107 | +_After:_ |
| 108 | + |
| 109 | +```groovy title="genomics-3.nf" linenums="1" hl_lines="1 2" |
| 110 | +// Include modules |
| 111 | +include { SAMTOOLS_INDEX } from './modules/samtools/index/main.nf' |
| 112 | +
|
| 113 | +/* |
| 114 | + * Call variants with GATK HaplotypeCaller |
| 115 | + */ |
| 116 | +process GATK_HAPLOTYPECALLER { |
| 117 | +``` |
| 118 | + |
| 119 | +You can now run the workflow again, and it should still work the same way as before. If you supply the `-resume` flag, no new should even need to be done: |
| 120 | + |
| 121 | +```bash |
| 122 | +nextflow run genomics-3.nf -resume |
| 123 | +``` |
| 124 | + |
| 125 | +```console title="Re-used Output after moving SAMTOOLS_INDEX to a module" |
| 126 | + N E X T F L O W ~ version 24.10.0 |
| 127 | + |
| 128 | +Launching `genomics-3.nf` [ridiculous_jones] DSL2 - revision: c5a13e17a1 |
| 129 | + |
| 130 | +[cf/289c2d] SAMTOOLS_INDEX (2) | 3 of 3, cached: 3 ✔ |
| 131 | +[30/b2522b] GATK_HAPLOTYPECALLER (1) | 3 of 3, cached: 3 ✔ |
| 132 | +[a8/d2c189] GATK_JOINTGENOTYPING | 1 of 1, cached: 1 ✔ |
| 133 | +``` |
| 134 | + |
| 135 | +### 1.2. Create a modules for the `GATK_HAPLOTYPECALLER` and `GATK_JOINTGENOTYPING` processes |
| 136 | + |
| 137 | +Repeat the same steps for the remaining processes. You'll need to create a directory for each process, and then create a `main.nf` file inside that directory, removing the process definition from the workflow's `main.nf` file and adding an import declaration for the module. Once you're done, check that your modules directory structure is correct by running: |
| 138 | + |
| 139 | +```bash |
| 140 | +tree modules/ |
| 141 | +``` |
| 142 | + |
| 143 | +```console title="Directory structure" |
| 144 | +modules/ |
| 145 | +├── gatk |
| 146 | +│ ├── haplotypecaller |
| 147 | +│ │ └── main.nf |
| 148 | +│ └── jointgenotyping |
| 149 | +│ └── main.nf |
| 150 | +└── samtools |
| 151 | + └── index |
| 152 | + └── main.nf |
| 153 | + |
| 154 | +5 directories, 3 files |
| 155 | +``` |
| 156 | + |
| 157 | +You should also have something like this in the main workflow file, after the parameters section: |
| 158 | + |
| 159 | +``` |
| 160 | +include { SAMTOOLS_INDEX } from './modules/samtools/index/main.nf' |
| 161 | +include { GATK_HAPLOTYPECALLER } from './modules/gatk/haplotypecaller/main.nf' |
| 162 | +include { GATK_JOINTGENOTYPING } from './modules/gatk/jointgenotyping/main.nf' |
| 163 | +
|
| 164 | +workflow { |
| 165 | +``` |
| 166 | + |
| 167 | +### Takeaway |
| 168 | + |
| 169 | +You've practiced modularizing a workflow, with the genomics workflow as an example. |
| 170 | + |
| 171 | +### What's next? |
| 172 | + |
| 173 | +Test the modularised workflow. |
| 174 | + |
| 175 | +--- |
| 176 | + |
| 177 | +## 2. Test the modularised workflow |
| 178 | + |
| 179 | +Let's try running that now. |
| 180 | + |
| 181 | +```bash |
| 182 | +nextflow run genomics-3.nf -resume |
| 183 | +``` |
| 184 | + |
| 185 | +And it works! |
| 186 | + |
| 187 | +```console title="Output" |
| 188 | + N E X T F L O W ~ version 24.10.0 |
| 189 | + |
| 190 | +Launching `genomics-3.nf` [gloomy_poincare] DSL2 - revision: 43203316e0 |
| 191 | + |
| 192 | +executor > local (7) |
| 193 | +[18/89dfa4] SAMTOOLS_INDEX (1) | 3 of 3 ✔ |
| 194 | +[30/b2522b] GATK_HAPLOTYPECALLER (2) | 3 of 3 ✔ |
| 195 | +[a8/d2c189] GATK_JOINTGENOTYPING | 1 of 1 ✔ |
| 196 | +``` |
| 197 | + |
| 198 | +Yep, everything still works, including the resumability of the pipeline. |
| 199 | + |
| 200 | +### Takeaway |
| 201 | + |
| 202 | +You've practiced modularizing a workflow, and you've seen that it still works the same way as before. |
| 203 | + |
| 204 | +--- |
| 205 | + |
| 206 | +## 3. Summary |
| 207 | + |
| 208 | +So, once again (assuming you followed [Hello Modules](../../../hello_nextflow/hello_modules.md)), you've done all this work and absolutely nothing has changed to how the pipeline works! This is a good thing, because it means that you've modularised your workflow without impacting its function. Importantly, you've laid a foundation for doing things that will make your code more modular and easier to maintain- for example, you can now add tests to your pipeline using the nf-test framework. This is what we'll be looking at in the next part of this course. |
0 commit comments