Skip to content

Commit 0a6b45f

Browse files
authored
Merge branch 'master' into devcontainers-cleanup
2 parents 264a925 + 99df8a2 commit 0a6b45f

File tree

5 files changed

+425
-0
lines changed

5 files changed

+425
-0
lines changed
Lines changed: 208 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,208 @@
1+
# Part 3: moving code into modules
2+
3+
In the first part of this course, you built a variant calling pipeline that was completely linear and processed each sample's data independently of the others.
4+
5+
In the second part, we showed you how to use channels and channel operators to implement joint variant calling with GATK, building on the pipeline from Part 1.
6+
7+
In this part, we'll show you how to convert the code in that workflow into modules. To follow this part of the training, you should have completed Part 1 and Part 2, as well as [Hello Modules](../../../hello_nextflow/hello_modules.md), which covers the basics of modules.
8+
9+
---
10+
11+
## 0. Warmup
12+
13+
When we started developing our workflow, we put everything in one single code file.
14+
Now it's time to tackle **modularizing** our code, _i.e._ extracting the process definitions into modules.
15+
16+
We're going to start with the same workflow as in Part 2, which we've provided for you in the file `genomics-3.nf`.
17+
18+
!!! note
19+
20+
Make sure you're in the correct working directory:
21+
`cd /workspaces/training/nf4-science/genomics`
22+
23+
Let's try running that now.
24+
25+
```bash
26+
nextflow run genomics-3.nf -resume
27+
```
28+
29+
And it works!
30+
31+
```console title="Output"
32+
N E X T F L O W ~ version 24.10.0
33+
34+
Launching `genomics-3.nf` [gloomy_poincare] DSL2 - revision: 43203316e0
35+
36+
executor > local (7)
37+
[18/89dfa4] SAMTOOLS_INDEX (1) | 3 of 3 ✔
38+
[30/b2522b] GATK_HAPLOTYPECALLER (2) | 3 of 3 ✔
39+
[a8/d2c189] GATK_JOINTGENOTYPING | 1 of 1 ✔
40+
```
41+
42+
Like previously, there will now be a `work` directory and a `results_genomics` directory inside your project directory.
43+
44+
### Takeaway
45+
46+
You're ready to start modularizing your workflow.
47+
48+
### What's next?
49+
50+
Move the Genomics workflow's processes into modules.
51+
52+
---
53+
54+
## 1. Move processes into modules
55+
56+
As you learned in [Hello Modules](../../../hello_nextflow/hello_modules.md), you can create a module simply by copying the process definition into its own file, in any directory, and you can name that file anything you want.
57+
58+
For reasons that will become clear later (in particular when we come to testing), in this training we'll follow the convention of naming the file `main.nf`, and placing it in a directory structure named after the tool kit and the command.
59+
60+
### 1.1. Create a module for the `SAMTOOLS_INDEX` process
61+
62+
In the case of the `SAMTOOLS_INDEX` process, 'samtools' is the toolkit and 'index' is the command. So, we'll create a directory structure `modules/samtools/index` and put the `SAMTOOLS_INDEX` process definition in the `main.nf` file inside that directory.
63+
64+
```bash
65+
mkdir -p modules/samtools/index
66+
touch modules/samtools/index/main.nf
67+
```
68+
69+
Open the `main.nf` file and copy the `SAMTOOLS_INDEX` process definition into it, so you end up with something like this:
70+
71+
```groovy title="modules/samtools/index/main.nf" linenums="1"
72+
#!/usr/bin/env nextflow
73+
74+
/*
75+
* Generate BAM index file
76+
*/
77+
process SAMTOOLS_INDEX {
78+
79+
container 'community.wave.seqera.io/library/samtools:1.20--b5dfbd93de237464'
80+
81+
publishDir params.outdir, mode: 'symlink'
82+
83+
input:
84+
path input_bam
85+
86+
output:
87+
tuple path(input_bam), path("${input_bam}.bai")
88+
89+
script:
90+
"""
91+
samtools index '$input_bam'
92+
"""
93+
}
94+
```
95+
96+
Then, remove the `SAMTOOLS_INDEX` process definition from `genomics-3.nf`, and add an import declaration for the module before the next process definition, like this:
97+
98+
_Before:_
99+
100+
```groovy title="tests/main.nf.test" linenums="1" hl_lines="1"
101+
/*
102+
* Call variants with GATK HaplotypeCaller
103+
*/
104+
process GATK_HAPLOTYPECALLER {
105+
```
106+
107+
_After:_
108+
109+
```groovy title="genomics-3.nf" linenums="1" hl_lines="1 2"
110+
// Include modules
111+
include { SAMTOOLS_INDEX } from './modules/samtools/index/main.nf'
112+
113+
/*
114+
* Call variants with GATK HaplotypeCaller
115+
*/
116+
process GATK_HAPLOTYPECALLER {
117+
```
118+
119+
You can now run the workflow again, and it should still work the same way as before. If you supply the `-resume` flag, no new should even need to be done:
120+
121+
```bash
122+
nextflow run genomics-3.nf -resume
123+
```
124+
125+
```console title="Re-used Output after moving SAMTOOLS_INDEX to a module"
126+
N E X T F L O W ~ version 24.10.0
127+
128+
Launching `genomics-3.nf` [ridiculous_jones] DSL2 - revision: c5a13e17a1
129+
130+
[cf/289c2d] SAMTOOLS_INDEX (2) | 3 of 3, cached: 3 ✔
131+
[30/b2522b] GATK_HAPLOTYPECALLER (1) | 3 of 3, cached: 3 ✔
132+
[a8/d2c189] GATK_JOINTGENOTYPING | 1 of 1, cached: 1 ✔
133+
```
134+
135+
### 1.2. Create a modules for the `GATK_HAPLOTYPECALLER` and `GATK_JOINTGENOTYPING` processes
136+
137+
Repeat the same steps for the remaining processes. You'll need to create a directory for each process, and then create a `main.nf` file inside that directory, removing the process definition from the workflow's `main.nf` file and adding an import declaration for the module. Once you're done, check that your modules directory structure is correct by running:
138+
139+
```bash
140+
tree modules/
141+
```
142+
143+
```console title="Directory structure"
144+
modules/
145+
├── gatk
146+
│ ├── haplotypecaller
147+
│ │ └── main.nf
148+
│ └── jointgenotyping
149+
│ └── main.nf
150+
└── samtools
151+
└── index
152+
└── main.nf
153+
154+
5 directories, 3 files
155+
```
156+
157+
You should also have something like this in the main workflow file, after the parameters section:
158+
159+
```
160+
include { SAMTOOLS_INDEX } from './modules/samtools/index/main.nf'
161+
include { GATK_HAPLOTYPECALLER } from './modules/gatk/haplotypecaller/main.nf'
162+
include { GATK_JOINTGENOTYPING } from './modules/gatk/jointgenotyping/main.nf'
163+
164+
workflow {
165+
```
166+
167+
### Takeaway
168+
169+
You've practiced modularizing a workflow, with the genomics workflow as an example.
170+
171+
### What's next?
172+
173+
Test the modularised workflow.
174+
175+
---
176+
177+
## 2. Test the modularised workflow
178+
179+
Let's try running that now.
180+
181+
```bash
182+
nextflow run genomics-3.nf -resume
183+
```
184+
185+
And it works!
186+
187+
```console title="Output"
188+
N E X T F L O W ~ version 24.10.0
189+
190+
Launching `genomics-3.nf` [gloomy_poincare] DSL2 - revision: 43203316e0
191+
192+
executor > local (7)
193+
[18/89dfa4] SAMTOOLS_INDEX (1) | 3 of 3 ✔
194+
[30/b2522b] GATK_HAPLOTYPECALLER (2) | 3 of 3 ✔
195+
[a8/d2c189] GATK_JOINTGENOTYPING | 1 of 1 ✔
196+
```
197+
198+
Yep, everything still works, including the resumability of the pipeline.
199+
200+
### Takeaway
201+
202+
You've practiced modularizing a workflow, and you've seen that it still works the same way as before.
203+
204+
---
205+
206+
## 3. Summary
207+
208+
So, once again (assuming you followed [Hello Modules](../../../hello_nextflow/hello_modules.md)), you've done all this work and absolutely nothing has changed to how the pipeline works! This is a good thing, because it means that you've modularised your workflow without impacting its function. Importantly, you've laid a foundation for doing things that will make your code more modular and easier to maintain- for example, you can now add tests to your pipeline using the nf-test framework. This is what we'll be looking at in the next part of this course.

mkdocs.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,8 @@ nav:
2727
- nf4_science/genomics/00_orientation.md
2828
- nf4_science/genomics/01_per_sample_variant_calling.md
2929
- nf4_science/genomics/02_joint_calling.md
30+
- nf4_science/genomics/03_modules.md
31+
3032
- Nextflow for RNAseq:
3133
- nf4_science/rnaseq/index.md
3234
- nf4_science/rnaseq/00_orientation.md
@@ -186,6 +188,8 @@ plugins:
186188
- advanced/index.md
187189
- advanced/orientation.md
188190
- side_quests/nf-test.md
191+
- nf4_science/genomics/03_modules.md
192+
189193
- i18n:
190194
docs_structure: suffix
191195
fallback_to_default: true

nf4-science/genomics/genomics-3.nf

Lines changed: 148 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,148 @@
1+
#!/usr/bin/env nextflow
2+
3+
/*
4+
* Pipeline parameters
5+
*/
6+
7+
// Primary input (file of input files, one per line)
8+
params.reads_bam = "${projectDir}/data/sample_bams.txt"
9+
10+
// Output directory
11+
params.outdir = "results_genomics"
12+
13+
// Accessory files
14+
params.reference = "${projectDir}/data/ref/ref.fasta"
15+
params.reference_index = "${projectDir}/data/ref/ref.fasta.fai"
16+
params.reference_dict = "${projectDir}/data/ref/ref.dict"
17+
params.intervals = "${projectDir}/data/ref/intervals.bed"
18+
19+
// Base name for final output file
20+
params.cohort_name = "family_trio"
21+
22+
/*
23+
* Generate BAM index file
24+
*/
25+
process SAMTOOLS_INDEX {
26+
27+
container 'community.wave.seqera.io/library/samtools:1.20--b5dfbd93de237464'
28+
29+
publishDir params.outdir, mode: 'symlink'
30+
31+
input:
32+
path input_bam
33+
34+
output:
35+
tuple path(input_bam), path("${input_bam}.bai")
36+
37+
script:
38+
"""
39+
samtools index '$input_bam'
40+
"""
41+
}
42+
43+
/*
44+
* Call variants with GATK HaplotypeCaller
45+
*/
46+
process GATK_HAPLOTYPECALLER {
47+
48+
container "community.wave.seqera.io/library/gatk4:4.5.0.0--730ee8817e436867"
49+
50+
publishDir params.outdir, mode: 'symlink'
51+
52+
input:
53+
tuple path(input_bam), path(input_bam_index)
54+
path ref_fasta
55+
path ref_index
56+
path ref_dict
57+
path interval_list
58+
59+
output:
60+
path "${input_bam}.g.vcf" , emit: vcf
61+
path "${input_bam}.g.vcf.idx" , emit: idx
62+
63+
script:
64+
"""
65+
gatk HaplotypeCaller \
66+
-R ${ref_fasta} \
67+
-I ${input_bam} \
68+
-O ${input_bam}.g.vcf \
69+
-L ${interval_list} \
70+
-ERC GVCF
71+
"""
72+
}
73+
74+
/*
75+
* Combine GVCFs into GenomicsDB datastore and run joint genotyping to produce cohort-level calls
76+
*/
77+
process GATK_JOINTGENOTYPING {
78+
79+
container "community.wave.seqera.io/library/gatk4:4.5.0.0--730ee8817e436867"
80+
publishDir params.outdir, mode: 'copy'
81+
82+
input:
83+
path all_gvcfs
84+
path all_idxs
85+
path interval_list
86+
val cohort_name
87+
path ref_fasta
88+
path ref_index
89+
path ref_dict
90+
91+
output:
92+
path "${cohort_name}.joint.vcf" , emit: vcf
93+
path "${cohort_name}.joint.vcf.idx" , emit: idx
94+
95+
script:
96+
def gvcfs_line = all_gvcfs.collect { gvcf -> "-V ${gvcf}" }.join(' ')
97+
"""
98+
gatk GenomicsDBImport \
99+
${gvcfs_line} \
100+
-L ${interval_list} \
101+
--genomicsdb-workspace-path ${cohort_name}_gdb
102+
103+
gatk GenotypeGVCFs \
104+
-R ${ref_fasta} \
105+
-V gendb://${cohort_name}_gdb \
106+
-L ${interval_list} \
107+
-O ${cohort_name}.joint.vcf
108+
"""
109+
}
110+
111+
workflow {
112+
113+
// Create input channel from a text file listing input file paths
114+
reads_ch = Channel.fromPath(params.reads_bam).splitText()
115+
116+
// Load the file paths for the accessory files (reference and intervals)
117+
ref_file = file(params.reference)
118+
ref_index_file = file(params.reference_index)
119+
ref_dict_file = file(params.reference_dict)
120+
intervals_file = file(params.intervals)
121+
122+
// Create index file for input BAM file
123+
SAMTOOLS_INDEX(reads_ch)
124+
125+
// Call variants from the indexed BAM file
126+
GATK_HAPLOTYPECALLER(
127+
SAMTOOLS_INDEX.out,
128+
ref_file,
129+
ref_index_file,
130+
ref_dict_file,
131+
intervals_file
132+
)
133+
134+
// Collect variant calling outputs across samples
135+
all_gvcfs_ch = GATK_HAPLOTYPECALLER.out.vcf.collect()
136+
all_idxs_ch = GATK_HAPLOTYPECALLER.out.idx.collect()
137+
138+
// Combine GVCFs into a GenomicsDB data store and apply joint genotyping
139+
GATK_JOINTGENOTYPING(
140+
all_gvcfs_ch,
141+
all_idxs_ch,
142+
intervals_file,
143+
params.cohort_name,
144+
ref_file,
145+
ref_index_file,
146+
ref_dict_file
147+
)
148+
}

0 commit comments

Comments
 (0)