Skip to content

Commit 76c76b0

Browse files
abhi18avmribeirodantasewelspditommaso
authored
Blog post: Analysing cacheing behavior of pipelines (#70)
Signed-off-by: Marcel Ribeiro-Dantas <[email protected]> Co-authored-by: Marcel Ribeiro-Dantas <[email protected]> Co-authored-by: Phil Ewels <[email protected]> Co-authored-by: Paolo Di Tommaso <[email protected]>
1 parent 65bab59 commit 76c76b0

File tree

5 files changed

+256
-1
lines changed

5 files changed

+256
-1
lines changed

.gitignore

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
.idea
2-
output/docs/
2+
output/
33
assets/docs/latest/en/
44
stuff/
55
.DS_Store
6+
output

assets/img/abhinav.jpg

-7.91 KB
Loading

assets/img/rnaseq-nf.base.png

40.8 KB
Loading

assets/img/rnaseq-nf.modified.png

37.3 KB
Loading
Lines changed: 254 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,254 @@
1+
title=Analyzing caching behavior of pipelines
2+
date=2022-11-10
3+
type=post
4+
description=A guide to analysis of dump-hashes for troubleshooting caching behavior.
5+
image=img/rnaseq-nf.fastqc.modified.png
6+
tags=nextflow,cache
7+
status=published
8+
author=Abhinav Sharma
9+
icon=abhinav.jpg
10+
~~~~~~
11+
12+
The ability to resume an analysis (i.e. caching) is one of the core strengths of Nextflow. When developing pipelines, this allows us to avoid re-running unchanged processes by simply appending `-resume` to the `nextflow run` command. Sometimes, tasks may be repeated for reasons that are unclear. In these cases it can help to look into the caching mechanism, to understand why a specific process was re-run.
13+
14+
We have previously written about Nextflow's [resume functionality](https://www.nextflow.io/blog/2019/demystifying-nextflow-resume.html) as well as some [troubleshooting strategies](https://www.nextflow.io/blog/2019/troubleshooting-nextflow-resume.html) to gain more insights on the caching behavior.
15+
16+
In this post, we will take a more hands-on approach and highlight some strategies which we can use to understand what is causing a particular process (or processes) to re-run, instead of using the cache from previous runs of the pipeline. To demonstrate the process, we will introduce a minor change into one of the process definitions in the the [nextflow-io/rnaseq-nf](https://github.com/nextflow-io/rnaseq-nf) pipeline and investigate how it affects the overall caching behavior when compared to the initial execution of the pipeline.
17+
18+
19+
### Local setup for the test
20+
21+
First, we clone the [nextflow-io/rnaseq-nf](https://github.com/nextflow-io/rnaseq-nf
22+
) pipeline locally:
23+
24+
```bash
25+
$ git clone https://github.com/nextflow-io/rnaseq-nf
26+
$ cd rnaseq-nf
27+
```
28+
29+
In the examples below, we have used Nextflow `v22.10.0`, Docker `v20.10.8` and `Java v17 LTS` on MacOS.
30+
31+
32+
### Pipeline flowchart
33+
34+
The flowchart below can help in understanding the design of the pipeline and the dependencies between the various tasks.
35+
36+
![rnaseq-nf](/img/rnaseq-nf.base.png)
37+
38+
### Logs from initial (fresh) run
39+
40+
As a reminder, Nextflow generates a unique task hash, e.g. 22/7548fa… for each task in a workflow. The hash takes into account the complete file path, the last modified timestamp, container ID, content of script directrive among other factors. If any of these change, the task will be re-executed. Nextflow maintains a list of task hashes for caching and traceability purposes. You can learn more about task hashes in the article [Troubleshooting Nextflow resume](https://nextflow.io/blog/2019/troubleshooting-nextflow-resume.html).
41+
42+
To have something to compare to, we first need to generate the initial hashes for the unchanged processes in the pipeline. We save these in a file called `fresh_run.log` and use them later on as "ground-truth" for the analysis. In order to save the process hashes we use the `-dump-hashes` flag, which prints them to the log.
43+
44+
**TIP:** We rely upon the [`-log` option](https://www.nextflow.io/docs/latest/cli.html#execution-logs) in the `nextflow` command line interface to be able to supply a custom log file name instead of the default `.nextflow.log`.
45+
46+
```console
47+
$ nextflow -log fresh_run.log run ./main.nf -profile docker -dump-hashes
48+
49+
[...truncated…]
50+
executor > local (4)
51+
[d5/57c2bb] process > RNASEQ:INDEX (ggal_1_48850000_49020000) [100%] 1 of 1 ✔
52+
[25/433b23] process > RNASEQ:FASTQC (FASTQC on ggal_gut) [100%] 1 of 1 ✔
53+
[03/23372f] process > RNASEQ:QUANT (ggal_gut) [100%] 1 of 1 ✔
54+
[38/712d21] process > MULTIQC [100%] 1 of 1 ✔
55+
[...truncated…]
56+
```
57+
58+
### Edit the `FastQC` process
59+
60+
After the initial run of the pipeline, we introduce a change in the `fastqc.nf` module, hard coding the number of threads which should be used to run the `FASTQC` process via Nextflow's [`cpus` directive](https://www.nextflow.io/docs/latest/process.html#cpus).
61+
62+
Here's the output of `git diff` on the contents of `modules/fastqc/main.nf` file:
63+
64+
```diff
65+
--- a/modules/fastqc/main.nf
66+
+++ b/modules/fastqc/main.nf
67+
@@ -4,6 +4,7 @@ process FASTQC {
68+
tag "FASTQC on $sample_id"
69+
conda 'bioconda::fastqc=0.11.9'
70+
publishDir params.outdir, mode:'copy'
71+
+ cpus 2
72+
73+
input:
74+
tuple val(sample_id), path(reads)
75+
@@ -13,6 +14,6 @@ process FASTQC {
76+
77+
script:
78+
"""
79+
- fastqc.sh "$sample_id" "$reads"
80+
+ fastqc.sh "$sample_id" "$reads" -t ${task.cpus}
81+
"""
82+
}
83+
```
84+
85+
### Logs from the follow up run
86+
87+
Next, we run the pipeline again with the `-resume` option, which instructs Nextflow to rely upon the cached results from the previous run and only run the parts of the pipeline which have changed. As before, we instruct Nextflow to dump the process hashes, this time in a file called `resumed_run.log`.
88+
89+
```console
90+
$ nextflow -log resumed_run.log run ./main.nf -profile docker -dump-hashes -resume
91+
92+
[...truncated…]
93+
executor > local
94+
[d5/57c2bb] process > RNASEQ:INDEX (ggal_1_48850000_49020000) [100%] 1 of 1, cached: 1 ✔
95+
[55/15b609] process > RNASEQ:FASTQC (FASTQC on ggal_gut) [100%] 1 of 1 ✔
96+
[03/23372f] process > RNASEQ:QUANT (ggal_gut) [100%] 1 of 1, cached: 1 ✔
97+
[f3/f1ccb4] process > MULTIQC [100%] 1 of 1 ✔
98+
[...truncated…]
99+
```
100+
101+
## Analysis of cache hashes
102+
103+
From the summary of the command line output above, we can see that the `RNASEQ:FASTQC (FASTQC on ggal_gut)` and `MULTIQC` processes were re-run while the others were cached. To understand why, we can examine the hashes generated by the processes from the logs of the `fresh_run` and `resumed_run`.
104+
105+
For the analysis, we need to keep in mind that:
106+
107+
1. The time-stamps are expected to differ and can be safely ignored to narrow down the `grep` pattern to the Nextflow `TaskProcessor` class.
108+
109+
2. The _order_ of the log entries isn't fixed, due to the nature of the underlying parallel computation dataflow model used by Nextflow. For example, in our example below, `FASTQC` ran first in `fresh_run.log` but wasn’t the first logged process in `resumed_run.log`.
110+
111+
### Find the process level hashes
112+
113+
We can use standard Unix tools like `grep`, `cut` and `sort` to address these points and filter out the relevant information:
114+
115+
1. Use `grep` to isolate log entries with `cache hash` string
116+
2. Remove the prefix time-stamps using `cut -d ‘-’ -f 3`
117+
3. Remove the caching mode related information using `cut -d ';' -f 1`
118+
4. Sort the lines based on process names using `sort` to have a standard order before comparison
119+
5. Use `tee` to print the resultant strings to the terminal and simultaneously save to a file
120+
121+
Now, let’s apply these transformations to the `fresh_run.log` as well as `resumed_run.log` entries.
122+
123+
- `fresh_run.log`
124+
125+
```console
126+
$ cat ./fresh_run.log | grep 'INFO.*TaskProcessor.*cache hash' | cut -d '-' -f 3 | cut -d ';' -f 1 | sort | tee ./fresh_run.tasks.log
127+
128+
[MULTIQC] cache hash: 167d7b39f7efdfc49b6ff773f081daef
129+
[RNASEQ:FASTQC (FASTQC on ggal_gut)] cache hash: 47e8c58d92dbaafba3c2ccc4f89f53a4
130+
[RNASEQ:INDEX (ggal_1_48850000_49020000)] cache hash: ac8be293e1d57f3616cdd0adce34af6f
131+
[RNASEQ:QUANT (ggal_gut)] cache hash: d8b88e3979ff9fe4bf64b4e1bfaf4038
132+
```
133+
134+
- `resumed_run.log`
135+
136+
```console
137+
$ cat ./resumed_run.log | grep 'INFO.*TaskProcessor.*cache hash' | cut -d '-' -f 3 | cut -d ';' -f 1 | sort | tee ./resumed_run.tasks.log
138+
139+
[MULTIQC] cache hash: d3f200c56cf00b223282f12f06ae8586
140+
[RNASEQ:FASTQC (FASTQC on ggal_gut)] cache hash: 92478eeb3b0ff210ebe5a4f3d99aed2d
141+
[RNASEQ:INDEX (ggal_1_48850000_49020000)] cache hash: ac8be293e1d57f3616cdd0adce34af6f
142+
[RNASEQ:QUANT (ggal_gut)] cache hash: d8b88e3979ff9fe4bf64b4e1bfaf4038
143+
```
144+
145+
### Inference from process top-level hashes
146+
147+
Computing a hash is a multi-step process and various factors contribute to it such as the inputs of the process, platform, time-stamps of the input files and more ( as explained in [Demystifying Nextflow resume]( https://www.nextflow.io/blog/2019/demystifying-nextflow-resume.html) blog post) . The change we made in the task level CPUs directive and script section of the `FASTQC` process triggered a re-computation of hashes:
148+
149+
150+
```diff
151+
--- ./fresh_run.tasks.log
152+
+++ ./resumed_run.tasks.log
153+
@@ -1,4 +1,4 @@
154+
- [MULTIQC] cache hash: dccabcd012ad86e1a2668e866c120534
155+
- [RNASEQ:FASTQC (FASTQC on ggal_gut)] cache hash: 94be8c84f4bed57252985e6813bec401
156+
+ [MULTIQC] cache hash: c5a63560338596282682cc04ff97e436
157+
+ [RNASEQ:FASTQC (FASTQC on ggal_gut)] cache hash: 54aa712db7c8248e7f31d5fb6535ff9d
158+
[RNASEQ:INDEX (ggal_1_48850000_49020000)] cache hash: 356aaa7524fb071f258480ba07c67b3c
159+
[RNASEQ:QUANT (ggal_gut)] cache hash: 169ced0fc4b047eaf91cd31620b22540
160+
161+
162+
```
163+
164+
Even though we only introduced changes in `FASTQC`, the `MULTIQC` process was re-run since it relies upon the output of the `FASTQC` process. Any task that has its cache hash invalidated triggers a rerun of all downstream steps:
165+
166+
![rnaseq-nf after modification](/img/rnaseq-nf.modified.png)
167+
168+
169+
### Understanding why `FASTQC` was re-run
170+
171+
We can see the full list of `FASTQC` process hashes within the `fresh_run.log` file
172+
173+
174+
```console
175+
176+
[...truncated…]
177+
Nov-03 20:19:13.827 [Actor Thread 6] INFO nextflow.processor.TaskProcessor - [RNASEQ:FASTQC (FASTQC on ggal_gut)] cache hash: 54aa712db7c8248e7f31d5fb6535ff9d; mode: STANDARD; entries:
178+
1a0e496fef579b22998f099981b494f9 [java.util.UUID] a11bf24f-638a-42d6-8b50-48d3be637d54
179+
195c7faea83c75f2340eb710d8486d2a [java.lang.String] RNASEQ:FASTQC
180+
2bea0eee5e384bd6082a173772e939eb [java.lang.String] """
181+
fastqc.sh "$sample_id" "$reads" -t ${task.cpus}
182+
"""
183+
184+
8e58c0cec3bde124d5d932c7f1579395 [java.lang.String] quay.io/nextflow/rnaseq-nf:v1.1
185+
7ec7cbd71ff757f5fcdbaa760c9ce6de [java.lang.String] sample_id
186+
16b4905b1545252eb7cbfe7b2a20d03d [java.lang.String] ggal_gut
187+
553096c532e666fb42214fdf0520fe4a [java.lang.String] reads
188+
6a5d50e32fdb3261e3700a30ad257ff9 [nextflow.util.ArrayBag] [FileHolder(sourceObj:/home/abhinav/rnaseq-nf/data/ggal/ggal_gut_1.fq, storePath:/home/abhinav/rnaseq-nf/data/ggal/ggal_gut_1.fq, stageName:ggal_gut_1.fq), FileHolder(sourceObj:/home/abhinav/rnaseq-nf/data/ggal/ggal_gut_2.fq, storePath:/home/abhinav/rnaseq-nf/data/ggal/ggal_gut_2.fq, stageName:ggal_gut_2.fq)]
189+
4f9d4b0d22865056c37fb6d9c2a04a67 [java.lang.String] $
190+
16fe7483905cce7a85670e43e4678877 [java.lang.Boolean] true
191+
80a8708c1f85f9e53796b84bd83471d3 [java.util.HashMap$EntrySet] [task.cpus=2]
192+
f46c56757169dad5c65708a8f892f414 [sun.nio.fs.UnixPath] /home/abhinav/rnaseq-nf/bin/fastqc.sh
193+
[...truncated…]
194+
195+
```
196+
197+
When we isolate and compare the log entries for `FASTQC` between `fresh_run.log` and `resumed_run.log`, we see the following diff:
198+
199+
```diff
200+
--- ./fresh_run.fastqc.log
201+
+++ ./resumed_run.fastqc.log
202+
@@ -1,8 +1,8 @@
203+
-INFO nextflow.processor.TaskProcessor - [RNASEQ:FASTQC (FASTQC on ggal_gut)] cache hash: 94be8c84f4bed57252985e6813bec401; mode: STANDARD; entries:
204+
+INFO nextflow.processor.TaskProcessor - [RNASEQ:FASTQC (FASTQC on ggal_gut)] cache hash: 54aa712db7c8248e7f31d5fb6535ff9d; mode: STANDARD; entries:
205+
1a0e496fef579b22998f099981b494f9 [java.util.UUID] a11bf24f-638a-42d6-8b50-48d3be637d54
206+
195c7faea83c75f2340eb710d8486d2a [java.lang.String] RNASEQ:FASTQC
207+
- 43e5a23fc27129f92a6c010823d8909b [java.lang.String] """
208+
- fastqc.sh "$sample_id" "$reads"
209+
+ 2bea0eee5e384bd6082a173772e939eb [java.lang.String] """
210+
+ fastqc.sh "$sample_id" "$reads" -t ${task.cpus}
211+
212+
```
213+
214+
Observations from the diff:
215+
1. We can see that the content of the script has changed, highlighting the new `$task.cpus` part of the command.
216+
2. There is a new entry in the `resumed_run.log` showing that the content of the process level directive `cpus` has been added.
217+
218+
In other words, the diff from log files is confirming our edits.
219+
220+
### Understanding why `MULTIQC` was re-run
221+
222+
Now, we apply the same analysis technique for the `MULTIQC` process in both log files:
223+
224+
```diff
225+
--- ./fresh_run.multiqc.log
226+
+++ ./resumed_run.multiqc.log
227+
@@ -1,4 +1,4 @@
228+
-INFO nextflow.processor.TaskProcessor - [MULTIQC] cache hash: dccabcd012ad86e1a2668e866c120534; mode: STANDARD; entries:
229+
+INFO nextflow.processor.TaskProcessor - [MULTIQC] cache hash: c5a63560338596282682cc04ff97e436; mode: STANDARD; entries:
230+
1a0e496fef579b22998f099981b494f9 [java.util.UUID] a11bf24f-638a-42d6-8b50-48d3be637d54
231+
cd584abbdbee0d2cfc4361ee2a3fd44b [java.lang.String] MULTIQC
232+
56bfc44d4ed5c943f30ec98b22904eec [java.lang.String] """
233+
@@ -9,8 +9,9 @@
234+
235+
8e58c0cec3bde124d5d932c7f1579395 [java.lang.String] quay.io/nextflow/rnaseq-nf:v1.1
236+
14ca61f10a641915b8c71066de5892e1 [java.lang.String] *
237+
- cd0e6f1a382f11f25d5cef85bd87c3f4 [nextflow.util.ArrayBag] [FileHolder(sourceObj:/home/abhinav/rnaseq-nf/work/03/23372f156e80deb4d7183c5f509274/ggal_gut, storePath:/home/abhinav/rnaseq-nf/work/03/23372f156e80deb4d7183c5f509274/ggal_gut, stageName:ggal_gut), FileHolder(sourceObj:/home/abhinav/rnaseq-nf/work/25/433b23af9e98294becade95db6bd76/fastqc_ggal_gut_logs, storePath:/home/abhinav/rnaseq-nf/work/25/433b23af9e98294becade95db6bd76/fastqc_ggal_gut_logs, stageName:fastqc_ggal_gut_logs)]
238+
+ 18966b473f7bdb07f4f7f4c8445be1f5 [nextflow.util.ArrayBag] [FileHolder(sourceObj:/home/abhinav/rnaseq-nf/work/03/23372f156e80deb4d7183c5f509274/ggal_gut, storePath:/home/abhinav/rnaseq-nf/work/03/23372f156e80deb4d7183c5f509274/ggal_gut, stageName:ggal_gut), FileHolder(sourceObj:/home/abhinav/rnaseq-nf/work/55/15b60995682daf79ecb64bcbb8e44e/fastqc_ggal_gut_logs, storePath:/home/abhinav/rnaseq-nf/work/55/15b60995682daf79ecb64bcbb8e44e/fastqc_ggal_gut_logs, stageName:fastqc_ggal_gut_logs)]
239+
d271b8ef022bbb0126423bf5796c9440 [java.lang.String] config
240+
5a07367a32cd1696f0f0054ee1f60e8b [nextflow.util.ArrayBag] [FileHolder(sourceObj:/home/abhinav/rnaseq-nf/multiqc, storePath:/home/abhinav/rnaseq-nf/multiqc, stageName:multiqc)]
241+
4f9d4b0d22865056c37fb6d9c2a04a67 [java.lang.String] $
242+
16fe7483905cce7a85670e43e4678877 [java.lang.Boolean] true
243+
```
244+
245+
Here, the highlighted diffs show the directory of the input files, changing as a result of `FASTQC` being re-run; as a result `MULTIQC` has a new hash and has to be re-run as well.
246+
247+
## Conclusion
248+
249+
Debugging the caching behavior of a pipeline can be tricky, however a systematic analysis can help to uncover what is causing a particular process to be re-run.
250+
251+
When analyzing large datasets, it may be worth using the `-dump-hashes` option by default for all pipeline runs, avoiding needing to run the pipeline again to obtain the hashes in the log file in case of problems.
252+
253+
While this process works, it is not trivial. We would love to see some community-driven tooling for a better cache-debugging experience for Nextflow, perhaps an `nf-cache` plugin? Stay tuned for an upcoming blog post describing how to extend and add new functionality to Nextflow using plugins.
254+

0 commit comments

Comments
 (0)