bash_for_bio/testing.qmd at main · fhdsl/bash_for_bio · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
---
title: "Testing Scripts"
format: html
---

## Developing and Testing scripts

One of the hard things to understand is what can be run on a compute node versus the head node, and what file systems are accessible via a compute node.

A lot of the issues you might have is because you need to understand the mental model of how cluster computing works. And the best way to understand that is to test your code on a compute node.

Let's explore how we can do that. You should also review the material about using `screen` (@sec-screen).

### Testing code on a compute node {#sec-grabnode}

Fred Hutch users have the advantage of `grabnode`, which is a custom command that lets you request an interactive instance of a compute node. (Non-FH folks can usually request this with the `-it` flag for `srun`)

Why would you want to do this? A good part of this is about testing software and making sure that your paths are correct.

:::{.callout}
## Don't rely on `grabnode`/interactive mode for your batch work

We often see users that will request a multicore node with higher memory, and do their processing on that node.

This doesn't take advantage of all of the machines that are available on the cluster, and thus is a suboptimal way to utilize the cluster.

When you are doing interactive analysis, such as working in JupyterLab or RStudio, that is a valid way to work. But when you have tasks you can *scatter* amongst many nodes, requesting a high-spec node isn't a great way to optimally achieve things.

The other disadvantage is that you may be waiting a very long time to get that multicore node, whereas if you batch across a bunch of nodes, you will get your work done much faster.
:::

### Grabbing an interactive shell on a worker

When you're testing code that's going to run on a worker node, you need to be aware of what the worker node sees.

It's also important in estimating how long our tasks are going to run since we can test how long a task runs for a representative dataset.

:::{.callout-note}
### For FH Users: `grabnode` {#sec-grabnode}

On the FH system, we can use a command called `grabnode`, which will let us request a node. It will ask us for our requirements (numbers of cores, memory, etc.) for our node.

```bash
tladera2@rhino01:~$ grabnode
```

`grabnode` will then ask us for what kind of instance we want, in terms of CPUs, Memory, and GPUs. Here, I'm grabbing a node with 8 cores, 8 Gb of memory, using it for 1 day, and no GPU.

```
How many CPUs/cores would you like to grab on the node? [1-36] 8
How much memory (GB) would you like to grab? [160] 8
Please enter the max number of days you would like to grab this node: [1-7] 1
Do you need a GPU ? [y/N]n

You have requested 8 CPUs on this node/server for 1 days or until you type exit.

Warning: If you exit this shell before your jobs are finished, your jobs
on this node/server will be terminated. Please use sbatch for larger jobs.

Shared PI folders can be found in: /fh/fast, /fh/scratch and /fh/secure.

Requesting Queue: campus-new cores: 8 memory: 8 gpu: NONE
srun: job 40898906 queued and waiting for resources
```

After a little bit, you'll arrive at a new prompt:

```
(base) tladera2@gizmok164:~$
```

Now you can test your batch scripts, in order to make sure your file paths are correct. It is also helpful in profiling your job.

If you're doing interactive analysis that is going to span over a few days, I recommend that you use [`screen` or `tmux`](@sec-screen).
:::

:::{.callout}
## For Other HPC systems

On a SLURM system, the way to open interactive shells on a node has changed. Check your version first:

```bash
srun --version
```

If you're on a version before 20.11, you can use `srun -i --pty bash` to open an interactive terminal on a worker:

```bash
srun -i --pty bash
```

If the version is past 20.11, we can open an interactive shell on a worker with `salloc`.

```bash
salloc bash
```
:::

:::{.callout}
## Remember `hostname`
When you are doing interactive analysis, it is easy to forget in which node you're working in. Just as a quick check, I use `hostname` (@sec-hostname) to remind myself whether I'm in `rhino`, `gizmo`, or within an apptainer container.
:::


## Testing code in a container {#sec-open-container}

In this section, we talk about testing scripts in a container using `apptainer`. We use `apptainer` (formerly Singularity) in order to run Docker containers on a shared HPC system. This is because Docker itself requires root-level privileges, which is not secure on shared systems.

In order to do our testing, we'll first pull the Docker container, map our bind point (so our container can access files outside of its file system), and then run scripts in the container.

Even if you aren't going to frequently use Apptainer in your work, I recommend trying an interactive shell in a container at least once or twice to learn about the container filesystem and conceptually understand how you connect it to the external filesystem.

I think the hardest thing about working with containers is wrapping your head around the indirectness of them. You are running software with its own internal filesystem and the challenges are getting the container to read files in folders/paths outside of its own filesytem, as well as outputting files into those outside folders.

### Pulling a Docker Container

Let's pull a docker container from the Docker registry. Note we have to specify `docker://` when we pull the container, because Apptainer has its own internal format called SIF.

```bash
module load Apptainer/1.1.6
apptainer pull docker://biocontainers/samtools:v1.9-4-deb_cv1
```

### Opening a Shell in a Container with `apptainer shell`

When you're getting started, opening a shell using Apptainer can help you test out things like filepaths and how they're accessed in the container. It's hard to get an intuition for how file I/O works with containers until you can see the limited view from the container.

By default, apptainers can see your current directory and navigate to the files in it.

You can open an Apptainer shell in a container using `apptainer shell`. Remember to use `docker://` before the container name. For example:

```bash
module load Apptainer/1.1.6
apptainer shell docker://biocontainers/samtools:v1.9-4-deb_cv1
```

This will load the `apptainer` module, and then open a Bash shell in the container using `apptainer shell`. Once you're in the container, you can test code, especially seeing whether your files can be seen by the container (see @sec-bindpaths). 90% of the issues with using Docker containers has to do with bind paths, so we'll talk about that next.

Once you're in the shell, you can take a look at where `samtools` is installed:

```bash
which samtools
```

Note that the container filesystem is isolated, and we need to explicitly build connections to it (called bind paths) to get files in and out. We'll talk more about this in the next section.

Once we're done testing scripts in our containers, we can exit the shell and get back into the node.

```bash
exit
```

### Using bind paths in containers {#sec-bindpaths}

One thing to keep in mind is that every container has its own filesystem. One of the hardest things to wrap your head around for containers is how their filesystems work, and how to access files that are outside of the container filesystem. We'll call any filesystems outside of the container *external filesystems* to make the discussion a little easier.

By default, the containers have access to your current working directory. We could make this where our scripts live (such as `/home/tladera2/`), but because our data is elsewhere, we'll need to specify that location (`/fh/fast/mylab/`) as well.

The main mechanism we have in Apptainer to access the external filesystem are *bind paths*. Much like mounting a drive, we can bind directories from the external filesystem using these bind points.

```{mermaid}
flowchart LR
   B["External Directory\n/fh/fast/mydata/"]
   B --read--> C
   C --write--> B
   A["Container Filesystem\n/mydata/"]--write-->C("--bind /fh/fast/mydata/:/mydata/")
   C --read--> A
```

I think of bind paths as "tunnels" that give access to particular folders in the external filesystem. Once the tunnel is open, we can access data files, process them, and save them using the bind path.

Say my data lives in `/fh/fast/mydata/`. Then I can specify a bind point in my `apptainer shell` and `apptainer run` commands.

We can do this with the `--bind` option:

```bash
apptainer shell --bind /fh/fast/mydata:/mydata docker://biocontainers/samtools:v1.9-4-deb_cv1
```

Note that the bind syntax doesn't have the trailing slash (`/`). That is, note that it is:

```
--bind /fh/fast/mydata: ....
```
Rather than

```
--bind /fh/fast/mydata/: ....
```

Now our `/fh/fast/mydata/` folder will be available as `/mydata/` in my container. We can read and write files to this bind point. For example, I'd refer to the `.bam` file `/fh/fast/mydata/my_bam_file.bam` as:

```
samtools view -c /mydata/my_bam_file.bam
```


:::{.callout-note}
## Opening a Shell in a Docker Container with Docker

For the most part, due to security reasons, we don't use `docker` on HPC systems. In short, the `docker` group essentially has root-level access to the machine, and it's not a good for security on a shared resource like an HPC.

However, if you have admin level access (for example, on your own laptop), you can open up an interactive shell with `docker run -it`:

```bash
docker run -it biocontainers/samtools:v1.9-4-deb_cv1 /bin/bash
```
This will open a bash shell much like `apptainer shell`. Note that volumes (the docker equivalent of bind paths) are specified differently in Docker compared to Apptainer.
:::

:::{.callout-note}
## WDL makes this way easier

A major point of failure with Apptainer scripting is when our scripts aren't using the right bind paths. It becomes even more complicated when you are running multiple steps.

This is one reason we recommend writing WDL Workflows and a {{<glossary "workflow manager">}} (such as {{<glossary "Cromwell">}} or Sprocket) to run your workflows. You don't have to worry that your bind points are setup correctly, because they are handled by the workflow manager.
:::

### Testing in the Apptainer Shell

Ok, now we have a bind point, so now we can test our script in the shell. For example, we can see if we are invoking `samtools` in the correct way and that our bind points work.

```bash
samtools view -c /mydata/my_bam_file.bam > /mydata/bam_counts.txt
```

Again, trying out scripts in the container is the best way to understand what the container can and can't see.

### Exiting the container when you're done

You can `exit`, like any shell you open. You should be out of the container. Confirm by using `hostname` to make sure you're out of the container.

### Testing outside of the container

Let's take everything that we learned and put it in a script that we can run on the HPC:

```bash
# Script to samtools view -c an input file:
# Usage: ./run_sam.sh <my_bam_file.bam>
# Outputs a count file: my_bam_file.bam.counts.txt
#!/bin/bash
module load Apptainer/1.1.6
apptainer run --bind /fh/fast/mydata:/mydata \
    docker://biocontainers/samtools:v1.9-4-deb_cv1 \
    samtools view -c /mydata/$1 > /mydata/$1.counts.txt
#apptainer cache clean
module purge
```

We can use this script by the following command:

```
./run_sam.sh chr1.bam
```

And it will output a file called `chr1.bam.counts.txt`.

:::{.callout}
## Apptainer Cache

[The apptainer cache](https://apptainer.org/docs/user/1.0/build_env.html) is where your docker images live. They are translated to the native apptainer `.sif` format.

You can see what's in your cache by using

```
apptainer cache list
```

By default the cache lives at `~/.apptainer/cache`.

If you need to clear out the cache, you can run

```
apptainer cache clean
```

to clear out the cache.

There are a number of environment variables (@sec-environment) that can be set, including login tokens for pulling from a private registry. [More information is here](https://apptainer.org/docs/user/1.0/build_env.html#environment-variables).
:::