fhdsl
diff --git a/‎01_assignment.qmd‎
Lines changed: 4 additions & 1 deletion b/‎01_assignment.qmd‎
Lines changed: 4 additions & 1 deletion
diff --git a/‎02_assignment.qmd‎
Lines changed: 19 additions & 0 deletions b/‎02_assignment.qmd‎
Lines changed: 19 additions & 0 deletions
diff --git a/‎03_batch.qmd‎
Lines changed: 11 additions & 70 deletions b/‎03_batch.qmd‎
Lines changed: 11 additions & 70 deletions
diff --git a/‎04_containers_workflows.qmd‎
Lines changed: 6 additions & 162 deletions b/‎04_containers_workflows.qmd‎
Lines changed: 6 additions & 162 deletions
diff --git a/‎container-basics.qmd‎
Lines changed: 1 addition & 1 deletion b/‎container-basics.qmd‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/02_scripting.html‎
Lines changed: 1 addition & 1 deletion b/‎docs/02_scripting.html‎
Lines changed: 1 addition & 1 deletion
@@ -20,4 +20,7 @@ Try running it:
 
 `./run_this.sh`
 
-3. Take a look at `scripts/week1/rnorm.R` or `scripts/week1/random_num.py`. Load up the `fhR` or `fhPython` modules on `rhino` using `module load`. Run it on the command line with `Rscript` or `python3`. Did you need to make this script executable?
+3. Take a look at `scripts/week1/rnorm.R` or `scripts/week1/random_num.py`. Load up the `fhR` or `fhPython` modules on `rhino` using `module load`. Run it on the command line with `Rscript` or `python3`. 
+
+Did you need to make this script executable before you ran it?
+
@@ -3,3 +3,22 @@ title: "Week 2 Assignment"
 format: html
 ---
 
+1. Modify the below script to be runnable. What does the script do?
+
+```bash
+#/
+
+```
+
+2. Modify the below R script to be runnable. It should also take an argument called $FILEPATH
+
+
+```r
+
+```
+
+3. Modify the below Python script to be runnable. It should also take the first position argument (a file path) and process the file
+
+```python
+
+```
@@ -41,7 +41,7 @@ If we run this in our repository, we get something similar to this:
      214    1314    8001 miscellaneous.qmd
 ```
 
-The `*.qmd` (the wildcard operator, also known as a glob) can be used in various ways. For example, if our files are in a folder called `raw_data/`, we could specify:
+The `*.qmd` (the wildcard operator, also known as a {{<glossary glob>}}) can be used in various ways. For example, if our files are in a folder called `raw_data/`, we could specify:
 
 ```
 for file in raw_data/*.fq
@@ -77,6 +77,11 @@ module purge
 
 See page 12 in Bite Size Bash.
 
+:::{.callout}
+## Selecting files with complicated patterns: Regular Expressions
+
+:::
+
 ### Using file manifests
 
 One approach that I use a lot is using file manifests to process multiple sets of files. Each line of the file manifest will contain all of the related files I need to process. 
@@ -106,6 +111,8 @@ unset IFS             #<2>
 1. Change IFS to be `""` (no space), to process a file line by line.
 2. Reset IFS to original behavior.
 
+
+
 ## Batching on HPC
 
 Now we can start to do more advanced things on the HPC: use one machine to process each file. 
@@ -149,6 +156,7 @@ We are able to set some configuration on running our jobs.
 
 Much more information about the kinds of directives that you can specify in a SLURM script is available here: <https://www.osc.edu/supercomputing/batch-processing-at-osc/slurm_directives_summary>
 
+The most important directive you should be aware of is how 
 :::
 
 ### Job Arrays
@@ -235,73 +243,6 @@ scancel 26328834
 
 This will cancel all sub jobs related to the parent job.
 
-## Containers
-
-We already learned about software modules (@sec-modules). There is an alternative way to use software: using a {{<glossary "container">}}.
-
-### What is a Container?
-
-A container is a self-contained unit of software. It contains everything needed to run the software on a variety of machines. If you have the container software installed on your machine, it doesn't matter whether it is MacOS, Linux, or Windows - the container will behave consistently across different operating systems and architectures.
-
-The container has the following contents:
-
-- **Software** - The software we want to run in a container. For bioinformatics work, this is usually something like an aligner like `bwa`, or utilities such as `samtools`
-- **Software Dependencies** - various software packages needed to run the software. For example, if we wanted to run `tidyverse` in a container, we need to have `R` installed in the container as well.
-- **Filesystem** - containers have their own isolated filesystem that can be connected to the "outside world" - everything outside of the container. We'll learn more about customizing these with bind paths (@sec-bindpaths).
-
-In short, the container has everything needed to run the software. It is not a full operating system, but a smaller mini-version that cuts out a lot of cruft. 
-
-Containers are {{< glossary "ephemeral">}}. They leverage the the file system of their host to manage files. These are called both *Volumes* (the Docker term) and *Bind Paths* (the apptainer term).
-
-### Docker vs. Apptainer
-
-There are two basic ways to run Docker containers: 
-
-1. Using the Docker software
-2. Using the Apptainer software (for HPC systems)
-
-In general, Docker is used on systems where you have a high level of access to the system. This is because `docker` uses a special user group called `docker` that has essentially root level privileges. This is not something to be taken lightly.
-
-This is not the case for HPC systems, which are shared and granting this level of access to many people is not practical. This is when we use {{< glossary "Apptainer">}} (which used to be called Singularity), which requires a much lower level of user privileges to execute tasks. For more info, see @sec-open-container . 
-
-:::{.callout-warning}
-## Be Secure
-
-Before we get started, security is always a concern when running containers. The `docker` group has elevated status on a system, so we need to be careful that when we're running them, these containers aren't introducing any system vulnerabilities. Note that on HPC systems, the main mechanism for running containers is `apptainer`, which is designed to be more secure.
-
-These are mostly important when running containers that are web-servers or part of a web stack, but it is also important to think about when running jobs on HPC.
-
-Here are some guidelines to think about when you are working with a container.
-
-- **Use vendor-specific Docker Images when possible**. 
-- **Use container scanners to spot potential vulnerabilities**. DockerHub has a vulnerability scanner that scans your Docker images for potential vulnerabilities. For example, the WILDS Docker library employs a vulnerability scanner and the containers are regularly patched to prevent vulnerabilities.
-- **Avoid kitchen-sink images**. One issue is when an image is built on top of many other images. It makes it really difficult to plug vulnerabilities. When in doubt, use images from trusted people and organizations. At the very least, look at the Dockerfile to see that suspicious software isn't being installed.
-:::
-
-### Common Containers for Bioinformatics
-
-- GATK (the genome analysis toolkit) is one common container that we can use for analysis.
-- 
-
-### The WILDS Docker Library
-
-The Data Science Lab has a set of Docker containers for common Bioinformatics tasks available in the [WILDS Docker Library](https://hub.docker.com/u/getwilds). These include:
-
-- `samtools`
-- `bcftools`
-- `manta`
-- `cnvkit`
-- `deseq2`
-
-Among many others. Be sure to check it out before you start building your own containers.
- 
-### Pulling a Docker Container
-
-Let's pull a docker container from the Docker registry. Note we have to specify `docker://` when we pull the container, because Apptainer has its own internal format called SIF.
-
-```bash
-module load Apptainer/1.1.6
-apptainer pull docker://ghcr.io/getwilds/scanpy:latest
-apptainer run --bind /path/to/data:/data,/path/to/script:/script docker://getwilds/scanpy:latest python /script/example.py
-```
+## Redirection and Pipes
 
+Ok, we've executed 
@@ -1,5 +1,5 @@
 ---
-title: "Containers and Workflows"
+title: "Workflows"
 ---
 
 ## Learning Objectives
@@ -19,9 +19,9 @@ A good workflow manager will allow you to:
 - Restart failed subjobs in the workflow
 - Allow you to customize where intermediate and final outputs go
 - Swap and customize modules in your workflow
-- Adapt to different architectures
+- Adapt to different computing architectures (HPC/cloud/etc)
 
-Many bioinformaticists have used workflow managers to process and manage hundreds or thousands of files at a time. 
+Many bioinformaticists have used workflow managers to process and manage hundreds or thousands of files at a time. They are well worth learning. 
 
 Here is an overview of some of the common bioinformatics workflow managers. We will be using `cromwell`, which runs WDL files.
 
@@ -31,7 +31,7 @@ Here is an overview of some of the common bioinformatics workflow managers. We w
 |Sprocket | WDL|Made for HPC Jobs|
 |MiniWDL|WDL|Used for local testing of workflows|
 |DNANexus|WDL/CWL|Used for systems such as AllOfUs|
-|Nextflow|.nf files|Owned by seqera|
+|Nextflow|`.nf` files|Owned by seqera|
 |Snakemake|make files||
 
 
@@ -301,163 +301,7 @@ workflow SRA_STAR2Pass {
 ```
 
 
-## Developing and Testing out scripts
+## Where Next?
 
-One of the hard things to understand is what can be run on a compute node versus the head node, and what file systems are accessible via a compute node. 
+Now that you understand the basics of working with Bash and WDL, you are now ready to start working with WDL workflows.
 
-A lot of the issues you might have is because you need to understand the mental model of how cluster computing works. And the best way to understand that is to test your code on a compute node. 
-
-Let's explore how we can do that.
-
-### Testing code on a compute node {#sec-grabnode}
-
-Fred Hutch users have the advantage of `grabnode`, which is a custom command that lets you request an interactive instance of a compute node.
-
-Why would you want to do this? A good part of this is about testing software and making sure that your paths are correct.
-
-:::{.callout}
-## Don't rely on `grabnode`/interactive mode for your work
-
-We see users that will request a multicore node with higher memory, and do their processing on that node.
-
-This doesn't take advantage of all of the machines that are available on a cluster, and thus is a suboptimal way to utilize the cluster.
-
-The other disadvantage is that you may be waiting a very long time to get that multicore node, whereas if you batch across a bunch of nodes, you will get your work done much faster.
-:::
-
-## Working with containers {#sec-containers}
-
-I think the hardest thing about working with containers is wrapping your head around the indirectness of them. You are running software with its own internal filesystem and the challenges are getting the container to read files in folders/paths outside of its own filesytem, as well as outputting files into those outside folders. 
-
-### Testing code in a container {#sec-open-container}
-
-In this section, we talk about testing scripts in a container using `apptainer`. We use `apptainer` (formerly Singularity) in order to run Docker containers on a shared HPC system. This is because Docker itself requires root-level privileges, which is not secure on shared systems.
-
-In order to do our testing, we'll first pull the Docker container, map our bind point (so our container can access files outside of its file system), and then run scripts in the container.
-
-Even if you aren't going to frequently use Apptainer in your work, I recommend trying an interactive shell in a container at least once or twice to learn about the container filesystem and conceptually understand how you connect it to the external filesystem.
-
-### Pulling a Docker Container
-
-Let's pull a docker container from the Docker registry. Note we have to specify `docker://` when we pull the container, because Apptainer has its own internal format called SIF.
-
-```bash
-module load Apptainer/1.1.6
-apptainer pull docker://biocontainers/samtools:v1.9-4-deb_cv1
-```
-
-### Opening a Shell in a Container with `apptainer shell`
-
-When you're getting started, opening a shell using Apptainer can help you test out things like filepaths and how they're accessed in the container. It's hard to get an intuition for how file I/O works with containers until you can see the limited view from the container. 
-
-By default, apptainers can see your current directory and navigate to the files in it. 
-
-You can open an Apptainer shell in a container using `apptainer shell`. Remember to use `docker://` before the container name. For example:
-
-```bash
-module load Apptainer/1.1.6
-apptainer shell docker://biocontainers/samtools:v1.9-4-deb_cv1
-```
-
-This will load the `apptainer` module, and then open a Bash shell in the container using `apptainer shell`. Once you're in the container, you can test code, especially seeing whether your files can be seen by the container (see @sec-bindpaths). 90% of the issues with using Docker containers has to do with bind paths, so we'll talk about that next.
-
-Once you're in the shell, you can take a look at where `samtools` is installed:
-
-```bash
-which samtools
-```
-
-Note that the container filesystem is isolated, and we need to explicitly build connections to it (called bind paths) to get files in and out. We'll talk more about this in the next section.
-
-Once we're done testing scripts in our containers, we can exit the shell and get back into the node.
-
-```bash
-exit
-```
-
-:::{.callout-note}
-## Opening a Shell in a Docker Container with Docker
-
-For the most part, due to security reasons, we don't use `docker` on HPC systems. In short, the `docker` group essentially has root-level access to the machine, and it's not a good for security on a shared resource like an HPC.  
-
-However, if you have admin level access (for example, on your own laptop), you can open up an interactive shell with `docker run -it`:
-
-```bash
-docker run -it biocontainers/samtools:v1.9-4-deb_cv1 /bin/bash
-```
-This will open a bash shell much like `apptainer shell`. Note that volumes (the docker equivalent of bind paths) are specified differently in Docker compared to Apptainer.
-:::
-
-:::{.callout-note}
-## WDL makes this way easier
-
-A major point of failure with Apptainer scripting is when our scripts aren't using the right bind paths. It becomes even more complicated when you are running multiple steps. 
-
-This is one reason we recommend writing WDL Workflows and a {{<glossary "workflow manager">}} (such as {{<glossary "Cromwell">}} or Sprocket) to run your workflows. You don't have to worry that your bind points are setup correctly, because they are handled by the workflow manager.
-:::
-
-### Testing in the Apptainer Shell
-
-Ok, now we have a bind point, so now we can test our script in the shell. For example, we can see if we are invoking `samtools` in the correct way and that our bind points work.
-
-```bash
-samtools view -c /mydata/my_bam_file.bam > /mydata/bam_counts.txt
-```
-
-Again, trying out scripts in the container is the best way to understand what the container can and can't see.
-
-### Exiting the container when you're done
-
-You can `exit`, like any shell you open. You should be out of the container. Confirm by using `hostname` to make sure you're out of the container.
-
-### Testing outside of the container
-
-Let's take everything that we learned and put it in a script that we can run on the HPC:
-
-```bash
-# Script to samtools view -c an input file:
-# Usage: ./run_sam.sh <my_bam_file.bam>
-# Outputs a count file: my_bam_file.bam.counts.txt
-#!/bin/bash
-module load Apptainer/1.1.6
-apptainer run --bind /fh/fast/mydata:/mydata docker://biocontainers/samtools:v1.9-4-deb_cv1 samtools view -c /mydata/$1 > /mydata/$1.counts.txt
-#apptainer cache clean
-module purge
-```
-
-We can use this script by the following command:
-
-```
-./run_sam.sh chr1.bam 
-```
-
-And it will output a file called `chr1.bam.counts.txt`.
-
-:::{.callout}
-## Apptainer Cache
-
-[The apptainer cache](https://apptainer.org/docs/user/1.0/build_env.html) is where your docker images live. They are translated to the native apptainer `.sif` format.
-
-You can see what's in your cache by using
-
-```
-apptainer cache list
-```
-
-By default the cache lives at `~/.apptainer/cache`.
-
-If you need to clear out the cache, you can run 
-
-```
-apptainer cache clean
-```
-
-to clear out the cache.
-
-There are a number of environment variables (@sec-environment) that can be set, including login tokens for pulling from a private registry. [More information is here](https://apptainer.org/docs/user/1.0/build_env.html#environment-variables).
-:::
-
-### More Info
-- [Carpentries Section on Apptainer Paths](https://hsf-training.github.io/hsf-training-singularity-webpage/07-file-sharing/index.html) - this is an excellent resource if you want to dive deeper into undestanding container filesystems and bind points.
-- [Apptainer Documentation on Bind Paths](https://apptainer.org/docs/user/main/bind_paths_and_mounts.html#bind-examples). There are a lot of good examples here on how to set up bind paths.
-- [More about bind paths and other options](https://apptainer.org/docs/user/main/bind_paths_and_mounts.html).
@@ -117,7 +117,7 @@ Now our `/fh/fast/mydata/` folder will be available as `/mydata/` in my containe
 samtools view -c /mydata/my_bam_file.bam
 ```
 
-## What is JSON?
+## What is JSON? {#sec-json}
 
 One requirement for running workflows is basic knowledge of JSON.
 
 
@@ -181,7 +181,7 @@
         <li class="sidebar-item">
   <div class="sidebar-item-container"> 
   <a href="./04_containers_workflows.html" class="sidebar-item-text sidebar-link">
- <span class="menu-text"><span class="chapter-number">6</span>&nbsp; <span class="chapter-title">Containers and Workflows</span></span></a>
+ <span class="menu-text"><span class="chapter-number">6</span>&nbsp; <span class="chapter-title">Workflows</span></span></a>
   </div>
 </li>
         <li class="sidebar-item">