Skip to content

Docker User Guide

Bryan Quach edited this page Jan 25, 2024 · 6 revisions

Table of Contents


Introduction

Expand for details

This page serves as an end-user reference guide for setting up Docker on a local machine or an Amazon EC2 instance and running Docker containers using images pulled from DockerHub. This document is intended as a quick-start guide. More comprehensive information can be found through the official Docker documentation.

What is Docker?

Docker is a software platform for containers, more technically known as operating-system level virtualization. Containers are virtual environments that run like other processes on a host system. They are similar to virtual machines but more lightweight and less demanding on host system CPU and memory resources. A container houses all the software and library dependencies needed to run a specific application, software, or script.

Benefits of containers

  • Flexible: Everything from simple scripts to elaborate applications can be containerized.
  • Lightweight: Containers use and share host resources without incorporating demanding intermediary resource management layers between the container and the host.
  • Interchangeable: Applications built on containers can be upgraded and updated on-the-fly.
  • Portable: You can build container images locally, deploy them to the cloud, and run them on any computer that has Docker installed regardless of the host operating system.
  • Scalable: Multiple copies of a Docker container can be created quickly and used simultaneously.
  • Stackable: Docker containers can be used together as interacting entities to form a composite application or software.

Terminology: image vs. container

This guide will use the terms image and container extensively. As defined in the Docker documentation:

A container is launched by running an image. An image is an executable package that includes everything needed to run an application--the code, a runtime, libraries, environment variables, and configuration files.

A container is a runtime instance of an image--what the image becomes in memory when executed.


Installing Docker

Expand for details

Windows and Mac

Docker Community Edition (CE) can be installed on Windows or Mac OS using the official installers for both operating systems.

  • Download the Windows installer here and follow the installation guide.
  • Download the Mac OS installer here and follow the installation guide.

Linux

Regardless of Linux OS type, Docker CE can be installed using package managers. See below for specific command-line installation commands for main Linux distribution types.

Amazon Linux

Expand for details
# Step 1: Install Docker CE
sudo yum install docker

# Step 2: Start Docker
sudo service docker start

The setup up to this point will allow Docker commands to be executed if sudo is prepended to all the commands. To allow non-sudo users to call Docker commands, execute the following line, replacing <user> with your username. If using an EC2 instance, the username is ec2-user:

sudo usermod -aG docker <user>

Exit your command-line session and create a new one. If Docker is successfully installed and initialized, the following commands should print the Docker version and list your local images:

docker --version
docker image ls

RHEL and CentOS

Expand for details
# Step 1: Install device mapper and management dependencies
sudo yum install -y yum-utils device-mapper-persistent lvm2

# Step 2: Add Docker repo to package manager
sudo yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
sudo yum update

# Step 3: Install Docker CE
sudo yum install docker-ce

# Step 4: Start Docker
sudo /etc/init.d/docker start

The setup up to this point will allow Docker commands to be executed if sudo is prepended to all the commands. To allow non-sudo users to call Docker commands, execute the following line, replacing <user> with your username:

sudo usermod -aG docker <user>

Exit your command-line session and create a new one. If Docker is successfully installed and initialized, the following commands should print the Docker version and list your local images:

docker --version
docker image ls

Debian and Ubuntu

Expand for details
# Step 1: Allow installations over HTTPS
sudo apt-get install apt-transport-https ca-certificates curl software-properties-common

# Step 2: Enable secure access to Docker repo
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -

# Step 3: Add Docker repo to package manager
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
sudo apt-get update

# Step 4: Install Docker CE
sudo apt-get install docker-ce

If step 3 throws an error, execute the command lsb_release -cs and verify that a string is returned. The command component $(lsb_release -cs) should grab the name of the Debian or Ubuntu release version name (trusty, cosmic, bionic, jesse, buster, etc.). If it does not, then replace $(lsb_release -cs) with the release version name. For Ubuntu 18.04 Bionic Beaver the command would become:

sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu bionic stable"

Alternatively, it may be that lsb_release is not installed, and step 3 may work after installing the appropriate package. It can be installed with the package manager using:

sudo apt-get install lsb-release

Once steps 1-4 are completed, the setup up to this point will allow Docker commands to be executed if sudo is prepended to all the commands. To allow non-sudo users to call Docker commands, execute the following line, replacing <user> with your username:

sudo usermod -aG docker <user>

Exit your command-line session and create a new one. If Docker is successfully installed and initialized, the following commands should print the Docker version and list your local images:

docker --version
docker image ls

Retrieving a Docker image from DockerHub

Expand for details

DockerHub is a Docker supported service for storing Docker images. Much like Git and GitHub provide a convenient way to version control and share code, DockerHub allows for versioning and sharing of Docker images.

Currently, images created specifically for RTI omics analyses are available at

All versions of an image are stored as a single repository. An image can be downloaded from DockerHub to your machine using the command:

docker pull <username>/<image_repository>:<image_tag>

In the above command, <username> is the repository owner, <image_repository> is the name of the repository, and <image_tag> is the specific version of the image that you want to retrieve. As an example, to download the PLINK v1.9 image, use the following command-line command:

docker pull rtibiocloud/plink:v1.9_178bb91

**Note: Docker images do not need to be pulled prior to calling docker run (explained in Running a Docker container). If the image is not found locally, then Docker will search DockerHub and pull the image from there.


Running a Docker container

Non-interactive mode

Expand for details

Base command

The basic command to run a non-interactive Docker container is

docker run --rm -t <username>/<image_repository>:<image_tag> <cmd>

where <username> is the repository owner, <image_repository> is the name of the repository, and <image_tag> is the specific version of the image.

Specifying hardware resource constraints and host file system access

Sometimes a Docker container may need to be run with specific CPU and memory constraints and/or with access to data on the host system. This can be achieved by adding additional options:

docker run \
  --rm \
  --memory=<mem_value><size_suffix> \
  --cpus=<cpu_value> \
  --mount type=bind,src=<host_path>,dst=<internal_docker_path> \
  -t <username>/<image_repository>:<image_tag> \
  <cmd>

Here <mem_value> is a number followed by a size specifier <size_suffix> which is typically m for megabytes and g for gigabytes. The <cpu_value> specifies the number of CPU units desired. More sophisticated resource management options are detailed here.

For data access <host_path> and <internal_docker_path> are directories on the host machine and within the Docker container respectively. For example, let <host_path> be /shared/data and <internal_docker_path> be /usr/data. All files and subdirectories within /shared/data will be visible inside the Docker container under /usr/data. Any files written inside the Docker container to /usr/data will become accessible on the host machine under /shared/data. If the --mount option is not specified, then Docker will not have access to data on your host machine.

The Docker image specification <username>/<image_repository>:<image_tag> follows the same conventions as described for pulling images.

The <cmd> value is a string representing a command-line call that Docker makes inside the container. Containers have a default command, but rtibiocloud Docker containers are generally setup to use a single tool/software/package. This means that <cmd> will typically be a call to a program executable followed by the options for that executable.

Example

The code snippet below is an example of converting a SAM file to BAM format by running the SAMtools view command. SAMtools is run using tag/version v1.9_140a84e of a container available on DockerHub through user rtibiocloud and container repository samtools. The following are the specifications for the example:

  • Input file: example.sam
  • Input file path on host machine: /usr/data/
  • Output file name: example.bam
  • 1 CPU
  • 1 GB of memory
# Docker example command
docker run \
  --rm \
  --memory=1g \
  --cpus=1 \
  --mount type=bind,src=/usr/data,dst=/data \
  -t rtibiocloud/samtools:v1.9_140a84e \
  samtools view -b -o /data/example.bam /data/example.sam

In this example the input data is visible to the Docker container through /data and the output file is written there as well. On the host machine the output file will be visible under /usr/data.

Interactive mode

Expand for details

The basic command to run an interactive Docker container is

docker run -i -t <username>/<image_repository>:<image_tag> /bin/bash

The placeholder representations have the same definition as described for the non-interactive mode. Also similar to the non-interactive mode command, the same resource constraints and data access options can be added. Here /bin/bash is the command that the container executes which will start an interactive bash session within your command-line terminal.

Running x86_64 images on ARM64 machines

Expand for details

Some cloud computing machines and Macs have ARM64 processors. For Docker images built only for the x86_64 architecture, these can be run with Docker using emulation through Rosetta. If your Docker version is >=4.3.0 then this is built into the install. Emulation is enabled for docker run using the argument --platform=linux/amd64.

Docker container conventions for rtibiocloud

Expand for details

Different users implement different conventions for how containers are constructed. A well-maintained image on DockerHub will include information about container contents and how to run the container for its designed purpose. For rtibiocloud images, the current design practice we employ is to store executables within the /opt directory of the container. This directory is also included in the PATH environment variable which stores paths to executables so that the full path to the executable does not need to be specified. The default command for rtibiocloud containers will call the usage guide for the main executable or list the executables available within /opt.


Exploratory Data Analysis Using Docker

Using Non-rtibiocloud Docker Images

Expand for details

As a general practice, rtibiocloud Docker images are designed to be modular and typically only contain a single software tool or a small, related set of software tools. For exploratory analyses, you may run into cases where an rtibiocloud Docker image does not exist that suites your needs. In these cases, it is recommended to find and use a publicly available Docker image that meets your needs. Public Docker images can be searched through DockerHub. If the software you need does not exist as a Docker image, you can save time by building upon a pre-existing base image, as opposed to creating a Docker image from scratch. See the section on creating a temporary Docker image from a pre-existing image for more details.

Extending a Pre-existing Docker Image

Expand for details

When conducting exploratory analyses, it may be more efficient to experiment with using a software tool before officially integrating it into the analysis infrastructure and deploying it for production within a reproducible research environment. Consider the following use case:

  • You are interested in using a new algorithm, easySolveR, implemented as a Bioconductor R package to analyze your data.
  • The analysis also requires several other R packages for data wrangling.
  • Another tool, genoIngestR, that functions as a standalone linux command line program is needed to generate results that feed into easySolveR.
  • All of these software tools do not currently exist within a single rtibiocloud Docker image.

For this use case, you can extend the r-base Docker image, which already has R installed within a Linux operating system. To do this via a command line terminal:

  1. Create your own free DockerHub account. This will allow you to keep your Docker images in a personal repository for retrieval later if needed. Do not include sensitive data on your images. These images will be publicly accessible via DockerHub.

  2. Pull the r-base Docker image from DockerHub and start an interactive bash session. For R version 4.0.2 this can be done by executing

    docker run --rm -it r-base:4.0.2 /bin/bash
  3. Install the software tools needed for your analysis (e.g., easySolveR, genoIngestR, R packages).

  4. Retrieve the container ID for your updated Docker container. To do this while your Docker container is still running, open another command line terminal (on the same computer where your active Docker container is running) and list your active containers using the command docker container ls.

  5. Save the Docker container as an image. The command to do this is

    docker commit <container ID> <username>/<image_repository>:<image_tag>

    The <contianer ID> is the ID for the container that you updated, <username> is the name of your DockerHub account, <image_repository>:<image_tag> is the desired name and version identifier for your newly created image. Note: This step will only save the image locally. If this image is created on an AWS EC2 instance that is then terminated, your image will be deleted as well (unless it is pushed to your DockerHub).

  6. To store your image for future retrieval, log in to DockerHub through the command line using the docker login command, then push the image to your DockerHub using docker push <username>/<image_repository>:<image_tag>. Note: This will make your image publicly findable and usable via DockerHub. Do not include sensitive data in your Docker image.

Using custom scripts with Docker images

Expand for details

This section describes a solution for situations where you have a Docker image with the desired software installed, but you do not want to use the Docker image interactively. This is a common scenario when you want to submit the analysis as a job through a job scheduler like SGE or SLURM. It is also common when you want to execute multiple iterations of an analysis with slight modifications to the analysis parameters, but the software you are using is not neatly packaged as a command line executable (e.g., the software is an R or Python package, and the analysis is a series of data processing steps and function calls using the package library). In these cases, you often resort to writing a custom script to do the analysis.

If the analysis will be done repeatedly and can be easily abstracted, a long-term solution would be to create a version-controlled, well-designed command line script that can be integrated into an official rtibiocloud Docker image. However, this is often not feasible or necessary. The steps below outline an example of a quick workaround when you want run a custom script that uses the software installed within a Docker image.

  1. Retrieve a Docker image containing all required software for your analysis. If one is not available, extend a pre-existing Docker image. For this example, the R version 4.0.2 r-base Docker image will be used.

  2. Determine the specifications for how your Docker host machine and Docker container file systems will be linked. This step is simply a mental note for the next step. Details of how host machine and Docker container file systems can be linked are described under the subsection "Specifying hardware resource constraints and host file system access" within the "Running a Docker container" section. For this example, we will associate the host machine directory /shared with the Docker container file system directory /tmp.

  3. Implement your custom script. If you write data/results to a file or hard-code data files into your script, be sure to do the following:

    • Use full, absolute paths to your data files and for your script's output files.
    • Specify your absolute paths in the script in relation to how the Docker container will see them. For a file on the host machine at /shared/data/genotypes.vcf, the absolute path specified in your script would be /tmp/data/genotypes.vcf. Likewise, a file output to /tmp/ will be accessible on the host machine at /shared, even after the Docker container stops running.
    • Store your data files within the linked directory on the host system. In this example, that directory would be within /shared (or a subdirectory of /shared).

    Write your script to a file saved within the /shared directory on the host machine. In this example, we will write a script called hello_world.R saved at host directory location /shared/scripts/hello_world.R:

    # hello_world.R - The world's best example script 
    
    # Print the user-specified string
    # Args:
    #   something: A string.
    # Returns: null
    say.something <- function(something){
        print(something)
        return(NULL)
    }
    
    say.something("Hello World!")
  4. Run your script within a Docker container. In this example, we will run hello_world.R using the r-base:4.0.2 Docker image, making sure to link the host file system with the Docker file system using the --mount option. CPU and memory specification options are included here as well.

    docker run \
      --rm \
      --memory=1g \
      --cpus=1 \
      --mount type=bind,src=/shared,dst=/tmp \
      -t r-base:4.0.2 \
       Rscript /tmp/scripts/hello_world.R

    Successful execution of this script will print the following to STDOUT:

    [1] "Hello World!"
    NULL
    

    Use the host machine directory structure when redirecting the output to STDOUT via the command line with >:

    docker run \
      --rm \
      --memory=1g \
      --cpus=1 \
      --mount type=bind,src=/shared,dst=/tmp \
      -t r-base:4.0.2 \
       Rscript /tmp/scripts/hello_world.R > /shared/hello_world.log